Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Paper to HTML Converter (papertohtml.org)
153 points by codeviking on Sept 15, 2021 | hide | past | favorite | 58 comments


Hi all,

I’m one of the engineers at AI2 that helped make this happen. We’re excited about this for several reasons, which I’ll explain below.

Most academic papers are currently inaccessible. This means, for instance, that researchers who are vision impaired can’t access that research. Not only is this unfair, but it probably prevents breakthroughs from happening by limiting opportunities for collaboration.

We think this is partly due to the fact that the PDF format isn’t easy to work with, and thereby make accessible. HTML, on the other hand, has benefited from years of open contributions. There’s a lot of accessibility affordances, and they’re well documented and easy to add. In fact, our hope long-term is to use ML to make papers more accessible without (much) effort on the author’s part.

We’re also excited about distributing papers in their HTML form as we think it’ll allow us to greatly improve the UX of reading papers. We think papers should be easy to read regardless of the device you’re on, and want to provide interactive, ML provided enhancements to the reading experience like those provided via the Semantic Reader.

We’re eager to hear what you think, and happy to answer questions.


Do you remove the pdf files we send to your servers?

Edit https://allenai.org/terms point 5, you own all the uploads! So if by mistake we send a medical PDF for example or something else that is under gdpr, we can't ask you to delete it???? ? Wtfffff


We don't retain the uploaded document. We cache the extracted content, as to make things more efficient.

See https://papertohtml.org/about:

> What data do we keep? We cache a copy of the extracted content as well as the extracted images. This allows us to serve the results more quickly when a user uploads the same file again. We do not retain the uploaded files themselves. Cached content is never served to a user who has not provided the exact same document.

Also, we can delete the extracted data on request. Just send a note to accessibility@semanticscholar.org.

Sorry for the confusion!


Ah okay, thank you.

>Also, we can delete the extracted data on request.

Just to be 100% clear, you are referring to the cached extracted data, right?


Yup, that's right.


Thank you very much!


Is there any thought about presenting the papers as TEI XML with XSLT to display the paper in a browser or screenreader? TEI provides pagination support (needed for citing page numbers, because most of academia still needs that) and extensive semantic markup for things like bibliographic information. It also serves as one data model that can be converted easily with existing tools (XSLT) to provide many representations for humans, while also serving as a machine-parsable text for datamining. Digital humanities has made heavy use of TEI for years, and this project seems like it could benefit from it.


"We're eager to hear what you think, ..."

I think I will stick with pdftohtml, pdftotext, and pdfimages https://en.wikipedia.org/wiki/Poppler_(software). These take seconds not minutes.

From user perspective I dont understand why not release the source code and let people compile a native application. (Did I miss the link to the source code.) Instead it looks like this is just a means of collecting free data (metadata, more training data, data from submitted papers by default) everytime someone submits a paper.


I've never actually questioned the why, so maybe you could shine some light... why are they usually published as PDFs?


I always assumed the main reason for using PDFs is, that an author/distributor can be pretty sure, that they're rendered almost exactly the same (fonts, layout) no matter with which viewer they're viewed.

This probably evokes some kind of sense of authenticity. Like some physical paper document it has exactly one appearance.


There's also the annotation features in PDFs which allow me to highlight text and add a comment.

I don't know of any more convenient way to directly attach my thoughts to a specific portion of text. (If there is, I'd genuinely like to know).

And it even works well across multiple devices: I have a folder on my PC that's synced with my phone via syncthing containing mostly PDFs (savend web pages, papers, books, ..) and the annotations I make in those PDFs on my phone are directly available on my PC ... all without using some cloud bullsh*.


Unfortunately for my mental health my thesis was exactly about converting arxiv papers to modern looking html, and there's so much more broken, unjust and ugly things in academia then using pdfs...

Regarding your question, I'd say that it is a natural continuation of centuries long tradition of writing on the actual paper. The invention of TeX actually made it easier to produce more papers, then came PDF, and you could produce virtual papers. Also science journals pretty much have monopoly on scientific knowledge distribution, and they are mostly paper too


Y'know, that's a good question. I'm not sure I know the answer.

My guess is it's largely for historical reasons. At the time most venues were organized PDF was probably the best (or only) mechanism for sharing documents for print distribution.

But we think it's time to change that :).


What alternative do you have? Word file?

PDF is the only widely supported format can guarantee accurate reprint.


HTML. As long as the information is preserved, the layout is not significant and actively harms viewing the content on many devices.

Here's the output of this tool on this PDF - https://arxiv.org/pdf/1909.00031.pdf - content is preserved, but text is readable: https://imgur.com/a/EPCaWxP


Are papers printed anymore?

HTML for text.

SVGs for diagrams.

Equations can be exported as images if needed.


> Are papers printed anymore?

You know what, they are.

I like print format for reading purposes, even if it's on my epaper tablet. The other day when I took a train for 8 hours, I printed out several papers to read on my b&w laserjet. And it's more difficult to read diagrams these days because people make them all in colour, sometimes in ways that are very difficult to read when it's converted to b&w.

I find it a real tragedy that all these efforts to turn papers into dynamic content, which I wholeheartedly applaud, ignore the still very relevant use case of printing. Every preview mechanism for camera-ready papers should include a b&w print-preview mode.

The other advantage of PDF is that "page count" still means something. There's a reason journals limit page count, and it's not because it adds a few kbs to the download. It's because long-winded papers that don't get to the point need editing.


That's the idea!

If all goes well we won't need this software anymore. In a best case scenario the publishers start accepting HTML, and gone are the days of having to convert PDFs to something better...!


How do you define pages in HTML?


We don't. We extract the content and present it as a single document.

Page anchors can be used for navigating between sections. We present a table of contents that makes this easy. For instance:

https://papertohtml.org/paper?id=6f9fc51102cf49bff4f4e2b3367...


Great initiative, HTML is the way to go!

It would be great if you could add some basic CSS rules for print? Right now navigation elements are needlessly repeated on each page, obscuring the content.

Also, you forgot to include bold and italics webfonts, so you have faux-styles for all headings and emphasis.


I'd love to see a way to re-export a paper into a digital-friendly format, say epub/mobi to use on my e-reader.

Any plans on that?


You could give Calibre a try. The result will probably be a long way from perfect for complicated documents but it does work reasonably well for most things. Formulas don't translate well unfortunately.


Looks great! Have you considered linking this up to something like arxiv or other preprint sites?


There's already this for arXiv: https://www.arxiv-vanity.com/

Their job is a little bit easier because arXiv papers have the .tex source available, so you can use one of the various tex2html variants, instead of having to extract the paper's contents from a rendered PDF.


Yup, we're definitely thinking about this.

Our focus right now is on providing a tool folks can run it on whatever papers they have access to. For instance, some researchers might have access to documents that aren't available to the public. We want them to be able to run this against those.

That said as we expand the effort I imagine we'll eventually pre-convert things that are publicly available, like those on ArXiv, etc.


This seems pdf2tohtml combined with GROBID[1].

It seems to me the masheen learningz technikz boil down to a generalization of my lightbulb moment here[2].

[1]: https://grobid.readthedocs.io/en/latest/

[2]: https://www.nu42.com/2014/09/scraping-pdf-documents-without-...


Yup, right now we use GROBID, do some post processing and combine the output with other extraction techniques. For instance, we use a model to extract document figures[1], so that we can render them in the resulting HTML document.

Also, we're working hard on a new extraction mechanism that should allow us to replace GROBID [2].

There's a lot of really smart people at AI2 working on this, I'm excited to see the resulting improvements and the cool things (like this) that we build with the results!

[1]: https://api.semanticscholar.org/CorpusID:4698432

[2]: https://api.semanticscholar.org/CorpusID:235265639


> It seems to me the masheen learningz technikz...

Off-topic low-value comment, but I'm now going to be getting a T-shirt made with the caption "i can haz masheen learningz?"


cool project, though the name was confusing for me: I believe to most people "paper" first means actual paper, so I thought this was some kind of OCR system converting printed material to html?


Thanks for the feedback. There's two hard problems n' all that... :)


Great site, congrats!

One comment is that the slowest page to load was the Gallery [0] as it loads an ungodly amount of PNG files from what appears to be a single IP (a GCP Compute instance?)

I see 421 requests and 150 Mb loaded. As it seems to be mostly thumbnails, have you considered using jpegs instead of pngs, potentially use lazy loading (i.e. not load images outside of the viewport) and potentially use GCP's (or another provider) CDN offering?

Once I clicked a thumbnail, loading the article itself (for example [1]) was quite breezy.

The gallery is a great showcase of what your site does -- I think that it'd be worth making it snappier :-)

Cheers and congrats again

P.S. Also, the paper linked below [1] seems to have a few conversion problems -- I see "EQUATION (1): Not extracted; please refer to original document", and also some (formula? Greek?) characters that seem out of place after the words "and the next token is generated by sampling"

[0] https://papertohtml.org/gallery

[1] https://papertohtml.org/paper?id=02f033482b8045c687316ef81ba...


> One comment is that the slowest page to load was the Gallery [0] as it loads an ungodly amount of PNG files from what appears to be a single IP (a GCP Compute instance?)

Yup. There's no CDN or anything like that right now. We kept things simple to get this out the door. But we definitely intend to make improvements like this as we improve the tool.

The more adoption we see, the more it motivates these types of fixes!

> P.S. Also, the paper linked below [1] seems to have a few conversion problems -- I see "EQUATION (1): Not extracted; please refer to original document", and also some (formula? Greek?) characters that seem out of place after the words "and the next token is generated by sampling"

Thanks for the catch. As you noted there's still a fair number of extraction errors for us to correct!


Another sample paper that caused some trouble with figure extraction: https://www.cs.utexas.edu/~hovav/dist/vera.pdf

Very cool project, looking forward to seeing how it develops!


Thanks, I'll pass this example along!


> have you considered using jpegs instead of pngs

For thumbs of text papers, perhaps a GIF or PNG would be smaller than a JPEG while retaining pixel accurate crispness?


This is amazing! Will make my (offline-only) Kindle finally display scientific papers. Took a random link of arxiv and it worked like a charm, including TOC. will this be OS'ed?


You may check out https://arxiv-vanity.com as well. OS, convertation rates are close to 70% on random arxiv paper if I'm not mistaken, but hardly can be called stable


There is a offline solution if you are looking for, the app is Calibre. It is basically ebook manager & extra. It can convert the PDF into mobi and customizable based on your preference. They have a preset for Kindles. Also it can works with DRM'ed files via DeDRM plugins. And Calibre can export it directly to your Kindle. A fair warning, don't use Calibre if you structured your ebook folder. The app will import everything and keep it within their own database folder thus doubling the space size.


Yay, glad to hear it! If you end up viewing one of these on your Kindle, let us know how well (or not) things work.

We're not sure if it's something that we can distribute as OSS just yet. It relies on a few internal libraries that would also need be publicly released, so it's not as simple as adjusting a single repository's visibility.


See also KOReader [0], if jailbreaking is an option for you. The built-in column splitter works pretty well for the papers I've used it to read.

[0] https://github.com/koreader/koreader


I've used KOReader in the past, and it's awesome! Keeping the jailbreak when my kindle randomly decides to updates itself, not so much. (yes I followed instructions to disable updates, but it still somehow managed to update) At some point it becomes too much of a hassle.

Though OP has his kindle offline all the time, so not a issue for them.


It's gotten a lot better since we entered the KindleBreak era. The community went nuclear, and now instead of applying various hacks to try and prevent updates from being downloaded, the jailbreak package includes a little service that (as I understand it) watches the disk and immediately deletes anything that looks like an update package. The MobileRead "Open Sesame!" thread [0] has all the modern tooling in one place, if you're interested.

[0] https://www.mobileread.com/forums/showthread.php?t=320564


(HTML->Mobi is totally possible)


I tried that a few days ago with one of my papers (a PDF generated using pdflatex) and it didn't work that well: the text was fine but some section titles were off, and all of the math and code parts were broken.

But clearly it is a nice idea and I can't wait that such tools work better!


> all of the math and code parts were broken.

Yup, this is a known issue that we're working towards fixing.

> But clearly it is a nice idea and I can't wait that such tools work better!

Glad to hear it!


For non-reflow conversion there is pdf2htmlEX: https://github.com/coolwanglu/pdf2htmlEX is discontinued but there is development under https://github.com/pdf2htmlEX/pdf2htmlEX

Demo: https://pdf2htmlex.github.io/pdf2htmlEX/doc/tb108wang.html


Looks exactly like what type of crunch work ML would do, but have you considered using brute force converters like latexml or pandoc where appropriate?


Yup, we've tried a lot of different tools in combination. All of them have their own trade-offs and extraction errors.

This system uses GROBID and some extraction techniques of our own. We're working on a GROBID replacement too, which should help us make things better.


I tried several physics papers and none of them had any equation extracted. Is it by design have problems with LaTeX equations?


Yup, this is a known limitation:

> What are the limitations? There are several known limitations. Tables are currently extracted from PDFs as images, which are not accessible. Mathematical content is either extracted with low fidelity or not being extracted at all from PDFs. Processing of LaTeX source and PubMed Central XML may lack some of the features implemented for PDF processing. We are working to improve these components, but please let us know if you would like some of these features prioritized over others.

But we intend to fix this!


I am so amazed at the work you guys are doing at AI2 & the Semantic Scholar project. You guys are really fixing a broken system of research and discovery which suffers from organization design principles based on university library index card filing cabinets as magnified by the exponential content growth.

Cant wait to see what people do with this . . . .


Thanks!

There's a lot of amazing people here, doing really great work. It's a really inspiring place to be. I feel really lucky to work with such great people on interesting, important problems.

Also, I should mention...we're hiring!

https://allenai.org/careers#current-openings


When are, as people, are going to ditch PDF? It's an awful format.

My friend wrote his PHD in Latex, but it all ends up being PDFed anyway for what, eye candy?

It's time to move on. #ditchpdf


Haven't tried it yet, but a very cool concept.

As per other recent discussions on HN I think the general accessibility of academic papers is ripe for improvement.


Please make it popular in the research field so you can spin up your own Sci-Hub!


Retro mode should be default.


I agree!

Maybe we'll work on vi bindings next...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: