I’m one of the engineers at AI2 that helped make this happen. We’re excited about this for several reasons, which I’ll explain below.
Most academic papers are currently inaccessible. This means, for instance, that researchers who are vision impaired can’t access that research. Not only is this unfair, but it probably prevents breakthroughs from happening by limiting opportunities for collaboration.
We think this is partly due to the fact that the PDF format isn’t easy to work with, and thereby make accessible. HTML, on the other hand, has benefited from years of open contributions. There’s a lot of accessibility affordances, and they’re well documented and easy to add. In fact, our hope long-term is to use ML to make papers more accessible without (much) effort on the author’s part.
We’re also excited about distributing papers in their HTML form as we think it’ll allow us to greatly improve the UX of reading papers. We think papers should be easy to read regardless of the device you’re on, and want to provide interactive, ML provided enhancements to the reading experience like those provided via the Semantic Reader.
We’re eager to hear what you think, and happy to answer questions.
Do you remove the pdf files we send to your servers?
Edit https://allenai.org/terms point 5, you own all the uploads! So if by mistake we send a medical PDF for example or something else that is under gdpr, we can't ask you to delete it????
? Wtfffff
> What data do we keep? We cache a copy of the extracted content as well as the extracted images. This allows us to serve the results more quickly when a user uploads the same file again. We do not retain the uploaded files themselves. Cached content is never served to a user who has not provided the exact same document.
Also, we can delete the extracted data on request. Just send a note to accessibility@semanticscholar.org.
Is there any thought about presenting the papers as TEI XML with XSLT to display the paper in a browser or screenreader? TEI provides pagination support (needed for citing page numbers, because most of academia still needs that) and extensive semantic markup for things like bibliographic information. It also serves as one data model that can be converted easily with existing tools (XSLT) to provide many representations for humans, while also serving as a machine-parsable text for datamining. Digital humanities has made heavy use of TEI for years, and this project seems like it could benefit from it.
From user perspective I dont understand why not release the source code and let people compile a native application. (Did I miss the link to the source code.) Instead it looks like this is just a means of collecting free data (metadata, more training data, data from submitted papers by default) everytime someone submits a paper.
I always assumed the main reason for using PDFs is, that an author/distributor can be pretty sure, that they're rendered almost exactly the same (fonts, layout) no matter with which viewer they're viewed.
This probably evokes some kind of sense of authenticity. Like some physical paper document it has exactly one appearance.
There's also the annotation features in PDFs which allow me to highlight text and add a comment.
I don't know of any more convenient way to directly attach my thoughts to a specific portion of text. (If there is, I'd genuinely like to know).
And it even works well across multiple devices: I have a folder on my PC that's synced with my phone via syncthing containing mostly PDFs (savend web pages, papers, books, ..) and the annotations I make in those PDFs on my phone are directly available on my PC ... all without using some cloud bullsh*.
Unfortunately for my mental health my thesis was exactly about converting arxiv papers to modern looking html, and there's so much more broken, unjust and ugly things in academia then using pdfs...
Regarding your question, I'd say that it is a natural continuation of centuries long tradition of writing on the actual paper. The invention of TeX actually made it easier to produce more papers, then came PDF, and you could produce virtual papers. Also science journals pretty much have monopoly on scientific knowledge distribution, and they are mostly paper too
Y'know, that's a good question. I'm not sure I know the answer.
My guess is it's largely for historical reasons. At the time most venues were organized PDF was probably the best (or only) mechanism for sharing documents for print distribution.
I like print format for reading purposes, even if it's on my epaper tablet. The other day when I took a train for 8 hours, I printed out several papers to read on my b&w laserjet. And it's more difficult to read diagrams these days because people make them all in colour, sometimes in ways that are very difficult to read when it's converted to b&w.
I find it a real tragedy that all these efforts to turn papers into dynamic content, which I wholeheartedly applaud, ignore the still very relevant use case of printing. Every preview mechanism for camera-ready papers should include a b&w print-preview mode.
The other advantage of PDF is that "page count" still means something. There's a reason journals limit page count, and it's not because it adds a few kbs to the download. It's because long-winded papers that don't get to the point need editing.
If all goes well we won't need this software anymore. In a best case scenario the publishers start accepting HTML, and gone are the days of having to convert PDFs to something better...!
It would be great if you could add some basic CSS rules for print? Right now navigation elements are needlessly repeated on each page, obscuring the content.
Also, you forgot to include bold and italics webfonts, so you have faux-styles for all headings and emphasis.
You could give Calibre a try. The result will probably be a long way from perfect for complicated documents but it does work reasonably well for most things. Formulas don't translate well unfortunately.
Their job is a little bit easier because arXiv papers have the .tex source available, so you can use one of the various tex2html variants, instead of having to extract the paper's contents from a rendered PDF.
Our focus right now is on providing a tool folks can run it on whatever papers they have access to. For instance, some researchers might have access to documents that aren't available to the public. We want them to be able to run this against those.
That said as we expand the effort I imagine we'll eventually pre-convert things that are publicly available, like those on ArXiv, etc.
Yup, right now we use GROBID, do some post processing and combine the output with other extraction techniques. For instance, we use a model to extract document figures[1], so that we can render them in the resulting HTML document.
Also, we're working hard on a new extraction mechanism that should allow us to replace GROBID [2].
There's a lot of really smart people at AI2 working on this, I'm excited to see the resulting improvements and the cool things (like this) that we build with the results!
cool project, though the name was confusing for me: I believe to most people "paper" first means actual paper, so I thought this was some kind of OCR system converting printed material to html?
One comment is that the slowest page to load was the Gallery [0] as it loads an ungodly amount of PNG files from what appears to be a single IP (a GCP Compute instance?)
I see 421 requests and 150 Mb loaded. As it seems to be mostly thumbnails, have you considered using jpegs instead of pngs, potentially use lazy loading (i.e. not load images outside of the viewport) and potentially use GCP's (or another provider) CDN offering?
Once I clicked a thumbnail, loading the article itself (for example [1]) was quite breezy.
The gallery is a great showcase of what your site does -- I think that it'd be worth making it snappier :-)
Cheers and congrats again
P.S. Also, the paper linked below [1] seems to have a few conversion problems -- I see "EQUATION (1): Not extracted; please refer to original document", and also some (formula? Greek?) characters that seem out of place after the words "and the next token is generated by sampling"
> One comment is that the slowest page to load was the Gallery [0] as it loads an ungodly amount of PNG files from what appears to be a single IP (a GCP Compute instance?)
Yup. There's no CDN or anything like that right now. We kept things simple to get this out the door. But we definitely intend to make improvements like this as we improve the tool.
The more adoption we see, the more it motivates these types of fixes!
> P.S. Also, the paper linked below [1] seems to have a few conversion problems -- I see "EQUATION (1): Not extracted; please refer to original document", and also some (formula? Greek?) characters that seem out of place after the words "and the next token is generated by sampling"
Thanks for the catch. As you noted there's still a fair number of extraction errors for us to correct!
This is amazing! Will make my (offline-only) Kindle finally display scientific papers. Took a random link of arxiv and it worked like a charm, including TOC. will this be OS'ed?
You may check out https://arxiv-vanity.com as well. OS, convertation rates are close to 70% on random arxiv paper if I'm not mistaken, but hardly can be called stable
There is a offline solution if you are looking for, the app is Calibre. It is basically ebook manager & extra. It can convert the PDF into mobi and customizable based on your preference. They have a preset for Kindles. Also it can works with DRM'ed files via DeDRM plugins. And Calibre can export it directly to your Kindle. A fair warning, don't use Calibre if you structured your ebook folder. The app will import everything and keep it within their own database folder thus doubling the space size.
Yay, glad to hear it! If you end up viewing one of these on your Kindle, let us know how well (or not) things work.
We're not sure if it's something that we can distribute as OSS just yet. It relies on a few internal libraries that would also need be publicly released, so it's not as simple as adjusting a single repository's visibility.
I've used KOReader in the past, and it's awesome! Keeping the jailbreak when my kindle randomly decides to updates itself, not so much. (yes I followed instructions to disable updates, but it still somehow managed to update) At some point it becomes too much of a hassle.
Though OP has his kindle offline all the time, so not a issue for them.
It's gotten a lot better since we entered the KindleBreak era. The community went nuclear, and now instead of applying various hacks to try and prevent updates from being downloaded, the jailbreak package includes a little service that (as I understand it) watches the disk and immediately deletes anything that looks like an update package. The MobileRead "Open Sesame!" thread [0] has all the modern tooling in one place, if you're interested.
I tried that a few days ago with one of my papers (a PDF generated using pdflatex) and it didn't work that well: the text was fine but some section titles were off, and all of the math and code parts were broken.
But clearly it is a nice idea and I can't wait that such tools work better!
Looks exactly like what type of crunch work ML would do, but have you considered using brute force converters like latexml or pandoc where appropriate?
> What are the limitations?
There are several known limitations. Tables are currently extracted from PDFs as images, which are not accessible. Mathematical content is either extracted with low fidelity or not being extracted at all from PDFs. Processing of LaTeX source and PubMed Central XML may lack some of the features implemented for PDF processing. We are working to improve these components, but please let us know if you would like some of these features prioritized over others.
I am so amazed at the work you guys are doing at AI2 & the Semantic Scholar project. You guys are really fixing a broken system of research and discovery which suffers from organization design principles based on university library index card filing cabinets as magnified by the exponential content growth.
There's a lot of amazing people here, doing really great work. It's a really inspiring place to be. I feel really lucky to work with such great people on interesting, important problems.
I’m one of the engineers at AI2 that helped make this happen. We’re excited about this for several reasons, which I’ll explain below.
Most academic papers are currently inaccessible. This means, for instance, that researchers who are vision impaired can’t access that research. Not only is this unfair, but it probably prevents breakthroughs from happening by limiting opportunities for collaboration.
We think this is partly due to the fact that the PDF format isn’t easy to work with, and thereby make accessible. HTML, on the other hand, has benefited from years of open contributions. There’s a lot of accessibility affordances, and they’re well documented and easy to add. In fact, our hope long-term is to use ML to make papers more accessible without (much) effort on the author’s part.
We’re also excited about distributing papers in their HTML form as we think it’ll allow us to greatly improve the UX of reading papers. We think papers should be easy to read regardless of the device you’re on, and want to provide interactive, ML provided enhancements to the reading experience like those provided via the Semantic Reader.
We’re eager to hear what you think, and happy to answer questions.