Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm very interested in getting into archival (getting started this month after a few more conversations).

Your buy button[0] is broken. You're potentially missing out on a few sales due to this.

Is 2x 4GB SD card sufficient for your purposes? I've been quoted 50MB TIFF images as a standard, and a lot of books wouldn't fit without swapping SDs at that size.

[0] http://store.diybookscanner.org/



If you use pi-scan the images are saved to a USB drive instead of the SDcards.


archiving what? just curious.


I want to digitize the entire linguistic and spoken corpus of a critically endangered language[0] and convert it to a searchable format to aid in language revival, academic research, and ensuring that an informed debate can occur when the modern usages of the language differ from traditional usages of the language.

Most of the printed books are scattered, but available, but it's akin to an iceberg: there's a significant amount of 'submerged' knowledge about the language in written manuscripts and recorded audio, and this is where a lot of the value comes from. Printed texts are primarily religious, and getting the colloquial usages of words and phrases is very useful.

Many manuscripts aren't digitized at all, or are available and need transcription.

The language is relatively well-recorded (dating back to at least the late 16th century in written form), and yet small enough that a comprehensive reference is viable: estimates of about 5MM words crop up, but even 3x could easily fit in memory on a Digital Ocean droplet, even if fully POS tagged[1]. Texts are also mostly in the public domain, and there's a lot of bilingual texts (which act as a Rosetta Stone).

[0] https://en.wikipedia.org/wiki/Manx_language#Revival

[1] https://en.wikipedia.org/wiki/Part-of-speech_tagging

EDIT: More than happy to talk in depth about this if anyone wants, via comments, or email on my profile.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: