You will need a lot of disk storage right?

LEARAX · on Jan 20, 2021

There are different extractors/services, and you can toggle them pretty easily. By default it screenshots everything, exports a PDF, saves like 4 different HTML copies and submits the link to the wayback machine. It also tries to extract important text, and stores that separately. You could easily configure it to only extract text, turn off some HTML extractors, or disable the PDF and screenshot captures if you want to prioritize disk space.

flas9sd · on Jan 19, 2021

it doesn't show in the Screenshot in the article, but ArchiveBox in Aug 2020 implemented the "readability article text extractor", see description in the release notes: https://github.com/pirate/ArchiveBox/releases/tag/v0.4.14 and the module that does the work https://github.com/pirate/readability-extractor

By only extracting text and article images you could go deep into an archive. If you skip images, much more so

Ace_Archer · on Jan 19, 2021

That probably depends on the scope of what you're looking to archive. If you're looking to make up local backup of your bookmarks folder (as one of the intentions seems to be), probably not an unreasonable amount of storage. Maybe a few GB at most(if you have a moderate to large bookmarks folder), depending on how many sites/heavy the sites are?

reefab · on Jan 20, 2021

For reference, archivebox uses 250GB for 5000 links in my setup.

mosselman · on Jan 20, 2021

That is an insane amount of storage for so few links. Is your setup somehow very greedy?

Saving article only view (images + text) should probably do better

I suspect your numbers come from JavaScript and css, etc? Is there a way for archivebox to not download react 5000 times but share source files? Most likely custom bundles that sites compile will not make this possible most of the time. Just thinking out loud here.

nikisweeting · on Jan 20, 2021

It's recommended to run it on a compressed filesystem like ZFS. On mine it's using ~75GB for ~3000 URLs. It varies greatly depending on the content, usually the vast majority of storage is from video/audio ripped with youtube-dl.