I found that implementing a local cache on the runners has been helpful. Ingress/egress on local network is hella slow, especially when each build has ~10-20GB of artifacts to manage.
ZeroFS looks really good. I know a bit about this design space but hadn't run across ZeroFS yet. Do you do testing of the error recovery behavior (connectivity etc)?
This has been mostly manual testing for now.
ZeroFS currently lacks automatic fault injection and proper crash tests, and it’s an area I plan to focus on.
SlateDB, the lower layer, already does DST as well as fault injection though.
A lot of it is wasted in build time though, due to a lack of appropriate caching facilities with GitHub actions.
[0] https://github.com/Barre/ZeroFS/tree/main/.github/workflows