What file formats are the existing datasets you have? I also work on data processing in a scientific domain where HDF5 is a common format. Unfortunately Duckdb doesn't support HDF5 out of the box, and the existing hdf5 extension wasn't fast enough and didn't have the features needed, so I made a new one based on the c++ extension template. I'd love to collaborate on it if anyone is interested.
That's really fascinating. Is your format open source? I don't know if I'd have overlapping needs for something like that (though I did investigate hdf5 early on, it seemed very promising as a place to store our outputs) but I'd be curious to explore it and see what you're doing with it.
Right now we typically read from CSV or Excel, because that's what the scientists prefer to work with. For better or worse. There's a bit of parquet kicking around. The wrappers around handling imports for DuckDB are very, very thin. It handles just about everything seamlessly