OP here.
I took the unofficial IKEA US dataset (originally scraped by jeffreyszhou) and converted all 30,511 products into a flat, markdown-like protocol called CommerceTXT.
The goal: See if a flatter structure is more efficient for LLM context windows.
The results:
- Size: 30k products across 632 categories.
- Efficiency: The text version uses ~24% fewer tokens (3.6M saved total) compared to the equivalent minified JSON.
- Structure: Files are organized in folders (e.g. /products/category/), which helps with testing hierarchical retrieval routers.
The link goes to the dataset on Hugging Face which has the full benchmarks.
Parser code is here: https://github.com/commercetxt/commercetxt
Happy to answer questions about the conversion logic!
For example, Google’s indexers already use this to surface pricing data. https://developers.google.com/search/docs/appearance/structu...
reply