You might want to overlap the first pass of chunks, something could get lost at the chunk boundaries. Not any sort of expert on this sort of thing, it just seems like an obvious pitfall for the context length.
I really like this idea. It’s basically applying similar principles as are used in image based nets - i.e. sliding window convolutional kernels - to text.
Yes it's a great idea and I have a version that is basically a convolution over the transcript. It works much better than the current version - it can automatically create cohesive chapters and summaries of those chapters - however, it consumes an order of magnitude more ChatGPT API calls making it uneconomical (for now!)
Thanks for the kind words. I built it on a few cross-country plane rides and now I mostly just leave it alone. The infrastructure and tooling we have these days is so incredible.
Sure. The old one just splits the transcript into 5 minute chunks and summarizes those. The reason this sucks is because each 5 minute chunk could contain multiple topics, or the same topic could be repeated across multiple chunks.
This dumb technique is actually pretty useful for a lot of people though, and has the advantages of being super easy to parallelize and requiring only 1 pass through the data.
The more advanced technique does a pass through large chunks of the transcript to create lists of chapters in each chunk. Then it combines them to a single canonical chapter list with timestamps (it usually takes a few tries for the model to get it right). Then it does a second pass through the transcript, summarizing the content for each chapter.
The end result is a lot more useful, but is way slower and more expensive.