> If anything it makes them more important because you can't systematically arrive at good ones
Important and easy to make are not the same
I never said prompts didn’t matter, just that they’re so easy to make and so similar to others that they aren’t a moat.
> I may be wrong - but I'll speculate you work on infra and have never had to build a (real) application that is trying to achieve a business outcome.
You’re very wrong. Don’t make assumptions like this. I’ve been a full stack (mostly backend) dev for about 15 years and started working with natural language processing back in 2017 around when word2vec was first published.
Prompts are not difficult, they are time consuming. It’s all trial and error. Data entry is also time consuming, but isn’t difficult and doesn’t provide any moat.
> that is hard to replicate.
Because there are so many factors at play _besides prompting. Prompting is the easiest thing to do in any agent or RAG pipeline. it’s all the other settings and infra that are difficult to tune to replicate a given result. (Good chunking of documents, ensuring only high quality data gets into the system in the first place, etc)
Not to mention needing to know the exact model and seed used.
Nothing on chatgpt is reproducible, for example, simply because they include the timestamp in their system prompt.
> Good prompts -> good output -> good training data -> good model.
This is not correct at all. I’m going to assume you made a mistake since this makes it look like you think that models are trained on their own output, but we know that synthetic datasets make for poor training data. I feel like you should know that.
A good model will give good output. Good output can be directed and refined with good prompting.
It’s not hard to make good prompts, just time consuming.
> but we know that synthetic datasets make for poor training data
This is a silly generalization. Just google "synthetic data for training LLMs" and you'll find a bunch of papers on it. Here's a decent survey: https://arxiv.org/pdf/2404.07503
It's very likely o1 used synthetic data to train the model and/or the reward model they used for RLHF. Why do you think they don't output the chains...? They literally tell you - competitive reasons.
Arxiv is free, pick up some papers. Good deep learning texts are free, pick some up.
Sure, hand wave away my entire comment as “nonsense” and ignore how statistics works.
Training a model on synthetic data (obviously) increases bias present in the initial dataset[1], making for poor training data.
IIRC (this subject is a little fuzzy for me) using synthetic data for RLHF is equivalent to just using dpo, so if they did RLHF it probably wasn’t with synthetic data. They may have gone with dpo, though.
Did you read this paper? No one is suggesting o1 was trained with 100% synthetic or 50% or anything of that nature. Generalizing that "synthetic data is bad" from "training exclusively/majority on synthetic data is bad" is dumb.
Researchers are using synthetic data to train LLMs, especially for fine tuning, and especially instruct fine tuning. You are not up to date with recent work on LLMs.
I think actually matters is the "input" and "interact". Prompt is just one of them. The key is you put how you think and how you solve the problem into the it and build a system. Not just computer system, "Multi Agents", "Human Society" are also systems.
Important and easy to make are not the same
I never said prompts didn’t matter, just that they’re so easy to make and so similar to others that they aren’t a moat.
> I may be wrong - but I'll speculate you work on infra and have never had to build a (real) application that is trying to achieve a business outcome.
You’re very wrong. Don’t make assumptions like this. I’ve been a full stack (mostly backend) dev for about 15 years and started working with natural language processing back in 2017 around when word2vec was first published.
Prompts are not difficult, they are time consuming. It’s all trial and error. Data entry is also time consuming, but isn’t difficult and doesn’t provide any moat.
> that is hard to replicate.
Because there are so many factors at play _besides prompting. Prompting is the easiest thing to do in any agent or RAG pipeline. it’s all the other settings and infra that are difficult to tune to replicate a given result. (Good chunking of documents, ensuring only high quality data gets into the system in the first place, etc)
Not to mention needing to know the exact model and seed used.
Nothing on chatgpt is reproducible, for example, simply because they include the timestamp in their system prompt.
> Good prompts -> good output -> good training data -> good model.
This is not correct at all. I’m going to assume you made a mistake since this makes it look like you think that models are trained on their own output, but we know that synthetic datasets make for poor training data. I feel like you should know that.
A good model will give good output. Good output can be directed and refined with good prompting.
It’s not hard to make good prompts, just time consuming.
They provide no moat.