I asked why it said it was Claude, and it said it made a mistake, it's actually ...

SkyPuncher · 2025-07-29T23:53:11 1753833191

LLMs don’t know who they are.

This comes up all the time on Cursor forums. People gripe that their premium Sonnet 4 Max requests say they’re 3.5.

Realistically, the LLMs just don’t know who they are.

XenophileJKO · 2025-07-28T18:49:23 1753728563

Routing can happen at the request level.

littlestymaar · 2025-07-28T19:54:03 1753732443

Then you need to reprocess the previous conversation from scratch when switching from one provider to another, which sounds very expensive for no reason.

majormajor · 2025-07-29T04:15:17 1753762517

Take a look at the API calls you'd use to build your own chatbot on top of any of the available models. Like https://docs.anthropic.com/en/api/messages or https://platform.openai.com/docs/api-reference/chat - you send the message history each time. You can even lie about that message history!

You can utilize caching like https://platform.openai.com/docs/guides/prompt-caching and note that "Cache hits are only possible for exact prefix matches within a prompt" and that the cache contains "Messages: The complete messages array, encompassing system, user, and assistant interactions." and "Prompt Caching does not influence the generation of output tokens or the final response provided by the API. Regardless of whether caching is used, the output generated will be identical." So it's matching and avoiding the full reprocessing, but in a functionally identical way as reprocessing the whole conversation from scratch. Consider if the server with your conversation history crashed. It would be replayed on one without the cache.

littlestymaar · 2025-07-29T15:51:56 1753804316

Exactly, but caching doesn't work if you switch between providers in the middle of the conversation, which is my entire point.

majormajor · 2025-07-30T04:30:20 1753849820

If you're selectively faking things you don't care. You may not even be aware because the caching is transparent to you and you send the whole set of messages to the system each time either way. From the perspective of the person faking their model to look better than it is, it requires no special implementation changes.

And if you're faking your model to look better than it is, you probably aren't sending every call out to the paid 3rd party, you're more likely intentionally only using it to guide your model periodically.

littlestymaar · 2025-07-30T06:19:38 1753856378

> because the caching is transparent to you

It isn't when you look at your invoices though.

> aren't sending every call out to the paid 3rd party, you're more likely intentionally only using it to guide your model periodically.

I'd you do that, you're going to have to pay each token multiple times: both as inferred token on your model, and as input tokens on the third party and your model.

If the conversation are long enough (I didn't do the math but I suspect they don't even need to be that long) it's going to be costlier than just using the paid model with caching.

omneity · 2025-07-28T20:12:43 1753733563

Conversations are always "reprocessed from scratch" on every message you send. LLMs are practically stateless and the conversation is the state, as in nothing is kept in memory between two turns.

littlestymaar · 2025-07-29T15:50:02 1753804202

> LLMs are practically stateless

This isn't true of any practical implementation: for a particular conversation, KV Cache is the state. (Indeed there's no state across conversations, but that's irrelevant to the discussion).

You can drop it after each response, but doing so increase the amount of token you need to process by a lot in multi-turn conversations.

And my point was that storing the KV cache for the duration of the conversation isn't possible if you switch between multiple providers in a single conversation.

Jabrov · 2025-07-28T20:40:57 1753735257

Not exactly true ... KV and prompt caching is a thing

yahoozoo · 2025-07-28T21:06:02 1753736762

Assuming you include the same prompts in the new request that were cached in the previous ones.

throw310822 · 2025-07-28T22:57:28 1753743448

As far as I understand, the entire chat is the prompt. So at the each round, the previous chat up to that point could already be cached. If I'm not wrong, Claude APIs require an explicit request to cache the prompt, while OpenAI's handle this automatically.

littlestymaar · 2025-07-29T13:14:17 1753794857

I don't understand how you are downvoted…

actinium226 · 2025-07-29T11:14:48 1753787688

You should try gaslighting it and asking why it said it's GLM.