More

remexre · 2026-02-25T05:50:13 1771998613

my impression is that most CL these days is existing large closed-source codebases, hence the price tag for those compilers (you're not trying it out for a bit, you're funding the compiler devs to work full-time on the issues you're actually having) and relatively little open-source activity for "finished" things -- if you're developing against internal libraries, it's hard to open-source just the part you intend to

(work at a CL shop; mostly SBCL users, but maybe 1/3 of people are die-hard ACL fans)

remexre · 2026-02-20T22:58:37 1771628317

https://0pointer.net/blog/file-descriptor-limits.html

remexre · 2026-02-04T06:36:45 1770187005

how does this compare to MoSA (arXiv:2505.00315)? do you require that there's a single contiguous window? and do you literally predict on position, or with a computed feature?

jmward01 · 2026-02-04T07:54:31 1770191671

I predict a specific location then put a window around that. Of course you can predict a different location per head or multiple window locations per head as well. The cost is negligible (single linear embx1 size) so attn becomes a fixed cost per token just like traditional windowed attn. Of course this doesn't solve memory consumption because you still have a kv cache unless you only do attn over the initial embeddings at which point you don't need the cache, just the token history. This is the tact I'm taking now since I have other ways of providing long context at deeper layers that remain O(1) for token prediction and are paralellizable like standard attn. I think this kind of architecture is the future, infinite context, fixed size state, O(1) prediction, externalized memory are all possible and break current context, memory and compute problems. It is clear that in the future token caching will be dead once these types of models (mine or someone else's with the same properties) are properly tuned and well trained.

remexre · 2026-02-01T09:49:19 1769939359

c++ certainly also has and needs a similarly sufficiently smart compiler to be compiled at all…

remexre · 2026-01-15T02:32:18 1768444338

https://en.wikipedia.org/wiki/Lexer_hack

Make your parser call back into your lexer, so it can pass state to it; make the set of type names available to it.

remexre · 2026-01-14T18:52:08 1768416728

The point of an IRB is to act as an outside reviewer of _ethics_. IRBs aren't some checklist thing admin put in to protect the University's reputation, they exist as a direct reaction to huge amounts of unethical human experimentation occurring last century.

jovial_cavalier · 2026-01-15T21:23:45 1768512225

The point of an IRB is to stop you from nonconsentually sterilizing people. As long as the system stops that from happening, I don't care about the paperwork. It's not my concern.

The "ethical" issues with this study do not rise to the level that I care, so the only objection is that they didn't get the IRB to rubber stamp it beforehand, which I also don't care about.

remexre · 2026-01-11T22:42:19 1768171339

There's usually an if(temp == 0) to change sampling methods to "highest probability" -- if you remove that conditional but otherwise keep the same math, that's not deterministic either.

Majromax · 2026-01-11T22:49:00 1768171740

If you remove the conditional and keep the same math, you divide by zero and get nans. In the limit as temperature goes to zero, you do in fact get maximum likelihood sampling.

dnautics · 2026-01-12T00:03:40 1768176220

if (t==0) argmax(logits) else pick(logits)

eru · 2026-01-12T08:16:05 1768205765

You are ignoring the limit taking.

dnautics · 2026-01-12T15:39:03 1768232343

If t < 0.1 return {error, 400}

eru · 2026-01-13T01:28:22 1768267702

Do you know that a mathematical limit is?

TomatoCo · 2026-01-11T22:47:35 1768171655

I'd assume that's just an optimization? Why bother sorting the entire list if you're just gonna pick the top token, linear time versus whatever your sort time is.

Having said that, of course it's only as deterministic as the hardware itself is.

dnautics · 2026-01-12T00:05:08 1768176308

The likelihood that top-two is close enough to be hardware dependent is pretty low. IIUC It's more of an issue when you are using other picking methods.

embedding-shape · 2026-01-11T22:47:19 1768171639

In for example llama.cpp? Specific to the architecture or in general? Could you point out where this is happening? Not that I don't believe you, but I haven't seen that myself, and would appreciate learning deeper how it works.

remexre · 2026-01-06T15:07:22 1767712042

What properties are you validating? ld.so/libdl don't give you a ton more than "these symbols were present/absent."

remexre · 2026-01-06T15:01:59 1767711719

If you're trying to audit React, don't you either need to audit its build artifacts rather than its source, or audit those dev dependencies too?

remexre · 2026-01-04T16:53:25 1767545605

Or more like,

    x = tokenize(input)
    i = 0
    do {
      finish, x = layers(x)
    } while(!finish && i++ < t_max);
    output = lm_head(x)

oofbey · 2026-01-05T00:43:08 1767573788

That’s closer still. But even closer would be:

    x = tokenize(input)
    i = 0
    finish = 0
    do {
      p, x = layers(x)
      finish += p
    } while(finish < 0.95 && i++ < t_max);
    output = lm_head(x)

Except the accumulation of the stop probabilities isn’t linear like that - it’s more like a weighted coin model.