my impression is that most CL these days is existing large closed-source codebases, hence the price tag for those compilers (you're not trying it out for a bit, you're funding the compiler devs to work full-time on the issues you're actually having) and relatively little open-source activity for "finished" things -- if you're developing against internal libraries, it's hard to open-source just the part you intend to
(work at a CL shop; mostly SBCL users, but maybe 1/3 of people are die-hard ACL fans)
how does this compare to MoSA (arXiv:2505.00315)? do you require that there's a single contiguous window? and do you literally predict on position, or with a computed feature?
I predict a specific location then put a window around that. Of course you can predict a different location per head or multiple window locations per head as well. The cost is negligible (single linear embx1 size) so attn becomes a fixed cost per token just like traditional windowed attn. Of course this doesn't solve memory consumption because you still have a kv cache unless you only do attn over the initial embeddings at which point you don't need the cache, just the token history. This is the tact I'm taking now since I have other ways of providing long context at deeper layers that remain O(1) for token prediction and are paralellizable like standard attn. I think this kind of architecture is the future, infinite context, fixed size state, O(1) prediction, externalized memory are all possible and break current context, memory and compute problems. It is clear that in the future token caching will be dead once these types of models (mine or someone else's with the same properties) are properly tuned and well trained.
The point of an IRB is to act as an outside reviewer of _ethics_. IRBs aren't some checklist thing admin put in to protect the University's reputation, they exist as a direct reaction to huge amounts of unethical human experimentation occurring last century.
The point of an IRB is to stop you from nonconsentually sterilizing people. As long as the system stops that from happening, I don't care about the paperwork. It's not my concern.
The "ethical" issues with this study do not rise to the level that I care, so the only objection is that they didn't get the IRB to rubber stamp it beforehand, which I also don't care about.
There's usually an if(temp == 0) to change sampling methods to "highest probability" -- if you remove that conditional but otherwise keep the same math, that's not deterministic either.
If you remove the conditional and keep the same math, you divide by zero and get nans. In the limit as temperature goes to zero, you do in fact get maximum likelihood sampling.
I'd assume that's just an optimization? Why bother sorting the entire list if you're just gonna pick the top token, linear time versus whatever your sort time is.
Having said that, of course it's only as deterministic as the hardware itself is.
The likelihood that top-two is close enough to be hardware dependent is pretty low. IIUC It's more of an issue when you are using other picking methods.
In for example llama.cpp? Specific to the architecture or in general? Could you point out where this is happening? Not that I don't believe you, but I haven't seen that myself, and would appreciate learning deeper how it works.
(work at a CL shop; mostly SBCL users, but maybe 1/3 of people are die-hard ACL fans)
reply