Hacker Newsnew | past | comments | ask | show | jobs | submit | neonbjb's commentslogin

> So it made me wonder. Is Brainf*ck the ultimate test for AGI?

Absolutely not. Id bet a lot of money this could be solved with a decent amount of RL compute. None of the stated problems are actually issues with LLMs after on policy training is performed.


> None of the stated problems are actually issues with LLMs after on policy training is performed

But still , isnt it a major weakness they have to RL on everything that has not much data? That really weakens the attempt to make it true AGI.


No.

AGI would be a universal learner, not a magic genie. It still needs to do learning (RL or otherwise) in order to do new tasks.


> It still needs to do learning (RL or otherwise) in order to do new tasks.

Why ? As in - why isn't reading the Brainfuck documentation enough for Gemini to learn Brainfuck ? I'd allow for 3-7 days of a learning curve like perhaps a human would need but why do you need to kinda redo the whole model (or big parts of it) just so it could learn Brainfuck or some other tool? Either the learning (RL or otherwise) need to become way more efficient than it is today (takes today weeks? months? billions of dollars) or it isn't AGI I would say. Not in practical/economic sense and I believe not in the philosophical sense of how we all envisioned true generality.


Yes we do. If you worked at Google you know moma. Our moma is an internal version of chat. It is very good.


Also work at OpenAI. Every tender offer has made full payouts to previous employees. Sorry to ruin your witch hunt..


I think the fact that you consider that a defense is a good illustration of why I had to ask that question. ("Yes, the gun is on the table, but the trigger has never been pulled. Sorry to ruin your witch hunt.")


I work for openai.

o4-mini gets much closer (but I'm pretty sure it fumbles at the last moment): https://chatgpt.com/share/680031fb-2bd0-8013-87ac-941fa91cea...

We're pretty bad at model naming and communicating capabilities (in our defense, it's hard!), but o4-mini is actually a _considerably_ better vision model than o3, despite the benchmarks. Similar to how o3-mini-high was a much better coding model than o1. I would recommend using o4-mini-high over o3 for any task involving vision.


Thanks for the reply. I am not sure the vision is the failing point here, but logic. I routinely try to get these models to solve difficult puzzles or coding challenges (the kind that a good undergrad math major could probably solve, but that most would struggle with). They fail almost always. Even with help.

For example, JaneStreet monthly puzzles. Surprisingly, the new o3 was able to solve this months (previous models were not), which was an easier one. Believe me, I am not trying to minimize the overall achievement -- what it can do incredible -- but I don't believe the phrase AGI should even be mentioned until we are seeing solutions to problems that most professional mathematicians would struggle with, including solutions to unsolved problems.

That might not be enough even, but that should be the minimum bar for even having the conversation.


>what it can do incredible -- but I don't believe the phrase AGI should even be mentioned until we are seeing solutions to problems that most professional mathematicians would struggle with, including solutions to unsolved problems.

But Why ? Why should Artificial General Intelligence preclude things a good chunk of humans wouldn't be able to do ? Are those guys no longer General Intelligences ?

I'm not saying this definition is 'wrong' but you have to realize at this point, the individual words of that acronym no longer mean anything.


Sure, there's no authority who stamps the official definition.

I'll make my case. To me, if you look at how the phrase is usually used -- "when humans have achieved AGI...", etc -- it evokes a science fiction turning point that implies superhuman performance in more or less every intellectual task. It's general, after all. I think of Hal or the movie Her. It's not "Artifical General Just-Like-Most-People-You-Know Intelligence". Though we are not there yet, either, if you consider the full spectrum of human abilities.

Few things would demonstrate general superhuman reasoning ability more definitively than machines producing new, useful, influential math results at a faster rate than people. With that achieved, you would expect it could start writing fiction and screenplays and comedy as well as people too (it's still very far imo), but maybe not, maybe those skills develop at different paces, and I still wouldn't want to call it AGI. But I think truly conquering mathematics would get me there.


A standard term people use for what you describe is superintelligence, not AGI.

Current frontier models are better than average humans in many skills but worse in others. Ethan Mollick calls it “jagged frontier” which sounds about right.


youre describing superintelligence. This is why these conversations always need to start with definitions


You're missing the fact that requests are batched. It's 70 tokens per second for you, but also for 10s-100s of other paying customers at the same time.


All these efficiencies just increase OpenAI's margin on inference. Of course it's not "one cluster per customer" and of course a customer can't saturate a cluster by themselves, my illustration was only to point out that the economics work.


You wouldn't do that to this model. It finds its own mistakes and corrects them as it is thinking through things.


No model is perfect, the less I can see into what it’s “thinking” the less productively I can use it. So much for interpretability.


I'm James Betker.

Of course architecture matters in this regard lol. Comparing a CNN to a transformer is like comparing two children brought up in the same household but one has a severe disability.

What I meant in this blog post was that given two NNs which have the same basic components that are sufficiently large and trained long enough on the same dataset, the "behavior" of the resulting models is often shockingly similar. "Behavior" here means the typical (mean, heh) responses you get from the model. This is a function of your dataset distribution.

:edit: Perhaps it'd be best to give a specific example: Lets say you train two pairs of networks: (1) A Mamba SSM and a Transformer on the Pile. (2) Two transformers, one trained on the Pile, the other trained on Reddit comments. All are trained to the same MMLU performance.

I'd put big money that the average responses you get when sampling from the models in (1) are nearly identical, whereas the two models in (2) will be quite different.


There's not many people who will proudly announce their employer, their name, and where someone can stick it over the course of two public comments these days.

You, sir, are my hero.


Please humor me for a moment, because I'm having trouble seeing why this is not just true by definition. Doesn't "training to the same performance" mean that you get the same responses? Or from a different angle: given that the goal of the model is to generate plausible completions based on a training dataset, it seems like plausibility (and therefore performance) is obviously defined by the dataset.


If Mamba really was as capable as a Transformer on tasks requiring accurate attending to long context, then there'd be no need for Jamba (Mamba+Transformer hybrid).

Your argument of "if we train a Mamba SSM to be as good as a Transformer, then it'll be as good as a Transformer", seems a tad circular...


Yeah, I'm not sure how someone could interpret what you said in the way people are citing here. It's actually obvious that you are right in the context of data in LLMs. Look at LLAMA 3, for example there are minimal architectural changes, and its performance is almost at the level of GPT-4. The biggest change was in the dataset.


As an employee of OpenAI: fuck you and your condescending conclusions about my peers and my motivations.


I’m curious about your perceptions of the (median) motivations of OpenAI employees - although of course I understand if you don’t feel free to say anything.


"condescending conclusions" - ask anyone outside of tech how they feel when we talk to them...


Users here often get the narrative and motivations deeply wrong, I wouldn’t take it too personally (Speaking as a peer)


I don't think the plan is an occasionally rocket launch long term. I think the plan is to launch rockets as fast as humanly possible.


Boca Chica is a test facility. SpaceX is only allowed 5 launches of the StarShip second stage, and 5 launches of the full stack each year (page 17 in [1]). More frequent launches will require a new environmental review. They plan production launches to occur out of Florida, and have already begun permitting and building of a launch site there.

[1]https://www.faa.gov/sites/faa.gov/files/2022-06/PEA_for_Spac...


They've built Starbase there, with all the rocket assembly facilities and the new expansions. Unless they're planning on shipping all the rockets via barge, I think it's safe to assume they'll send them to Florida by launching from Boca Chica.


They're building factories at Cape Canaveral too, the Starbase facilities have been designated as prototyping facilities. Any changes to production would be made , tested and optimized there first, then brought over to other production facilities like Cape Canaveral. Although initially they might still have to transport boosters and ships via barge to the cape.


The facilities at Starbase are far more than prototyping facilities. If that were the case, they wouldn't have needed a Megabay and be almost completely finished building a second. They would have just stood pat with the High Bay etc. Same with the new buildings replacing the temporary tents. Starbase will be cranking out LOTS of boosters and Starships.


When I say prototyping facilities, I mean as prototyping for the other factories. As Musk has previously said, since they want to mass produce Starships and boosters, the challenge is to design "the machine that builds the machine". That's why I mentioned changes to production being tested and optimized there rather than changes to the vehicles specifically.


Has Musk, Shotwell or anyone at SpaceX ever said that this was the case? I haven't seen any sign of this at all. And again, if Starbase was just for prototyping workflows, there's no need for two Megabays.


I could've sworn that Musk mentioned that Starbase would be mainly relegated to prototyping after the FAA limited them to 5 launches/year out of there. I think it was in the presentation where they announced the plans to work with T-Mobile? I'll see if I can dig it up later.


Why not ship by barge? It'll be far cheaper than the fees, costs and labour costs associated with a rocket launch. Both Brownsville and Cape Canaveral are well set up for barge shipping.

Also, they're expanding their factory in Florida, so they could be built on site.


There's that protected-from-open-ocean barge lane (I forget what it's called?) that I think runs right from Brownsville to Cape Canaveral, or close enough, I think they plan on using that.

Edit https://en.wikipedia.org/wiki/Intracoastal_Waterway


Sure you could ship by barge. It's just that SpaceX has never said that's the plan. Looking at the expanding factory footprint at Boca Chica, it's obvious that this isn't just a test facility, but a production facility. My hunch is that the current limits of 5 per year will be expanded once they prove it "safe" to launch from the site. They'll boil that frog slowly.


While you maybe be right, I just want to note that SpaceX isn't just expanding factory footprint at Boca Chica. They will be manufacturing Starship in Florida as well and have already started expanding there too.


They will be barging ships to Florida, and long term when you have multiple launches per day, they will be launching the ships from barges.


They've pretty much scrapped their initial plans to use old oil platforms for launches, and the size of a barge needed to launch (and retrieve the first stage) would be tremendous. That idea is a complete non-starter.


They did scrap their initial plans, because they won’t need that capability until they’re launching like 1000 times per year. That might not be until one or two decades from now. Would it make sense to try to maintain that capability starting now and then just let it rest for two decades.

They have mentioned recently that they do still plan to go to barge launching when they get to 1000 flights per year.

The idea is not crazy, at least not any more than anything else related to Starship. But it is not needed right now.


Not from Boca. Boca is a testing grounds, since they aren't allowed to test elsewhere.


Its cool that this is starting to approach real time video territory (30 images per second, this claims close to 1 image/sec).


Once you reach a few fps, you can use other techniques to interpolate frames.


SD is trained on static images and don't have a clue about motion. Interpolation won't help with that. See deforum videos. They don't make any sense. Fast generation will help iterate faster though.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: