Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That bike example seemed a mix of underwhelming (for being the demo video) and even confusing.

1. It's not smart enough to recognize from the initial image this is a bolt style seat lock (which a human can).

2. The manual is not shown to the viewer, so I can't infer how the model knows this is a 4mm bolt (or if it is just guessing given that's the most likely one).

3. I don't understand how it can know the toolbox is using metric allen wrenches.

Additionally is this just the same vision model that exists in bing chat?



The bike shown in the first image is Specialized Sirrus X. You can make out from the image of the manual that it says "spacer/axle/bolt specifications". Searching for this yields the following Specialized bike manual which is similar: https://www.manualslib.com/manual/1974494/Specialized-Epic-E... -- there are some notable differences, but the Specialized Sirrus X manuals that are online aren't in the same style.

The prior page (8) shows "SEAT COLLAR 4mm HEX" and, based on looking up seat collar in an image search, the part in question matches.

In terms of the toolbox, note that it only identified the location of the Allen wrench set. The advice was just "Within that set, find the 4 mm Allen (Hex) key". Had they replied with "I don't see any sizes in mm", the conversation could've continued with "Your Allen keys might be using SAE sizing. A compatible size will be 5/32, do you see that in your set?"


It bugged me that they made no mention of torque. The manual is really clear on that part with a big warning:

> WARNING! Correct tightening force on fasteners (nuts, bolts, screws) on your bicycle is important for your safety. If too little force is applied, the fastener may not hold securely. If too much force is applied, the fastener can strip threads, stretch, deform or break. Either way, incorrect tightening force can result in component failure, which can cause you to lose control and fall. Where indicated, ensure that each bolt is torqued to specification. The following is a summary of torque specifications in this manual...

The seat collar also probably has the max torque printed on it.

When they asked if they had the right tool, I would have preferred to see an answer along the lines of "ideally you should be using a torque wrench. You can use the wrench you have currently, but be careful not to over tighten."


The seat collar also probably has the max torque printed on it. <<<< Nope. There's no need for a torque wrench on that one.


Ah good find. yah, I tried bing and it is able to read a photo of that manual page and understand that the seat collar takes a 4mm hex wrench (though hallucinated and told me the torque was 5 Nm, unlike the correct 6.2, suggesting table reading is imperfect).

Toolbox: I just found it too strong to claim you have the right tool, when it really doesn't know that. :)

In the end it does feel like the image reader is just bolted onto an LLM. Basically, just doing object recognition and dumping features into the LLM prompt.


Like a basic CLIP description: Tools, yellow toolbox, DEWALT, Allen wrenches, instruction manual. And then just using those keywords in the prompt. Yes, you’re right, it does feel like that.


A few of these wouldn't be possible with something like that. Look at the last picture, the graph analysis.

https://imgur.com/a/iOYTmt0


Yep. This example basically convinced me that they were unable to figure out anything actually useful to do with the model's new capabilities. Which makes me wonder how capable the new model in fact is.


Yah, pretty sure it is the same feature that's been in Bing Chat for 2 months now. Which feels really like there's only one pass of feature extraction from the image, preventing any detailed analysis beyond a course "what do you see". (Follow-up questions of things it likely didn't parse are highly hallucinated).

This is why they can't extract the seat post information directly from the bike when the user asks. There's no "going back and looking at the image".

Edit: nope, it's a better image analyzer than Bing


>Yah, pretty sure it is the same feature that's been in Bing Chat for 2 months now.

It's not. Feel free to try these queries:

https://twitter.com/ComicSociety/status/1698694653845848544?... (comic book page in particular, from a be my eyes user)

Or these https://imgur.com/a/iOYTmt0 (graph analysis in particular, last example) and see Bing fail them.


Right. It appeared that the response to the first image and question would have been the same if the image wasn't provided.

I wasn't impressed with the demo but we'll see what real world results get.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: