Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But this content is presumably in its training set, no? I'd be interested if you did the same task for a collection of books published more recently than the model's last release.


To test this hypothesis, I just took the complete book "Advances in Green and Sustainable Nanomaterials" [0] and pasted it into the prompt, asking Gemini: "What absorbs thermal radiations and converts it into electrical signals?".

It replied: "The text indicates that graphene sheets present high optical transparency and are able to absorb thermal radiations with high efficacy. They can then convert these radiations into electrical signals efficiently.".

Screenshot of the PDF with the relevant sentence highlighted: https://i.imgur.com/G3FnYEn.png

[0] https://www.routledge.com/Advances-in-Green-and-Sustainable-...


Ask it what material absorbs “infrared light” efficiently.

To me, that’s useful intelligence. I can already search text for verbatim matches, I want the AI to understand that “thermal radiations” and “infrared light” are the same thing.


> Answer the following question using verbatim quotes from the text above: "What material absorbs infrared light efficiently?"

> "Graphene is a promising material that could change the world, with unlimited potential for wide industrial applications in various fields... It is the thinnest known material with zero bandgaps and is incredibly strong, almost 200 times stronger than steel. Moreover, graphene is a good conductor of heat and electricity with very interesting light absorption properties."

Interestingly, the first sentence of the response actually occures directly after the latter part of the response in the original text.

Screenshot from the document: https://i.imgur.com/5vsVm5g.png.

Edit: asking it "What absorbs infrared light and converts it into electrical signals?" yields "Graphene sheets are highly transparent presenting high optical transparency, which absorbs thermal radiations with high efficacy and converts it into electrical signals efficiently." verbatim.


Fair point, but I also think something that's /really/ clear is that LLMs don't understand (and probably cannot). It's doing highly contextual text retrieval based on natural language processing for the query, it's not understanding what the paper means and producing insights.


Honestly I think testing these on fiction books would be more impressive. The graphene thing I'm sure shows up in some research papers.


Gemini works with brand new books too; I've seen multiple demonstrations of it. I'll try hunting one down. Side note: this experiment is still insightful even using model training material. Just compare its performance with the uploaded book(s) to without.


I would hope that Byung-Chul Han would not be in the training set (at least not without his permission), given he's still alive and not only is the legal question still open but it's also definitely rude.

This doesn't mean you're wrong, though.


It's pretty easy to confirm that copywritten material is in the training data. See the NYT lawsuit against OpenAI for example.


Part of that back-and-forth is the claim "this specific text was copied a lot all over the internet making it show up more in the output", and that means it's not a useful guide to things where one copy was added to The Pile and not removed when training the model.

(Or worse, that Google already had a copy because of Google Books and didn't think "might training on this explode in our face like that thing with the Street View WiFi scanning?")




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: