this post was submitted on 10 Jul 2023
220 points (100.0% liked)
Technology
37719 readers
105 users here now
A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.
Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.
Subcommunities on Beehaw:
This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I tested by asking ChatGPT 3.5 specific questions about The Bedwetter, and it seems like it was not trained on the full text of the book. I asked it what is the first sentence, and then what is the second paragraph, and it gave plausible but incorrect answers. I asked it for the table of contents, and then if a specific chapter was in the book, and it said "my responses are generated based on pre-existing data and do not have real-time access to specific book content". I asked who wrote the foreward, and who wrote the afterward. It said Patton Oswalt wrote the foreward and that there is no afterward. In reality, Sarah wrote the foreward and God wrote the afterward.
ChatGPT conversation
Table of contents and first chapter from Google Books.
LLMs compress data, there’s no way ChatGPT could remember every detail of the book alongside all the other information it stores in its encodings. The issue isn’t whether the entire text of the book is contained within the encodings, it’s whether it was trained on the book in the first place.
GPT3 is 800GB while the entirety of the English Wikipedia is around 10GB compressed. So yeah it doesn't store evey detail of everything but LLMs do memorize a lot of things verbatim. Also see https://bair.berkeley.edu/blog/2020/12/20/lmmem/