this post was submitted on 29 Nov 2023
360 points (98.9% liked)
Privacy
31975 readers
579 users here now
A place to discuss privacy and freedom in the digital world.
Privacy has become a very important issue in modern society, with companies and governments constantly abusing their power, more and more people are waking up to the importance of digital privacy.
In this community everyone is welcome to post links and discuss topics related to privacy.
Some Rules
- Posting a link to a website containing tracking isn't great, if contents of the website are behind a paywall maybe copy them into the post
- Don't promote proprietary software
- Try to keep things on topic
- If you have a question, please try searching for previous discussions, maybe it has already been answered
- Reposts are fine, but should have at least a couple of weeks in between so that the post can reach a new audience
- Be nice :)
Related communities
Chat rooms
-
[Matrix/Element]Dead
much thanks to @gary_host_laptop for the logo design :)
founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I bet these are instances of over training where the data has been input too many times and the phrases stick.
Models can do some really obscure behavior after overtraining. Like I have one model that has been heavily trained on some roleplaying scenarios that will full on convince the user there is an entire hidden system context with amazing persistence of bot names and story line props. It can totally override system context in very unusual ways too.
I've seen models that almost always error into The Great Gatsby too.
This is not the case in language models. While computer vision models train over multiple epochs, sometimes in the hundreds or so (an epoch being one pass over all training samples), a language model is often trained on just one epoch, or in some instances up to 2-5 epochs. Seeing so many tokens so few times is quite impressive actually. Language models are great learners and some studies show that language models are in fact compression algorithms which are scaled to the extreme so in that regard it might not be that impressive after all.
How many times do you think the same data appears after a model has as many datasets as OpenAI is using now? Even unintentionally, there will be some inevitable overlap. I expect something like data related to OpenAI researchers to reoccur many times. If nothing else, overlap in redundancy found in foreign languages could cause overtraining. Most data is likely machine curated at best.