Technology

61081 readers

2471 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

241

AI struggles to understand human history and fails miserably when tested (www.earth.com)

submitted 3 days ago by FlyingSquid@lemmy.world to c/technology@lemmy.world

50 comments fedilink hide all child comments

LLMs performed best on questions related to legal systems and social complexity, but they struggled significantly with topics such as discrimination and social mobility.

“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history,” said del Rio-Chanona. “They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task.”

Among the tested models, GPT-4 Turbo ranked highest with 46% accuracy, while Llama-3.1-8B scored the lowest at 33.6%.

you are viewing a single comment's thread
view the rest of the comments

[–] aramis87@fedia.io 12 points 3 days ago (1 children)

I was trying to solve a betweenle a couple weeks ago, had it down to the words immediately before and after the word-for-the-day, and couldn't think of it. I went to three different AI engines and asked them what word was between those two, alphabetically. All three engines repeatedly gave me "answers" that did not occur between the two words. Like, I'd ask "what 5-letter English words are alphabetically between amber and amble, and they'd suggest aisle or armor. None of them understood 'alphabetically'.

[–] anindefinitearticle@sh.itjust.works 10 points 3 days ago (1 children)

Try asking one to write a sentence that ends with the letter "r", or a poem that rhymes.

They know words as black boxen with weights attached for how likely they are to appear in certain contexts. Prediction happens by comparing the chain of these boxes leading up to the current cursor and using weights and statistics to fill in the next box.

They don't understand that those words are made of letters unless they have been programmed to break each word down into its component letters/syllables. None of them have been programmed to do this because that increases the already astronomical compute and training costs.

About a decade ago I played with an LLM whose markov chain did predictions based on what letter came next instead of what word came next (pretty easy modification of the base code). It was surprisingly comparably good at putting sentences and grammar together when working at the letter-scale. It also was horribly less efficient to train (which is saying something in comparison to word-level prediction LLMs) because it needs to consider many more units (letters vs words) leading up to the current one to maintain the same coherence. If the markov chain was looking at the past 10 words, a word-level prediction has 10 boxes to factor into its calculations and trainings. If those words have an average of 5 letters, then letter-level prediction needs to consider at least 50 boxes to maintain the same awareness of context within a sentence/paragraph. This is a five-fold increase in memory footprint, and an even greater increase in compute time (since most operations are at least of linear order and sometimes more).

That efficiency hit would allow for LLMs to understand sub-word concepts like alphabetization, rhyming, root words, etc. The expense and energy requirements aren't worth this modest expansion of understanding.

Adding a General Purpose Transformer just adds some plasticity to those weights and statistics beyond the markov chain example I use above.

[–] Leg@sh.itjust.works 3 points 3 days ago

I just tried chatgpt with all of these failure points--a 5-letter word that fits alphabetically between amber and amble (ambit), a sentence that ends in the letter R (it ended with "door"), and a poem that rhymes (aabbccbb). These things appear to be getting ironed out quite quickly. I don't think it's much longer before we'll have to make some serious concessions.