this post was submitted on 26 Aug 2023
67 points (94.7% liked)

Artificial Intelligence

1340 readers
38 users here now

Welcome to the AI Community!

Let's explore AI passionately, foster innovation, and learn together. Follow these guidelines for a vibrant and respectful community:

You can access the AI Wiki at the following link: AI Wiki

Let's create a thriving AI community together!

founded 1 year ago
 
all 15 comments
sorted by: hot top controversial new old
[–] Ubermeisters@lemmy.zip 21 points 1 year ago* (last edited 1 year ago) (3 children)

I still don't see how adding a tiny little line of code to robots.txt equals OpenAI can no longer scrape data. Seems like they can still manually go in and harvest info manually right? and that's not exactly a large list of companies.

[–] isaiah@lemmy.world 35 points 1 year ago (2 children)

That's honor system, for sure. OpenAI has promised that their bots will honor this line in robots.txt. But unless these companies have implemented some detect-and-block method on their own, there is nothing physically stopping the bots from gathering data anyway.

[–] Ubermeisters@lemmy.zip -5 points 1 year ago (2 children)

Exactly, so this is purely performative on their part. Businesses shouldn't virtue signal, its a little pathetic.

[–] senoro@lemmy.ml 6 points 1 year ago

Companies do honour robots.txt. Maybe not the small project made by some guy somewhere. But large companies do.

[–] dezmd@lemmy.world 6 points 1 year ago (2 children)

Thats not really virtue signaling unless the org is using it for PR reasons, it's just asking others to respect your wishes in a cooperative community sense rather than a legal demand. This is a more technical side of things than the politics everyone injects.

Im a proponent supporting these LLaMA systems, they are really just the next iteration of Search systems. Just like with search engines, they use traffic and server time with queries and its good manners for everyone to follow the robots.txt limits of every site, but the freedom is still inherent under an open internet that a third party can read the site for whatever reasons. If you dont want take part in the open community part of the open internet of the world, you don't have to expose anything at all to public access that can be scraped.

I rarely read paywalled news sites because they opt to not to be part of the open community of information sharing that our open internet represents.

[–] pjhenry1216@kbin.social 1 points 1 year ago (1 children)

next iteration of Search systems.

Except it doesn't credit the source nor direct traffic. So... almost an entirely different beast.

[–] hemko@lemmy.dbzer0.com 4 points 1 year ago

That depends fully on the implementation. Bing does give you sources, but chatgpt generates "original content" based on all the shit it's scraped

[–] Burp@kbin.social -1 points 1 year ago

A bit of a tangent, but I've recently shifted my focus to reading content behind paywalls and have noticed a significant improvement in the quality of information compared to freely accessible sources. The open internet does offer valuable content, but there's often a notable difference in journalistic rigor when a subscription fee is involved. I suspect that this disparity might contribute to the public's vulnerability to disinformation, although I haven't fully explored that theory.

[–] agressivelyPassive@feddit.de 7 points 1 year ago

Robots.txt never was any hurdle, it's just a flag which you are free to ignore.

[–] foggy@lemmy.world 7 points 1 year ago (2 children)

Why wouldnt they? It's totally legal to write up something that visits pages in a genuine browser and takes all the content from the page source.

[–] Ubermeisters@lemmy.zip 4 points 1 year ago* (last edited 1 year ago) (1 children)

This attitude right here is my point. Thanks for unintentionally making my point for me ;)

legal =/= right

I'm so tired of people thinking the boundaries of the law are the boundaries of whats socially acceptable; it isn't. The boundary of laws is where we get so fed up with you that we arrest your ass. the grey are in the middle where you are a shit human but not illegal, is not a place to brag about being.

[–] foggy@lemmy.world 2 points 1 year ago

I was intentionally making your point!

And the boundary of laws, where fines are the punishment, is simply wealth.

If breaking a law whose punishment is a fine earns me more than the fine sets me back, then it's a no brainier, it's profit.

[–] pjhenry1216@kbin.social 2 points 1 year ago

Not if there's a EULA forbidding it. That's part of the reason for robot.txt. it's sort of the agreement bots have to pass through versus the one a person sees.

[–] ares35@kbin.social 11 points 1 year ago

and if you mark those in this list that are using ai, how long til the venn diagram is a circle?