I work as an AI engineer and any company that's has a legal department that isn't braindead isn't stealing content. The vast majority of content posted online is done so under a creative commons licence, which allows other people to do basically whatever they want with that content, so it's in no way shape or form stealing.

[–] spacesweedkid27@lemmy.world 2 points 1 year ago* (last edited 1 year ago) (1 children)

Yeah I know, but the press picks some rare cases and says, that this goes for all.

You know how it is.

[–] Shazbot@lemmy.world 4 points 1 year ago

It really depends on what the site's terms of service/usage agreement says about the content posted on the site.

For example a site like Art Station has users agree to a usage license that lets them host and transmit the images in their portfolios. However there is no language saying the visitors are also granted these rights. So you may be able to collect the site text for fair use, but the art itself requires reaching out for anything other than personal/educational use.

[–] r1veRRR@feddit.de 8 points 1 year ago

I remember when copying data wasn't theft, and the entire internet gave the IP holders shit for the horrible copyright laws...

[–] Randomperson1234@lemmy.world 1 points 1 year ago (1 children)

If its published online its not theft Thats like saying that if i publish a book and someone uses it to learn a language then they are stealing my book

[–] spacesweedkid27@lemmy.world 9 points 1 year ago (1 children)

Ok then picture this: A webscrapper is copying code that has copyright that indicates, that it is forbidden to modify this code, to publish it under a different name or to sell this code (for example a method that calculates the inverse square root really fast).

By using the code as training data, most language models may actually paste this code or write it with little change, because most language models are based not on writing something that has a purpose that is given by the user, like for example AI's that are supposed to evaluate pictures of dogs and cats and is supposed to decide which is which, but they are based on the following schematic:

Read previous text
Predict, what letter will follow
Repeat until user interferes.

Because language models work this way, if I would for example only train it on the novsl "Alice in Wonderland", then there is a high possibly, that the model will reproduce parts of it.

But there is a way to fix this problem: If we broaden the training data very much, the chance the output would be considered plagiarism will narrow down.

Upon closer inspection there is another problem though, because AI (at this point in time), don't have an influence from outside in a sense like humans do: A human is experiencing every day of their life with there being a chance of something happening, that modifies their brain structure through emotion, like for example a chronic depression. This influences the output of a person not only in their symptoms, but also in the way they would write text for example.

The consequence is that the artist may use this emotional change to express it with their art.

Every day influences the artist differently and inspires them with unseen and new thoughts.

The AI (today) has the problem that they definitely retell the stories it has heard again and again like Aristotle (?) says.

The outer influence is missing to the AI at this point in time.

If you want to have a model that can give you things, that never have been written before or don't even seem like anything that there has ever been, you have to give it these outer influences.

And there is a big problem coming up, because, yes this process could be implemented by training the AI even further after it already was launched, by reinforced learning, but this process would still need data input from humans, which is really annoying.

A way to make it easier would be to give an AI a device on which it can run and sensors as well as output devices, so that it can learn from its sensors and use this information in its post-training training phase to gather more data and make current events, that it perceives, relevant.

As you can see, if we would do that, then we would have an AI that could do anything, and learn from anything, which both makes it really really fucking dangerous, but also really really fucking interesting.

[–] PopShark@lemmy.world 1 points 1 year ago (1 children)

lol that’s funny my machine learning course used Alice in wonderland as an example for us to train our projects on

[–] spacesweedkid27@lemmy.world 2 points 1 year ago (1 children)

That wasn't a coincidence, I watched a video some time ago, where they used tensor flow to do exactly that.

(I'm doing my CS bachelors btw)

[–] PopShark@lemmy.world 1 points 1 year ago

Oh OK nice! Good luck!!

[–] FakinUpCountryDegen@lemmy.world 1 points 1 year ago

Well, theft of the labor of contributors to give to non contributors is communism ... So, your statement is true, it's just more broad than that.

[–] TimeNaan@lemmy.world 17 points 1 year ago

Online communism unless you actually want something for free.