Why would they need to scrape fediverse when they can just get all of the data and more just through federation? Also this anti-scraping stance for a public, transparent protocol is really weird - that's the whole point of the protocol.
Complaining that data available on the public internet is being read seems very strange. Whatever happened to "Information wants to be free" or "The Net Interprets Censorship As Damage and Routes Around It."
There can be only one monopoly in each domain by definition. In AI world it's more like several 'fortresses'. Together they ruin click economy. Which almost eliminated printed books and magazines. Well, attention is limited resource.
The main difference is that the click economy did not rely on printed books and magasines' continued existence. It could produce its own original information. A magasine author could become a blogger, and they could still write their own café reviews.
Generative AI still relies on the work of the creators whose livelihood it threatens for its training data. It still relies on someone else experiencing the real world, and describing it for them. It just denies them their audience or the fruit of their labour.
Someone here put it nicely: AI companies are eating their seed corn.
But restricting the flow of information is a really weird way of handling this issue. It's like digging pot holes on the road just because you're upset that Teslas are on it.
It's not that important now as AI took off the ground. New models can be trained completely on generated data. That will give them core abilities. Real world knowledge... whatever humans can get models can.
> New models can be trained completely on generated data.
How does that account for all the things that change in the world, but in ways only humans can observe?
How can AI discover that a beloved tourist destination has turned to crap, or that the best vacuum cleaner of 2022 has a new challenger, or that German tipping culture is shifting, or that the café down the road has great banana bread but is a little loud on Saturdays?
"AI" corporations aren't just "scraping" the fediverse. They are DDOSing independent websites all over the internet. Blocking and hampering their scrapers is often the best and only solution for some small indie sites to remain financially viable. These companies are destroying the commons.
Either that or, to continue building the shadow profiles we know they build, and to gain intelligence on their enemies and possible enemies of the current admin
Why would they need to scrape fediverse when they can just get all of the data and more just through federation? Also this anti-scraping stance for a public, transparent protocol is really weird - that's the whole point of the protocol.
Complaining that data available on the public internet is being read seems very strange. Whatever happened to "Information wants to be free" or "The Net Interprets Censorship As Damage and Routes Around It."
The information is used to build monopolies that strangle the independent web.
There can be only one monopoly in each domain by definition. In AI world it's more like several 'fortresses'. Together they ruin click economy. Which almost eliminated printed books and magazines. Well, attention is limited resource.
The main difference is that the click economy did not rely on printed books and magasines' continued existence. It could produce its own original information. A magasine author could become a blogger, and they could still write their own café reviews.
Generative AI still relies on the work of the creators whose livelihood it threatens for its training data. It still relies on someone else experiencing the real world, and describing it for them. It just denies them their audience or the fruit of their labour.
Someone here put it nicely: AI companies are eating their seed corn.
But restricting the flow of information is a really weird way of handling this issue. It's like digging pot holes on the road just because you're upset that Teslas are on it.
It's not that important now as AI took off the ground. New models can be trained completely on generated data. That will give them core abilities. Real world knowledge... whatever humans can get models can.
> New models can be trained completely on generated data.
How does that account for all the things that change in the world, but in ways only humans can observe?
How can AI discover that a beloved tourist destination has turned to crap, or that the best vacuum cleaner of 2022 has a new challenger, or that German tipping culture is shifting, or that the café down the road has great banana bread but is a little loud on Saturdays?
Or it is being used to build the most useful information indexing and search algorithms ever created.
Until it starves out the websites and communities that provide the training data.
The circle of Life.
"AI" corporations aren't just "scraping" the fediverse. They are DDOSing independent websites all over the internet. Blocking and hampering their scrapers is often the best and only solution for some small indie sites to remain financially viable. These companies are destroying the commons.
Even Hacker News users report being affected: https://news.ycombinator.com/item?id=43397361
There are countless examples of "AI" DDOSing of independent websites if you care to search for them.
Note: I do not endorse the linked blogger
Yes, obviously. More people should scrape and archive the Fediverse.
Yes, obviously, next question.
people view robots.txt and llm.txt as some kind of binding contract.
its not, and expecting companies to follow it is naive.
Any data that is put on the public internet WILL be scraped and used for LLM training.
Either that or, to continue building the shadow profiles we know they build, and to gain intelligence on their enemies and possible enemies of the current admin
Nobody cares about robots.txt, nor should they.
I will never not be amused by people clutching pearls about this.