I stole a lot of books too, reading them and all. Just integrated them into my worldview, and don't pay a license fee when I use the ideas in new contexts. Sometimes I even quote from them. A lot of them I didn't even pay for, I borrowed them from libraries or friends.
Have you considered that you have some traits that make you eligible to read books and access information freely in the country you live in*? Something about being a conscious human being enjoying human rights, perhaps? An implement that does the same but (A) at scale and (B) without thought or free will or agency, completely at the bidding of its operator, for profit, has no such protections. Instead, the operator carries all responsibility (in this case, Meta).
If a software service had legal protections like that, sure, I could build one that returns you any book you request and say that the service had integrated it into its worldview. Who can check, eh?
* Actually, in some countries you could be in trouble for reading a book and incorporating it into your worldview, to say nothing about quoting it, but let’s set that aside.
>Have you considered that you have some traits that make you eligible to read books and access information freely in the country you live in*? Something about being a conscious human being enjoying human rights, perhaps?
Not a relevant factor when it comes to copyright law. Fair use (the law that's most applicable here) applies regardless if you're a student using incorporating news articles into your work, or google making thumbnails and displaying them on their search results.
This is not a good analogy. Google does not display the contents to any significant degree (you have to visit the search result). And even then it was/is in legal trouble, in fact (in some countries like Australia* more than others).
Furthermore:
> Examples of fair use in United States copyright law include commentary, search engines, criticism, parody, news reporting, research, and scholarship.
I do not see “automated generation of derivative works of arbitrary nature” in it.
>This is not a good analogy. Google does not display the contents to any significant degree (you have to visit the search result).
The point isn't that AI training is legal because it's like generating thumbnails. That is being argued in the courts right now. The point is that fair use exemptions isn't limited to "being a conscious human being enjoying human rights", as google generating thumnails and snippets using computers shows.
> Examples of fair use in United States copyright law include commentary, search engines, criticism, parody, news reporting, research, and scholarship.
> The point is that fair use exemptions isn't limited to "being a conscious human being enjoying human rights"
Sure. However, my point is that this is not fair use*, so other principles need to be applied. Whether legal systems in various countries find that fair use applies here or not, I agree we are yet to see.
* At least in cases where it’s an LLM operated at scale for profit (which I suppose would not hold for Meta’s models if they were truly open, but that’s not the case if they require obtaining a license in some conditions).
>Sure. However, my point is that this is not fair use (at least in cases where it’s an LLM operated for profit), so other principles need to be applied.
This isn't a complete argument. Most of AI companies' argument relies on the fact that AI models are "transformative". That's a plausible claim, and as Perfect 10 v. Google, and Authors Guild, Inc. v. Google, Inc. has shown, being a for-profit company is hardly a disqualification from getting fair protection.
“Transformative” is always a grey area. If my service just returns you a book you requested, but in upper case, then it was transformed.
But sure, the “transformative” argument is the one that could apply (and even I believe Google used it to argue its case), if it can be shown that an LLM can not verbatim reproduce a given work (which, incidentally, is something that you, a warm-blooded fleshy human with agency who has the freedom to read books, cannot do, but LLMs were shown to do).
That said, relevant laws existed before LLMs, and may are outdated. If the goal is to balance reasonable uses while protecting original output of authors that ultimately drives innovation and creativity, I am not sure if the preexisting laws are continuing to fulfil their function, but that’s my opinion.
>But sure, the “transformative” argument is the one that could apply (and even I believe Google used it to argue its case), if it can be shown that an LLM can not verbatim reproduce a given work.
You have to try pretty hard to get LLMs to reproduce a work verbatim, especially any lengthy passages that aren't famous (and thus re-quoted on the internet a bazillion times). Moreover just because LLMs can reproduce a work verbatim if you try hard enough doesn't mean it's not transformative. Google search snippets and google book search has been ruled "transformative" by the courts, but if you tried hard enough you can use them to extract the entire work.
>That said, relevant laws existed before LLMs, and may are outdater. If the goal is to balance reasonable uses while protecting original output of authors that ultimately drives innovation and creativity, I am not sure if the preexisting laws are continuing to fulfil their function, but that’s my opinion.
AFAIK the era of mining the public internet or published works for AI training data is over, or at least coming to an end. Everything that could be mined, has already been mined, and besides, the internet is getting increasingly polluted by AI output. Private training data is where it's at now, whether it's sourcing document troves from companies (eg. emails, documentation, source code, etc.), or paying "AI annotators" to produce training data for you. If the argument is that human authors should get a cut of AI profits because their works were "stolen" to train the models, this is going to be a increasingly losing argument, because it doesn't have a leg to stand on for private training data.
> If the argument is that human authors should get a cut of AI profits because their works were "stolen" to train the models, this is going to be a increasingly losing argument, because it doesn't have a leg to stand on for private training data.
The argument can be made that LLMs could not be created without expropriating the original works of all the authors they were trained on, and that argument would in fact be true and have quite sturdy legs as far as I’m concerned.
It’s not a historical instance of forgotten times, it started less than half a decade ago and I would be surprised if it’s not still ongoing (your argument about synthetic training data is forward-looking).
>The argument can be made that LLMs could not be created without expropriating the original works of all the authors they were trained on, and that argument would in fact be true and have quite sturdy legs as far as I’m concerned.
That makes as much sense as "American industry was built on the backs of British inventors (back it the day it was the "China" when it came to IP), so Britain should get perpetual (?) royalties from the US economy".
So we’re back the human vs. unthinking machine distinction. American inventors were human. We’re going in circles and this article was hidden on HN anyway.
> I do not see “automated generation of derivative works of arbitrary nature” in it
The “automated” isn’t really key. If you read a book, and learn from it, and are able to use that knowledge in other contexts, should you pay a licensing fee? It doesn’t matter if “you” is a human or machine.
“Automated” is key. You are not an automaton, not a machine, you do not infinitely scale with compute power; but unlike a machine you have free will and agency, and legal framework of developed countries grants you human rights that include freedom. That was, in fact, my entire point.
I just don’t get your angle. My point was that the human is the one who has some freedoms and the one who bears responsibility. If you read a bunch of books in a book store without buying them and use your imperfect memory of them to do your job better and get paid more, it is shady but if you are not shooed away by store owner you have the freedom to do it. No one can extract the books you already read from your brain, and you did not sign an NDA. But if you set up an industrial scale book scanner in the same store, the boss will call the police on you and you cannot point fingers and say the scanner “reads” books and incorporates them into its worldview just like you would do. Because scanner is not human and you are, so you’re the one responsible for operating the scanner.
I see no difference with cryptotokens here, the human has freedoms to do things and the human is responsible for them if those things are bad. (Just unlike LLMs, theft of property and all that is kinda always a crime, unlike reading a book in a shop without buying.)
Maybe not but even if Meta did buy 1 copy of every book I doubt it would stop anyone from making bad analogies to theft. (Not that the analogy on the other side to a human reading is any better.)
>You have bought the text so you have the readright, but you do not the copyright.
You do however, have the right to make derivative works based on the contents of the book. You reading a physics textbook doesn't mean you can't write a blog post about gravity or whatever, and you reading harry potter doesn't mean you can't write a series of fantasy books involving a young wizard trying to fight an evil wizard.
Last I looked, machines are considered unable to create copyrightable content, so your attempt to compare that to LLMs might not work in court.
> The application was denied because, based on the applicant’s representations in the application, the examiner found that the work contained no human authorship. After a series of administrative appeals, the Office’s Review Board issued a final determination affirming that the work could not be registered because it was made “without any creative contribution from a human actor.”
>Last I looked, machines are considered unable to create copyrightable content, so your attempt to compare that to LLMs might not work in court.
That just means whatever they produce can't be copyrighted, not that they can't produce derivative works. Courts have upheld the right for google to produce thumbnails of copyrighted works, even though the procedure for producing thumbnails is done by a computer and thus can't be copyrighted.
Sure, which is a thing we've sort of agreed on as a society based on human consumption and creativity. Not that we all agree, and not that this agreement is free of influence from megacorps. But the context in which we've enacted copyright law is based on values related to human consumption and creativity.
Maybe we will end up agreeing that we just want to stick with those same laws for machine consumption and creativity. But maybe we won't since they are quite different things.
What if we could buy the books for one human, make the human read all the books, and somehow we would be able to clone this human in a way that they remember the book contents.
I think we should pay authors a fair wage based on some measure of the quality of the content instead of how well the book sells.
There's no reason for Harry Potter for example being 10000 times more valuable than a book on quantum mechanics only because the former is more popular and the latter is on a more obscure topic.
> I stole a lot of books too, reading them and all. Just integrated them into my worldview, and don't pay a license fee when I use the ideas in new contexts.
That... doesn't make it okay...
> A lot of them I didn't even pay for, I borrowed them from libraries or friends.
This 2nd sentence doesn't fit your first. What is your message?
The "and then fed them into an AI" part of "facebook pirated a bunch of books and then fed them into an AI" part is irrelevant. It would be equally illegal if they pirated them and then sat around reading them. Unless you somehow hope that the entirety of copyright will be overturned by this court case (not a chance) then you should strongly hope that facebook loses, because the alternative is literally "rules for thee but not for me" where corps can pirate whatever they want, but nothing changes for ordinary citizens.
So first there was this ‘corporations are people’ and not we have ‘computers are people’.
So I expect to see that either you are no longer allowed to own computer software
Or a return of slavery.
Also if we find indecent portrayal of minors in a data centre I expect that we treat it as a strict liability crime and the entire data centre or corporation that owns it gets a long prison sentence, just like a human would. However that is suppose to work.
This viewpoint is one widely shared inside the AI community — that AI systems should be able to learn from material just as humans do.
Extrapolated out into some new future a hundred years from now when we have embodied AI humanoids walking alongside us, would it be weird if those humanoids were barred from buying a new book or charged a different rate than the humans they coexist with?
I’m still deciding how I feel about some of this too.
If we are going to afford models like this treatment equivalent to sentient beings in this regard, why not others? In your extrapolation these ai walking among us are property of giant tech companies…
Valid point. And to some extent a lot of existing licensing models cover this for humans too (are you using this thing for yourself, or are you using this thing on behalf of your company).
There will be a lot to figure out over the coming years.
> This viewpoint is one widely shared inside the AI community — that AI systems should be able to learn from material just as humans do.
I'm not even against this to a point. The issue is what comes after. The monetization. The enshitification. The derivatives in place of real creativity.
Even if we could all train evenly, those with money will always win out on the execution. You can't possibly believe just "letting things play out" as they have been so far, but with less copyright guardrails, is the solution here?
It's a straw man though, whether or not AI should be allowed to learn from books is irrelevant to the point that Meta stole tens of thousands of books to accomplish this. A fact that they've admitted to and even had they not would be trivially proven.
They're not being charged, that would be a vast improvement over reality.
At ten dollars per book, that'd just be a few hundred thousand dollars. They spent way more than that training the model, and probably will spend more on legal fees in this case.
But if they had done that, I bet they would have been sued anyway.
Even accepting that, the law should be encouraging creative output by individuals and there is justifiable fear that this will be used to bypass protections designed to reward such behavior.
For a more direct counterexample, I can memorize something and type it back out, but if it is copyrighted the law doesn’t make an exception just because it passed through my head.
If the AI is able to type back out a duplicate of the training data, then I agree that's copyright infringement. If it just learns from the data like a human with normal memory reading a large amount of material, then I don't see it. That's normally the case. There have been experiments where someone managed to make an AI spit out near-copies, but it's not the default situation and seems preventable.
I do agree that we should encourage human creativity. But if AI isn't making copies, and the output of AI isn't awarded copyright (as is currently the case) then I think humans still have sufficient reward.
The courts have already ruled that training on data is similar to reading it (sufficiently transformative) to be considered fair use, in the same way that I cannot claim a copyright on your brain because you read this comment.
On the other hand, they torrented books and then open sourced LLM weights. No punishment is too severe for that!
If you still don’t understand, I strongly suggest watching Max Headroom, “Lessons”, which you can get here:
I think he’s saying he works for meta and when the company employees committed mass copyright violation that’s ok because once someone read Winnie the Pooh to him at story hour at the library.
Honestly, if people continue to conflate human development with a mega corps trawling copyright material to build a mathematical model and then wrap it up and charge a subscription for it, then there's really not much you, I or anyone else can do to avoid the inevitable fallout and we really deserve everything we get for it.
I agree with you until this part. There comes a time where I don't think I deserve to get my eyes poked out just because other people find that fashionable.
OP is making the spurious argument that technology should have the same ethical entitlements as humans. It's on par with "information wants to be free".
I don't read it as an ethical argument, it's an argument about the purpose of copyright. Copyright is intended to restrict reproduction of a work for the purpose of incentivizing the creation of new works. Copyright is not intended to restrict the transmission of knowledge.
My thoughts as well. I just prefer to remove the nuance on these types of things. If OP wants to draw a line and clearly state "I'm with the corpo robots" that's fine. Just state it plainly so I can proceed accordingly.