I stole a lot of books too, reading them and all. Just integrated them into my w...

anileated · on Jan 21, 2025

Have you considered that you have some traits that make you eligible to read books and access information freely in the country you live in*? Something about being a conscious human being enjoying human rights, perhaps? An implement that does the same but (A) at scale and (B) without thought or free will or agency, completely at the bidding of its operator, for profit, has no such protections. Instead, the operator carries all responsibility (in this case, Meta).

If a software service had legal protections like that, sure, I could build one that returns you any book you request and say that the service had integrated it into its worldview. Who can check, eh?

* Actually, in some countries you could be in trouble for reading a book and incorporating it into your worldview, to say nothing about quoting it, but let’s set that aside.

gruez · on Jan 21, 2025

>Have you considered that you have some traits that make you eligible to read books and access information freely in the country you live in*? Something about being a conscious human being enjoying human rights, perhaps?

Not a relevant factor when it comes to copyright law. Fair use (the law that's most applicable here) applies regardless if you're a student using incorporating news articles into your work, or google making thumbnails and displaying them on their search results.

anileated · on Jan 21, 2025

This is not a good analogy. Google does not display the contents to any significant degree (you have to visit the search result). And even then it was/is in legal trouble, in fact (in some countries like Australia* more than others).

Furthermore:

> Examples of fair use in United States copyright law include commentary, search engines, criticism, parody, news reporting, research, and scholarship.

I do not see “automated generation of derivative works of arbitrary nature” in it.

* https://www.bbc.com/news/world-australia-55760673.amp

gruez · on Jan 21, 2025

>This is not a good analogy. Google does not display the contents to any significant degree (you have to visit the search result).

The point isn't that AI training is legal because it's like generating thumbnails. That is being argued in the courts right now. The point is that fair use exemptions isn't limited to "being a conscious human being enjoying human rights", as google generating thumnails and snippets using computers shows.

https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com....

> Examples of fair use in United States copyright law include commentary, search engines, criticism, parody, news reporting, research, and scholarship.

Those are examples, not an exhaustive list. It's not even something that Judges are supposed to compare against when deciding whether something is fair use or not, see: https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors

anileated · on Jan 21, 2025

> The point is that fair use exemptions isn't limited to "being a conscious human being enjoying human rights"

Sure. However, my point is that this is not fair use*, so other principles need to be applied. Whether legal systems in various countries find that fair use applies here or not, I agree we are yet to see.

* At least in cases where it’s an LLM operated at scale for profit (which I suppose would not hold for Meta’s models if they were truly open, but that’s not the case if they require obtaining a license in some conditions).

gruez · on Jan 21, 2025

>Sure. However, my point is that this is not fair use (at least in cases where it’s an LLM operated for profit), so other principles need to be applied.

This isn't a complete argument. Most of AI companies' argument relies on the fact that AI models are "transformative". That's a plausible claim, and as Perfect 10 v. Google, and Authors Guild, Inc. v. Google, Inc. has shown, being a for-profit company is hardly a disqualification from getting fair protection.

anileated · on Jan 21, 2025

“Transformative” is always a grey area. If my service just returns you a book you requested, but in upper case, then it was transformed.

But sure, the “transformative” argument is the one that could apply (and even I believe Google used it to argue its case), if it can be shown that an LLM can not verbatim reproduce a given work (which, incidentally, is something that you, a warm-blooded fleshy human with agency who has the freedom to read books, cannot do, but LLMs were shown to do).

That said, relevant laws existed before LLMs, and may are outdated. If the goal is to balance reasonable uses while protecting original output of authors that ultimately drives innovation and creativity, I am not sure if the preexisting laws are continuing to fulfil their function, but that’s my opinion.

gruez · on Jan 21, 2025

>But sure, the “transformative” argument is the one that could apply (and even I believe Google used it to argue its case), if it can be shown that an LLM can not verbatim reproduce a given work.

You have to try pretty hard to get LLMs to reproduce a work verbatim, especially any lengthy passages that aren't famous (and thus re-quoted on the internet a bazillion times). Moreover just because LLMs can reproduce a work verbatim if you try hard enough doesn't mean it's not transformative. Google search snippets and google book search has been ruled "transformative" by the courts, but if you tried hard enough you can use them to extract the entire work.

>That said, relevant laws existed before LLMs, and may are outdater. If the goal is to balance reasonable uses while protecting original output of authors that ultimately drives innovation and creativity, I am not sure if the preexisting laws are continuing to fulfil their function, but that’s my opinion.

AFAIK the era of mining the public internet or published works for AI training data is over, or at least coming to an end. Everything that could be mined, has already been mined, and besides, the internet is getting increasingly polluted by AI output. Private training data is where it's at now, whether it's sourcing document troves from companies (eg. emails, documentation, source code, etc.), or paying "AI annotators" to produce training data for you. If the argument is that human authors should get a cut of AI profits because their works were "stolen" to train the models, this is going to be a increasingly losing argument, because it doesn't have a leg to stand on for private training data.

anileated · on Jan 21, 2025

> If the argument is that human authors should get a cut of AI profits because their works were "stolen" to train the models, this is going to be a increasingly losing argument, because it doesn't have a leg to stand on for private training data.

The argument can be made that LLMs could not be created without expropriating the original works of all the authors they were trained on, and that argument would in fact be true and have quite sturdy legs as far as I’m concerned.

It’s not a historical instance of forgotten times, it started less than half a decade ago and I would be surprised if it’s not still ongoing (your argument about synthetic training data is forward-looking).

gruez · on Jan 21, 2025

>The argument can be made that LLMs could not be created without expropriating the original works of all the authors they were trained on, and that argument would in fact be true and have quite sturdy legs as far as I’m concerned.

That makes as much sense as "American industry was built on the backs of British inventors (back it the day it was the "China" when it came to IP), so Britain should get perpetual (?) royalties from the US economy".

anileated · on Jan 21, 2025

So we’re back the human vs. unthinking machine distinction. American inventors were human. We’re going in circles and this article was hidden on HN anyway.

seanmcdirmid · on Jan 21, 2025

> I do not see “automated generation of derivative works of arbitrary nature” in it

The “automated” isn’t really key. If you read a book, and learn from it, and are able to use that knowledge in other contexts, should you pay a licensing fee? It doesn’t matter if “you” is a human or machine.

anileated · on Jan 21, 2025

“Automated” is key. You are not an automaton, not a machine, you do not infinitely scale with compute power; but unlike a machine you have free will and agency, and legal framework of developed countries grants you human rights that include freedom. That was, in fact, my entire point.

Paradigma11 · on Jan 21, 2025

So, your argument is similar to cryptobros who argue that much of defi is not plain financial fraud because it runs on a blockchain only reverse.

anileated · on Jan 21, 2025

Paradigma11 · on Jan 21, 2025

They are arguing that doing something that is illegal if being done by humans is ok because it is on computers running a blockchain.

You are arguing that doing something that is legal if being done by humans is not ok if it is done on computers running an LLM.

anileated · on Jan 21, 2025

I just don’t get your angle. My point was that the human is the one who has some freedoms and the one who bears responsibility. If you read a bunch of books in a book store without buying them and use your imperfect memory of them to do your job better and get paid more, it is shady but if you are not shooed away by store owner you have the freedom to do it. No one can extract the books you already read from your brain, and you did not sign an NDA. But if you set up an industrial scale book scanner in the same store, the boss will call the police on you and you cannot point fingers and say the scanner “reads” books and incorporates them into its worldview just like you would do. Because scanner is not human and you are, so you’re the one responsible for operating the scanner.

I see no difference with cryptotokens here, the human has freedoms to do things and the human is responsible for them if those things are bad. (Just unlike LLMs, theft of property and all that is kinda always a crime, unlike reading a book in a shop without buying.)

maeil · on Jan 21, 2025

Didn't know Meta paid for them or borrowed them from libraries. In fact, I don't think they did.

furyofantares · on Jan 21, 2025

Maybe not but even if Meta did buy 1 copy of every book I doubt it would stop anyone from making bad analogies to theft. (Not that the analogy on the other side to a human reading is any better.)

corobo · on Jan 21, 2025

Isn't that literally what copyright is?

You have bought the text so you have the readright, but you do not the copyright.

gruez · on Jan 21, 2025

>You have bought the text so you have the readright, but you do not the copyright.

You do however, have the right to make derivative works based on the contents of the book. You reading a physics textbook doesn't mean you can't write a blog post about gravity or whatever, and you reading harry potter doesn't mean you can't write a series of fantasy books involving a young wizard trying to fight an evil wizard.

yencabulator · on Jan 21, 2025

Last I looked, machines are considered unable to create copyrightable content, so your attempt to compare that to LLMs might not work in court.

> The application was denied because, based on the applicant’s representations in the application, the examiner found that the work contained no human authorship. After a series of administrative appeals, the Office’s Review Board issued a final determination affirming that the work could not be registered because it was made “without any creative contribution from a human actor.”

https://www.copyright.gov/ai/ai_policy_guidance.pdf

gruez · on Jan 21, 2025

>Last I looked, machines are considered unable to create copyrightable content, so your attempt to compare that to LLMs might not work in court.

That just means whatever they produce can't be copyrighted, not that they can't produce derivative works. Courts have upheld the right for google to produce thumbnails of copyrighted works, even though the procedure for producing thumbnails is done by a computer and thus can't be copyrighted.

furyofantares · on Jan 21, 2025

Sure, which is a thing we've sort of agreed on as a society based on human consumption and creativity. Not that we all agree, and not that this agreement is free of influence from megacorps. But the context in which we've enacted copyright law is based on values related to human consumption and creativity.

Maybe we will end up agreeing that we just want to stick with those same laws for machine consumption and creativity. But maybe we won't since they are quite different things.

nicce · on Jan 21, 2025

What if we could buy the books for one human, make the human read all the books, and somehow we would be able to clone this human in a way that they remember the book contents.

Now, would that be a fair use of the books?

GuB-42 · on Jan 21, 2025

That's what storytelling is, and if you have twins, you even have the "clone" part.

Ok, it doesn't tell much about AI and fair use, but I find it funny that your thought experiment is actually something you can do in real life.

hnfong · on Jan 21, 2025

I think we should pay authors a fair wage based on some measure of the quality of the content instead of how well the book sells.

There's no reason for Harry Potter for example being 10000 times more valuable than a book on quantum mechanics only because the former is more popular and the latter is on a more obscure topic.

aprilthird2021 · on Jan 21, 2025

Yes because using the books to train a book-replicating machine is a legally grey area.

irjustin · on Jan 21, 2025

> I stole a lot of books too, reading them and all. Just integrated them into my worldview, and don't pay a license fee when I use the ideas in new contexts.

That... doesn't make it okay...

> A lot of them I didn't even pay for, I borrowed them from libraries or friends.

This 2nd sentence doesn't fit your first. What is your message?

philistine · on Jan 21, 2025

Did you get those books through torrents? That's what's at stake. Distribution, not the parsing, which might (might!) have been legal.

recursivecaveat · on Jan 21, 2025

The "and then fed them into an AI" part of "facebook pirated a bunch of books and then fed them into an AI" part is irrelevant. It would be equally illegal if they pirated them and then sat around reading them. Unless you somehow hope that the entirety of copyright will be overturned by this court case (not a chance) then you should strongly hope that facebook loses, because the alternative is literally "rules for thee but not for me" where corps can pirate whatever they want, but nothing changes for ordinary citizens.

ClumsyPilot · on Jan 21, 2025

So first there was this ‘corporations are people’ and not we have ‘computers are people’.

So I expect to see that either you are no longer allowed to own computer software

Or a return of slavery.

Also if we find indecent portrayal of minors in a data centre I expect that we treat it as a strict liability crime and the entire data centre or corporation that owns it gets a long prison sentence, just like a human would. However that is suppose to work.

SecretDreams · on Jan 21, 2025

I can't tell if your implication is what meta did is fine or just that your brain is as good as an AI?

Could you clearly speak your point?

whynotminot · on Jan 21, 2025

This viewpoint is one widely shared inside the AI community — that AI systems should be able to learn from material just as humans do.

Extrapolated out into some new future a hundred years from now when we have embodied AI humanoids walking alongside us, would it be weird if those humanoids were barred from buying a new book or charged a different rate than the humans they coexist with?

I’m still deciding how I feel about some of this too.

plasticchris · on Jan 21, 2025

If we are going to afford models like this treatment equivalent to sentient beings in this regard, why not others? In your extrapolation these ai walking among us are property of giant tech companies…

whynotminot · on Jan 21, 2025

Valid point. And to some extent a lot of existing licensing models cover this for humans too (are you using this thing for yourself, or are you using this thing on behalf of your company).

There will be a lot to figure out over the coming years.

SecretDreams · on Jan 21, 2025

> This viewpoint is one widely shared inside the AI community — that AI systems should be able to learn from material just as humans do.

I'm not even against this to a point. The issue is what comes after. The monetization. The enshitification. The derivatives in place of real creativity.

hedora · on Jan 21, 2025

Hardware and algorithms keep getting 10x better in this space with a regular cadence.

The only way to prevent the things you are worried about is to let anyone train a model, and then compete to make the best product.

The enshittifying monopolies and big copyright holders are the only ones that would benefit from locking down training data or regulating AI compute.

SecretDreams · on Jan 21, 2025

Even if we could all train evenly, those with money will always win out on the execution. You can't possibly believe just "letting things play out" as they have been so far, but with less copyright guardrails, is the solution here?

tinco · on Jan 21, 2025

It's a straw man though, whether or not AI should be allowed to learn from books is irrelevant to the point that Meta stole tens of thousands of books to accomplish this. A fact that they've admitted to and even had they not would be trivially proven.

They're not being charged, that would be a vast improvement over reality.

DennisP · on Jan 21, 2025

At ten dollars per book, that'd just be a few hundred thousand dollars. They spent way more than that training the model, and probably will spend more on legal fees in this case.

But if they had done that, I bet they would have been sued anyway.

whynotminot · on Jan 21, 2025

I think that part is maybe missed here — “why didn’t you pay for it?”

“Because you wouldn’t have sold it to me. Or even if you would have, you would have put such onerous terms on it that I’d rather take this path.”

(Not saying this makes it right)

plasticchris · on Jan 21, 2025

Buying a single copy of a work does not grant you a license to duplicate or distribute that work.

So just buying copies wouldn’t have helped them.

DennisP · on Jan 21, 2025

But they are not duplicating or distributing it. The whole point is that the AI is just reading and learning from it, just like a human would do.

plasticchris · on Jan 25, 2025

Even accepting that, the law should be encouraging creative output by individuals and there is justifiable fear that this will be used to bypass protections designed to reward such behavior.

For a more direct counterexample, I can memorize something and type it back out, but if it is copyrighted the law doesn’t make an exception just because it passed through my head.

DennisP · on Jan 26, 2025

If the AI is able to type back out a duplicate of the training data, then I agree that's copyright infringement. If it just learns from the data like a human with normal memory reading a large amount of material, then I don't see it. That's normally the case. There have been experiments where someone managed to make an AI spit out near-copies, but it's not the default situation and seems preventable.

I do agree that we should encourage human creativity. But if AI isn't making copies, and the output of AI isn't awarded copyright (as is currently the case) then I think humans still have sufficient reward.

hedora · on Jan 21, 2025

The courts have already ruled that training on data is similar to reading it (sufficiently transformative) to be considered fair use, in the same way that I cannot claim a copyright on your brain because you read this comment.

On the other hand, they torrented books and then open sourced LLM weights. No punishment is too severe for that!

If you still don’t understand, I strongly suggest watching Max Headroom, “Lessons”, which you can get here:

https://archive.org/download/max-headroom-complete

wang_li · on Jan 21, 2025

I think he’s saying he works for meta and when the company employees committed mass copyright violation that’s ok because once someone read Winnie the Pooh to him at story hour at the library.

Guthur · on Jan 21, 2025

Honestly, if people continue to conflate human development with a mega corps trawling copyright material to build a mathematical model and then wrap it up and charge a subscription for it, then there's really not much you, I or anyone else can do to avoid the inevitable fallout and we really deserve everything we get for it.

SecretDreams · on Jan 21, 2025

> and we really deserve everything we get for it.

I agree with you until this part. There comes a time where I don't think I deserve to get my eyes poked out just because other people find that fashionable.

maeil · on Jan 21, 2025

> and we really deserve everything we get for it.

Those people who conflate them deserve it. You and me don't.

> there's really not much you, I or anyone else can do

We can make our own community. And Bluesky is very much not it.

23B1 · on Jan 21, 2025

OP is making the spurious argument that technology should have the same ethical entitlements as humans. It's on par with "information wants to be free".

derektank · on Jan 21, 2025

I don't read it as an ethical argument, it's an argument about the purpose of copyright. Copyright is intended to restrict reproduction of a work for the purpose of incentivizing the creation of new works. Copyright is not intended to restrict the transmission of knowledge.

23B1 · on Jan 21, 2025

OP's argument is a common one amongst the IP entitlements crowd.

SecretDreams · on Jan 21, 2025

My thoughts as well. I just prefer to remove the nuance on these types of things. If OP wants to draw a line and clearly state "I'm with the corpo robots" that's fine. Just state it plainly so I can proceed accordingly.

nicce · on Jan 21, 2025

Maybe I need to use this argument when I go to the movies next time.