*> But the latest and greatest software trend–generative AI–is in danger of bein...

rubinelli · on Jan 22, 2023

De minimis is a longstanding defense in copyright law. If you are copying very little from very many works, as is the case when you turn multiple petabytes into a few gigabytes of neural network weights, you are in the clear. The problem arises when models overfit and spit out almost perfect copies of the training data.

belorn · on Jan 22, 2023

Copyright doesn't have an explicit size, but rather uses size as one of many indicators.

For example, I could take a massive 8k video and covert it into a very small 144p youtube video. Am I in the clear simply because the output is tiny compared to the input? Similar I could take a huge studio master copy of a song and convert it to a very small and rather compressed (distorted) mp3.

I partially agree that some of the problem is when perfect copies are spit out by the models, but I do think there is a bigger problem. Copyright is a complex concept that can't be defined exclusively by a single metric like size, and any mathematically definition will in the end be killed if large copyright holders feel threatened by it.

sdenton4 · on Jan 22, 2023

Thumbnail images don't violate copyright, and are a very helpful comparison case to consider.

"Transformative Use" is a major consideration in fair use copyright: https://en.wikipedia.org/wiki/Transformative_use

ML models do not supplant the pre-existing work, and provide fundamentally new modalities. Transformative use seems like a slam dunk to me, but I guess we'll see what the Supremes decide in twenty years or so...

YellowStuDregg · on Jan 22, 2023

So is market harm

"Some courts have held this factor to be the most important in the analysis."

https://ilt.eff.org/Copyright__Fair_Use.html#Market_Harm

SxC97 · on Jan 22, 2023

I’m unclear about this. Let’s say a movie comes out and I make a YouTube review using brief clips or screenshots from the movie. Since my review is transformative, I should be in the clear (I think?).

But when it comes to market harm, does the tone of my review effect the enforceability of copyright?

As in, if my review is negative it would harm the market for people going to watch the movie vs a positive review right?

belorn · on Jan 22, 2023

Reviews have a distinct "character of use", one of the four cornerstones of fair use exceptions.

A review can be commercial, can cause significant harm to the market, can include substantial amount of the work, and yet the character of use can be significant enough to convince a judge that a exemption should be applied. Since judges historically has come to this conclusion there exist now legal precedence. With precedence we can make some general conclusions which tell us that reviews are in general exempted when using other peoples copyrighted work for the purpose of reviews.

This character of use is very different then if I convert a studio record of a song into mp3 and publish it on p2p sharing site. Judges has historically viewed the character of use in those situation as not being worth giving exemptions.

wnkrshm · on Jan 22, 2023

I am not a lawyer but I'd think:

You're not directly competing with the movie though, your work is a review, not a feature film.

If you were to make a parody movie from the material of the movie itself, directly taking scenes and altering them to your liking but still relying on the viewer recognizing the original in it, you'd have a harder time, I think.

Animats · on Jan 22, 2023

There's a Stable Diffusion example where, having been trained on too many Getty Images pictures stamped with their logo, the system generated new images with Getty Images logos.[1] That's a bit embarrassing. There are code generation examples where copyright notices appeared in the output. A plagiarism detection system to insure that the output is sufficiently different from any single training input ought to be possible.

[1] https://petapixel.com/2023/01/17/getty-images-is-suing-ai-im...

phoe-krk · on Jan 22, 2023

Yes, agreed, I don't think the problem is with networks that mix tons of input data in a way that doesn't heavily derive from one or a couple of sources. The currently available models do not have overfitting solved, though, and this technological imperfection also has direct practical (and legal) consequences.

thewakalix · on Jan 22, 2023

How are you using the word “copy”? It doesn’t seem to match the standard meaning. For instance, most people would not consider a brief summary of a movie’s plot to be a “copy” of that movie, or protected under copyright.

phoe-krk · on Jan 22, 2023

If you have an image, then train a neural network on that image, then use the neural network to reconstruct that image in detail, then the NN by definition contains enough information used to reconstruct that image - hence, a copy.

With NNs trained on thousands or millions of data entries, this concept becomes fuzzy in the same way as you described - a short summary likely wouldn't be considered a copy, just like a 64x64 generated thumbnail wouldn't be considered in the same way a 4096x4096 hi-res image.

coeneedell · on Jan 22, 2023

The thing is, the “good” models can’t reconstruct the image in detail. It’s considered a sign of “overfitting” if you reconstruct the input exactly. Even if you put the exact query that was associated with that image, you’ll get the weighted average (feature-wise) image associated with the query. This applies to all like machine learning models without loss of generality.

wnkrshm · on Jan 22, 2023

That doesn't mean I can't recover the image (or at least get really close to one) using a different query, does it? Edit: It's nonlinear after all.

oneeyedpigeon · on Jan 22, 2023

Sure, but I could write a program to spew out an unbounded number of images containing random pixels. It could create an image that is identical to a copyrighted image, but if I just keep that image on my hard drive, have I violated copyright? I don't think I would be, but if I started distributing them, yes I would.

Krasnol · on Jan 22, 2023

> If you have an image, then train a neural network on that image, then use the neural network to reconstruct that image in detail

I haven't seen that happening since the discussion started. Most of the complains I saw aimed at things like "it stole my style" not "it reproduced my art".

Do you have any examples?

greysphere · on Jan 22, 2023

It's more, 'this product is profiting from my labor without my consent (ie paying me).'

In music you aren't allowed to use the same notes, even if you played them on a trumpet with a swing beat, while the source was on the piano very staccato.

While we don't have the same vocabulary for art, it's not unreasonable to expect similar protections.

wvenable · on Jan 22, 2023

What you describe for music is already outrageous -- why do you think that needs to be extended to everything?

https://www.vice.com/en/article/wxepzw/musicians-algorithmic...

greysphere · on Jan 22, 2023

It takes work to create/identify/classify information, both in the economic and physics sense. That work should be allowed the same protections we do other forms of work.

Your example is one where nearly no work was done, thus it doesn't deserve much value. "Let a = the set of all songs" doesn't help me find new songs I like. A songwriter does that work. Another artist that takes and uses and resells that work (without consent), is stealing that work.

To me it's funny that nearly all the problem with the current team of AI generation would be solved if the model generators simply licensed the content they train on. "But that would cost too much" Ok, just use public domain work, "But that wouldn't be as good" Oh so you are saying the work has value, but you are unwilling to pay for it, and instead your scheme is to just take it. That seems like a good definition of stealing - not paying for something that has value.

Krasnol · on Jan 24, 2023

> Your example is one where nearly no work was done, thus it doesn't deserve much value

You are aware that there is very expensive art out there where the artist did not much work. Like painting a canvas in one colour or throwing an item in the corner of a museum.

According to you, that would not deserve much value but it does have a lot value in reality.

In fact "value" is what somebody else gives to the piece of art.

A prompted AI artwork made by me may have more value to me than all the art in the Louvre.

The discussion here continues to turn around copies when it's not a copy those algorithms generate.

wvenable · on Jan 22, 2023

> Another artist that takes and uses and resells that work (without consent), is stealing that work.

Another artist accidentally uses a melody from another song (because it's a finite set) and are sued for all their income is a horrible system. The winners aren't the people producing value, it's the people who got there first and are now profiting off other people's work.

greysphere · on Jan 23, 2023

If I grew up under a rock, somehow became a self taught musician, and ended up authoring a song that had recognizable components from Happy Birthday, then even still the author of Happy Birthday, having established that melody so successfully in the public zeitgeist, reasonably should benefit.

This is so common the recording industry itself has established rules for sampling and licensing and covers and what not. Are there some folks out there abusing the system, for sure. But overall its goal is to maximize the value produced by the recording industry, which very much includes the people who 'got their first' who built foundations for future artists. To me, this all seems basically reasonable.

wvenable · on Jan 23, 2023

Copyright is supposed to promote the creation of new works. You just described a system where a song written well over 100 years ago is preferred over over a new artist creating a new work.

Turing_Machine · on Jan 22, 2023

> to reconstruct that image in detail

Pretty much none of these systems "reconstruct an image in detail".

ninjin · on Jan 22, 2023

Honestly, can people stop speaking in absolutes regarding these systems? We (researchers and non-researchers alike) are gradually trying to comprehend exactly how much they generalise and memorise, but this is darn hard work and it is not our fault that several major tech giants decided to deploy and profit from these models long before the scientific and legal landscape was clear. Somepalli et al. (2022) [1] for example is a fairly strong argument against your statement above.

[1]: https://arxiv.org/abs/2212.03860

The fact is that these systems are complex, new, and interesting. However, it is not the fault of small-time programmers and artists that modern copyright law is a major, overreaching mess that is now finally greatly affecting what the big corporations want to do. They are getting sued? Cry me a river… Perhaps they will finally stop backing the American-led copyright lobby then?

Turing_Machine · on Jan 22, 2023

> is a fairly strong argument against your statement above.

From a quick skim of this paper, they apparently used toy models with a few hundred to a few thousand images in the training set. For the ones with as few as a few thousand training images, they rarely or never saw exact duplicates.

For instance, in their figure 4, they show exact duplicates for the training set with only 300 images (well, duh), and didn't find any exact duplicates for the training set with only 3,000.

I'm not sure I'd call this a "strong argument" when applied to models with millions or billions of images. Quite the contrary. LAION-5B (used in Stable Diffusion) was trained on 5 billion image/caption pairs.

ninjin · on Jan 23, 2023

Firstly, thank you for engaging in a discussion. Secondly, I am not an expert in image processing, rather my focus in on language. Thus my intuitions will not work as much in my favour in this domain, although the models do have similarities.

They explore a range of sizes and I do not think it is fair to to only highlight the smallest ones. They do explore a 12M subset of LAION in Section 7 for a model that was trained on 2B images. Yes, it is not an ideal experimental setup to use a subset (they admit this) and far from LAION-5B, but it is a fair stab at this kind of analysis and is likely to lead to further explorations.

Let us return though to your claim, which is what I objected to: “Pretty much none of these systems ‘reconstruct an image in detail’.” I think it is fair to say that this work certainly makes me doubt whether none of these systems (even the larger ones) exhibit behaviour that may limit their generalisability or cross the boundary of what is legally considered derivative work.

You may very well be right that once we scale to billions of images this behaviour is improved (or maybe even disappears), but to the best of my knowledge we do not know if this is the case and we do not know when, how, and why it occurs if it does occur. I remain a firm believer that these kinds of models are the future as there is little evidence that we have reached their limits, but I will continue to caution anyone that talks in absolutes until there is solid evidence to support those claims.

CuriouslyC · on Jan 22, 2023

Incorrect. Training images are used to generate a latent manifold which might not contain any of the images in the original training set to within a meaningful delta unless they're massively overrepresented or cliche.

phoe-krk · on Jan 22, 2023

This might be technically correct but doesn't seem to matter much in practice because of how many practical cases of NN outputting copyright-infriging content there are. These include examples in the recent lawsuits that made the rounds on HN. Either even the best specialists behind these NNs cannot make the NNs not contain data that is "massively overrepresented or cliche", or they are unwilling to.

throwaway675309 · on Jan 22, 2023

"Many practical cases of NN outputting copyright-infriging content there are"

I have seen very few practical cases of generative models such as copilot and even less of a stable diffusion of reproducing original copyrighted works in exact detail, and the few that I did encounter were instructed torturously to do so, which strikes me as highly contrived.

CuriouslyC · on Jan 22, 2023

Or they're focusing on performance to get the tools to the place where they're good enough to start being adopted. Once getting sued is more of a problem than not having a product at all, it's not hard to switch gears. I imagine there are a number of ways to avoid storing "too close" copies in a model that have various tradeoffs, I'm sure they'll be adopted quickly in the face of litigation.

fzil · on Jan 22, 2023

> the recent lawsuits that made the rounds on HN

I happened to have missed those discussions, do you have some links you can point towards? thanks!

phoe-krk · on Jan 22, 2023

https://stablediffusionlitigation.com/

https://githubcopilotlitigation.com/

Both along with their respective HN threads.

AuryGlenz · on Jan 22, 2023

I’m still not seeing those examples, at least in the Stable Diffusion link.

Even in my overfitted dreambooth model of my wife it doesn’t pop out the exact same portraits.

fzil · on Jan 22, 2023

thanks a lot!

danbolt · on Jan 22, 2023

> to within a meaningful delta

This feels like an argument for communism being the most productive system in theory. Most of the time I feel like I see uninspired material that’s tracing it’s own training data.

CuriouslyC · on Jan 22, 2023

How is that any different from looking at human art?

danbolt · on Jan 22, 2023

When humans do it within a delta, it’s considered copyright infringement.

Machines currently are adept at making copies within a delta, hence articles such as this to limit copyright so the people who operate said machines can profit.

tomrod · on Jan 22, 2023

Transgressors, perhaps, but not criminals. AI, especially generative AI, are not lossy copies but rather synthesizers.

If anything, your comment highlights the need to evolve IP norms and laws.

tshaddox · on Jan 22, 2023

> But they are compressed lossy copies of all that data!

Sounds similar to, for instance, songs written by people who have heard other songs. I wouldn’t expect legal cases concerning AI-generated works to be any simpler than legal cases concerning the difference between a songwriter violating the IP of another songwriter or simply being inspired by another song.

lelanthran · on Jan 22, 2023

Sometimes a songwriter violates copyright on their own song, without realising it.

yazzku · on Jan 22, 2023

I found all this to be very odd take coming from something entitled "copyleft currents". You summarized some of the points quite well.

> So, all neural network developers, get ready for the lawyers, because they are coming to get you.

This is just dumb.

mdp2021 · on Jan 22, 2023

> But they are compressed lossy copies of all that data

So are those you are storing in your mind...

(Edit: i.e., "learning is not a violation". Also see below.)

scarmig · on Jan 22, 2023

Well, if you copied a famous piece of art from memory... That'd be a copyright violation.

Groxx · on Jan 22, 2023

Yeah, I think this is the crux of the issue that people keep glossing over when claiming it's not a true reproduction.

Drawing Coca Cola's logo from memory by hand and slapping it on your product is still copyright infringement, even if it isn't an image Coca Cola has ever produced. In that sense it doesn't matter at all if it's AI or human - the production and subsequent distribution for profit of a copyrighted thing is not allowed, period.

The current set of AIs do exactly this all the time. That's a very clear legal problem.

sdenton4 · on Jan 22, 2023

well, you get five demerits as the coca cola logo is a trademark, covered by a whole other branch of IP law...

Groxx · on Jan 22, 2023

Eh, fair enough.

I think the point still stands though. Just mentally replace it with "a random DeviantArt" and it all still applies.

tikhonj · on Jan 22, 2023

it's covered by both

mdp2021 · on Jan 22, 2023

I meant that _learning_ is not a copyright violation.

I do not see much space for misunderstanding: which relevant box takes instances of input data to output one of them?

Edit:

Make your point explicit, sniper... You can hide in front of the ### consuetude of "silent disagreement", but it remains violently annoying. The article contains "nuances" like "derivative work", but the poster is objecting to the article calling «neural networks» not «copies», as «compressed lossy copies», and I retorted that so is everything you learn, and holding them in the "corpus" is not considered "retaining a copy". If you have objections to that, either you present them, or there is no contribution in shaking your invisible head.