The process is described above, but it’s very hard to “innocently” end up with one of those images that they are looking for from the database.
And the way it’s being done (hashes), a collision is highly unlikely. If it does occur it doesn’t mean it’s similar in nature (e.g. innocent picture of own child in bath). The hash isn’t looking at the image content in the sense of “what’s in the picture”, just the bits of the file. So it’s highly, highly unlikely, even if a collision occurs, that the collision would be an image that happens to be another child innocently bathing.
> The process is described above, but it’s very hard to “innocently” end up with one of those images that they are looking for from the database.
Actually, it's very easy to end up with an image that has a similar perceptual hash to an illegal image.
They are not doing MD5 hashing, they're taking perceptual hashes and then using something like the hamming distance or Levenshtein distance to make a fuzzy match with hashes from illegal images.
I've built products using these methods, and it is incredibly easy to make a fuzzy match based on perceptual hashes from two images that have nothing to do with each other.
> but it’s very hard to “innocently” end up with one of those images that they are looking for from the database.
Are you sure, given how many iPhone takeover/jailbreak bugs regularly exist, including the recent 0-click iMessage bug? Would you like to dare a hacking group, domestic or foreign, to land a single verboten image on your iPhone?
> And the way it’s being done (hashes), a collision is highly unlikely.
The limited details so far are suggestive that its using perceptual hashes, which are more susceptible to collision-engineering/false-positives than cryptographically-secure hashes.
What happens if someone spams CP via iMessage to their enemies?
Additionally, this isn't just a hash of the file but a perceptual hash on the image content. So e.g. changing a single bit in the image would create a different cryptographic hash, but generally not a different perceptual hash.
Presumably the verification that these false matches are not problematic is manual?
That's not good enough given how our media currently works. Imagine articles published when information about such a check leaks that say "celebrity X's phone checked by police for suspected CSAM." While that is the truth ("suspected") no one cares about that nuance and such a person would get cancelled very quickly, even if there was no evidence of wrongdoing.
Presuming that is what they're doing, it won't stay that way for long.
It's absurdly easy to change the hash of a file. In the case of something like JPEG, you don't even have to change the file itself, you can just change the metadata. Apple could presumably only hash the image itself, but again, all you have to do is make tiny, imperceptible to humans changes to the image and the hash is totally different.
Long story short, this is either nearly pointless and privacy invasive, or it's about to get drastically more invasive to be effective.
Furthermore, "highly unlikely" is an understatement. Using SHA256 there are 2^256 hashes.
That's 115792089237316195423570985008687907853269984665640564039457584007913129639936. Probability of collision (two different inputs generating the same hash) is infinitesimal.
And the way it’s being done (hashes), a collision is highly unlikely. If it does occur it doesn’t mean it’s similar in nature (e.g. innocent picture of own child in bath). The hash isn’t looking at the image content in the sense of “what’s in the picture”, just the bits of the file. So it’s highly, highly unlikely, even if a collision occurs, that the collision would be an image that happens to be another child innocently bathing.