Neural Networks That Describe Images

kolbe · on Nov 19, 2014

Presentations like these make me realize how close we are to developing law enforcement (/police state) technology that will be very effective. I figure when the kinks are smoothed out, that we could run this on a video feed, and have crimes prevented right as they're about to happen. It's almost scary. Imagine 20 years from now, some guy pulls a gun on you, and a video feeds identifies his action, and immediately shoots a tranquilizer straight into his jugular with perfect aim.

bprater · on Nov 19, 2014

Effective and also scary. That same video feed could be archiving (for eternity) every single citizen's movement, action, or even words:

"Citizen #5135: In 2015, spit on sidewalks 28x this year, jaywalked 49x, etc..."

visarga · on Nov 20, 2014

Even if this is not working now, it can still be applied later on video shot today. So we are already en route to being judged by such an automated guardian.

smokeyj · on Nov 20, 2014

http://en.wikipedia.org/wiki/Recorded_Future

Don't worry. Google and the CIA have your back.

/duck and covers from incoming thought police

userbinator · on Nov 20, 2014

I fear greatly for the false positives that will inevitably happen, and the mass oppression of society that could result from the natural instinct to avoid them. When every action even slightly against the norms of society at the time can be acted upon and countered, the chilling effect on freedom could be immense.

You had to live -- did live, from habit that became instinct -- in the assumption that every sound you made was overheard, and, except in darkness, every movement scrutinized. -George Orwell, 1984.

dustintran · on Nov 20, 2014

I imagine more preventative solutions to develop earlier which are more effective, e.g., "treat" citizens who are predicted highly likely to be cause a dangerous outcome in the future.

Or in the simple context of computer vision and the camera, raise warnings as individuals on screen act increasingly more suspicious. I imagine future technology will at least have intelligent security cameras that signal alarms based on the current video feed, not just when it's suddenly shut off for example.

jhgg · on Nov 20, 2014

And you have just described the plot of an anime named Psycho-Pass (https://en.wikipedia.org/wiki/Psycho-Pass)!

e12e · on Nov 20, 2014

> some guy pulls a gun on you, and a video feeds identifies his action, and immediately shoots a tranquilizer straight into his jugular with perfect aim.

Eh, no. Safe preventive action against a gun (or other lethal weapon, really) needs to happen before the weapon is ready for use. What if the assumed offender was wearing protection? Perfect aim wouldn't help. Lethal force might be possible to deploy - but probably not in a safe manner.

ibrahima · on Nov 20, 2014

That's basically the plot of Person of Interest, right? (I've watched a few episodes but never paid much attention).

LunaSea · on Nov 20, 2014

Minority Report style

jcr · on Nov 19, 2014

At the Bay Area Vision Meeting in 2013 [1], Fei-Fei Li and Olga Russakovsky gave a related talk on, "Analysis of Large-Scale Visual Recognition" [2,3]

[1] http://bavm2013.splashthat.com/

[2] video: http://www.youtube.com/watch?v=DK6KfUsVN8w

[3] slides: http://bavm2013.splashthat.com/img/events/46439/assets/a10b....

Zeebrommer · on Nov 20, 2014

I'm a bit skeptic. They have several very good examples, and those are really impressive. Everything about how often these striking results occur is 'coming soon' though. And since it is neural network-based, being lucky sometimes doesn't say anything about the statistical performance. What percentage of the test dataset was labelled correctly?

userbinator · on Nov 20, 2014

I'd like to see the worst-case behaviour too - e.g. hilariously wrong results. Seeing the failures (and their rate) makes it more believable and enables a more unbiased evaluation of the capabilities of the system.

It's like those "up to XX% better" claims - "up to", not "at least" being the key phrase here.

gaurav_v · on Nov 20, 2014

Model performance is discussed in their full paper, here:

http://cs.stanford.edu/people/karpathy/deepimagesent/devisag...

arjie · on Nov 19, 2014

Fascinating. Also interesting to see the failure modes. Any human would quickly realize that the "boy doing backflip on wakeboard" is actually playing on a trampoline. Or the "two young girls playing with legos toy". Great stuff!

DavidSJ · on Nov 20, 2014

I think I'm a human, but I did not realize that boy was on a trampoline instead of a wakeboard until you pointed it out. I think the water in the background confused me. I was also probably biased by reading the network's label before reaching my own judgment.

makmanalp · on Nov 20, 2014

> I think I'm a human

You sure? :) I definitely think the text does bias your interpretation though, I did the exact same.

tripzilch · on Nov 20, 2014

> > I think I'm a human

> You sure?

Sounds suspiciously like something a neural net would say ...

jameshart · on Nov 20, 2014

Right - impressive though this is, also the cat isn't black, the woman with the bananas might well not be a woman, the girl isn't swinging on a swing, the construction worker is probably not in the road, and the guitarist is tuning the guitar, not playing it. It's pretty much wrong -in detail- on every count.

The weird thing is how at a glance, it seems pretty much correct. And how some people here are willing to look at that and think 'automated law enforcement is clearly imminently possible'.

negamax · on Nov 20, 2014

Wow.. talk about high expectations. This is really well done. Identifying multiple concepts in an image and describe it to a good degree of accuracy. And this will get better further.

jameshart · on Nov 21, 2014

No no! Don't misunderstand me! I agree: It's phenomenally impressive! It's also still completely useless at this level of accuracy.

At this point, this software is as useful at describing photographs as a disinterested teenager who is busy trying to text. "This is a picture of my mom with some dude playing I dunno like tennis or something. Whatever."

Which is really impressive! Seriously!

But closing the gap to accuracy is really important, and it's a hard hard problem.

dTal · on Nov 20, 2014

And those are clearly plantains, not bananas!

aurelius · on Nov 20, 2014

Neural networks are impressive only in that they are able to give any kind of meaningful results at all. In the end, they are only a poor mimicry of real machine intelligence, and not much better, conceptually, than plain old nonlinear regression.

Nobody has been able to determine what the structure of a neural network should look like for any given problem (network type, number of nodes, layers, activation functions), how many iterations of the parameter optimization algorithm are needed to achieve "optimal" results, and how "learning" is actually stored in the network.

Statistical learning methods are obviously still useful, but I think the field is still wide open for something to emerge that is closer to true machine intelligence.

xanderjanz · on Nov 20, 2014

Please, try implementing non-linear regression to understand images. Tell me how it goes.

Also, 'nonbody know hows learning is stored'? You very clearly have never worked with neural nets before. Experience is stored in the form of weight values.

jameshart · on Nov 21, 2014

Okay, you've got a neural net that does a really good job on identifying types of animals in pictures. unfortunately, whenever you show it a picture of a horse, it says 'fish'. Everything else, it's great at - marmosets, capybaras, dolphins, kangaroos; but it's got a complete blindspot for horses.

Where's the incorrect data stored? How can you fix it? It's in the weight values, somewhere, but you can't go and change the weight values to fix the horse/fish cascade without breaking everything else it knows.

Yes, we know 'where' the data is stored. But it's diffuse, not discrete, so we can't separate it from other data.

xanderjanz · on Nov 21, 2014

Umm even still, you can train it on more horse photos in order to increase its performance specifically on horses. Furthermore, you can study neron activation levels on said horse training data in order to reverse-engineer the neural "ravines" which the activations settle into. And run comparison tests about those ravines against the ravines for say zebras.

This is something actively being done by nn researches. And it lets us do things like take the low level audio processing part of a neural net trained on english voice data, and use it to train smarter neural nets on Portugese voice data than you couldn've without the English voice recordings.

bcbrown · on Nov 20, 2014

What do you define 'machine intelligence' as?

aurelius · on Nov 21, 2014

Jarvis from Iron Man, or KITT from Knight Rider.

31reasons · on Nov 20, 2014

Something doesn't seem right. If we are this good in image recognition, how come we are still using captchas? Are we so good in image recognition that it can identify young girl, bunch of bananas, guitar etc just by training it from an image set of just few thousand image. Whats the catch ? On one hand google says their self-driving car can't understand all the situations on the road while this algorithm can identify so many details from an image like a strong AI would. Feels very strange.

Houshalter · on Nov 20, 2014

As I understand it, reCAPTCHA now relies heavily on metadata to determine if someone is a bot. Probably stuff like their IP address, browser information, geographic location, internet speed. Then it predicts how likely they are to be a spam bot and sends them a much harder CAPTCHA if so. And they change the interface to throw off bots.

The average person who just wants to automate filling out your website form is still blocked, so it's not useless.

There is some recent research that suggests you can make images which are very hard for neural networks to identify, but still easy for humans.

Here is an example: http://i.imgur.com/K6AQRkV.png The digits on the right are just slightly changed to be harder for NNs to recognize.

For comparison, this is the amount of random noise needed to have the same effect as their method: http://i.imgur.com/Asnf2L8.png

vidarh · on Nov 20, 2014

Captcha's largely fight the lowest common denominator, my making those who don't care enough (or have the knowledge) to work around them when they can just go elsewhere, so that you can invest your human resources fighting the more sophisticated attackers that actually target you.

They are reasonably successful because spammers have enough other targets that not many see it as worth the extra effort (and clock cycles) to break them, not because most of them are particularly hard to beat any more.

indymike · on Nov 20, 2014

Breaking captchas also have an incentive that has been attractive enough to create an three entire industries -- spam, malware and security products to deal with spam and malware.

cLeEOGPw · on Nov 20, 2014

These images are probably cherry picked as the most successful recognitions. We need more data until we can say whether it can actually recognize young girls in images confidently, or just sometimes gets "lucky". That's why they say that not all traffic situations can be understood. The algorithms aren't perfect.

31reasons · on Nov 20, 2014

They are using neural networks so they are never going to confidently say in what situations it will stop working.

ed · on Nov 20, 2014

This isn't really a breakthrough in object identification, as much as it is a clever pairing of identification with (mostly existing) language systems, is that right?

Wondering whether there's any merit to sibling comments speculating this is the future of e.g. surveillance

xanderjanz · on Nov 20, 2014

well, this is doing more than just object recognition+language labeling. Notice how it understands the the girl is 'in' the white dress. That is a lot more comlicated that identifying a white dress and a girl.

zvanness · on Nov 20, 2014

This is one of the coolest things i've seen in a while. I'd guess this is super similar to what the folks on the DeepMind team at Google are working on now, with the overall vision of being to classify images that have no metadata and add them to a dynamically learning knowledge graph:

http://www.newscientist.com/article/dn24946-google-buys-ai-f...

http://en.wikipedia.org/wiki/Knowledge_Graph

http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=H...

tiler · on Nov 19, 2014

I really appreciate that the Stanford group is always willing to post mistakes/mislabels. Kudos.

sigterm · on Nov 20, 2014

The face detection on her research group's front page has one glaring mislabel (http://vision.stanford.edu/).

I always find that quite amusing.

xanderjanz · on Nov 19, 2014

Sharing Pre built models are so cool, and definitely important to the advance machine learning science. Especially when you consider how mixing weight layers allows you to do things like understand Portuguese text better through English text.

thomasahle · on Nov 19, 2014

It would be great to have this for image search!

amckenna · on Nov 19, 2014

Yup! That's exactly why there has been so much work on this from the likes of Google. Video and images represent a "dark media" in terms of indexing and search. It is very difficult without robust tagging (which is time intensive or relies on the articles and descriptions that surround the images) to search through images. This technology would make it much easier perform robust querying of this data.

visarga · on Nov 20, 2014

It will revolutionize porn search, for sure.

dmritard96 · on Nov 20, 2014

unfortunately came to the same conclusion. could be a serious buzz kill though when it is wrong.

yihyeh · on Nov 20, 2014

Before going through on object recognition in images, I'm curious about how they managed to arrange perfect sentence for the caption.

Are we already at the point where NN can arrange perfect sentence when we throw bunch of words into it?

ogrisel · on Nov 20, 2014

Yes, there is a recent trend to use Recurrent Neural Networks to model the structure and semantics of sentences. This used in particular to do research for Machine Translation by people at Google and the University of Montreal in particular: http://scholar.google.fr/scholar?q=rnn+lstm+machine+translat...

userbinator · on Nov 20, 2014

They're not perfect but similar to what NN-based machine translators output; it's not completely consistent.

E.g.

"girl in pink dress is jumping in air."

vs

"black and white dog jumps over bar."

yuncun · on Nov 19, 2014

Idk if this is a daft question, but in the Visual-Semantic Alignment section, are those objects in the colored boxes actually being directly recognized by the software? Or are they inputted in some other way?

tjr · on Nov 19, 2014

From the paper: Our core insight is that we can leverage these large image-sentence datasets by treating the sentences as weak labels, in which contiguous segments of words correspond to some particular, but unknown location in the image. Our approach is to infer these alignments and use them to learn a generative model of descriptions.

karpathy · on Nov 19, 2014

This gets a little more detailed into the work, but compared to other papers that have sprung up in this area recently, our paper slightly frowns on the idea of distilling a complex image into a single short sentence description. In that sense we are a little more ambitious and we're trying to produce snippets of text that cover the full image with descriptions on level of image regions. I would call our results encouraging, but there is certainly more work to be done here. And I think one of the limitations right now to do a good job is the amount of training data available to us.

notastartup · on Nov 19, 2014

so basically you have a database of pre existing sentences to match whats roughly likely to be on the page, not actually seeing individual objects and generating a grammatically correct and accurate description based on the individual objects?

struwelpeter · on Nov 20, 2014

I'm impressed but skeptic, too. Laypeople might not get an accurate impression on what is possible. Considering the image with the guy in a black shirt playing guitar. What if it was a picture of a naked man standing next to a black shirt that is hanging on a clothes hook, and next to it is a guitar is mounted on a rack: My guess it that the computer will still spit out: Guy in black shirt playing guitar, just because this sentence is very plausible in the underlying language model.

fmax30 · on Nov 20, 2014

I remember doing something like this without the neural networks that is, but my results were very very bad. If i find that project in old laptop i'll post it to github.

shultays · on Nov 20, 2014

I am pretty sure some of those examples are correct only by a chance. "little girl is eating piece of cake."? Only thing that marks her as a girl is hair clip, did the software really saw that?

"woman is holding bunch of bananas." hell, I (hopefully a human) would recognize her as a male at first glance.

tintor · on Nov 19, 2014

It is interesting how the neural network labeled the woman in the lower right photo with the rectangle that includes the body only, without the head.

hugozap · on Nov 20, 2014

They forgot to put the link to the npm module ;)

notastartup · on Nov 19, 2014

this is unsettling and amazing. not too far when we'll have robots that will be aware of what's going on around it.

nulldata · on Nov 20, 2014

Fascinating that we've come this far, also: http://xkcd.com/1425/

cLeEOGPw · on Nov 20, 2014

Well, five years probably already passed since the comic release, so it isn't surprising.

kolinko · on Nov 20, 2014

This comic is from this year if I'm not mistaken