Presentations like these make me realize how close we are to developing law enforcement (/police state) technology that will be very effective. I figure when the kinks are smoothed out, that we could run this on a video feed, and have crimes prevented right as they're about to happen. It's almost scary. Imagine 20 years from now, some guy pulls a gun on you, and a video feeds identifies his action, and immediately shoots a tranquilizer straight into his jugular with perfect aim.
Even if this is not working now, it can still be applied later on video shot today. So we are already en route to being judged by such an automated guardian.
I fear greatly for the false positives that will inevitably happen, and the mass oppression of society that could result from the natural instinct to avoid them. When every action even slightly against the norms of society at the time can be acted upon and countered, the chilling effect on freedom could be immense.
You had to live -- did live, from habit that became instinct -- in the assumption that every sound you made was overheard, and, except in darkness, every movement scrutinized. -George Orwell, 1984.
I imagine more preventative solutions to develop earlier which are more effective, e.g., "treat" citizens who are predicted highly likely to be cause a dangerous outcome in the future.
Or in the simple context of computer vision and the camera, raise warnings as individuals on screen act increasingly more suspicious. I imagine future technology will at least have intelligent security cameras that signal alarms based on the current video feed, not just when it's suddenly shut off for example.
> some guy pulls a gun on you, and a video feeds identifies his action, and immediately shoots a tranquilizer straight into his jugular with perfect aim.
Eh, no. Safe preventive action against a gun (or other lethal weapon, really) needs to happen before the weapon is ready for use. What if the assumed offender was wearing protection? Perfect aim wouldn't help. Lethal force might be possible to deploy - but probably not in a safe manner.
I'm a bit skeptic. They have several very good examples, and those are really impressive. Everything about how often these striking results occur is 'coming soon' though. And since it is neural network-based, being lucky sometimes doesn't say anything about the statistical performance. What percentage of the test dataset was labelled correctly?
I'd like to see the worst-case behaviour too - e.g. hilariously wrong results. Seeing the failures (and their rate) makes it more believable and enables a more unbiased evaluation of the capabilities of the system.
It's like those "up to XX% better" claims - "up to", not "at least" being the key phrase here.
Fascinating. Also interesting to see the failure modes. Any human would quickly realize that the "boy doing backflip on wakeboard" is actually playing on a trampoline. Or the "two young girls playing with legos toy". Great stuff!
I think I'm a human, but I did not realize that boy was on a trampoline instead of a wakeboard until you pointed it out. I think the water in the background confused me. I was also probably biased by reading the network's label before reaching my own judgment.
Right - impressive though this is, also the cat isn't black, the woman with the bananas might well not be a woman, the girl isn't swinging on a swing, the construction worker is probably not in the road, and the guitarist is tuning the guitar, not playing it. It's pretty much wrong -in detail- on every count.
The weird thing is how at a glance, it seems pretty much correct. And how some people here are willing to look at that and think 'automated law enforcement is clearly imminently possible'.
Wow.. talk about high expectations. This is really well done. Identifying multiple concepts in an image and describe it to a good degree of accuracy. And this will get better further.
No no! Don't misunderstand me! I agree: It's phenomenally impressive! It's also still completely useless at this level of accuracy.
At this point, this software is as useful at describing photographs as a disinterested teenager who is busy trying to text. "This is a picture of my mom with some dude playing I dunno like tennis or something. Whatever."
Which is really impressive! Seriously!
But closing the gap to accuracy is really important, and it's a hard hard problem.
Neural networks are impressive only in that they are able to give any kind of meaningful results at all. In the end, they are only a poor mimicry of real machine intelligence, and not much better, conceptually, than plain old nonlinear regression.
Nobody has been able to determine what the structure of a neural network should look like for any given problem (network type, number of nodes, layers, activation functions), how many iterations of the parameter optimization algorithm are needed to achieve "optimal" results, and how "learning" is actually stored in the network.
Statistical learning methods are obviously still useful, but I think the field is still wide open for something to emerge that is closer to true machine intelligence.
Please, try implementing non-linear regression to understand images. Tell me how it goes.
Also, 'nonbody know hows learning is stored'? You very clearly have never worked with neural nets before. Experience is stored in the form of weight values.
Okay, you've got a neural net that does a really good job on identifying types of animals in pictures. unfortunately, whenever you show it a picture of a horse, it says 'fish'. Everything else, it's great at - marmosets, capybaras, dolphins, kangaroos; but it's got a complete blindspot for horses.
Where's the incorrect data stored? How can you fix it? It's in the weight values, somewhere, but you can't go and change the weight values to fix the horse/fish cascade without breaking everything else it knows.
Yes, we know 'where' the data is stored. But it's diffuse, not discrete, so we can't separate it from other data.
Umm even still, you can train it on more horse photos in order to increase its performance specifically on horses. Furthermore, you can study neron activation levels on said horse training data in order to reverse-engineer the neural "ravines" which the activations settle into. And run comparison tests about those ravines against the ravines for say zebras.
This is something actively being done by nn researches. And it lets us do things like take the low level audio processing part of a neural net trained on english voice data, and use it to train smarter neural nets on Portugese voice data than you couldn've without the English voice recordings.
Something doesn't seem right. If we are this good in image recognition, how come we are still using captchas?
Are we so good in image recognition that it can identify young girl, bunch of bananas, guitar etc just by training it from an image set of just few thousand image. Whats the catch ?
On one hand google says their self-driving car can't understand all the situations on the road while this algorithm can identify so many details from an image like a strong AI would. Feels very strange.
As I understand it, reCAPTCHA now relies heavily on metadata to determine if someone is a bot. Probably stuff like their IP address, browser information, geographic location, internet speed. Then it predicts how likely they are to be a spam bot and sends them a much harder CAPTCHA if so. And they change the interface to throw off bots.
The average person who just wants to automate filling out your website form is still blocked, so it's not useless.
There is some recent research that suggests you can make images which are very hard for neural networks to identify, but still easy for humans.
Here is an example: http://i.imgur.com/K6AQRkV.png The digits on the right are just slightly changed to be harder for NNs to recognize.
For comparison, this is the amount of random noise needed to have the same effect as their method: http://i.imgur.com/Asnf2L8.png
Captcha's largely fight the lowest common denominator, my making those who don't care enough (or have the knowledge) to work around them when they can just go elsewhere, so that you can invest your human resources fighting the more sophisticated attackers that actually target you.
They are reasonably successful because spammers have enough other targets that not many see it as worth the extra effort (and clock cycles) to break them, not because most of them are particularly hard to beat any more.
Breaking captchas also have an incentive that has been attractive enough to create an three entire industries -- spam, malware and security products to deal with spam and malware.
These images are probably cherry picked as the most successful recognitions. We need more data until we can say whether it can actually recognize young girls in images confidently, or just sometimes gets "lucky". That's why they say that not all traffic situations can be understood. The algorithms aren't perfect.
This isn't really a breakthrough in object identification, as much as it is a clever pairing of identification with (mostly existing) language systems, is that right?
Wondering whether there's any merit to sibling comments speculating this is the future of e.g. surveillance
well, this is doing more than just object recognition+language labeling. Notice how it understands the the girl is 'in' the white dress. That is a lot more comlicated that identifying a white dress and a girl.
This is one of the coolest things i've seen in a while. I'd guess this is super similar to what the folks on the DeepMind team at Google are working on now, with the overall vision of being to classify images that have no metadata and add them to a dynamically learning knowledge graph:
Sharing Pre built models are so cool, and definitely important to the advance machine learning science. Especially when you consider how mixing weight layers allows you to do things like understand Portuguese text better through English text.
Yup! That's exactly why there has been so much work on this from the likes of Google. Video and images represent a "dark media" in terms of indexing and search. It is very difficult without robust tagging (which is time intensive or relies on the articles and descriptions that surround the images) to search through images. This technology would make it much easier perform robust querying of this data.
Yes, there is a recent trend to use Recurrent Neural Networks to model the structure and semantics of sentences. This used in particular to do research for Machine Translation by people at Google and the University of Montreal in particular: http://scholar.google.fr/scholar?q=rnn+lstm+machine+translat...
Idk if this is a daft question, but in the Visual-Semantic Alignment section, are those objects in the colored boxes actually being directly recognized by the software? Or are they inputted in some other way?
From the paper: Our core insight is that we can leverage these large image-sentence datasets by treating the sentences as weak labels, in which contiguous segments of words correspond to some particular, but unknown location in the image. Our approach is to infer these alignments and use them to learn a generative model of descriptions.
This gets a little more detailed into the work, but compared to other papers that have sprung up in this area recently, our paper slightly frowns on the idea of distilling a complex image into a single short sentence description. In that sense we are a little more ambitious and we're trying to produce snippets of text that cover the full image with descriptions on level of image regions. I would call our results encouraging, but there is certainly more work to be done here. And I think one of the limitations right now to do a good job is the amount of training data available to us.
so basically you have a database of pre existing sentences to match whats roughly likely to be on the page, not actually seeing individual objects and generating a grammatically correct and accurate description based on the individual objects?
I'm impressed but skeptic, too. Laypeople might not get an accurate impression on what is possible.
Considering the image with the guy in a black shirt playing guitar. What if it was a picture of a naked man standing next to a black shirt that is hanging on a clothes hook, and next to it is a guitar is mounted on a rack: My guess it that the computer will still spit out: Guy in black shirt playing guitar, just because this sentence is very plausible in the underlying language model.
I remember doing something like this without the neural networks that is, but my results were very very bad. If i find that project in old laptop i'll post it to github.
I am pretty sure some of those examples are correct only by a chance. "little girl is eating piece of cake."? Only thing that marks her as a girl is hair clip, did the software really saw that?
"woman is holding bunch of bananas." hell, I (hopefully a human) would recognize her as a male at first glance.