It's not fair to compare with AI. In this case, a visual network is more like a reflex. They can fool the equivalent of a human reflex. The vision neural net feeds into a "world model" where such inconsistencies are resolved on a more abstract level. The same world model is being used to plan the path of the car. Even if the vision net makes an error, the internal model has ways to detect that there's a perception error if it doesn't make sense in the context.