Amazon eats their own dogfood with AWS services, while Google does not, and no amount of marketing is going to make up for that. Every time I give it a try, I find random bugs everywhere, and I know there are not enough internal engineers feeling the pain to get those bugs fixed.
AWS had its own share of massive outage. I had a look at AWS and GCE incidents history and it looks the frequency and magnitude of incidents is quite similar. Do you have any factual information that contradicts this?
Perhaps the fact that the GCE outage affected multiple regions around the world. EC2 outages were all localized to a single region (and in most cases limited to a single AZ within that region).
AWS asks you to design your architecture for failure and work across multiple availability zones. Its complicated and expensive but you can avoid that. With google from the other end you can't. and that's the price you pay when you choose PaaS over IaaS
Google Compute Engine is also an IaaS. You can also design your architecture for failure and work across multiple availability zones by using GCE. GCE is not GAE.
The difference is really just in maturity. AWS started years earlier and has had time to work out the kinks. GCE will get there, but it'll need time and experience.
I have spoken to multiple Amazon engineers who build their production services on AWS. (Hint: I never claimed that all of Amazon retail already runs on AWS, nor is that required for my argument.) Zero Google services are built on Google Cloud Platform, and that is not likely to change. The difference shows up in the bugginess of the services. When you annoy internal customers, things get fixed. Meanwhile, there are known issues with Google Cloud Platform that are several years old with no workarounds.
The way Google deploys applications is fundamentally different than the model they expose to customers. Google's cloud platform runs on top of their real resource management system, not the other way around.
I worked there in 2011 and I'm pretty sure not everything was on AWS at that time.
I don't know how to square that with the tweet: my own understanding could be wrong, or it could be a misleading stat like 100% of public traffic hits a server on AWS infrastructure but not all of the other backend services were.
We found that we could still go into their console, SSH into the boxes (from the browser-based thing), and reboot the boxes from there. When they came back up, they worked. May have just been good luck, just happening to come back up on hardware that's not behind the bad routers or whatever it is.
Edit: Also, we found that our instances were reachable (they mostly provide a JSON-based API over HTTP), in the sense that they were getting the incoming HTTP/HTTPS traffic that they normally do. But the responses were not getting back out to whoever requested them... lost on the way back out or something.
On any post incident reports - shall we ever come out of using the most common and ambiguous technology lingo "network issue". If it were identified a network issue (Route black hole, prefix hijacking, resources depletion or consistency issues etc) then you probably know enough to elaborate on the specifics.
Amazon eats their own dogfood with AWS services, while Google does not, and no amount of marketing is going to make up for that. Every time I give it a try, I find random bugs everywhere, and I know there are not enough internal engineers feeling the pain to get those bugs fixed.