Google Compute Engine Is Down

lern_too_spel · on Feb 19, 2015

Google Cloud Platform is a cluster, in the http://www.urbandictionary.com/define.php?term=Cluster&defid... sense.

Amazon eats their own dogfood with AWS services, while Google does not, and no amount of marketing is going to make up for that. Every time I give it a try, I find random bugs everywhere, and I know there are not enough internal engineers feeling the pain to get those bugs fixed.

ngrilly · on Feb 19, 2015

AWS had its own share of massive outage. I had a look at AWS and GCE incidents history and it looks the frequency and magnitude of incidents is quite similar. Do you have any factual information that contradicts this?

plnk22 · on Feb 19, 2015

Perhaps the fact that the GCE outage affected multiple regions around the world. EC2 outages were all localized to a single region (and in most cases limited to a single AZ within that region).

orenbarzilai · on Feb 19, 2015

AWS asks you to design your architecture for failure and work across multiple availability zones. Its complicated and expensive but you can avoid that. With google from the other end you can't. and that's the price you pay when you choose PaaS over IaaS

tapirl · on Feb 19, 2015

Google Compute Engine is also an IaaS. You can also design your architecture for failure and work across multiple availability zones by using GCE. GCE is not GAE.

plnk22 · on Feb 19, 2015

GCE had a multi region outage so multi zone within a region wouldn't have helped in this case.

ngrilly · on Feb 27, 2015

I agree. This is the reason why this outage is more serious than others.

orenbarzilai · on Feb 19, 2015

My mistake, mixed GCE with GAE

jgreen10 · on Feb 19, 2015

The difference is really just in maturity. AWS started years earlier and has had time to work out the kinks. GCE will get there, but it'll need time and experience.

EdwardDiego · on Feb 19, 2015

Yep, exactly. This GCE outage comes at a great time to help me answer the question, should I use GCE over AWS? Probably not yet.

vanessa98 · on Feb 20, 2015

Whoa, what makes you think Amazon retail runs on AWS? That's a core founding myth...

lern_too_spel · on Feb 21, 2015

I have spoken to multiple Amazon engineers who build their production services on AWS. (Hint: I never claimed that all of Amazon retail already runs on AWS, nor is that required for my argument.) Zero Google services are built on Google Cloud Platform, and that is not likely to change. The difference shows up in the bugginess of the services. When you annoy internal customers, things get fixed. Meanwhile, there are known issues with Google Cloud Platform that are several years old with no workarounds.

on Feb 19, 2015

[deleted]

ngrilly · on Feb 19, 2015

Do you have any fact that support your claim?

hueving · on Feb 19, 2015

The way Google deploys applications is fundamentally different than the model they expose to customers. Google's cloud platform runs on top of their real resource management system, not the other way around.

vertex-four · on Feb 19, 2015

Amazon didn't run entirely on AWS until four years after AWS was created: https://twitter.com/cvwadored/status/81400058624475136

esrauch · on Feb 19, 2015

I worked there in 2011 and I'm pretty sure not everything was on AWS at that time.

I don't know how to square that with the tweet: my own understanding could be wrong, or it could be a misleading stat like 100% of public traffic hits a server on AWS infrastructure but not all of the other backend services were.

vanessa98 · on Feb 21, 2015

The Q4 peak hardware purchase graphs Amazon retail present. Different stacks.

Have you ever seen an AWS outage line up with an Amazon retail outage?

Dogfooding is a myth - Amazon retail's not betting its business on AWS.

nateweiss · on Feb 19, 2015

We found that we could still go into their console, SSH into the boxes (from the browser-based thing), and reboot the boxes from there. When they came back up, they worked. May have just been good luck, just happening to come back up on hardware that's not behind the bad routers or whatever it is.

Edit: Also, we found that our instances were reachable (they mostly provide a JSON-based API over HTTP), in the sense that they were getting the incoming HTTP/HTTPS traffic that they normally do. But the responses were not getting back out to whoever requested them... lost on the way back out or something.

_0ctp · on Feb 19, 2015

This worked for me too, about an hour ago. So not just good luck it seems :-)

jread · on Feb 19, 2015

All our status VMs are unreachable: https://cloudharmony.com/status-for-google

mnml_ · on Feb 19, 2015

I was going to ask them If they will give compensations, but I can't contact them as I don't have "silver support" :/

nitinics · on Feb 19, 2015

On any post incident reports - shall we ever come out of using the most common and ambiguous technology lingo "network issue". If it were identified a network issue (Route black hole, prefix hijacking, resources depletion or consistency issues etc) then you probably know enough to elaborate on the specifics.

aceperry · on Feb 20, 2015

Preliminary cause is described here: https://status.cloud.google.com/incident/compute/15045

dvh · on Feb 19, 2015

"Incident began at 2015-02-19 15:59"

Is that in future or am I missing something?

rey12rey · on Feb 19, 2015

I believe it's a typo as all other references seem to point to 22:59 Feb 18 2015.

It's 22:59 Feb 18 2015 also for Google Cloud SQL https://status.cloud.google.com/incident/cloud-sql/17006

Edit: It turns out this was the case. Updated now.

pronoiac · on Feb 19, 2015

The times on the left, standalone, are Pacific. In the text on the right, UTC.

Edit: uh, wait, this is wrong. It's edited (possibly) or I'm just bleary-eyed (more likely).

tazjin · on Feb 19, 2015

That seems to be local time in Hong Kong.

Edit: Just noticed that it says all times are US Pacific. Hm, then it is in the future.

waitingkuo · on Feb 19, 2015

Same here... Let's see what will happen

_0ctp · on Feb 19, 2015

Ssh'ing from the gce web console seemed to make my instances reachable. Afterward ssh from terminal and web access worked.

jsprogrammer · on Feb 19, 2015

0:30 passed and no update?

Who would have guessed that? Anyone?

thezilch · on Feb 19, 2015

There are updates and clear times for when to expect the next update.

jsprogrammer · on Feb 19, 2015

Right, when I posted it said the next update would occur at 0:30, but it was after 0:30 and there was no update.

iamspoilt · on Feb 19, 2015

The page says that all issues are resolved now.