Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Google Compute Engine Is Down (cloud.google.com)
113 points by jread on Feb 19, 2015 | hide | past | favorite | 33 comments


Google Cloud Platform is a cluster, in the http://www.urbandictionary.com/define.php?term=Cluster&defid... sense.

Amazon eats their own dogfood with AWS services, while Google does not, and no amount of marketing is going to make up for that. Every time I give it a try, I find random bugs everywhere, and I know there are not enough internal engineers feeling the pain to get those bugs fixed.


AWS had its own share of massive outage. I had a look at AWS and GCE incidents history and it looks the frequency and magnitude of incidents is quite similar. Do you have any factual information that contradicts this?


Perhaps the fact that the GCE outage affected multiple regions around the world. EC2 outages were all localized to a single region (and in most cases limited to a single AZ within that region).


AWS asks you to design your architecture for failure and work across multiple availability zones. Its complicated and expensive but you can avoid that. With google from the other end you can't. and that's the price you pay when you choose PaaS over IaaS


Google Compute Engine is also an IaaS. You can also design your architecture for failure and work across multiple availability zones by using GCE. GCE is not GAE.


GCE had a multi region outage so multi zone within a region wouldn't have helped in this case.


I agree. This is the reason why this outage is more serious than others.


My mistake, mixed GCE with GAE


The difference is really just in maturity. AWS started years earlier and has had time to work out the kinks. GCE will get there, but it'll need time and experience.


Yep, exactly. This GCE outage comes at a great time to help me answer the question, should I use GCE over AWS? Probably not yet.


Whoa, what makes you think Amazon retail runs on AWS? That's a core founding myth...


I have spoken to multiple Amazon engineers who build their production services on AWS. (Hint: I never claimed that all of Amazon retail already runs on AWS, nor is that required for my argument.) Zero Google services are built on Google Cloud Platform, and that is not likely to change. The difference shows up in the bugginess of the services. When you annoy internal customers, things get fixed. Meanwhile, there are known issues with Google Cloud Platform that are several years old with no workarounds.


[deleted]


Do you have any fact that support your claim?


The way Google deploys applications is fundamentally different than the model they expose to customers. Google's cloud platform runs on top of their real resource management system, not the other way around.


Amazon didn't run entirely on AWS until four years after AWS was created: https://twitter.com/cvwadored/status/81400058624475136


I worked there in 2011 and I'm pretty sure not everything was on AWS at that time.

I don't know how to square that with the tweet: my own understanding could be wrong, or it could be a misleading stat like 100% of public traffic hits a server on AWS infrastructure but not all of the other backend services were.


The Q4 peak hardware purchase graphs Amazon retail present. Different stacks.

Have you ever seen an AWS outage line up with an Amazon retail outage?

Dogfooding is a myth - Amazon retail's not betting its business on AWS.


We found that we could still go into their console, SSH into the boxes (from the browser-based thing), and reboot the boxes from there. When they came back up, they worked. May have just been good luck, just happening to come back up on hardware that's not behind the bad routers or whatever it is.

Edit: Also, we found that our instances were reachable (they mostly provide a JSON-based API over HTTP), in the sense that they were getting the incoming HTTP/HTTPS traffic that they normally do. But the responses were not getting back out to whoever requested them... lost on the way back out or something.


This worked for me too, about an hour ago. So not just good luck it seems :-)


All our status VMs are unreachable: https://cloudharmony.com/status-for-google


I was going to ask them If they will give compensations, but I can't contact them as I don't have "silver support" :/


On any post incident reports - shall we ever come out of using the most common and ambiguous technology lingo "network issue". If it were identified a network issue (Route black hole, prefix hijacking, resources depletion or consistency issues etc) then you probably know enough to elaborate on the specifics.


Preliminary cause is described here: https://status.cloud.google.com/incident/compute/15045


"Incident began at 2015-02-19 15:59"

Is that in future or am I missing something?


I believe it's a typo as all other references seem to point to 22:59 Feb 18 2015.

It's 22:59 Feb 18 2015 also for Google Cloud SQL https://status.cloud.google.com/incident/cloud-sql/17006

Edit: It turns out this was the case. Updated now.


The times on the left, standalone, are Pacific. In the text on the right, UTC.

Edit: uh, wait, this is wrong. It's edited (possibly) or I'm just bleary-eyed (more likely).


That seems to be local time in Hong Kong.

Edit: Just noticed that it says all times are US Pacific. Hm, then it is in the future.


Same here... Let's see what will happen


Ssh'ing from the gce web console seemed to make my instances reachable. Afterward ssh from terminal and web access worked.


0:30 passed and no update?

Who would have guessed that? Anyone?


There are updates and clear times for when to expect the next update.


Right, when I posted it said the next update would occur at 0:30, but it was after 0:30 and there was no update.


The page says that all issues are resolved now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: