Hacker Newsnew | past | comments | ask | show | jobs | submit | hurstdog's commentslogin

Checking specifically for GCS at https://status.cloud.google.com/summary ...

There are two outages there, one that only affected a few projects and one that affected only service in the central US (we have regions over much of the world).

Anecdotally, from watching and working with other services internally I don't think most of our outages affect all regions. We actually spend a significant amount of engineering effort ensuring that we're as decoupled as possible.

Disclaimer: I'm an Engineering Manager on Google Cloud Storage.


On the flip side, many many outages lately have hit AN entire region, making the availability zone bit not so useful. For example the last US Central load balancer issue, it took out the entire central region for anyone using it.

Not sure what happened there, but think a deploy should be rolled out to 1 availability zone at a time for hosted things (like load balancer)


Is there some documentation anywhere talks about how Google Cloud (any product) creates isolation between various regions while at the same time exposing a simple "regionless" programming model?


Probably the best reference is the SRE book, specifically the chapters on loadbalancing and distributed consensus protocols.

Other than that, the general approach is to minimize global control planes and dependencies in our software stack. In the case of GCS, we do have a single namespace which means we need to look up the locations of data early in the request. Once we know the locations of data we can route the request to the right datacenter to serve it. That global location table is highly replicated and cached, of course.

When outages happen, most are caused by changes to the stack, so we also are careful to roll out code or configuration slowly and carefully, slowly increasing the blast radius after it's been proven safe. For example rolling out new binaries first to a few canary instances in one zone, then to a few instances in many regions, then to a full region, then to the world, all spread over a few days.

Disclaimer: I'm an Engineering Manager on Google Cloud Storage.


Ok, I dug into this a bit. You got stuck in a trap in our current setup. We've improved the docs slightly to make this less common, we're going to fix up the error messages to be more clear, and we have a longer-term plan to make this much more push-button.

Lastly, I'm going to go chase people down about getting you a comp for this.

Apologies for the trouble. -Andrew


Thanks a lot! I appreciate it!


If it helps, I'll dig into this when I get into work. This is the first I've heard of this case.

-Andrew, GCS


I think you're joking, but just for those that don't think that, we don't actually do that.


The following is direct from Google's Security and Privacy:

"In order to provide some of the core features in Google Apps products, our automated systems will scan and index some user data. For example:

-Email is scanned so we can perform spam filtering and virus detection.

-Priority Inbox, a Gmail feature, scans email message to identify which messages are considered important and which are considered not important.

-If you are using Google Apps (free edition), email is scanned so we can display contextually relevant advertising in some circumstances.

-Some user data, such as documents and email messages, are scanned and indexed so your users can privately search for information in their own Google Apps accounts.

*Google Apps data is not part of the general google.com index, except when users choose to publish information publicly."


then how can you scroogle and boobble people without those informations?


Where is the stuff about the creepy invasion and abuse of our privacy?

I know, I know, you don't do that. Nope, no one does. Everything is fine and dandy. Smile every one, no problem here.


There definitely is a form of a search bubble though, right?


I don't remember if hotmail used to run ads.


It did and they were display ads. Incredibly distracting.

At one point the "homepage" of Hotmail was a huge ad space, stories from MSN, and a tiny link to "Inbox."

The new Outlook is so much better. If Hotmail had evolved that way earlier, I would not have switched to Gmail.


So it is a bit hypocritical of MS talking about ads in gmail. But again where those ads contextual?


The "Scroogled" campaign has nothing to do with products or customers; the point is to broaden the PR base for Microsoft's ongoing campaign to convince the feds to initiate anti-trust proceedings against Google. That is why they hired a political PR executive to create the campaign.


The issue that you're complaining about was due to the incident linked at [1].

I can't go into the technical details as to why this happened, but I can roughly explain that it was due to the CAP theorem, essentially "Consistency, Availabilty, Partition Tolerance. Choose two." [2]

Furthermore, you have to choose partition tolerance [3]. The delivery delays that were seen yesterday is because we choose consistency over availability in our systems.

In fact, most of the outages I see people complain about on Hacker News related to Gmail are because we won't sacrifice consistency of user accounts. It's a different problem than huge scale serving of web search indexes or facebook timelines because in those cases if you're missing a few entries most people won't notice or care. When you're searching for an email, you know what email you expect to find and you'll get angry if it isn't there.

Users won't stand for an email showing up one day, disappearing next hour, and then coming back later (which is what could happen in some designs for eventual consistency when serving from different datacenters).

Thus, Gmail availability is lower sometimes because we make sure that all of your data is there all the time. We're insane about it, and we have huge jobs that run constantly on our systems to ensure that we're even resilient to bad hardware. With those we regularly find single bit errors and bad CPUs.

So, as a Gmail engineer, I'm sorry that there were delivery delays yesterday, and all I can say is that every time these happen we tweak and redesign our systems to make these more rare and to improve Gmail's uptime. We'll never have the snappy response and perfect uptime[4] of a computer under your desk. But at the same time a hurricane could take our one of our datacenters and we won't lose your data.

-Andrew, a Gmail Engineer.

1. http://www.google.com/appsstatus#hl=en&v=issue&ts=13... 2. http://en.wikipedia.org/wiki/CAP_theorem 3. http://codahale.com/you-cant-sacrifice-partition-tolerance/ 4. For some definitions of perfect.


Thanks Andrew- this is really helpful.


I'm not sure why you're getting those quota errors, but I can assure you that we're not trying to pressure people to stop using SMTP and IMAP.

If you post on the forums, some user support folks should be able to help you out and escalate to engineering if they find you're hitting a bug.

Sorry for the issues you're seeing.


I have about 2.5 million emails in my work account and it still works really well, as a counter example. (I work at Google, on Gmail backend).

If you've debugged it yourself and can't get to a good resolution (which it looks like you have), please post in our help forums. There are a few people that are dedicated to reading the forum and helping users, and when they hit dead ends for various reasons they'll escalate to the backend team directly and we investigate and fix the account if there is a bug.

Gmail tends to work very well for heavy users, but as with all complex software there are edge cases. Sometimes people fall into those edges and if you escalate through the help forums we'll see a rise in specific types of user escalations and fix whatever bug is causing it.

Hope that helps, -Andrew


I was coming to this page just to post a comment like yours above. Thanks for doing that.

I work on the Gmail backend team and we regularly see blogs go by such as this one. The vast majority of the times that prominent bloggers report slowness it's because they do have tons of apps polling their account via IMAP or other sync methods. All of these apps end up competing for resources to your account with the web UI and thus you experience slowness.

Internally we have accounts with upwards of 100G of mail still being very usable, so we know Gmail scales. Also we have quite a few people internally focusing specifically on finding and fixing these types of problems so that eventually it won't matter how many clients you have or how large your mailbox is, but these things take time. Gmail is a huge ship, we can't turn on a dime.

So yeah, check that IssuedAuthSubTokens page and revoke access to any random services that you've tried out and forgot about.

Hope that helps. -Andrew


Sorry, but this isn't the case for me. IssuedAuthSubTokens shows a google talk authorization (bitlbee...), google calendar, and chrome.

But if I try subscribing to LKML for a few weeks, gmail becomes completely unusable.


I'm sorry that trick isn't working for you. As you mention LKML causes slowdown, it's possible that you're getting an inordinate amount of mail. It's also possible you've got an agressive IMAP client that isn't using one of those auth tokens (another common slowdown cause).

As I mentioned in a comment below* if this persists please to post on the help forums. There are people dedicated to helping there, and if it's a legitimate bug the backend team investigates and we fix the issues.

Hope that helps,

-Andrew

* http://news.ycombinator.com/item?id=4478100


I'm the OP. Andrew, thanks for responding here. I have a ton of connected IssuedAuthSubTokens apps, but I've most of them are just accessing my Calendar, Contacts, etc. The only one that that has access to my Gmail is Baydin (Boomerang) which, as mentioned in the post, I am writhe to remove. But I wouldn't think that one connected Gmail app would be enough to cause slowness, would it?


It all depends. I don't know how Boomerang is implemented, but it's possible that with an account your size it's hitting some degenerate condition.

One of our support folks should be emailing you and they'll help debug where the slowness lies, on our side or through some other interaction. If it's a bug, we'll fix it up shortly.


A little late to the party here, but Boomerang doesn't do anything at all except for when you Boomerang something and when it's scheduled to be sent/returned. The rest of the time, it doesn't make any API calls.


The response is much appreciated!

Suggestion:

Why not give the UI interface priority over other polls (allocate more bandwidth to it) for 5-10 minute chunks when its active?


There are people working on those sorts of improvements, but it's not that straightforward so it takes time to get it right.

The reason that it's not that straightforward is that lots of operations change state in your account, so you need proper ordering of requests for everything that comes in if you want to ensure consistency. Furthermore when you do a UI operation you want it to happen now, and anything longer than that is frustrating. But if there is a large IMAP operation in the background, we can't just pre-empt that and re-start it because it might have already modified some state, not all changes are idempotent. So you have to wait, and you experience slowness.

Add in message delivery, various background operations for different features, android sync, different third-party add-ons that hit via IMAP or other sync protocols, and a stack that is more than a few layers deep and it gets very complex, very fast.

Hope that helps,

-Andrew


Hi Andrew, sorry to hijack the thread. Just wanted to say you guys are doing an amazing job at full-text email searching. Do you have any paper on that? I know Google publishes research papers and technical reports but couldn't find one about that.

Any pointer would be much appreciated. Thanks!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: