There are two outages there, one that only affected a few projects and one that affected only service in the central US (we have regions over much of the world).
Anecdotally, from watching and working with other services internally I don't think most of our outages affect all regions. We actually spend a significant amount of engineering effort ensuring that we're as decoupled as possible.
Disclaimer: I'm an Engineering Manager on Google Cloud Storage.
On the flip side, many many outages lately have hit AN entire region, making the availability zone bit not so useful. For example the last US Central load balancer issue, it took out the entire central region for anyone using it.
Not sure what happened there, but think a deploy should be rolled out to 1 availability zone at a time for hosted things (like load balancer)
Is there some documentation anywhere talks about how Google Cloud (any product) creates isolation between various regions while at the same time exposing a simple "regionless" programming model?
Probably the best reference is the SRE book, specifically the chapters on loadbalancing and distributed consensus protocols.
Other than that, the general approach is to minimize global control planes and dependencies in our software stack. In the case of GCS, we do have a single namespace which means we need to look up the locations of data early in the request. Once we know the locations of data we can route the request to the right datacenter to serve it. That global location table is highly replicated and cached, of course.
When outages happen, most are caused by changes to the stack, so we also are careful to roll out code or configuration slowly and carefully, slowly increasing the blast radius after it's been proven safe. For example rolling out new binaries first to a few canary instances in one zone, then to a few instances in many regions, then to a full region, then to the world, all spread over a few days.
Disclaimer: I'm an Engineering Manager on Google Cloud Storage.
Ok, I dug into this a bit. You got stuck in a trap in our current setup. We've improved the docs slightly to make this less common, we're going to fix up the error messages to be more clear, and we have a longer-term plan to make this much more push-button.
Lastly, I'm going to go chase people down about getting you a comp for this.
The following is direct from Google's Security and Privacy:
"In order to provide some of the core features in Google Apps products, our automated systems will scan and index some user data. For example:
-Email is scanned so we can perform spam filtering and virus detection.
-Priority Inbox, a Gmail feature, scans email message to identify which messages are considered important and which are considered not important.
-If you are using Google Apps (free edition), email is scanned so we can display contextually relevant advertising in some circumstances.
-Some user data, such as documents and email messages, are scanned and indexed so your users can privately search for information in their own Google Apps accounts.
*Google Apps data is not part of the general google.com index, except when users choose to publish information publicly."
The "Scroogled" campaign has nothing to do with products or customers; the point is to broaden the PR base for Microsoft's ongoing campaign to convince the feds to initiate anti-trust proceedings against Google. That is why they hired a political PR executive to create the campaign.
The issue that you're complaining about was due to the incident linked at [1].
I can't go into the technical details as to why this happened, but I can roughly explain that it was due to the CAP theorem, essentially "Consistency, Availabilty, Partition Tolerance. Choose two." [2]
Furthermore, you have to choose partition tolerance [3]. The delivery delays that were seen yesterday is because we choose consistency over availability in our systems.
In fact, most of the outages I see people complain about on Hacker News related to Gmail are because we won't sacrifice consistency of user accounts. It's a different problem than huge scale serving of web search indexes or facebook timelines because in those cases if you're missing a few entries most people won't notice or care. When you're searching for an email, you know what email you expect to find and you'll get angry if it isn't there.
Users won't stand for an email showing up one day, disappearing next hour, and then coming back later (which is what could happen in some designs for eventual consistency when serving from different datacenters).
Thus, Gmail availability is lower sometimes because we make sure that all of your data is there all the time. We're insane about it, and we have huge jobs that run constantly on our systems to ensure that we're even resilient to bad hardware. With those we regularly find single bit errors and bad CPUs.
So, as a Gmail engineer, I'm sorry that there were delivery delays yesterday, and all I can say is that every time these happen we tweak and redesign our systems to make these more rare and to improve Gmail's uptime. We'll never have the snappy response and perfect uptime[4] of a computer under your desk. But at the same time a hurricane could take our one of our datacenters and we won't lose your data.
I have about 2.5 million emails in my work account and it still works really well, as a counter example. (I work at Google, on Gmail backend).
If you've debugged it yourself and can't get to a good resolution (which it looks like you have), please post in our help forums. There are a few people that are dedicated to reading the forum and helping users, and when they hit dead ends for various reasons they'll escalate to the backend team directly and we investigate and fix the account if there is a bug.
Gmail tends to work very well for heavy users, but as with all complex software there are edge cases. Sometimes people fall into those edges and if you escalate through the help forums we'll see a rise in specific types of user escalations and fix whatever bug is causing it.
I was coming to this page just to post a comment like yours above. Thanks for doing that.
I work on the Gmail backend team and we regularly see blogs go by such as this one. The vast majority of the times that prominent bloggers report slowness it's because they do have tons of apps polling their account via IMAP or other sync methods. All of these apps end up competing for resources to your account with the web UI and thus you experience slowness.
Internally we have accounts with upwards of 100G of mail still being very usable, so we know Gmail scales. Also we have quite a few people internally focusing specifically on finding and fixing these types of problems so that eventually it won't matter how many clients you have or how large your mailbox is, but these things take time. Gmail is a huge ship, we can't turn on a dime.
So yeah, check that IssuedAuthSubTokens page and revoke access to any random services that you've tried out and forgot about.
I'm sorry that trick isn't working for you. As you mention LKML causes slowdown, it's possible that you're getting an inordinate amount of mail. It's also possible you've got an agressive IMAP client that isn't using one of those auth tokens (another common slowdown cause).
As I mentioned in a comment below* if this persists please to post on the help forums. There are people dedicated to helping there, and if it's a legitimate bug the backend team investigates and we fix the issues.
I'm the OP. Andrew, thanks for responding here. I have a ton of connected IssuedAuthSubTokens apps, but I've most of them are just accessing my Calendar, Contacts, etc. The only one that that has access to my Gmail is Baydin (Boomerang) which, as mentioned in the post, I am writhe to remove. But I wouldn't think that one connected Gmail app would be enough to cause slowness, would it?
It all depends. I don't know how Boomerang is implemented, but it's possible that with an account your size it's hitting some degenerate condition.
One of our support folks should be emailing you and they'll help debug where the slowness lies, on our side or through some other interaction. If it's a bug, we'll fix it up shortly.
A little late to the party here, but Boomerang doesn't do anything at all except for when you Boomerang something and when it's scheduled to be sent/returned. The rest of the time, it doesn't make any API calls.
There are people working on those sorts of improvements, but it's not that straightforward so it takes time to get it right.
The reason that it's not that straightforward is that lots of operations change state in your account, so you need proper ordering of requests for everything that comes in if you want to ensure consistency. Furthermore when you do a UI operation you want it to happen now, and anything longer than that is frustrating. But if there is a large IMAP operation in the background, we can't just pre-empt that and re-start it because it might have already modified some state, not all changes are idempotent. So you have to wait, and you experience slowness.
Add in message delivery, various background operations for different features, android sync, different third-party add-ons that hit via IMAP or other sync protocols, and a stack that is more than a few layers deep and it gets very complex, very fast.
Hi Andrew, sorry to hijack the thread. Just wanted to say you guys are doing an amazing job at full-text email searching. Do you have any paper on that? I know Google publishes research papers and technical reports but couldn't find one about that.
There are two outages there, one that only affected a few projects and one that affected only service in the central US (we have regions over much of the world).
Anecdotally, from watching and working with other services internally I don't think most of our outages affect all regions. We actually spend a significant amount of engineering effort ensuring that we're as decoupled as possible.
Disclaimer: I'm an Engineering Manager on Google Cloud Storage.