If you go back in time and watch threads during almost ALL web hosting outages, the #1 COMPLAINT is always lack of communication with users.
PLEASE, web hosting companies, get the point, we want constant communications, even if you have nothing new to report. WE are smart too, let us know what you're trying, what's working, what's not, maybe your USERS can help you fix the problem.
I've been on the other side of this wall, though not for anything near this large of a service.
A lot of shops run with minimal staff, because -- let's face it -- cost is a very large factor of competition in the hosting industry.
Then a problem occurs, and if it were an easy problem to solve, they'd just solve it straight off and then do damage control in PR. Or, the problem is more like this one, and it requires some hard and fast thinking on the part of the staff that are available.
If you're one of those staff, the last thing you want to do is distract yourself by logging in to a forum every twenty minutes, or checking email, or writing status updates every thirty minutes. You instead feel incredible pressure to get the problem fixed immediately, meaning you don't even stop for a bathroom break if you can help it.
As a sysadmin, you'll also justify your decisions by saying that telling the users what's going on won't really change anything; it's not like they're assisting you in troubleshooting. Even if they're intelligent, they still aren't familiar with the specific systems and network topology and other good stuff involved in your operations, so you'll probably spend a lot more time answering their suggestions than you would if you just got in and figured it out yourself.
Though as a user, that kind of attitude is extremely frustrating.
You have an excellent point because I too host websites for hundreds of people and have been in the middle of a disaster. It's hard to keep updating a blog, but I got around to it about every hour regardless. Usually it just said what I was working on and that nothing had changed. Two sentences, but everyone knew what was going on. I also changed my outgoing voice mail recording which helped a ton. Most people didn't even leave messages.
It's REALLY hard, but well worth it in the end because you get praise for your communication instead of hate mail for abandoning your customers.
Things just really don't work on human time/relationship scales anymore, especially when it comes to internet services. Communications can move in milliseconds now, diagnosing and fixing complex issues is still on the order of hours/years/lifetimes!
It doesn't help that your customers often know about issues you are having before you do. So say ~30min to get notified of an issue (Which is not unrealistic for a non obvious issue that isn't something you actively monitor) - Quick 10min investigation to assess what the scale of the issue is, another 3-5min to post a couple notifications, and that is the better part of a hour gone, and you haven't even got around to trying to come up with a fix.
This is why Slicehost rock, they always communicate brilliantly. What I find amazing here is that the Linode representative hasn't even said sorry in the first post they made and hasn't in the second post either, crazy.
EDIT: Wait? Does Slicehost not cost more? Please, correct me if I'm wrong. But from what I'm seeing you get less RAM, Bandwidth, and Space for the same price. Btw, I use Slicehost, and just recently they had problems with my server. Otherwise they're flawless though. Great support too.
Not sure why this is downmodded; but Slicehost is more expensive than Linode. Both have good service and web interfaces, but Linode gives you more RAM for less cash.
They are down today, I guess, but that is to be expected for something you pay $20 a month for.
"This is why Slicehost rock, they always communicate brilliantly."
Um, ok, here's some counter-anecdotal info: My company at the time had an account on Slicehost, and every so often our instance would just go offline.
No word from anyone, just dead. After some e-mail, we learned that there was some concern of malicious activity, so the image was shutdown. Now, if Slicehost was getting complaints, and were planning on shutting down the instance, why couldn't they give some warning so some action could be taken. Or even say something right after shutting down a suspicious instance? Instead, we were left wondering why our site was not up.
I never figured out just how the instance was supposedly compromised, and eventually left Slicehost for eApps.
Nowadays I use Linode, or The Planet.
I know a few people who swear by Slicehost, but not everyone has a rosey exeperience.
I've never tried Slicehost before, but I've actually had really good results with Rackspace's "fanatical" support. Slicehost's support could have only gotten better! ;-)
My ISP Internode always says when they expect the next status update to be. Usually there's a deadline of 2-3 hours between updates and recently there was a period of 72 hours with several cascading issues ( http://advisories.internode.on.net/item/6611/ ). There were status updates all the way through and staff communicating with users out-of-band to discover the extent of the issues so they could resolve them.
What is notable to me is the surprise received by those who purchased hosting in multiple geographies to survive a datacenter-wide issue. I hadn't thought much before about the effect of centralized IT management decisions on availability. Perhaps those who really need uptime will now consider using another hosting provider as fail-over. Not sure how the DNS issues would play out as network-level load balancing isn't something I know much about.
It's disconcerting to discover the cause of my server problems today via a thread on Hacker News instead of getting an email or from their RSS feed or something.
That being said, I think Linode is great and I've had a great experience with them so far.
Looks like they did an update on their host machines. The update unfortunately meant that many hosts marked linodes as being shutdown, when they should have been marked up.
This meant many linodes were unreachable (No network).
linode.com was extremely slow also, don't know if it's related.
They're rebooting hosts to fix the issue, which is just horrible.
Here's a copy of the original post, since their own website and forum are slow/offline:
That's rough. I'm sure some heads are gonna roll come tomorrow morning.
EDIT: I just recently signed up for Linode and love their service. Mine is still up. The only thing I was a little freaked out by is that their main site uses Cold Fusion. Poor judgement?
It doesn't look like it was a graceful shutdown either; I had to enter maintenance mode and fsck by hand. Lots of bad free block counts, but now that it's booted, I'm not seeing any corruption. I just use my linode for fun, so I don't really care too much, but it would be nice to get a better explanation than that initial explanation of hosts being erroneously marked as down.
Personally, I haven't noticed any downtime today on our linodes in Dallas and Atlanta. I don't know. I get the complaint about lack of communication, on the other hand, though, I'd rather have them working to solve the problem than sitting and updating a blog.
Was it a dumb move to push the update to multiple data centers in the middle of the day? Maybe, but then, without knowing all the details of how and why it as done, that's a big assumption to jump to.
Linode is still, hands-down, the best hosting provider I've ever used (and yes, I had a slicehost account for a long time), so I'll give them a pass on this one. Sometimes problems just happen.
It's not unreasonable to be shocked at a upgrade in the middle of the business day for the majority of their customers.
Every significant online business I've worked with does deployments and upgrades off-peak, without exception. In fact one prefers to do production changes on Friday evenings so that the technical staff has the entire weekend to come in and resolve any problems that arise before the following Monday morning peak.
Oh, no, I didn't mean to imply that. I don't think anyone here is being unreasonable. I just meant that I've had enough awesome from Linode that I'm willing to give them a pass on this one. Everyone screws up now and then. :)
I lucked out, too. The world's not fair: I'm hosting a bunch of stupid sites on my Linode in New Jersey (like http://histaniputyourpictureonthewebihopeyoudontmind.com/ ), whereas I bet some other people with Linode problems are losing money. I feel for them.
EDIT: I just checked my uptime and it looks like my linode was rebooted. But I never noticed (and I was actually logged in and working there a lot today).
Here's the initial note from the forum. I copied it before and can't reload, so I don't know the user that posted it.
"During a shared library update distributed to our hosts, a number of the hosts incorrectly have marked Linodes as being shut down. To recover from this we may be issuing host reboots to upgrade their software to our latest stack, and then bringing the Linodes to their last state. We're working on this now and expect to have additional updates shortly. We'll also be notifying those affected via our support ticket system. Please stand by."
For what it's worth, I have also have VPS in Dallas that is running fine right now, and has been up since my last manual reboot (more than 30 days ago).
I recently (in the last week) signed up for the most basic Linode package for a pet project. Many of the posts on the forums note Linode's stellar track record, which is a bit reassuring, but doesn't really offset the lack of communication Linode has had with this.
PLEASE, web hosting companies, get the point, we want constant communications, even if you have nothing new to report. WE are smart too, let us know what you're trying, what's working, what's not, maybe your USERS can help you fix the problem.