The last few weeks have taught me that all web hosts suck. Ordinarily, I don’t feel it necessary to write a blog post railing and ranting against a company that annoys me; I do my best to let these things go quickly. However, according to the “sucky research method”, WebFaction has thus far managed to escape any negative criticisms. This wouldn’t be such a problem except that they’re a tad smug about it. Further, I think I am having trouble achieving emotional closure and catharsis on this matter since I find no similarly suffering souls out there with whom to commiserate. So it is with a heavy heart that I feel compelled to type out a petty, petulant “WebFaction sucks” post.
Who knows? Maybe I’m just the unluckiest customer a web host has ever had. My last web host had a RAID system failure on the machine hosting multimedia.cx. You might be saying, as I did, “Wait, remind me again what a RAID system is good for?” After that host failed to restore service after 24 hours, I quickly started scouting for new web hosts. I found WebFaction and figured we could live happily together. Those happy thoughts came crashing down on Monday evening when the site became progressively less responsive until it stopped responding at all. Eventually, their status blog notified us that… well, could you guess? Another RAID failure! Just on the machine that hosts my site.
Now, I could suspect conspiracy, or divine intervention. Another theory is that “RAID failure” is now the catch-all technical failure explanation (used to be “technical difficulties” or generic “glitch in the system”). At least software isn’t taking the fall. The blog post indicated that they would be restoring from the previous day’s backup which only made my blood boil– I had non-backed up work dated later than that, to say nothing of the FATE data that would be lost.
All told, the failure only lasted for about 12 hours before the IP address was ready to start serving HTTP data again. But my site still wasn’t up. I was notified that some httpd processes with my username were using inordinate amounts of memory and that I had to be suspended.
So let’s review the technical analysis:
- Data points: Customer’s scripts were working fine; our hardware crashes; we migrate possibly corrupt data to a new server; customer’s scripts are now misbehaving
- Analysis: Must be the customer’s fault!
- Best course of action: Administratively disable customer’s account
My best guess for what was going on was that perhaps some MySQL table was corrupted, causing a SELECT statement from a PHP script to read inordinate amounts of data into memory. After work, I was able to run CHECK and REPAIR on the tables manually but didn’t find anything too serious. I put in a request to have my websites re-activated and played the waiting game. And waited. Then waited some more. This is something my last web host never made me do.
The Resolution
Do I bail and begin the arduous search for another web host who will only suffer a RAID failure at my website’s hands? It has only been 2 weeks since I ditched my last host and… hey, wait… I still haven’t canceled that account. I’ve been meaning to, as soon as I do one more pass to make sure 4.5 years of data are migrated away. But they’re long past their RAID problems (knock on proverbial wood).
After waiting on my ticket for 2 hours, I decided to play a little game: If WebFaction could answer my tech support ticket in less time than it took to transfer my site back to the old host, they could keep my business. Mind you, both providers are well-connected, bandwidth-wise, and transferring the data between the 2 sites directly didn’t take long at all. WebFaction lost the game, responding to the tech support ticket nearly 4 hours after I filed it, a short while after the migration completed. For all of my previous provider’s failings, they always answered tech support tickets in 10 minutes or less.
I think the silver lining in all this is that I’m becoming so proficient at moving my operation between webhosts that I should soon be able to develop an automated process for doing so. And don’t think I’m happy about that. I lost 2 full evenings that I had slated for more important, FATE-related pursuits. This kind of thing really murders motivation.
Option 3. Retain both hosts, and RAID *them*.
Hey, I considered that, however briefly– what about retaining both hosts and aggressively, efficiently mirroring data between the 2? As long as they’re not in the same data center (with my luck, they both would be).
If you have the time and inclination, you can always buy a Linode: .
Now, that you have 2 partially unreliable, but complementing hosts, you got time to find the final solution.
Maybe a system that would update which ever hosted data that is the oldest one from the newest one (as suggested above) AND a script to connect to the closest one for spreading the charge? Or is that just an over kill?
By the way, a RAID issue? 0, 1, 0 + 1? Could you find a more explicit error detail of the “RAID System Failure”? It doesn’t sound right anyway.
Best of luck.
Tam
@tam:
A claimed RAID controller issue: http://statusblog.webfaction.com/
And I still think, if your RAID controller fails _ever_ you bought the wrong one.
@Mike:
Your idroq FATE test is funny, for all I can tell all output is the expected but the test still fails…
@Reimar: I updated that test spec last night. So there was an incongruency between the time I updated and the time that new code was checked in to be tested.
now the idea is to decentralize the fate status page across multiple hosts.
i wonder, is there any mechanicism for automatically halting a process which has eaten too much memory? the times i’ve seen it happen it usually brings the box to a halt after filling the swap…
@compn: that’s what ulimit is for.
When WebFaction told me that they had to disable my account because an httpd process with my uid attached was using 1.1G of memory, I asked why there weren’t measures in place to protect against such renegade processes. They assured me that there were measures in place but that I somehow managed to circumvent them.
Apparently, I don’t even recognize how 133t I am.