Joined: 23 Aug 2003 Posts: 351 Location: San Carlos, CA, USA
Posted: Sun Mar 16, 2008 2:07 pm Post subject: Footbag.org major service problems, March 12-19
Folks,
Footbag.org had a major, catastrophic failure of its primary web server on Wednesday night (March 12, 2008) at around 10:22pm, PDT.
Of course, as luck would have it, I was only 1/4 of the way to Frankfurt, Germany, on my way to Prague for the Todexon event. (Which is great, by the way. Highly recommended. )
Unfortunately, that meant I wasn't home to just go downstairs (yes, footbag.org is in my house) and fix it. The problem was that the server's power supply failed. So, the machine completely died -- dead -- with no way for me to hack into it from Europe even if I could have.
Of course, this ruined my 5-day vacation. I spent days 1-3 pretty much constantly on my Mac laptop in my hotel room in Prague, trying to restore service.
The short version is that we're back up (obviously), however, this is really not a great situation. What I managed to do was to build a whole new server almost from scratch, and fortunately thanks to the miracle of VNC, I was able to get into my backup server at home (a dedicated Mac running Retrospect backup software) and transfer the latest backed-up state of footbag.org onto that server. (It sounds easy, but the last sentence took 2 days of almost non-stop work.)
Of course, because the whole footbag.org server was down, there was no e-mail service, so all e-mail was being held in the "ether" waiting for the server to come back up. But the clock was ticking -- if we didn't get mail service restored within a couple of days, clients would start discarding the unsent mail. But, as luck would have it, I beat the clock and got mail working within under 24 hours. That means, almost all mail made it through to @footbag.org aliases. Unfortunately, I had a bug in my hacked-up mail server and mail to @ifpa.footbag.org started bouncing at that point. So, mail to/from @ifpa.footbag.org mailing lists were lost between Wednesday night when the service crashed and Thursday night when I got it working again. Some mail may have sneaked through, but assume it didn't.
Unfortunately, the old server I hacked up to run the backup copy of footbag.org is running a 7-year-old operating system (Debian Linux "woody" distribution, for those geeks among you) and that means lots of things just aren't compatible with today's software on footbag.org (most of which I wrote myself, so that was easy to fix; but for the third-party software, it was a lot harder). So I gave up trying to get the Wiki to work (that's the Reference section for the rest of you) since upgrading php meant pulling in a lot of new libraries and I just didn't want to risk bringing down my only remaining server while still 7,000 miles away.
In parallel with all this, my dear friend Strick took time out of his ridiculously busy schedule to drive over to my house and go into my server room to start looking into why the server failed to begin with. Long story short, again, Strick did something very important and helpful -- he removed the (totally functional) disk drives from the now-dead machine, and put them in his own "chassis" which he'd created a long time ago for exactly this purpose (for other friends of his). He's the go-to man when something like this happens; we all need a Strick. You guys mostly have me; I have Strick. So, indirectly, you all owe Strick a huge debt of gratitude.
Anyway, what Strick did was two things. First, he made it so I could access the original hard drives (and latest data) from the instant footbag.org crashed on Wednesday night. Second, he set up a host environment whereby I could (if necessary) switch over to his machine and have footbag.org running again using the real hard disks and basically mask the fact that the other machine failed. Unfortunately, this latter path didn't materialize 'til I'd already done 97% (literally) of the hacking required to get my other machine working, so I opted to stick it out and get the old machine up. However, because of his first bit of work, I was able to copy the exact last state of footbag.org's databases over to the replacement machine and got it up and working exactly as before (with all the caveats listed above).
Unfortunately, this saga isn't over. I keep having problems, because the server is so old and out of date. This morning, I woke up to find footbag.org down yet again due to a mysql issue. I troubleshot it, and think I got it fixed, but it's not clear it's not going to fail again. If I switch back to the real hard disks, this won't happen, because they run the latest versions of everything (relatively speaking) and the issue now is that I've got this hybrid world where half the stuff is old (like mysql, apache, etc.) and half the stuff is new (like my software, mediawiki, etc.).
So, once I get home (Monday night, very late PDT) I will probably just go to sleep. But after that, sometime on Tuesday, I'll start working on repairing the original server. I think I can get things working as before by Wednesday, but no promises. It's been a long few sleepless days/nights for me and I think you can live w/o the wiki for a few more days, and with the relative instability.
OK, well, I thought I'd give everyone a bit of flavor for what's been going on. I also wanted to test posting to the forum.
Steve
P.S. For those geeks among you who are asking yourselves, "Why doesn't Steve have a better disaster-recovery plan?" or even the less daunting, "Why doesn't Steve have a hot-standby or even a cold-standby server running for this?" The answer is, time, money, help. I have money, but I don't have time or nearly as much help as I need. Really time is the issue, though. Either way, I do have a cold-standby, which is what I got us up on. It took longer than it could have because I was stupid and didn't just go and upgrade the cold standby machine when I bought the new server so they would have the exact same OS on them. I'll do that next time, you can rest assured. But I am working on a better reliability/recovery plan, had already been engaging in discussions with several folks about what's required, and was actually about to embark on earnest work in that area with Allan and others in the next few weeks. Who knew after 633 days of non-stop uptime that footbag.org would crash the day I went to Europe?
Posted: Sun Mar 16, 2008 7:02 pm Post subject: Re: Footbag.org major service problems, March 12-19
Steve Goldberg (brat) wrote:
Who knew after 633 days of non-stop uptime that footbag.org would crash the day I went to Europe?
murphy did...
but for christs sake, i´m pretty sure we would have all survived without .org running until you´re back from praha.
i'm pretty much the "strick-kind-of-guy" but i would not have worked for days when i'm actually on a tourney.
you are a crazy workaholic i guess. but thanks for the insights.
Joined: 01 Jun 2005 Posts: 187 Location: Ellenville, NY, USA
Posted: Sun Mar 16, 2008 7:23 pm Post subject:
well that all sounded like alot of work(this comming from a guy,me, who still asks you how to use my personnal e-mil). See steve every time I want to say you are a jpain in the A@#$%, its stuff like this that I remember, and say so what ted he's great for footbag. Thanks for all the work over the years. I also would not have done this until I was back home, but I would than have allowed people who are not familiar with footbag too possibly label us unorganized and unproffessional. Stay healthy dude something happens too you we will all pay dearly!!!!
Joined: 23 Aug 2003 Posts: 351 Location: San Carlos, CA, USA
Posted: Sun Mar 16, 2008 11:16 pm Post subject:
UPDATE:
Crap. I realize now that the outage this morning (the second time around, because of the mysql problem) resulted in all e-mail to users via their footbag.org aliases to bounce.
This means all e-mail went back to sender, which people should not ignore! If you saw a message saying, "Returned mail: see transcript for details" and in the contents of the message you saw something like this:
Code:
----- The following addresses had permanent fatal errors -----
<random__user@footbag.org>
(reason: 550 5.1.1 <random__user@footbag.org>... User unknown)
----- Transcript of session follows -----
... while talking to llic.net.:
>>> DATA
<<< 550 5.1.1 <random__user@footbag.org>... User unknown
550 5.1.1 <random__user@footbag.org>... User unknown
<<< 503 5.0.0 Need RCPT (recipient)
It means you should re-send the message because it was lost.
This happened sometime between this Sunday early-morning (PST) and this Sunday afternoon (PST). Probably about 10 hours of exposure. (Fortunately, Sunday is a slow-email time; but if you expected to get, or sent an e-mail to any user on footbag.org during those hours, I strongly recommend you re-send the original message again, or contact your correspondents to ask them to do the same in case you were expecting something from them today. It's all working fine, now, finally.)
I'm sorry, I lost mail then too, obviously. If you e-mailed me today or even late last night, please send it again just in case (unless you are sure I got it). I hate losing mail. It's the one thing I try the hardest to avoid. But in fact the mail wasn't actually lost, as much as rejected. So, in theory, your correspondents will know you didn't get the mail because they'll get a bounce message. Though most people just ignore them (which they shouldn't).
OK, that's the update for now. I'm going to bed so I can get up early tomorrow and start flying home. I'll check in one more time in about 7 hours or so.
Joined: 19 Sep 2003 Posts: 1508 Location: Hobart, Tasmania, Australia
Posted: Mon Mar 17, 2008 1:00 pm Post subject:
Yeah totally, it's incredible the amount of effort you've put into this, as well as putting up with my annoying SMSs Thanks heaps. If there is anything I can do to help, which I guess is unlikely given my lack of IT knowledge and my location, I'm more than happy to do it.
Joined: 23 Aug 2003 Posts: 55 Location: Austin, TX, USA
Posted: Mon Mar 17, 2008 2:37 pm Post subject: Thanks Steve!
Thanks for all your hard work and dedication. You are amazing! Maybe this is something IFPA could do some fundraising around. Maybe a big fundraiser to buy footbag.org some new equipment. You are the best! Tina.
P.S. People won't die if we lose some e-mail so I don't sweat that too much.
Joined: 23 Sep 2003 Posts: 52 Location: Santa Clara, CA, USA
Posted: Mon Mar 17, 2008 8:43 pm Post subject:
Steve, thanks for your devotion to footbag.org and footbag I work on software to keep applications running when hardware fails. So maybe I can help. Also, IFPA is willing to help anyway it can to protect footbag.org (and you!) from problems like these.
Joined: 23 Aug 2003 Posts: 351 Location: San Carlos, CA, USA
Posted: Fri Mar 21, 2008 7:15 pm Post subject:
Just a quick update now that I'm home from Europe.
I ordered a replacement power supply for the "real" footbag.org (since we're still running on the backup right now, and several features are still disabled as a result).
It arrived this morning, and I'm going to swap it out tonight when I get home from work.
If all goes according to plan, the "real" footbag.org should be back online by tonight (Friday, March 21, PDT). Of course I'll update this thread if when I know if that worked or not. It's only a guess that the power supply died; the other option is that the mother board melted. I like to think that's a much lower probability problem.
If the power supply wasn't the problem, I have to go to Plan B...
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum