Author Topic: Site Outage  (Read 3967 times)

0 Members and 1 Guest are viewing this topic.

Offline Corrine

  • The Mystical Rose
  • Administrator
  • Hero Member
  • *****
  • Posts: 11542
  • "Stronger than the past, united in our goal."
    • Security Garden
Site Outage
« on: June 02, 2008, 09:12:59 PM »
Regular visitors to LandzDown Forum have been missing their LzD fix since Saturday, May 31st at 4:55 pm CDT.  There was a major problem at the data center which houses the server which hosts LzD.  Electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding the data center electrical equipment room. Thankfully, no one was injured. 

The Planet has been working around the clock to restore service to the approximately 9,000 servers and 7,500 customers.  You can get an idea of the magnitude of the situation from http://service-update.theplanet.com/.
,  

Take a walk through the "Security Garden" -- Where Everything is Coming up Roses!

Remember - A day without laughter is a day wasted.
May the wind sing to you and the sun rise in your heart.

Offline Aaron Hulett [MSFT]

  • Web Server Manager
  • Administrator
  • Hero Member
  • *****
  • Posts: 1098
  • I take the bus!
    • Microsoft Corporation
Re: Site Outage
« Reply #1 on: June 03, 2008, 05:12:19 AM »
Hi everyone.  For those that may not know me, my name is Aaron Hulett, and I am an administrator here at The LandzDown Forum.  I personally rent and maintain the physical server that this site and some of my and my family's websites live on.  The server is hosted at The Planet, a hosting company specializing in providing dedicated servers.  Renting a dedicated server is different than simply running a website in that the entity renting the server owns performing maintenance, software upgrades, and troubleshooting of the server.  In other words, if things stop working, I am deeply involved in investgating what happened and resolving the issue.

On 31 May 2008 at 2:55 PM Pacific Daylight Time (Coordinated Universal Time -7 || 21:55) an electrical explosion occurred in the underground electrical conduit that feeds one of The Planet's datacenters, this one located in Houston Texas.  This explosion and resulting fire completely destroyed the equipment connecting the building to the electric grid, the underground cabling powering the first floor, as well as the first floor's main distribution panel and transfer switch and the four walls surrounding the electrical equipment room.  Luckily, no one was injured in the explosion.

This resulted in the website becoming unavailable.  Given I was active in the LandzDown Forum's chat room, I immediately noticed the server go down and I began investgating the cause.  Upon determination that the downtime was a result of this explosion, and that The Planet was actively working to restore service within a reasonable timeframe, I decided to hold off on any further action and instead waited for the server to come back online, which it did approximately 46 hours later.

While this scenario was in outside of my control, I still want to share with you my thoughts about what happened.  First and foremost, I'm thankful no one was hurt at the datacenter.  Any amount of server downtime does not outweigh injury or death.  Second, I have no plans to change dedicated server providers.  I'm very thankful for The Planet's transparency regarding what happened and what they are doing to resolve things.  Last, I'll drive a conversation with LandzDown Forum's key stakeholders regarding plans surrounding future potential site downtime and see where we can make improvements in this space.

If anyone has any questions for me surrounding what happened, please don't hesitate to ask by posting in the LandzDown Lounge.

Thanks,
Aaron
Aaron Hulett | Malware Protection Center | Microsoft Corporation
This post is provided "AS IS" without warranty, and confers no rights.

Offline Aaron Hulett [MSFT]

  • Web Server Manager
  • Administrator
  • Hero Member
  • *****
  • Posts: 1098
  • I take the bus!
    • Microsoft Corporation
Re: Site Outage
« Reply #2 on: June 03, 2008, 10:58:47 PM »
And then Round 2 started...

At about 12:45 am PDT on 03 June, the backup generator powering this server shut down.  Faulty current sensors detected a nonexistant out-of-balance current condition.  Things came back online literally minutes ago.  I'll bring the IRC server (the chat room) back online in a few hours.  If for some reason the site goes down again, you can monitor status updates from The Planet directly at:

http://service-update.theplanet.com/

The server is located in the H1 facility, under Phase 1 (the first floor).

Aaron
Aaron Hulett | Malware Protection Center | Microsoft Corporation
This post is provided "AS IS" without warranty, and confers no rights.

Offline Aaron Hulett [MSFT]

  • Web Server Manager
  • Administrator
  • Hero Member
  • *****
  • Posts: 1098
  • I take the bus!
    • Microsoft Corporation
Re: Site Outage
« Reply #3 on: June 30, 2008, 06:37:03 AM »
And then Round 3.

The SQL service failed at 1:24 AM Pacific Daylight Time and when an automatic restart was attempted, it failed to restart and the server failed to email me a failure notification.  This SQL service failure took this site along with my blog and several other sites hosted on this server offline until I issued a manual service restart request this evening.  I'm not sure what caused the failure (my main goal was to get things back online, then ask questions later).  I do know this is unrelated to the power problems that occurred at The Planet.  I also made some very quick changes to things so that I'm made aware of site downtime more rapidly and therefore can respond faster in the future.

Hopefully, given things tend to come in 3s, this is the last downtime for a while.

Aaron
Aaron Hulett | Malware Protection Center | Microsoft Corporation
This post is provided "AS IS" without warranty, and confers no rights.