There's been a lot of rubbish being spread around this week regarding the status of this forum, and how it has been off the air for quite a large chunk of time.
Since I'm the bloke who set the forum up in the first place on my own servers, I feel I must let you know what exactly is going on.
As many of you know I run a web hosting business on my own servers here in Australia.
Just around 2 weeks ago I got a bit of a surprise to hear that the business that I buy my co location rack space from has decided to change their business plan, and will no longer be providing me with rack space for my servers in the Equinix data centre in Sydney. This means not only myself, but 4 or 5 other small businesses who were using that space had to rapidly find a place closer to home to put our own servers in to, which means a Brisbane data centre.
So what's wrong with that you ask ? It means that my servers have to pulled out of the racks in Sydney then sent up here, which means this forum would be off the air for at least 30 straight hours, if not a bit longer due to the transit time on the road.
With the fact that the time and date when we were able to arrange someone to go to remove all the different business servers has really been up the air for the last week or more now.
I could not simply sit back and hope for the best, when this big move could've happened this weekend when there's a Grand Prix on. I wasn't going to take the risk of the forum being offline either leading up to the GP weekend (for the tipping comp closing and people missing out on getting their tips in), or the actually weekend itself.
The long and short of it is that i chose to go with a Virtual Private Server from a company in the UK, and set up the forum on that server last weekend when being down wasn't going to be as much of an issue as this coming up weekend would be.
The delay last weekend was DNS issue, where most DNS servers were still looking at my own servers, rather than the newer server because I didn't reduce the TTL the day before. It took until late Saturday night before people could finally see the forum at it's new temporary location. Temporary because once I get my servers back to Brisbane, the forum will be back to its home of the last 2 years.
I'm going to skip why the forum was down for most of Wednesday, but it had nothing to do with the hosting.
Today, the company that I got the VPS from had an emergency outage, which they emailed all their clients at around 3am our time.
Here's their full notice, and reasons behind the extended downtime :-
Emergency Network Maintenance
September 24th, 2009
We need to perform some emergency network as soon as possible. This has been scheduled for tonight, which we appreciate is very short notice. We need to reboot a router to install new software, and this reboot will take up to 45 minutes. We will do everything we can to speed up the process as much as we can and reduce the maintenance time.
Date: 24/09/2009
Window: 23:00 for 2 hours
Duration: < 45 minutes.
The maintenance is to perform an emergency upgrade of Cisco software. We are using a Cisco VSS-1440 as part of our network core, and we have been experiencing some reduced performance with it today. There is no cause within our network configuration and set up of this, and it started to have a detrimental effect on some clients today. We escalated this to the Cisco TAC team, who have diagnosed a fault with the software on the router in the form of a memory leak. Cisco has supplied us with a new version of the software for the router which will fix the memory leak and slow performance.
The nature of this problem is that it will escalate as time goes on, which is why we want to apply the fix as soon as core business hours finish today. Please accept our apologies for the short notice, we hope your clients appreciate this problem was out of our control caused by Cisco software, and we are working as best we can to resolve it quickly.
We apologise for any inconvenience this may cause, please do not hesitate to contact us if you have any queries or questions regarding this maintenance window.
UPDATE 01:56 - Servers are starting to come back online now, we have requested a full reason for outage from the data centre as to why this took much longer than expected and we will be contacting all customers once we have the facts.
UPDATE 05:38 -
The following update has been issued by our data centre:
This is a further update to our earlier message regarding the problems with the scheduled maintenance.
As mentioned in our previous message, there was a complication with the new firmware which required additional troubleshooting. During this time there was no connectivity to Spectrum House North Side (RSH Nth), one of the zones in one of our datacentres in Maidenhead. The network, including the London fibre ring and all external peering points, remained in full working order.
Full service to RSH Nth should have been restored by approximately 03:30. Small outages of less than 5 minutes may still be experienced by individual subnets as final configuration work is completed. These will be completed by 07:00. If anyone is still experiencing any problems please contact us immediately and we”ll do our best to resolve them for you.
Date: 24/09/09
Time: 23:00
Duration: <4.5hrs
The main cause of the extended outage was a problem in getting the VSS cluster to accept the new firmware. We have tried to provide a brief but accurate summary of the events below. A full reason for outage (RFO) will then follow tomorrow.
Given that the new firmware was required to avoid the memory leak issue, the situation had to be resolved. A decision was made to focus on correcting a potentially debilitating problem to the network and subsequently this evening''s outage was extended, rather than revert to a flawed firmware version.
1. New firmware image is loaded and prepared for use on reboot.
2. First router is rebooted.
3. Router finishes booting into new firmware, but the configuration has been wiped and it is no longer part of the cluster.
4. Router is reinitialised with cluster settings and rebooted again, to apply these changes.
5. The router hangs during the boot process, shortly after decompressing the image.
6. On consultation with Cisco it is agreed to boot back into the old firmware to try and restore a solid boot. This works.
7. The old image is removed from the boot memory and the new image is again prepared for use on boot.
8. This time both the image and the minor temporary configuration hold.
9. The backup of the configuration is restored to the router and a reboot applied to test that it holds, which it does.
10. The boot process includes bringing up each line card one at a time. During this boot two of the line cards are not initialised, citing an error.
11. Following consultation with Cisco, one of the two line cards is brought back online. This restores connectivity to the remaining racks in RSH.
12. Due to the reconfiguration of the cluster on the first router, these changes have to be replicated on the rest of the cluster. This is a time consuming process.
13. The previously scheduled maintenance that had been prevented by the memory leak needs to be completed. This process is ongoing and should be completed by 07:00. Once these final configuration changes are applied we will send a further update.
Obviously we would like to apologise to any of our clients who were affected by this work. Due to unforeseen circumstances described in detail above, work was severely delayed for not just our clients but other companies located in the same data centre.
We are working closely with the data centre to identify any further weak points and to avoid additional disruption.
Quite a lengthy report, but clearly beyond my control.
Later today there was another hiccup where the site was offline for another short while.
Like everyone here, I'm hoping they've got their issues sorted out now.
In a week or two, I will have my own servers back in to a Brisbane data centre, and at that time I will need to put the forum offline for a couple of hours, so that it can come back 'home' where it has reliably run for the last couple of years.
If you have any questions, please ask me.