Australian Grand Prix -

Clipsal 500 -

Author Topic: Forum Online Status - why you can't connect to the forum  (Read 949 times)

0 Members and 1 Guest are viewing this topic.

Offline bpratt

  • Administrator
  • Legendary Racer
  • ****
  • Posts: 5540
  • View 's Album

    • Logan Village Weather
Forum Online Status - why you can't connect to the forum
« on: September 25, 2009, 08:39:43 pm »
There's been a lot of rubbish being spread around this week regarding the status of this forum, and how it has been off the air for quite a large chunk of time.

Since I'm the bloke who set the forum up in the first place on my own servers, I feel I must let you know what exactly is going on.

As many of you know I run a web hosting business on my own servers here in Australia.

Just around 2 weeks ago I got a bit of a surprise to hear that the business that I buy my co location rack space from has decided to change their business plan, and will no longer be providing me with rack space for my servers in the Equinix data centre in Sydney. This means not only myself, but 4 or 5 other small businesses who were using that space had to rapidly find a place closer to home to put our own servers in to, which means a Brisbane data centre.

So what's wrong with that you ask ? It means that my servers have to pulled out of the racks in Sydney then sent up here, which means this forum would be off the air for at least 30 straight hours, if not a bit longer due to the transit time on the road.

With the fact that the time and date when we were able to arrange someone to go to remove all the different business servers has really been up the air for the last week or more now.

I could not simply sit back and hope for the best, when this big move could've happened this weekend when there's a Grand Prix on.  I wasn't going to take the risk of the forum being offline either leading up to the GP weekend (for the tipping comp closing and people missing out on getting their tips in), or the actually weekend itself.

The long and short of it is that i chose to go with a Virtual Private Server from a company in the UK, and set up the forum on that server last weekend when being down wasn't going to be as much of an issue as this coming up weekend would be.

The delay last weekend was DNS issue, where most DNS servers were still looking at my own servers, rather than the newer server because I didn't reduce the TTL the day before. It took until late Saturday night before people could finally see the forum at it's new temporary location. Temporary because once I get my servers back to Brisbane, the forum will be back to its home of the last 2  years.

I'm going to skip why the forum was down for most of Wednesday, but it had nothing to do with the hosting.

Today, the company that I got the VPS from had an emergency outage, which they emailed all their clients at around 3am our time.

Here's their full notice, and reasons behind the extended downtime :-

Quote
Emergency Network Maintenance
September 24th, 2009

We need to perform some emergency network as soon as possible. This has been scheduled for tonight, which we appreciate is very short notice. We need to reboot a router to install new software, and this reboot will take up to 45 minutes. We will do everything we can to speed up the process as much as we can and reduce the maintenance time.

Date: 24/09/2009
Window: 23:00 for 2 hours
Duration: < 45 minutes.

The maintenance is to perform an emergency upgrade of Cisco software. We are using a Cisco VSS-1440 as part of our network core, and we have been experiencing some reduced performance with it today. There is no cause within our network configuration and set up of this, and it started to have a detrimental effect on some clients today. We escalated this to the Cisco TAC team, who have diagnosed a fault with the software on the router in the form of a memory leak. Cisco has supplied us with a new version of the software for the router which will fix the memory leak and slow performance.

The nature of this problem is that it will escalate as time goes on, which is why we want to apply the fix as soon as core business hours finish today. Please accept our apologies for the short notice, we hope your clients appreciate this problem was out of our control caused by Cisco software, and we are working as best we can to resolve it quickly.

We apologise for any inconvenience this may cause, please do not hesitate to contact us if you have any queries or questions regarding this maintenance window.

UPDATE 01:56 - Servers are starting to come back online now, we have requested a full reason for outage from the data centre as to why this took much longer than expected and we will be contacting all customers once we have the facts.

UPDATE 05:38 -

The following update has been issued by our data centre:

    This is a further update to our earlier message regarding the problems with the scheduled maintenance.

    As mentioned in our previous message, there was a complication with the new firmware which required additional troubleshooting. During this time there was no connectivity to Spectrum House North Side (RSH Nth), one of the zones in one of our datacentres in Maidenhead. The network, including the London fibre ring and all external peering points, remained in full working order.

    Full service to RSH Nth should have been restored by approximately 03:30. Small outages of less than 5 minutes may still be experienced by individual subnets as final configuration work is completed. These will be completed by 07:00. If anyone is still experiencing any problems please contact us immediately and we”ll do our best to resolve them for you.

    Date: 24/09/09
    Time: 23:00
    Duration: <4.5hrs

    The main cause of the extended outage was a problem in getting the VSS cluster to accept the new firmware. We have tried to provide a brief but accurate summary of the events below. A full reason for outage (RFO) will then follow tomorrow.

    Given that the new firmware was required to avoid the memory leak issue, the situation had to be resolved. A decision was made to focus on correcting a potentially debilitating problem to the network and subsequently this evening''s outage was extended, rather than revert to a flawed firmware version.

    1. New firmware image is loaded and prepared for use on reboot.
    2. First router is rebooted.
    3. Router finishes booting into new firmware, but the configuration has been wiped and it is no longer part of the cluster.
    4. Router is reinitialised with cluster settings and rebooted again, to apply these changes.
    5. The router hangs during the boot process, shortly after decompressing the image.
    6. On consultation with Cisco it is agreed to boot back into the old firmware to try and restore a solid boot. This works.
    7. The old image is removed from the boot memory and the new image is again prepared for use on boot.
    8. This time both the image and the minor temporary configuration hold.
    9. The backup of the configuration is restored to the router and a reboot applied to test that it holds, which it does.
    10. The boot process includes bringing up each line card one at a time. During this boot two of the line cards are not initialised, citing an error.
    11. Following consultation with Cisco, one of the two line cards is brought back online. This restores connectivity to the remaining racks in RSH.
    12. Due to the reconfiguration of the cluster on the first router, these changes have to be replicated on the rest of the cluster. This is a time consuming process.
    13. The previously scheduled maintenance that had been prevented by the memory leak needs to be completed. This process is ongoing and should be completed by 07:00. Once these final configuration changes are applied we will send a further update.

Obviously we would like to apologise to any of our clients who were affected by this work. Due to unforeseen circumstances described in detail above, work was severely delayed for not just our clients but other companies located in the same data centre.

We are working closely with the data centre to identify any further weak points and to avoid additional disruption.


Quite a lengthy report, but clearly beyond my control.

Later today there was another hiccup where the site was offline for another short while.

Like everyone here, I'm hoping they've got their issues sorted out now.


In a week or two, I will have my own servers back in to a Brisbane data centre, and at that time I will need to put the forum offline for a couple of hours, so that it can come back 'home' where it has reliably run for the last couple of years.


If you have any questions, please ask me.
2008+2009 V8 tipping comp Champion


Offline bpratt

  • Administrator
  • Legendary Racer
  • ****
  • Posts: 5540
  • View 's Album

    • Logan Village Weather
Re: Forum Online Status - why you can't connect to the forum
« Reply #1 on: September 26, 2009, 08:58:51 am »
More information on why the forum was unobtainable  on Friday night 25/9/9 from the UK hosting company we are temporarily located at :-

Quote

Network Issues Update
September 25th, 2009

The data centre are continuing to have severe network issues that will be affecting all our servers.

We are still waiting for the all clear from the data centre but in the mean time users may experience a slowness in connection, timeouts or lag when browsing sites or using services.

Please bear with us during this period as unfortunately the issue is out of our control but we are working closely with the data centre to get this resolved asap.

Update 18:22
The following has now been received from the data centre:

    Below is a summary update of the issues experienced by clients in RSH-North today and yesterday, 25th and 24th September 2009.

    Fundamentally we have experienced serious issues that have affected all clients in RSH-North. This has been due to the Cisco equipment at the core of our Spectrum House services not responding as per specification and documentation.

    As has already been explained, we have been in direct contact with Cisco, working on a Level 1 priority request to solve the issues that have affected our clients today. We take full responsibility for our vendor selection and do not wish to appear to be passing blame “conveniently”. We pride ourselves in the level of service that we provide and also in the quality of communication that we send to clients. We accept that neither have been anywhere close to our usual standard during this prolonged incident. However, we would like to take this opportunity to clarify a few points that we are aware have been questioned and discussed by our clients and competitors:

    - Spectrum House routers were configured in a redundant VSS cluster
    - CPU usage was recorded as being very high for normal usage
    - Cisco have offered two possible solutions to the issue, including a firmware update provided to us yesterday. These solutions have failed to resolve the issues experienced
    - Our priority has, at all times, been to provide stable service for as much time as possible, to as many clients as possible. This has been the endeavour and sometimes this has not been possible. We do not shirk the responsibility for this matter and recognise the impact that it has on our clients’ businesses and both our and their reputations.

    Our network team are continuing to investigate why some of our clients are still experiencing outages, interrupted service and packet-loss and are doing so with Cisco.

As soon as we have further updates we will be posting them directly here.

2008+2009 V8 tipping comp Champion


Offline bpratt

  • Administrator
  • Legendary Racer
  • ****
  • Posts: 5540
  • View 's Album

    • Logan Village Weather
Re: Forum Online Status - why you can't connect to the forum
« Reply #2 on: September 26, 2009, 01:55:59 pm »
There were further issues at the openmindhosting.co.uk data centre putting the forum offline for most of the day up till now.

No further details from their support site.

I'm really looking forward to going back to my own Overflow Servers where we never had any troubles before.
2008+2009 V8 tipping comp Champion


Offline bpratt

  • Administrator
  • Legendary Racer
  • ****
  • Posts: 5540
  • View 's Album

    • Logan Village Weather
Re: Forum Online Status - why you can't connect to the forum
« Reply #3 on: September 26, 2009, 09:40:50 pm »
Updated details from todays outage. :-

Quote

Network Outage
September 25th, 2009

Currently all servers are offline as the data centre are once again attempting to resolve the issues with the core server.

Further updates will be posted as we have them

UPDATE: 21:37 – Approximately 40% of our servers are now coming back online…

UPDATE – 01:33 – All servers up and running, the network is now stable. A full RFO will be issued shortly…

UPDATE – 02:11 – After removing the VSS cluster technology from the RSH North zone routers and restoring the routers to a similar configuration to the ones which served RHC stably for over two years, considerable progress was made to restore connectivity for 90% of clients. However, service is not stable and clients will see differing levels of connectivity ranging from complete loss of service to severely degraded.

Obviously we are continuing to work on this with Cisco.

Please be assured that further updates will be provided as and when we have them.

Of course this continued long after their 02:11 update. :(
2008+2009 V8 tipping comp Champion


Offline bpratt

  • Administrator
  • Legendary Racer
  • ****
  • Posts: 5540
  • View 's Album

    • Logan Village Weather
Re: Forum Online Status - why you can't connect to the forum
« Reply #4 on: October 07, 2009, 07:20:02 pm »
Just to keep everyone up to date with what's going on here. :)

The VPS that I have all my sites currently located on is having a memory leak, which is causing it to lock up and require a reboot several times a day !!

This is totally unacceptable, and if I were to continue with running my business on servers located overseas, then I would've left these guys days ago !

If I were to move to another VPS host, it would take me more than a day to set it up, and another day or more to move all my other customer's sites across to it. Considering my servers are that close to be recommissioned in a Brisbane data centre, I'm sure you can understand why the plan is to leave things as they stand, particulary as Bathurst is on this weekend, and I have other business committments over the next 2 days.


I have the guys from OGN.com.au going in to the data centre Saturday to remove my servers, whilst they move their own out and in to another data centre in Sydney, however they won't be returning to Brisbane with my servers until Wednesday.


The good news then is that the forum will be back on a stable server, and you won't be having anymore troubles access your favourite forums.

2008+2009 V8 tipping comp Champion


Offline bpratt

  • Administrator
  • Legendary Racer
  • ****
  • Posts: 5540
  • View 's Album

    • Logan Village Weather
Re: Forum Online Status - why you can't connect to the forum
« Reply #5 on: October 16, 2009, 11:13:59 pm »
FINALLY !!!

We've got the forum back on the Overflow Servers, which means we'll be back somewhere that is reliable and stable.

It just goes to prove a long held opinion that whilst hosting is cheaper overseas than it is in Australia, you really do take pot luck with how lucky or unlucky you can be.

After this bad experience I know I won't be likely to take that chance again !
2008+2009 V8 tipping comp Champion


Offline Forum Admin

  • Administrator
  • Debuting
  • ****
  • Posts: 12
Re: Forum Online Status - why you can't connect to the forum
« Reply #6 on: May 19, 2010, 12:04:55 pm »
Our hosting provider has just added a new server and has advised us that we need to move the forum on to this new server.


This means that the forum will come up as being in maintenence mode for a couple of hours whilst we move the forums database across to this new server.


So if you come online in the next few days and see the forum in maintence mode, don't panic. The plan is to keep every single post intact for moving across to the new server, and to do that we need to prevent new posts from appearing on this server.


As the new server is in a different data centre to the old one, it means  that there could be a couple of hours delay in the forum reappearing after we move the forums posts across.


The plan is to do this during the quiet times, planned time is Friday, unless our uplink gets the IP's routed sooner than we expect.


 

commonwealth
commonwealth

Official Podcaster

Of The AMF

YouTube AMF forum video

commonwealth