Garage Gazette outage

Papaw · June 10, 2018, 01:49:47 PM

It seems our sister site, The Garage Gazette, is having server problems again. Hope it isn't a major problem!
We know all too well that sites can crash and be lost all too easily, as this one did back in 2011 or so.

I just checked Tools and Garages, Rusty's forum, and it is down also

Spartan-C · June 10, 2018, 06:18:15 PM

Yeah, so is the Machinists Gazette, where I'm a member, too.

Papaw · June 10, 2018, 06:33:24 PM

As it has always been, Gazette and Tools & Garages members are welcome here on Tool Talk if they need to connect with others.

bonneyman · June 10, 2018, 06:55:51 PM

Thanks for the heads-up, Papaw.

Papaw · June 10, 2018, 07:08:22 PM

I also can't get Rusty by email. His address in my records gets sent back as "Undeliverable".

Papaw · June 10, 2018, 08:14:59 PM

Got an email from Uncle Buck-

"Hey Noel, thanks for the message. I am sure a few of the guys are a bit restless with the site down. Rusty and I have never maintained any contact off of the site, (we should, my fault) I know he will get it squared away, he always does. Like as not, by tomorrow."

slip knot · June 10, 2018, 09:52:38 PM

I just figured it was something Bruce did

Uncle Buck · June 11, 2018, 09:37:40 AM

Quote from: slip knot on June 10, 2018, 09:52:38 PM
I just figured it was something Bruce did

Probably stuck one of his flip flops in the server and fried the bloody thing!

bonneyman · June 11, 2018, 09:47:30 AM

Sounds like it's not too bad.

Perhaps we could get a relay notification system going? For when either site goes down, emergencies, internet service interrupted etc. Being I'm home alot now and have a landline I could be a part of the system/chain.

goodfellow · June 11, 2018, 12:17:47 PM

Just got this from Rusty --

Quote:

"They've advised us they have about 107 servers (down from 140 earlier), this is just under 1% of the systems there. These blades went down after a firmware upgrade to fix the TLS problems in the ILO. This server is one of the affected units. Most units upgraded without incident, but some the power management controller failed to come back online taking the server offline. They are discussing with HPE whether a downgrade will help or we need to replace hardware. Its a mess but we'll work through it until its resolved."

Papaw · June 11, 2018, 01:24:45 PM

Sure hope they do get it resolved. When Tool Talk went down, everything was lost, but with help and the great membership, it came alive again.

jabberwoki · June 11, 2018, 03:03:16 PM

Its so Bruce's fault your right on the money.

slip knot · June 11, 2018, 05:24:53 PM

Quote from: Uncle Buck on June 11, 2018, 09:37:40 AM
Quote from: slip knot on June 10, 2018, 09:52:38 PM
I just figured it was something Bruce did

Probably stuck one of his flip flops in the server and fried the bloody thing!

Not just flip flops...Server crashing flip flops!!!!!

MAD · June 11, 2018, 11:05:00 PM

Quote from: Papaw on June 10, 2018, 06:33:24 PM
As it has always been, Gazette and Tools & Garages members are welcome here on Tool Talk if they need to connect with others.

Thanks for the info and hospitality!

goodfellow · June 13, 2018, 08:23:06 AM

This is the message that Rusty received from the hosting company. It's a major hardware firmware failure -- and it may not be resolved quickly because they don't have enough hardware to replace the failed components (the blade servers). We may have to accept the fact that when we do get up and running , there may not be a proper backup to restore the site to a recently archived steady state =-- bottom line; a lot of recently posted content may be lost.

_______________________________________________________________________________________________________________________________

"News: Atlanta - Blades in A01-05, B01-05, C01-05, J11-J15 Published: 11/06/2018 We've had a detailed update from our DC about the status of around 120 blade servers that have been affected by a recent firmware update.

At the beginning of June, we began upgrading our servers with the new ILO code to address both TLS vulnerability and the new Java security requirements. During last week we had 1-2 servers that failed their upgrade but completed over 1500 upgrades successfully. On Friday we started upgrading the next batch of systems in A01-05, B01-05, C01-05, J11-J15, there were not incidents detected until Saturday when a total of 174 servers dropped offline over the course of 3 hours, we've had a few more drop out on Sunday and Monday as well.

We notified the DC to check through the first systems and they found the server refused to power up, it just sat with a flashing red health error light. They contacted HPE for assistance and carried out their recommend downgrade procedure on a couple of the blades. It made no difference.

HP's tech investigated a few of the blades and discovered that the power management controller had failed to upgrade properly and was preventing the blade from powering up. We have tried a variety of operations with HPE to resurrect these servers, some has come back to life, some are still offline. The DC will continue to work with HPE to find a solution to these issues.

We have swapped hardware where we have available stock but have now depleted our stock of spare compatible systems so we are reliant on HPE resolving the firmware issue. In the meantime we're shipping a couple of pallets of servers from New York and Denver to Atlanta so we can replace hardware where necessary.

The DC is working round the clock with HPE to find a resolution, if we can't get a firmware resolution then we'll replace the hardware as soon as additional kit arrives on site.

Its an unfortunate and messy situation all round and we apologise for the downtime."