STE VE CASSIDY

Steve contends with a hard disk that’s “just become a bubbling, evil swamp of despair”, containing all a company’s VMs and with no backup in sight

2017-03-09 - Steve is a consultant who specialises in networks, cloud, HR and upsetting the corporate apple cart @stardotpro STEVE CASSIDY

Steve contends with a hard disk that’s “just become a bubbling, evil swamp of despair”, containing all a company’s VMs and with no backup in sight.

It’s one thing to talk about preventing disaster; real experience is both rare and valuable. What happens when you’re actually sitting down to work out what’s left of a business’ data, with the owner breathing down your neck and the clock ticking on both your diagnosis and the likely action plan? Like almost everyone I talked to during this month’s crisis, I have a full repertoire of ways of keeping restorable backups; not a full repertoire of tools to try to drag data out of a hard disk that’s just become a bubbling, evil swamp of despair.

So, feel free to do what all those colleagues and contacts did, taking up too much of my crisis-management time: ask a series of questions that might, in a different situation, have served up a restorable backup.

No, there wasn’t a cloud backup product in place (in this case, the database supplier had elected to help out with that at one branch, but not the other). Yes, there was a RAID – of that rather irritating kind that VMware specialises in constructing, where it’s up to the guest VMs to lay out a fault-tolerant disk architecture.

No, the available mirrors didn’t seem to be usable. No, there was no hard deadline for recovering the database, but there also wasn’t a month available for long scans of the misbehaving 2TB media in pursuit of potentially useful, but initially corrupt, mirror partitions.

No, the backups that had been taken off the database by the software supplier weren’t deposited on a different server – indeed, not even on a different drive letter. Usefully, that particular discovery helped to fix a looming “too many cooks” problem, with all these questions – none of which speed up the data recovery, mind you – slowly drip-feeding in from said supplier and then suddenly falling silent when the boss-man realised why I was being so critical.

The downside to that being, of course, that I was on my own. I’d arrived with the assumption that the problem was the drive controller circuitry, and I knew I still had an exact match for the drive, so I pulled the board off my drive and put it in my pack for the trip.

This is one of those bits of voodoo work that aren’t widely understood outside of the server-fixing, datarecovering professions. In theory, every controller board on a traditional hard disk can be swapped for one from a known working identical device. But in practice, it isn’t always the board that’s the seat of the drive problem. In this case, it took only a few minutes to work out that the 2TB WD Red drive with the dead VMFS volume on it actually had a better board on it than the one in my pack.

WD Reds have easily demounted controller boards, with all the contacts between drive and board made by spring-loaded fingers and neat, flat pads – but mine was heavily marked with dirt so fine that it resembled toner. The one on the troublesome drive was as clean as a whistle: I quietly hid mine away and went back to the original search, which was for data-recovery tools that recognise VMware’s own VMFS disk and partition format.

This was one of the most frustrating searches I’ve ever undertaken. First, you have to penetrate the vast numbers of ignorant but opinionated helpers who lurk in internet product and support forums: not just this year’s edition, but those of every year. Perhaps the most useful filter for their contribution is to put “-should -ought” into your Google search string. I don’t want to think about the number of threads I worked through, where one or both of those terms reveal that a “helpful” answer is being delivered by someone with absolutely no clue about how to achieve what they’re talking about.

No, dear internet, it isn’t true that VMware volumes of the modern kind are plug-mountable to a Linux system by USB adapter. No, VMFS does nothing wonderfully magical with a single SATA volume to be inherently fault-tolerant. No, inquiring about snapshots is a dead end – the volume won’t mount, never mind what’s on it on other, not-really-similar systems with not-really-related disk failures.

After an intense run of research late into the night, I felt I had two options the following morning. One was a partition-rebuilding process, documented in superb detail on a chap’s blog; the other was a complete VMFS volume-recovery tool. Both use the VMware command line connection. Since this would be a long and involved install on a machine set up to manage the VM host, I felt a morning start on these tools was in order.

Bright and fresh the next day, with the command line on Windows nicely hooked up to the VMware hypervisor on the broken server, I dived into the instructions for repairing the apparently non-existent partition table on this horrible little specimen of a disk. The first few stages went well, bringing up all the data that the rest of the recovery process wanted. I even learned the interesting snippet that VMware only writes back to the partition table (when needed) at system shutdown, which appears to justify the care taken to make a safe, clean shutdown a convoluted process.

But I digress: the first dead-stop moment came at the point where I had all the information to overwrite whatever was being used as the GUID Partition Table – and then I ran into the follow-on comments to the blog entry, which came from people who had wanted to do a recovery “on their lab server”. This is nerd code for “the autograph is for my son/daughter”.

It seems that just corrupting that one bit may be common enough, but some forms of disk damage can cause the numbers used to recreate the table to be mis-read. Typing them back in can make subsequent attempts at recovery – by this or any other method – less likely to proceed. Therefore, it isn’t recommended as a fix for disks that don’t have backups of their contents held somewhere safe.

My finger literally hovered over the Enter key on that one, then shifted to the Delete key. This wasn’t a lab; that was, after exhaustive searching, definitely the only usable copy of the data.

Plan B – the actual tangible bit of software – was looking a lot more attractive. VMFS Recovery it said, among many other kinds. Able to sit on a Windows PC and remotely interrogate the disks of a VMware server, reporting back what it finds after one of several scan types, and configured to do that with the free download. Writing back or downloading found data requires a licence, which costs $800.

The initial download and dataanalysis run are best described as easy but long. If you’ve ever wondered how long it takes to read every bit on a single 2TB disk over Gigabit Ethernet, the answer is around 15 hours. This doesn’t mean that much of value came up in the scan. While the faster short scan revealed a list of typical VMware files, the longer one just seemed to say “1 file found”. That was it. Not much for 15 hours. And the short scan report gave just generic file names that could have been any VMware volume.

My reluctance re-emerged at this point, because I’d been unbelievably bored by the bits of the 15-hour scan that weren’t in the middle of the night, and during that time I couldn’t find a satisfactory information trail for this product. Sure, it had a Facebook page, with a few sparse posts from apparently happy customers – but none from the developers. Some loose links to a community of Russian or Ukrainian IT types, with a history of similar interests around VMware and storage in general, but nothing like enough proof that these identities were active or even correct. No visible comments or contributions in that vast labyrinth of forum postings on this subject from them – and suspiciously dead-end designs in their contact pages (which is one of several reasons why I’m not putting a link or even a screenie in this piece, to any of these sites). It just didn’t add up.

In fairness, though, its software did use the VMware CLI to make a remote smart link to the VMware server; it did tell me exactly what type of drive it was; and it did show a few bits of disk info that countered my suspicion that this was scamware. But $800 is just too much to spend speculatively, especially for something this vital. I really wanted these guys to be on the phone, and talk me through what the screens were saying, and explain why some of the features seemed to be completing while others weren’t so happy. And I couldn’t.

So I bit the bullet on the silent, invisible pressure coming at me from the business owner, and said that in my view the lack of social or professional footprint was enough to

“If you’ve ever wondered how long it takes to read every bit on a 2TB disk over Gigabit Ethernet, it’s around 15 hours”

put me off the Ukrainian option. Especially given that the local office of Kroll Ontrack was only a couple of hours away via courier. Even though its suggested pricing for a VMware system recovery topped out in five figures, well over ten times what the Ukrainian software men were asking, Kroll’s approach gave far more cause for comfort.

Several phone calls in rapid succession followed to establish the scale of the problem, the scale of the charges, and the timescale, and for there to be a bit of gentle probing as to whether or not I could be trusted to make their job harder. Collectively, we agreed that it wasn’t complex in terms of the amount of actual kit involved; it was quite likely (but no promises) that recovery was possible; but paying for the superduper express option was almost completely pointless, since this would be a series of long-scan and recovery processes. I began to feel as if my job here was almost done and could hear the departures lounge beckon.

But first, Kroll’s own remote diagnostic software had to be let loose on the server, in a setup remarkably similar to the anonymous Ukrainians – although Kroll didn’t need the command line tools. It seems perverse to trust a company to run installs on a machine inside your LAN, and run another long bit-scan of the drive with far less visible, interactive utility programs, even before you’ve signed a contract or made a payment. But the results were reassuring and offered an explanation of why all the other approaches had failed. After Kroll’s run finished, we had a curt voicemail: the drive media is physically damaged. Could we pack it up and send it to them for the next phases of analysis and, possibly, recovery?

Since Kroll’s offices were very close to the airport, I volunteered to take the drive with me: much easier than arranging a courier, I thought. Kroll Ontrack disagreed, pointing out that almost all its jobs moved around on courier services, which showed a much lower failure-to-deliver rate than a lone driver in a strange country in bad weather. I was mortally offended – right up to the point where I couldn’t drive on to the autobahn because there was a huge air ambulance helicopter parked right on the top of the access ramp. I had plenty of time to look at the 42-tonne truck lying on its side in the field by the carriageway in the ensuing diversion traffic jam, while pondering whether it was the groupage delivery for Kroll’s’ preferred courier in a little fit of schadenfreude.

As I flew home, after a total of 30 hours of deep bit-scanning and data-copying, Kroll Ontrack got everything back. Only the partition table had been “physically damaged”, although neither I nor the client really wanted to spend any more money on validating what had actually happened, given that Kroll used the same courier to send back a single, tiny disk in a Jiffy bag. Formatted for a Windows machine, this 2.5in 1TB USB 3 disk contained all the missing VMs. It was, in my client’s opinion, a $10,000 external disk.

We had already copied over the backed-up VMs that had been completed from the separate NAS box; it seemed almost an anti-climax to plug in that little drive and get the missing data up and running.

Hyperconvergence for the rest of us

It’s remarkable how slowly the whole industry updates its network infrastructure. Gigabit is the standard I encounter most often, even though most small networks don’t achieve the practical maxima possible using that standard. Part of this is because the experiment to see if higher speeds are possible can be amazingly disruptive, even in a network with no other lurking configuration issues. But also because there’s an incredible hike in cost between little plasticcase switches with “gigabit” written on the front and a full scale, webconfigured, intelligent routing switch with customisable module ports, multiple ports on the front and back, and service access controls for the intensely paranoid network manager.

But time has been passing, and affordable 10GbE hardware is now creeping into the second-hand, ex-corporate market. I’m aware that lots of readers think this is one of my obsessions, but it seems to follow on that if you have scars from any kind of network tweaking, then your natural response should be to move forward with some low-cost experiments that don’t put your head on the block.

This goes double for 10GbE networks, because the main prospect for kudos out of such an upgrade is to make use of hyperconvergence. That state where a server just has a single lead going into it, which can easily and fluidly balance all the different traffic types passing through it. Not something we should be finding difficult in 2017, surely?

Unfortunately, it is difficult. Especially with virtualisation leading to much more varied traffic types, all trying to live together. I’m pretty sure that a lot of the people who have a hard time “going virtual” are reporting failures based on single servers, connected by a single cable to a single switch, as a consequence of which they rarely see network speeds of as much as 25MB/sec.

And some of those converged services – such as iSCSI – don’t like to cooperate or, indeed, even recover from a traffic jam on a crowded Gigabit connection. Trying a completely converged server can be disappointing, even on a fully tuned Gigabit LAN: I tend to use the relatively plentiful multi-port Gigabit Ethernet cards, but that may be because I love the cat’s cradle complexity that they permit.

If you want the icing on the cake, then Microsoft says it won’t support hyperconvergence on anything less than a 10GbE LAN. Here, right now, I have two 10GbEcapable devices: a Netgear XS716E for copper connections, and an SMC 48-port, which only presents 10GbE using a fibre port on the back of the switch. Just having them stacked up together in the computer room means that I can select a 10GbE card and pick my way through the traffic management setup, rather than the virtual port setup, in either VMware or Hyper-V: and my users have no idea it’s happening.

That’s the best way to dip a toe in the subject.

“Kroll’s remote diagnostic software had to be let loose on the server, in a setup similar to the anonymous Ukrainians”

?? ?? BELOW WD Red drives have easily dismounted controller boards – useful in a crisis, if not this one — BELOW WD Red drives have easily dismounted controller boards – useful in a crisis, if not this one

?? ?? ABOVE Contrary to what the internet says, VMware doesn’t do anything magical to make a single disk fault-tolerant — ABOVE Contrary to what the internet says, VMware doesn’t do anything magical to make a single disk fault-tolerant

?? ?? ABOVE As prices drop, isn’t it time you experimented with some 10GbE hardware? — ABOVE As prices drop, isn’t it time you experimented with some 10GbE hardware?

STE VE CASSIDY

Steve contends with a hard disk that’s “just become a bubbling, evil swamp of despair”, containing all a company’s VMs and with no backup in sight

Newspapers in English

Newspapers from United Kingdom