About

I'm an Aussie who moved to Ottawa, Canada in 2008. I'm always having a moan about something. This is where I moan and whinge about things. Enjoy.

Saturday 7 November 2020

VMware Shite

Recently this came up in my Facebag feed: 

Link: https://www.facebook.com/ubuntulinux/posts/10158514493598592?comment_id=10158578071988592

A comment, and my reply:

Days later, someone replied and asked "Why is that?". This lead me to write the following tirade, which I feel shouldn't be lost in the noise of Facebag's huge amount of crap..

A couple of years ago I had a job where I had to implement Site Recovery Manager, to manage a couple of redundant VMs which ran billing software, for an airport.

I nearly lost my mind trying to set it all up, it was so completely broken.

It was about a 60 page document of instructions I had to write on how to do it. Ask me, and I'll send it to you. (I lied, I just found it, it's only 40 pages long).

It needs like 5 "appliance" VMs VMware provide implemented to be able to do it.

Things like vcentre server, replication appliance/manager, some other thing with provided automation ("vRealize Operations Manager (VROM)") which seemed to actually do nothing.. Some other ones I can't remember what they were for..

So now our environment, of 2 redundant VMs, had turned into like 7.

I started trying to do it with VMware 6.1 I believe, and reached a point where the appliances were incompatible with each other, and had to start all over again with 6.5.

Half the appliances need to be deployed using the new HTML5 UI, or they won't work. They other half need to be deployed using the legacy Flash UI, or they won't work.

Someone wrote a 10 part series (!) on how to install it all: https://univirt.wordpress.com/.../a-step-by-step-guide.../

It even includes "One of the quirks I discovered when experimenting in my lab was that deploying the replication appliance using the HTML5 client will seem to work fine until you try to boot it at which stage you get an error (see KB55027). The only workaround is to use the older vSphere Web Client.", which is exactly what I discovered, after a huge waste of my time.

KB article confirming that this is just broken: https://kb.vmware.com/s/article/55027

There are lots of these articles.

One reference I just found in my document:

This install MUST be done via vCenter Server, using the “FLEX/web client”. Otherwise it will fail with an error about “Line 321 Unsupported Section” if you try to deploy the OVF via the VMware Hypervisor Web UI eg .10. See https://kb.vmware.com/s/article/2041653 or using the HTML5 UI it will deploy, but not power on due to a missing binding, which is not created unless you deploy using the Flash/FLEX/web client. Ridiculous.

If anything goes wrong, you have to wipe everything and start again from scratch, installing ESXi on the hardware. I ended up scripting it since I had to do that about 30 times.

If you need to make any changes to those appliances, like the IP address, then you need to redeploy them. Which probably won't work, so you'll need to start again from scratch.

From my doc: When the status is complete, leave the VM powered off. If you power on the vSphere Replication appliance too soon, it will not have received the config file containing the networking and password settings, and you will not be able to access the appliance to configure it. You will power on the machine later when you configure it. If you see errors about a missing file in the console, you powered it on too soon.

These appliances would randomly get corrupted, and you'd need to redeploy them. Which usually didn't work, so you'd need to start again from scratch.

SRM only ran on Windows, so now you needed another pair of VMs, and Windows licences. (Apparently it's now available as YAA, Yet Another Appliance).

From my doc: In Windows, use Internet Explorer to go to https://172.2X.0.21/folder to access the VMware datastore. This will require dealing with a bunch of security errors, clicking “continue to this website (not recommended) about 12 times, adding exceptions or modifying zones and other nonsense.

Windows will break the file by renaming it and losing the .exe off the end so you can’t run it from the download dialogue. Find the file in the Downloads directory (“open folder”), fix the filename by renaming/re-adding .exe to the end, and then run it.

Once everything was "working".. SRM doesn't handle VMs with multiple network interfaces. Inside our VMs we use heartbeat and Corosync, so each VM had 3 (?) interfaces to talk to each other, stay in sync, and the floating virtual IP etc.

SRM would just pick a random interface each time you performed a failover, and reconfigure that random interface, so it pretty much never worked, as it would pick the wrong interface, and then corosync and heartbeat would fail, and the backup VM would steal the virtual IP address. And probably still not work, since VMware had misconfigured the routable interface.

I ended up having to disable all this functionality, and implement it via a shell script that ran on startup, so that the interfaces would be configured and reconfigured correctly.

From my doc: Don’t bother setting up any IP customisation, as it is pointless and only handles 1 interface/IP, and will lose all the other IPs, eg the DB IP, and for the HA/cluster, and if there is a bond it will probably modify the wrong interface.. The changeover of network settings is done within the VM, using a script which detects the network has changed, and reconfigures things as necessary.

I could have implemented the whole thing using KVM and rsync for free, in about 1/10th the time it took to get VMware "working", vs the almost $100k in licences it cost with VMware.

It doesn't support hot failover, like KVM does, so you had to (re)boot your VMs to move them.

Even when it was "working", when you tried to perform a (cold) failover, it would usually fail, and leave your VMs corrupted on both ends, so now you couldn't start them in either location.

From my doc: If the test fails for some reason, then it will error out, and you will have to “cleanup” the failed test, before you can try again.

If the attempted run (rather than test) of the failover fails, then you will end up in a state that may be difficult to recover from. The source host will be blocked from powering on the VM, and the destination may not exist, or not be able to be powered on.

To fix this situation you may need to try different things, including removing the protection from the VMs, and then attempting to restart them on the source host, and then reconfiguring the replication on them, and adding them back into the protection group(s)/recovery plan(s) again. It also may be necessary to forcefully delete the replication configured in SRM as the 2 hosts get out of sync with each other, and then you may need to go into the [ into the what?! Evidently I lost my mind/train of thought.. ].

As a result, you would need to take a backup of the VMs before you tried failing them over, so that when it would almost certainly fail, you could delete the VMs from both ends, restore the backup on the failed end, resync it, which would take hours, or days, and try again..

This backup meant using a third party tool, like Nakivo, as Veeam didn't support 6.5 yet. More licences, more costs..

If you could, by some miracle, actually get the failover to work, it invalidated the copy of the vdisks at the source site, which possibly makes sense, however you couldn't just reverse the replication and have it sync the changes back, oh no, it had to perform the entire replication from scratch from the backup/recovery site, back to the original site, which would take hours, or days, since it is across a WAN link.. Unlike using rsync, which would have taken a few minutes.

If your failback fails, then you need to restore your third party backups, and try again.. but not until you have waited hours, or days, for the replication of your restored VM(s) to complete.

I left this job after less than a year. This project was a major contributing factor in that decision.

A year later, after I had flown halfway around the world to the airport and the backup site to perform the installations of all the VMware shite, before I left that job, the project still wasn't put into production because it was so broken and unreliable.

No one else at the company could work out how I had even made it "work" as well as it did, read, hardly at all, and they actually contracted me back to show them how I had installed it, and got it working as barely as it did. (This is where the smashed TV came from.. Which I thought I had a post about, but cannot find..)

It's probably still not in production. It shouldn't be. It was a complete steaming pile of shit.

You know what I do now? At every opportunity I migrate VMs off VMware, onto KVM, or any other virtualisation technology.

Utter garbage.

Bonus content

There were other reasons I left that job, which for the work, other than the software/technologies I had to use, was a dream job, with travel, to places like Nairobi and Cairo, and I probably would have got to visit most of the main airports in Egypt..

One of my co-workers was a useless old fart, abusive, and a bully. He hauled me into his office at one point, and spent about 10 minutes yelling at me for having "wasted my time" writing scripts to automate tedious and error prone manual installation procedures, and that none of it would be used.

One of these reduced a 50 page install document into about 3 pages.

If I wasn't still on my probationary period, or at least, thought I was, because a few minutes after I came out of his office from being yelled at, I was told that I had made it through my probationary period, and had health coverage again.. I probably would have punched him in his stupid face.

Another one of the things I got abused for was using Google Docs to write my documentation, since I would go work in the lab doing stuff, and didn't want to always have to take my laptop with a Windows VM and Word on it, so I would write my documentation on any random machine logged into Google, and then when I got back to my laptop I could copy/paste it into Word..

One time this somehow caused the formatting, actually not even the formatting, just the display settings, eg it went into "read mode", or something other than the default "print layout" mode, so word wrap got disabled, and the hidden characters showing line feeds and carriage returns got turned on, so the old fart came out of his office with his laptop and came and stood behind me at my desk going "What did you do to this document?!" pointing at the screen.

"Uh, click View, Print Layout. And click that backwards P thing to turn off the hidden characters".

I explained that it probably happened because I used Google Docs instead of Word. "Well don't be doing that then! You can only use Word!".

I just found this in my doc: don’t copy and paste directly from this document, it has some weird characters in the front (the indents/spaces) which muck up the cfg file and syslinux will not work, so copy and paste the following into and out of a plain text editor first.

(There was some white space/tabs for indenting, and somehow Word managed to screw up blank space! It was like the em vs en dash shit which broke a bunch of stuff, all over again).

One other time he couldn't do something, and hauled me into his office and made me sit down at his laptop to do it for him/show him how to do it. I saw that he was using Internet Explorer. That's pretty much when I knew I was done at this job.

I think I had already contacted the recruiter who got me the job, by this point, when I started to have problems with this dude. The recruiter met with me for a coffee, and we had a chat about it.

He told me that the company had assured him that "this wouldn't happen again".

Things got worse.

At one point, when I was testing the startup script for the VMs to make sure that they would failover and failback properly, on the machines which I'd installed in Nairobi, evidently the old fart who was then in Nairobi, got pissed off because I kept rebooting the VMs, which I erroneously thought would be fine, because I thought it was the middle of the night there, when I was doing that testing.

He sent me an abusive email, and claimed that I was "going rogue", and telling me that I wasn't allowed to do anything unless he told me to do it. He wasn't even my manager.

I ended up putting in my notice. Here's the letter I handed in:

Hmm, apparently I used the wrong word, lead, instead of led.

While I was in Cairo, where I spent my 2 weeks notice, a co-worker who had travelled there with me asked me about it.

"So you couldn't handle [name]?".
"No. I don't put up with abuse and bullying".
"Oh, well you're the third person to quit for that reason".

LOL.

Also, while I was in Cairo, I was sent some ~70 page install document to update, including the process of installing using a script I'd written. That added about half a page, since the script does a whole bunch of tedious things for you, including stuff like setting the timezone on the server (we put everything in UTC for simplicity, since we had machines all over the world in every timezone..).

I was then asked to include details of everything that the script does, and how to do it all without using the script.. Uhh, just use it, that's why I wrote it?

So that added about another 20 pages I had to write, explaining how to do all the things the script did, manually, but I made sure to include this image way into the document at the start of that section:

What are they going to do, fire me?

When we landed back in Ottawa, I got a "it was nice working with you", and my job was done.

I went into the office a couple of days later, to drop off my laptop and phone, pickup a wifi router I had left behind, and say goodbye to my co-workers. Most of them were very nice people, and I miss working with a few of them.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.