Trouble is on a busy server once processes start stacking up, taking even more resources, it is on a path to nowhere.

Indeed, Rods2, once a Linux server dives into swap due to a disk i/o loading, the system will die fairly quickly, and Linux has a god awful desire to swap before releasing buffers.
Trouble is, I'm not sure what is causing the i/o latency to spike for reasonably extended periods. Disk throughput seems to be about normal just as it starts (but drops off, obviously, during the problem period).
Across the hypervisor, our baseline latency on the HP Smart Array is single digit, but will spike to over 100ms for 15mins or more, which certainly kills all Linux VMs, as they swap rather than release disk buffers, compounding it (though the OOF main server swaps to SSD, so has less impact to hypervisor). The Windows and Solaris VMs cope better.
I think the problem lays at the hypervisor level, as I've halved the amount of ram that the OOF server is allocated, and it seems to be far more stable, with only one brief alert (a slow response) overnight. But nothing has changed at that level. Hardware has been up since the fire, no updates to firmware or hypervisor have taken place....