Microsoft Helps AMD with GPU Hotswap on Linux

You’ve all heard of Microsoft; one of the biggest players in the cloud computing game, also makes the Office 365 suite and the Windows operating system? Well, they’ve recently given AMD a bit of a hand with something that’s pretty important in the enterprise and datacenter space.

Microsoft, being a leader in the area, use AMD’s datacenter GPUs as well as run Linux for maximum stability and uptime, however this approach comes with a few issues. Occasionally, those GPUs have to be replaced due to failure or fault, but that requires shutting down the entire system to swap the component. There is another option, though; hot-plug, or hot-swap.

It’s exactly what it sounds like; remove and replace the GPU while the computer is on. Simple, right? As it turns out, no. Not really. To enable seamless hot-swap, the team at Microsoft Research had to pull their collective grey matter together and develop a special driver that enables hot-plug of AMD GPUs on Linux servers.

Microsoft’s hot-swap hotfix for Linux has been posted on the mailing list and GitHub for reviews and testing, and is aimed at Microsoft’s Azure instances where the ability to rapidly deploy resources to and from a particular machine is of some benefit.

“We are from Microsoft Research and are working on GPU disaggregation technology,” a code review request reads. “We have created a patch […], which will enable PCIe hot-plug support for AMD GPU. […] We believe the support of hot-plug of GPU devices can open doors for many advanced applications in data center in the next few years, and we would like to have some reviewers on this PR so we can continue further technical discussions around this feature.” 

While Microsoft didn’t release any further information or details about its GPU disaggregation technology, it seems to be some proprietary code that will allow Azure instances to dynamically add GPU resources to servers that do not actually physically house the cards. Because computers that have the cards actually installed work in extremely tough conditions (having multiple 500W+ space heaters running at full clip 24/7 will break things) hot-swap support is a particularly useful feature.

Hot-plugging a GPU or expansion card isn’t new, but doing it via PCIe (successfully) is. In the past, AMD developed a driver that used a Thunderbolt 3 port and an eGPU box to do functionally the same thing but intended for laptops and lower-power devices. But it looks like AMD isn’t supporting this for datacenter usage just yet.

Give it six months.

About the author...
Picture of Ross Evans
Ross Evans

Put together from bits of scrap electronics sourced from various junk yards, Ross is the Tech House Business Consultant and blog post writer for all things regarding tech. Avid consumer of caffeine-based products. Hates trains. Is an actual wizard.