Live migration deep dive: how Netframe moves VMs without downtime

Live migration looks like magic the first time you see it: a VM moves from one physical host to another without dropping a packet or pausing a query. Inside, it is anything but magic, it is careful coordination between the hypervisor, the storage layer, the network fabric, and the management plane.

This post walks through how Netframe does it.

Phase 1, Pre-copy memory transfer. The Manager identifies the source and destination hosts, validates that the destination has sufficient memory and CPU resources, and begins copying the VM's memory pages over the cluster network. The source VM continues to run during this phase.

Phase 2, Dirty-page tracking. As memory pages are copied, the source VM continues to write to memory. The hypervisor tracks which pages have been modified ("dirtied") since the start of the copy. After the initial copy completes, dirtied pages are copied again. This iterates until the dirtied page set is small enough to copy in a sub-second window.

Phase 3, Brief freeze and final sync. The source VM is paused for the time it takes to copy the final dirty page set plus the CPU register state. This is typically on the order of 100–500 milliseconds.

Phase 4, Network state handoff. The destination VM is unpaused. The cluster network layer signals upstream switches that the VM's MAC address has moved (typically via gratuitous ARP), and traffic reroutes within milliseconds.

Phase 5, Source cleanup. The source-side VM resources are released.

There are subtleties on top of all of this, bandwidth shaping so live migration does not starve production traffic, opportunistic page compression on slow links, careful handling of pinned memory regions, and orchestrated handling of attached storage. We will go deeper on each of these in subsequent posts.

Newer

Why we built Netframe: a technical origin story

Older

Netframe 1.0 is generally available