Rebuilding the Plane We're Flying In

Three years ago, we were part way through the process of moving all of our servers and infrastructure from a single colocation facility to a multiple availability zone Amazon virtual private cloud. Things were going well and the business was booming, but we had a major problem. The old Storage Area Network (disk drive system shared by many of our servers) was going to run out of capacity in a matter of months.

riding an airplane

This problem didn't appear suddenly. It was something we'd discussed for years. We had outlined plans for how to clean up disk space and update applications to no longer need the old disk system. Unfortunately doing this sort of technical cleanup had always lost out to other work that actually drove the business forward instead of insuring that it wouldn't stop.

Just so we're clear: If we'd hit capacity on this system, the business would have stopped. We would not have been able to either take or fulfill any more orders. That would have put the brakes on over 500 people getting their jobs done and would have made for a really horrible customer experience (something we care a lot about).

Brainstorming a Solution

Within the Web Operations team, we started to have a series of informal discussions talking through how we could solve the problem without forcing the company to stop other work and without spending massive amounts of money.

We actually had a lot of data that we knew could be deleted from the system to free up space. The problem was that deleting the data too quickly resulted in saturating the throughput of the system. That caused all the applications that depended on it slow to a crawl. Our site became unusable.

lightening brain

Still it made sense to make cleaning up the drive space part of the solution. Joe Ritchey set to work writing scripts that throttled the cleanup process. Even with the scripts running 24x7 we weren't buying ourselves much time.

We knew that ideally we'd switch to using Amazon's Simple Storage Solution, but that would require significant code changes to most of our applications.

Gluster File System became our technology of choice. Applications wouldn't need to know whether the servers were using Gluster or our SAN. Since we had already moved some of our infrastructure to AWS we had several unused servers sitting around. They didn't have much disk space, but for less than the cost of a new server we could buy enough new drives to equal the capacity of our old SAN.

Gluster also has the ability to do geo-replication. That meant that moving files to Gluster inside our colocation facility would also give us a way to replicate it into the cloud. After a few days of informal discussion we had a plan.

On my way out of the office that evening, Doug Wilson mentioned that he was really interested in learning Gluster and working on the data migration. Sounded like he was volunteering to me, so the project landed in his lap (the whole team would be working on it of course, but Doug took the lead on researching and implementing Gluster).

Swapping in New Parts

I put the order for the drives on my company card and by the next week we had a pair of Gluster servers up and running. Shortly after that we ranked the applications based on the risk to the business if we lost data and started migrating them to using Gluster.

It was still a balancing act. The existing data had to be copied from the SAN to Gluster and we had the same issue with saturating the throughput as we did with the cleanup scripts. Never-the-less we gradually got to the point where we were freeing up disk space faster than we were using it.

mechanic

In the end we made it through the busiest period the company had ever seen in it's 12 years of existence while completely re-creating the shared storage solution it depended on.

We actually did work closely with our development teams to move completely out of the colocation facility, and at this point we are close to having removed the need for even the Gluster filesystem. We've largely managed to do it without any negative impact to the company which now has over 1,500 employees. I love working with a team that is creative enough to find ways to pull off massive changes like this.

by Jake Vanderdray