If you had asked me 6+ weeks ago how far along my replication project would be on this date, I would have thought I’d be basking in the glory of success, and admiring my accomplishments.
…I should have known better.
Nothing like several IT emergencies unrelated to this project to turn one’s itinerary into garbage. A failed server (an old physical storage server that I don’t have room on my SAN for), a tape backup autoloader that tanked, some Exchange Server and Domain Controller problems, and a host of other odd things that I don’t even want to think about. It’s overlooked how much work it takes to keep an IT infrastructure from not losing any ground from the day before. At times, it can make you wonder how any progress is made on anything.
Enough complaining for now. Lets get back to it.
For my testing, all of my replication is set to occur just once a day. This is to keep it simple, and to help me understand what needs to be adjusted when my offsite replication is finally turned up at the remote site.
I’m not overly anxious to turn up the frequency even if the situation allows. Some pretty strong opinions exist on how best to configure the frequency of the replicas. Do a little bit with a high frequency, or a lot with a low frequency. What I do know is this. It is a terrible feeling to lose data, and one of the more overlooked ways to lose data is for bad data to overwrite your good data on the backups before you catch it in time to stop it. Tapes, disk, simple file cloning, or fancy replication; the principal is the same, and so is the result. Since the big variable is retention period, I want to see how much room I have to play with before I decide on frequency. My purpose of offsite replication is disaster recovery. …not to make a disaster bigger.
The million dollar question has always been how much changed data, as perceived from the SAN will occur for a given period of time, on typical production servers. It is nearly impossible to know this until one is actually able to run real replication tests. I certainly had no idea. This would be a great feature for Dell/EqualLogic to add to their solution suite. Have a way for a storage group to run in a simulated replication where it simply collects statistics that would accurately reflect the amount of data that would be replicate during the test period. What a great feature for those looking into SAN to SAN replication.
Below are my replication statistics for a 30 day period, where the replicas were created once per day, after the initial seed replica was created.
Average data per day per VM
- 2 GB for general servers (service based)
- 3 GB for servers with guest iSCSI attached volumes.
- 5.2 GB for code compiling machines
Average data per day for guest iSCSI attached data volumes
- 11.2 GB for Exchange DB and Transaction logs (for a 50GB database)
- 200 MB for a SQL Server DB and Transaction logs
- 2 GB for SharePoint DB and Transaction logs
The replica sizes for the VM’s were surprisingly consistent. Our code compiling machines had larger replica sizes, as they write some data temporarily to the VM’s during their build processes.
The guest iSCSI attached data volumes naturally varied more from day-to-day activities. Weekdays had larger amounts of replicated data than weekends. This was expected.
Some servers, and how they generate data may stick out like sore thumbs. For instance, our source code control server uses a crude (but important) way of an application layer backup. The result is that for 75 GB worth of repositories, it would generate 100+ GB of changed data that it would want to replicate. If the backup mechanism (which is a glorified file copy and package dump) is turned off, the amount of changed data is down to a very reasonable 200 MB per day. This is a good example of how we will have to change our practices to accommodate replication.
Decreasing the amount of replicated data
Up to this point, the only step to reduce the amount of data replication is the adjustment made in vCenter to move the VM’s swap files off onto another VMFS volume that will not be replicated. That of course only affects the VM’s paging files – not the guest VM’s paging files that are controlled by the OS. I suspect that a healthy amount of changed data on the VMs are the paging files for the OS. The amount of changed data on those VM’s looked suspiciously similar to the amount of RAM assigned to the VM. There typically is some correlation to how much RAM an OS has to run with, and the size of the page file. This is pure speculation at this point, but certainly worth looking into.
The next logical step would be to figure out what could be done to reconfigure VM’s to perhaps place their paging/swap files in a different, non-replicated location. Two issues come to mind when I think about this step.
1.) This adds an unknown amount of complexity (for deploying, and restoring) to the systems running. You’d have to be confident in the behavior of each OS type when it comes to restoring from a replica where it expects to see a page file in a certain location, but does not. How scalable this approach is would also need to be asked. It might be okay for a few machines, but how about a few hundred? I don’t know.
2.) It is unknown as to how much of a payoff there will be. If the amount of data per VM gets reduced by say, 80%, then that would be pretty good incentive. If it’s more like 10%, then not so much. It’s disappointing that there seems to be only marginal documentation on making such changes. I will look to test this when I have some time, and report anything interesting that I find along the way.
The fires… unrelated, and related
One of the first problems to surface recently were issues with my 6224 switches. These were the switches that I put in place of our 5424 switches to provide better expandability. Well, something wasn’t configured correctly, because the retransmit ratio was high enough that SANHQ actually notified me of the issue. I wasn’t about to overlook this, and reported it to the EqualLogic Support Team immediately.
I was able to get these numbers under control by reconfiguring the NIC’s on my ESX hosts to talk to the SAN with standard frames. Not a long term fix, but for the sake of the stability of the network, the most prudent step for now.
After working with the 6224’s, they do seem to behave noticeably different than the 5242’s. They are more difficult to configure, and the suggested configurations from the Dell documentation seem were more convoluted and contradictory. Multiple documents and deployment guides had inconsistent information. Technical Support from Dell/EqualLogic has been great in helping me determine what the issue is. Unfortunately some of the potential fixes can be very difficult to execute. Firmware updates on a stacked set of 6224’s will result in the ENTIRE stack rebooting, so you have to shut down virtually everything if you want to update the firmware. The ultimate fix for this would be a revamp of the deployment guides (or lets try just one deployment guide) for the 6224’s that nullifies any previous documentation. By way of comparison, the 5424 switches were, and are very easy to deploy.
The other issue that came up was some unexpected behavior regarding replication, and it’s use of free pool space. I don’t have any empirical evidence to tie these two together, but this is what I had observed.
During this past month in which I had an old physical storage server fail on me, there was a moment where I had to provision what was going to be a replacement for this box, as I wasn’t even sure if the old physical server was going to be recoverable. Unfortunately, I didn’t have a whole lot of free pool space on my array, so I had to trim things up a bit, to get it to squeeze on there. Once I did, I noticed all sorts of weird behavior.
1. Since my replication jobs (with ASM/ME and ASM/VE) leverage the free pool space for the creation of temporary replica/snap that is created on the source array, this caused problems. The biggest one was that my Exchange server would completely freeze during it’s ASM/ME snapshot process. Perhaps I had this coming to me, because I deliberately configured it to use free pool space (as opposed to a replica reserve) for it’s replication. How it behaved caught me off guard, and made it interesting enough for me to never want to cut it close on free pool space again.
2. ASM/VE replica jobs also seems to behave odd with very little free pool space. Again, this was self inflicted because of my configuration settings. It left me desiring a feature that would allow you to set a threshold so that in the event of x amount of free pool space remaining, replication jobs would simply not run. This goes for ASM/VE and ASM/ME.
Once I recovered that failed physical system, I was able to remove that VM I set aside for emergency turn up. That increased my free pool space back up over 1TB, and all worked well from that point on.
Lastly, one subject matter came up that doesn’t show up in any deployment guide I’ve seen. The timing of all this protection shouldn’t be overlooked. One wouldn’t want to stack several replication jobs on top of each other that use the same free pool space, but haven’t had the time to replicate. Other snapshot jobs, replicas, consistency checks, traditional backups, etc should be well coordinated to keep overlap to a minimum. If you are limited on resources, you may also be able to use timing to your advantage. For instance, set your daily replica of your Exchange database to occur at 5:00am, and your daily snapshot to occur at 5:00pm. That way, you have reduced your maximum loss period from 24 hours to 12 hours, just by offsetting the times.