Replication with an EqualLogic SAN; Part 2

 

In part 1 of this series, I outlined the decisions made in order to build a replicated environment.  On to the next step.  Racking up the equipment, migrating my data, and laying some groundwork for testing replication.

While waiting for the new equipment to arrive, I wanted to take care of a few things first:

1.  Update my existing PS5000E array up to the latest firmware.  This has never been a problem, other than the times that I’ve forgotten to log in as the default  ‘grpadmin’ account (the only account allowed to do firmware updates).  The process is slick, with no perceived interruption.

2.  Map out how my connections should be hooked up on the switches.  Redundant switches can only be redundant if you plug everything in the correct way.

3.  IP addressing.  It’s all too easy just to randomly assign IP addresses to a SAN.  It may be it’s own isolated network, but in the spirit of “design as if you know its going to change”  it might just be worth observing good addressing practices.  My SAN is on a /24 net block.  But I configure my IP addresses to respect potential address boundaries within that address range.  This is so that I can subnet or VLAN them down (e.g. /28)  later on, as well as helping to simplify rule sets on my ISA server that are based on address boundaries, and not a scattering of addresses.

Preparing the new array

Once the equipment arrived, it made most sense to get the latest firmware on the new array.  The quickest way is to set it up temporarily using the “initialize PS series  array” feature in the “Remote Setup Wizard” of the EqualLogic HITKit on a machine that can access the array.  Make it it’s own group, update the firmware, then reset the array to the factory defaults.  After completing the update and  typing “reset”  up comes the most interesting confirmation prompt you’ll ever see.  Instead of “Reset this array to factory defaults?”  [Y/N]”  where a “Y” or “N” is required, the prompt is “Reset this array to factory defaults? [n/DeleteAllMyDataNow]”  You can’t say that isn’t clear.  I applaud EqualLogic for making this very clear.  Wiping a SAN array clean is serious stuff, and definitely should be harder than typing a “Y” after the word “reset.” 

After the unit was reset, I was ready to join it to the existing group temporarily so that I could evacuate all of the data from the old array, and have it placed on the new array.  I plugged all of the array ports into the SAN switches, and turned it on.  Using the Remote Setup Wizard, I initialized the array, joined it to the group, then assigned and activated the rest of the NICs.   To migrate all of the data from one array to another, highlight the member with the data on it, then  click on “Delete Member”  Perhaps EqualLogic will revisit this term.  “Delete” just implies way too many things that doesn’t relate to this task.

The process of migrating data chugs along nicely.  VM’s and end users are none-the-wiser.  Once it is complete, the old array will remove itself from the group, and reset itself to the factory defaults.  It’s really impressive.  Actually, the speed and simplicity of the process gave me confidence when we need to add additional storage.

When the old array was back to it’s factory defaults,  I went back to initialize the array, and set it up as a new member in a new group.  This would be my new group that would be used for some preliminary replication testing, and will eventually live at the offsite location.

As for how this process compares with competing products, I’m the wrong guy to ask.  I’ve had zero experience with Fiber Channel SANs, and iSCSI SANs from other vendors.  But what I can say is that it was easy, and fast.

After configuring the replication between the two group, which consisted of configuring a few shared passwords between the the two groups, and configuring replication to occur on each volume, I was ready to try it out  …Almost.

 

Snapshots, and replication.

It’s worth taking a step back to review a few things on snapshots and how the EqualLogic handles them.  Replicas appear to work in a similar (but not exact) manner to snapshots, so many of the same principals apply.  Remember that snapshots can be made in several ways.

1.  The most basic are snapshots created in the EqualLogic group Manager.  These do exactly as they say, making a snapshot of the volume.  The problem is that they are not file-system consistent of VM datastores, and would only  be suitable for datastores in which all of the VM’s were turned off at the time the snapshot was made.

2.  To protect VM’s, “Autosnapshot manager VMware Edition” (ASM/VE) provides and ability to create a point-in-time snapshot, leveraging vCenter through VMware’s API, then does some nice tricks to make this an independent snapshot (well, of the datastore anyway) that you see in the EqualLogic group manager, under each respective volume.

3.  For VM’s with guest iscsi attached drives, there is “Autosnapshot Manager Microsoft Edition” (ASM/ME).  This great tool is installed with the Host Integration Toolkit (HITkit).  This makes application aware snapshots by taking advantage of the Microsoft Volume Shadow Copy Service Provider.  This is key for protecting SQL databases, Exchange databases, and even flat-file storage residing on guest attached drives.  It insures that all I/O is flushed when the snapshot is created.  I’ve grown quite partial to this type of snapshot, as its nearly instant, no interruption to the end users or services, and provides easy recoverability.  The downside is that it can only protect data on iscsi attached drives within the VM’s guest iscsi initiator, and must have a VSS writer specific to an application (e.g. Exchange, SQL) in order for it to talk correctly.  You cannot protect the VM itself with this type of snapshot.  Also, vCenter is generally unaware of these types of guest attached drives, so VCB backups and other apps that rely on vCenter won’t include these types of volumes.

So just as I use ASM/ME for smartcopy snapshots of my guest attached drives, and ASM/VE for my VM snapshots, I will use these tools in the similar way to create VM and application aware replica’s of the VM’s and the data.

ASM/VE tip:  Smartcopy snapshots using ASM/VE give the option to “Include PS series volumes accessed by guest iSCSI initiators.”  I do not use this option for a few very good reasons, and rely completely on ASM/ME for properly capturing guest attached volumes. 

Default replication settings in EqualLogic Group Manager

When one first configures a volume for replication, some of the EqualLogic defaults are set very generous.  The two settings to look out for are the “Total replica reserve” and the “Local replication reserve.”  The result is that these very conservative settings can chew up a lot of your free space on your SAN.  Assuming you have a decent amount of free space in your storage pool, and you choose to stagger some of your replication to occur at various times of the day, you can reduce the “Local replication reserve” down to it’s minimum, then click the checkbox for “allow temporary use of free pool space.”  This will minimize the impact of enabling replication on your array.

 

Preparing VM’s for replication

There were a few things I needed to do to prepare my VM’s to be replicated.  I wasn’t going to tackle all optimization techniques at this time, but thought it be best to get some of the easy things out of the way first.

1.  Reconfigure VM’s so that swap file is NOT in the same directory as the other VM files.  (This is the swap file for the VM at the hypervisor level; not to be confused with the guest OS swap file.)  First I created a volume in the EqualLogic group manager that would be dedicated for VM swap files, then made sure it was visible to each ESX host.  Then, simply configure the swap location at the cluster level in vCenter, followed by changing the setting on each ESX host.  The final step will be to power off and power on of each VM.  (A restart/reboot will not work for this step).  Once this is completed, you’ve eliminated a sizeable amount of data that doesn’t need to be replicated.

2.  Revamp datastores to reflect good practices with ASM/VE.  (I’d say “best practices” but I’m not sure if they exist, or if these qualify as such).  This is a step that takes into consideration how ASM/VE works, and how I use ASM/VE.   I’ve chosen to make my datastores reflect how my VM’s are arranged in vCenter.    Below is a screenshot in vCenter of the folders that contain all of my VMs.

image

Each folder has VMs in it that reside in just one particular datastore.  So for instance, the “Prodsystems-Dev” has a half dozen VM’s exclusively for our Development team.  These all reside in one datastore called VMFS05DS.  When a scheduled snapshot of a vcenter folder (e.g. “Prodsystems-Dev”) using ASM/VE, it will only hit those VM’s in that vcenter folder, and the single datastore that they reside on.  If it is not done this way, an ASM/VE snapshot of a folder containing VM’s that reside in different datastores will generate snapshots in each datastore.  This becomes terribly confusing to administer, especially when trying to recover a VM.

Since I recreated many of my volumes and datastores, I also jumped on the opportunity to make these new datastores with a 4MB block size instead of the the default 1MB block size.  Not really necessary in my situation, but based on the link here, it seems like a a good idea.

Once the volumes and the datastores were created and sized the way I desired, I used the storage vmotion function in vCenter to move each VM into the appropriate datastore to mimic my arrangement of folders in vCenter.  Because I’m sizing my datastores for a functional purpose, I have a mix of large and small datastores.  I probably would have made these the same size if it weren’t for how ASM/VE works.

The datastores are in place, and now mimic the arrangement of folders of VM’s in vCenter.  Now I’m ready to do a little test replication.  I’ll save that for the next post.

Suggested reading

Michael Ellerbeck has some great posts on his experiences with EqualLogic, replication, Dell switches, and optimization.    A lot of good links within the posts.
http://michaelellerbeck.com/

The Dell/EqualLogic Document Center has some good overview documents on how these components work together.  Lots of pretty pictures. 
http://www.equallogic.com/resourcecenter/documentcenter.aspx

20 thoughts on “Replication with an EqualLogic SAN; Part 2”

  1. Found this post through a Google search I did on something else, but it made me think about something (I’m fairly new to virtulization/SAN/Equallogics so pardon my ignorance). It was about your section concerning ASM/VE and ASM/ME. I’m currently snapping my virtual machines with ASM/VE (smart copy actually). I do have a couple of virtual machines where I have an MS iSCSI mounts from my Equalogics SAN to the virtual machines (data/file storage machines). I’m currently backing those mounts/volumes up with Group Manager. Are you saying that group manager will cause issues should I need to restore one of these iSCSI mounts.

    Thanks for your time and your blog has been bookmarked!

    1. Hello Christian,

      Thanks for reading, and the reply. When snapshotting volumes that are connected using the guest iSCSI initiator, if you snapshot them using the Group Manager, while the VM is on, it will make a “crash consistent” snapshot of the data. The application never had a chance to flush that data out of it’s I/O to put it in a recoverable state (similar to pulling the plug on a server). If you use ASM/ME to snapshot the guest attached VM, it will leverage VSS to make an “application consistent” snapshot of the data. ASM/ME does all of the work for you! This can be important even for flatfile storage. It’s critical for Exchange, and SQL. You might be able to get away with a crash consistent copy, but its a risk that you can definately avoid by simply using ASM/ME. The only way the Group Manager snapshots are valid is if the systems accessing the volume are off while the snapshot is made.

  2. Thank you for your writeup, I’m finding it very useful. Could you give a little detail on the following statement: “Smartcopy snapshots using ASM/VE give the option to “Include PS series volumes accessed by guest iSCSI initiators.” I do not use this option for a few very good reasons, and rely completely on ASM/ME for properly capturing guest attached volumes.”

    Why would we not want to use the ‘include PS series volumes….’?

    1. Hello Fred,

      There are several good reasons that you wouldn’t want to use the “Include PS series volumes access by guest iSCSI initiators.” ASM/ME will make application consistent snapshots of data by leveraging VSS. It does all of the coordination and flushing of I/O necessary to give these snapshots good recoverable data. If you use ASM/VE, and choose the “Include PS series volumes…” it will indeed make a snapshot of those guest attached volumes, but they will NOT be coordinated with the application (whether it be Exchange, SQL, or File access). Early on, when I did use this option, I had a file server doing some extremely intensive I/O when the ASM/VE snap was made. The guest attached volume demonstrated some very suspicious behavior from that point on, and so I switched exclusively to relying on ASM/ME to protect my guest attached volumes. ASM/ME has worked absolutely flawlessly for me. Its truly an impressive “free” application.

      I’m also betting that you only have so much room when it comes to snapshot reserve of those guest attached volumes. Well, if you choose that option above, along with the traditional (recommended) ASM/ME snapshots, you will quickly run out of room. Good snaps will be mixed in with bad snaps, and it will be confusing when you try to recover.

      This is why I recommend that you do not use the “Include PS series volumes access by guest iSCSI initiators.” And simply leverage ASM/ME and ASM/VE for their specific purposes.

  3. I have a question about the part changing the location of the swap file: How many VM swapfiles will you store on a volume?

    1. My short answer is, “all of them.” It shouldn’t be a concern. Most of the time there isn’t much activity on the VM swap file (not talking about the OS swap file) if things are not overcomitted. I just checked SANHQ on the activity of that volume. Over the past 30 days, it’s averaged under 2 IOPS.

  4. Another Question about swap files. You exclude discussing the OS swap file, but I’m replicating over a slow link and need to reduce unnecessary replicated data as much as possible. The OS swap file seems like a good target for moving off the SAN or to an unreplicated volume to reduce replication traffic. I realize this could potentially effect vMotion and similar, but I’m not really concerned about using vMotion, vStorage or HA in my particular setup. Thank you.

    1. Hello and thanks for reading. Yes, I deliberately excluded discussing tweaking the VM’s to not replicate a volume that contains an OS swap file for a few different reasons. It is signifigant enough of a topic that it really deserves its own post. And many VMs tweaked for this behave different depending on the version of the OS. The key to them all is not getting them to replicate, but getting them in a predictable, recoverable state at the site that does not have that particular volume replicated. It definately adds a signifigant amount of complexity, but it can be done. Also, I think it takes away the very first step that many should do after turning up replication; which is seeing if there isn’t other things or data that is compromising the attempt at creating a minimal footprint. A great example of this is one who say, has SQL backup jobs that do mid day backups or transaction log dumps, and it’s going to the same volume as where the DB’s or the transaction logs live. By modifying just that, you can greatly reduce the replica footprint without adding any real complexity to the VM.

  5. Is there a reason to keep a file-server volume as a direct attached LUN as opposed to migrating it into a datastore as a VD? It would seem that if I were to migrate it to a VD then I could utilize the ASM/VE and avoid the added complication of ASM/ME for a single volume.

    Additionally is there an advantage to using ASM as opposed to VDR?

    1. Yes, there are several reasons. I have found that it is simply more flexible, efficient, and easy to protect. If you had a large VMDK, you might have to dedicate a separate VMFS volume for it, then tack on another 20% or so for snapshot overhead. With guest attached volumes you don’t run into some of the standard volume limits.

      Remember that ASM/VE leverages the vCenter API and makes a journaled, hypervisor consistent snap, and through a little magic, converts it into its own SAN based snapshot. But in order to do that, it has to do some special things. The nature of how it does it is that it also makes a snap of all other VM’s in that same VMFS volume, so if you aren’t clear on the how to restore things, it can get confusing.

      The only way to get fully application aware snapshots is to leverage the Volume Shadow Copy Services provider (VSS) inside the OS. ASM/ME does this, and has writers for SQL, Exchange, and NTFS. Now, the downside is that because vCenter is not aware of those volumes, if you are using a commercial backup system such as Veeam, it would not be able to see them. But I’ve found backing up guest attached volumes to external media pretty easy, as you just mount the snapshot to the media server, and it will see it as a volume.

      Comparing the two, you’ll find ASM/ME snapshots almost imperceptible in interruption. ASM/VE takes longer, and you will see some small levels of interruption while it runs through its process. Also, protecting the guest volumes more frequently is actually the desired result. After all, for a flat file storage server, you really don’t need to protect the OS every 20 minutes do you? …You may have those requirements on the guest attached volumes, therefore, it is easy to employ.

      As for ASM vs VDR, they are really two different animals. Both versions of ASM simply give more intelligence to a SAN block based snapshot (crash consistent). So the reality is that whether it is 5MB, or 5TB, it will be nearly instant in its protection. It is best to incorporate ASM for near term snaphots of systems, with longer term replicas, while still providing alternate methods of long term archiving. VDR, while valuable, is really intended for different purposes.

      I encourage you to finish off my Replication series, as well as some of my more recent posts on ASM, guest attached volumes, etc. to give you better insight as to the pros and cons of each, as well as what should be considered in an effective strategy of managing and protecting the data, and the systems that serve up the data.

  6. Very interesting post!!. Thank you.
    I have two questions:
    1.- If I have a VM with an SQL database (or Exchange or FS) as guest attached drives, when I take an snapshot of my VM (for replication purposes, for example) from ASM/VE I have problems quiescing the VM (because, I think, SQL is managed for ASM/ME). The only way to take the snapshot is not quiescing the VM. Is this the way of doing it?
    2.- If this VM has a big windows paging file and I want to put it on another volume (non-replicated), I can’t do a replica from ASM/ME because my VM belongs to two volumes. Is posible to put the paging file in a guest attached drive? any issue?

    Thank you in advance
    Javier

    1. Thanks for reading Javier. Here are some answers

      1. When you take a “smartcopy” snapshot of your VM using ASM/VE, it will make a hypervisor aware snapshot; quiescing the VM in the same manner that performing a simple snapshot in vCenter. What it won’t do is coordinate with the Application inside the guest using VSS to make sure that the application and its data is left in a consistent state. (thus the term “application consistent” snap). So the way to do this is to schedule up ASM/VE for the systems that serve up the data, then leverage ASM/ME to protect the guest attached volumes, in a fully “applicaiton consistent” manner. This also works well in that your business requirements for frequency of protection may be different for the data versus the system that serves up the data. ASM/VE and ASM/ME can reflect your scheduling needs. A good example of this may be that you only want to make a replica of Your exchange server every few days, but make a replica of the Exchange DB and logs every 6 hours. Using ASM/VE and ASM/ME together allows you to do this.

      2. Actually, the limitation regarding ASM/VE’s support of making a smartcopy snapshot or replica on a VM with vmdk’s in different datastores has been lifted (supported in v3 and later). So that limitation no longer exists. As far as splitting off the guest paging file to a volume on its own, or a guest attached volume), the bigger concern is the recoverability of the system, and how much complexity is added for the benefit that you end up getting. While I have wanted to do similar things, I would recommend exercising caution to fully understand the implications. I have also learned that with regards to minimizing changed block data, there are other matters that end up being lower hanging fruit. Things like legacy tools, procedures (defrags, etc.) that undermine ones efforts to minimize changed block data.

      1. Thank you for the quick answer. Some comments:
        1.- I have an VM with W2008R2 and SQL Server 2008 R2 with a database in an attached drive. ASM/ME replications works fine. When I schedule a ASM/VE replication (of the VM datastore) the snaphot fails quiescing the VM. Only if I stop SQL Server or if I detach the database, the snapshot is done. I guess that I have to deactivate VSS when installing vmware tools in this VM. What do you think about it?
        2.- You talk about a VM replica, but when I schedule a replica in ASM/VE it is over a datastore, not over a VM. When ASM/VE looks for dependences, it found my VM who has a second vmdk disk on another datastore and the snapshot fails too.
        Your comments are very valuous to me but please, feel free to answer my doubts and sorry for my primitive English.

        Javier

      2. 1. How are you determining that the snapshot is failing to quiesce the VM? Is there an error message that states that? What you are describing is not typical behavior. You may need to investigate further, or work with support to resolve.
        2. Which version of ASM/VE are you using?

  7. 1. The error is: “[2012-05-11 18:00:35.494 ‘vcbMounter’ 2100 error] Error: Other error encountered: Snapshot creation failed: Could not quiesce file system.” (from VCB). A similar error occurs when taking the snapshot from vcenter. But: a) I have discovered another vm with the same problem and no ASM/ME and b) I have setup a similar configuration (ASM/VE and ASM/ME) with an Exchange Server and is working fine. So, I have to investigate more. Now I think the problem is the vm and not the coexistence of ASM/ME and ASM/VE. When I discover something I will post it.
    2. I’m runnig HIT/VE 3.1.1.37.

    Thanks

Leave a reply to ITforMe Cancel reply