Ever since my series of posts on replication with a Dell EqualLogic SAN, I’ve had a lot of interest from other users wondering how I actually use the built-in tools provided by Dell EqualLogic to protect my environment. This is one of the reasons why I’ve written so much about ASM/ME, ASM/LE, and SANHQ. Well, it’s been a while since I’ve touched on any information about ASM/VE, and since I’ve updated my infrastructure to vSphere 5.0 and the HIT/VE 3.1, I thought I’d share a few pointers that have helped me work with this tool in my environment.
The first generation of HIT/VE was really nothing more than a single tool referred to as “Auto-Snapshot Manager / VMware Edition” or ASM/VE. A lot has changed, as it is now part of a larger suite of VMware-centric tools from EqualLogic called the Host Integration Tools / VMware Edition or HIT/VE. This consists of the following; EqualLogic Auto-Snapshot Manager, EqualLogic Datastore Manager, and the EqualLogic Virtual Desktop Deployment Utility. HIT/VE is one of three Host Integration toolsets. The others being HIT/ME and HIT/LE for Microsoft and Linux respectively.
Ever since HIT/VE 3.0, Dell EqualLogic thankfully transitioned toward and appliance/plug-in model. This reduced overhead, complexity, and removed some of the quirks with the older implementations. Because I had been lagging behind in updating vSphere, I was still using 2.x up until recently, and skipped right over 3.0 to 3.1. Surprisingly, many of the same practices that have served me well with the older version adopt quite well to the new version.
Let me preface that these are just my suggestions off of personal use with all versions of the HIT over the past 3 years. Just as with any solution, there are a number of different ways to achieve the same result. The information provided may or may not align with best practices from Dell, or your own practices. But the tips I provide have stood up to the rigors of a production environment, and have actually worked in real recovery scenarios. Whatever decisions you make should compliment your larger protection strategies, as this is just one piece of the puzzle.
Tips for Configuring and working with the HIT/VE appliance
1. The initial configuration will ask for registration in vCenter (configuration item #8 on the appliance). You may only register one HIT/VE appliance in vCenter.
2. The HIT/VE appliance was designed to integrate with vCenter. But it also offers the flexibility of access. After the initial configuration, you can verify and modify settings in the respective ASM appliances by browsing directly to their IP address, FQDN, or DNS alias name. You may type in: http://%5BapplianceFQDN%5D or for the Auto-Snapshot Manager, type in http://%5BapplianceFQDN%5D/vmsnaptool.html
3. Configuration of the storage management network on the appliance is optional, and depending on your topology, may not be needed.
4. When setting up replication partners, ASM will ask for a “Server URL” This implies you should enter an “http://” or “https://” Just enter in the IP address or FQDN without the http:// prefix. A true URL as it implies will not work.
5. After you have configured your HIT/VE appliances, run back through and double check the settings. I had two of them mysteriously reset some DNS configuration during the initial deployment. It’s been fine since that time. It might have been my mistake (twice), but it might not.
6. For just regular (local) SmartCopies, create one HIT/VE appliance. Have the appliance sit in its own small datastore. Make sure you do not protect this volume via ASM. Dell warns you about this. For environments where replication needs to occur, set up a second HIT/VE appliance at the remote site. The same rules apply there.
7. Log files on the appliance are accessible via Samba. I didn’t discover this until I was working through the configuration and some issues I was running into. What a pleasant way to to pull the log data off of the appliance. Nice work!
Tips for ASM/VE
8. Just as I learned and recommended in 2.x, the most important suggestion I have to successfully utilizing ASM/VE in your environment is to arrange vCenter folders to represent the contents of your datastores. Include in the name some indicated of the referencing volume/datastore (seen in the screen capture below, where “103” refers to a datastore called VMFS103. The reason for this is so that you can keep your smartcopy snapshots straight during creation. If you don’t do this, when you make a SmartCopy of a folder containing VM’s that reside in multiple datastores, you will see SAN snapshots in each one of those volumes, but they didn’t necessarily capture all of the data correctly. You will get really confused, and confusion is not what you need when understanding the what and how of recovering systems or data.
9. Schedule or manually create SmartCopy Snapshots by Folder. Schedule or manually create SmartCopy Replicas by dataStore. Replicas cannot be created by vCenter Folder. This strategy has been the most effective for me, but if you didn’t feel like re-arranging your folders in vCenter, you could schedule or manually create SmartCopy Snapshots by datastore as well.
10. Do not schedule or create Smartcopies by individual machine. This will get confusing (see above), and may interfere with your planning of snapshot retention periods. If you want to protect a system against some short term step (e.g. installing service pack, etc.), just use a hypervisor snapshot, and remove when complete.
11. ASM/VE 2.x was limited to making smart copies of VM’s that had vmdk files all in the same location. 3.x does not have this limitation. This offers up quite a bit of flexibility if you have VM’s with independent vmdks in other datastores.
12. Test, and document. Create a couple of small volumes, large enough to hold 2 test VM’s in each. Make a SmartCopy of the VMWare folder where those VM’s reside. Do a few more SmartCopies, then attempt a restore. Test. Add a vmdk in another datastore to one of the VM’s then test again. This is the best way to not only understand what is going on, but to have no surprises or trepidation when you have to do it for real. It is especially important to understand how the other VM’s in the same datastore will behave, and how VM’s with multiple vmdks in different datastores will act, as well as what a “restore by rollback” is. And while you’re add it, make a OneNote or Word document outlining the exact steps for recovery, and what to expect. Create one for local SmartCopies, and another for remote replicas. This avoids not thinking clearly under the heat of the moment. Your goal is to make things better by a restore, not worse. Oh, and if you can’t find the time to document the process, don’t worry, I’m sure the next guy who replaces you will find the time.
13. Set snapshot and replication retention numbers in ASM/VE. This much needed feature was added to the 3.0 version. Tweak each snapshot reserve to a level that you feel comfortable with, and that also matches up against your overall protection policies. There will be some tuning for each volume so that you can offer the protection times needed, without allocating too much space to snapshot reserves. ASM will only be able to manage the snapshots that it creates, so if you have some older snaps of a particular datastore, you may need to do a little cleanup work.
14. Watch the frequency!!! The only thing worse than not having a backup of a system or data, is to have several bad copies of it, and to realize that the last good one just purged itself out. A great example of this is something going wrong on a Friday night. You maybe don’t notice it mid-day on Monday. But your high frequency SmartCopies only had room for two days worth of changed data. With ASM/VE, I tend to prefer very modest frequencies. Once a day is fine with me on many of my systems. Most of the others that I like to have more frequent SmartCopies of have the actual data on guest attached volumes. Early on in my use, I had a series of events that were almost disastrous, all because I was overzealous on the frequency, but not mindful enough of the retention. Don’t be a victim of the ease at cranking up the frequency at the expense of retention. This is something you’ll never find in a deployment or operations guide, and applies to all aspects of data protection.
15. If you are creating SmartCopy snapshots and SmartCopy replicas, use your scheduling an opportunity to shrink the window of vulnerability. Instead of running a replica right after a snapshot each once a day, right after eachother, split the difference so that the replica runs in between the the last SmartCopy snapshot, and the next one.
16. Keep your SmartCopy and replica frequencies and scheduling as simple as possible. If you can’t understand it, who will? Perhaps start with a frequency rate of just once a day for all of your datastores, then go from there. You might find a frequency such as once a day might work for 99% of your systems. I’ve found that for most of my data that I need to protect at more frequent intervals, those are on guest attached volumes anyway, and I schedule those up via ASM/ME to meet my needs.
17. For SmartCopy snapshots, I tend to schedule them so that there is only one job on one datastore at a time. With the next one scheduled say 5 minutes afterward. For SmartCopy replicas, if you choose to use free pool space, instead of replica reserve (as I do), you might want to offset those more, so that the replica has time to fully complete in order for the space held by the invisible local replica can be reclaimed for the next job. Generally this isn’t too much of an issue, unless you are really tight on space.
18. The SmartCopy defaults have been changed a bit since ASM/VE 2.x. No need to tick any of the checkboxes such as “Perform virtual machine memory dump” and “Set created PS Series snapshots online” In fact, I would untick the “Included PS Series volumes access by guest iSCSI initiators” More info on why below.
19. ASM/VE still gives you the option to snapshot volumes that are attached to that VM via guest iSCSI initiators. In general, don’t do it. Why? If you chose to use this option for Microsoft based VM’s, it would indeed make a snapshot, giving you the impression that all is well, but these would not be coordinated with the internal VSS writer inside the VM, so they are not truly application consistent snapshots of the guest volumes. Sure, they might work, but they might not. They may also interfere with your retention policies in ASM/ME. Do you really want to take that chance with your Exchange or SQL databases, or flat file storage? If you think flat file storage isn’t important to quiesce, remember that source code control systems like Subversion typically use file systems, and not a database. It is my position that the only time you should use this option is if you are protecting a Linux VM with guest attached volumes. Linux has no equivalent to VSS, so you get a free pass on using this option. However, because this option is a per-job definition, you’ll want to separate Windows based VM’s with guest volumes from Linux based VM’s with guest volumes. If you wanted to avoid that, you could just rely on on a crash consistent copy of that linux guest attached volume via a scheduled snapshot in the Group Manager GUI. So the moral of the story is this. To protect your guest attached volumes in VM’s running Windows, rely entirely on ASM/ME to create a SmartCopy SAN snapshot of your guest attached volumes.
20. If you need to cherry-pick a file off of a snapshot, or look at an old registry setting, consider restoring or cloning to another volume, and make sure that the restored VM does not have any direct access to the same network that the primary system is running. Having a special portgroup in vCenter that is just for this purpose works nice. Many times this method can be the least disruptive to your environment.
21. I still like to have my DC’s in individual datastores, on their own, and create SmartCopy schedules that do not occur simultaneously. I found that in practice, our very sensitive automated code compiling system which has dozens (if not hundreds) of active ssh sessions ran into less interference this way compared to when I initially had them in one datastore, or intertwined in datastores with other VMs. Depending on the number of DCs you have, you might be able to group a couple together, with perhaps splitting off the DC running the PDC emulator role into a separate datastore. Beware that the SmartCopy for your DC should just be considered as a way to protect the system, not AD. More info on my post about protecting Active Directory here.
Tips for DataStore Manager
22. The Datastore Manager in vCenter is one of my favorite new ways to view my PS Group. Not only do you get a quick check on how my datastores look (limiting the view to just VMFS volumes), but it also shows which volumes have replicas in flight. It has quickly become one of my most used items in vCenter.
23. Use the ACL policies feature in Datastore Manager. With the new integration between vCenter and the Group Manager, you can easily create volumes. The ACL policies feature in the HITVE is a way for you to save a predetermined set of ACL’s for your hosts (CHAP, IP, or IQN). While I personally prefer using IQN’s, any combination of the three will work. Having an ACL policy is a great way to provision the access to a volume quickly. If you are using manually configured multi-pathing, it is important to note that creating datastores by this way will using a default pathing of “VMWare fixed.” You will need to manually change that to “VMWare Round Robin.” I am told that if you are using the EqualLogic Multi-pathing Extension Module (MEM), that this will be set to the proper setting. I don’t know that for sure because MEM hasn’t been released for vSphere 5.0 as of this writing.
24. VMFS5 offers some really great features, but many of them are only available if they were natively created (not upgraded from VMFS3). If you choose to recreate them by doing a little juggling with Storage vMotion (as I am), remember that this might wreak havoc on your replication, as you will need to re-seed the volumes. But if you can, you are exposed to many great features of VMFS5. You might also use this as an opportunity to revisit your datastores and re-arrange if necessary.
25. If you are going to redo some of your volumes from scratch (to take full advantage of VMFS5), if they are replicated, redo the volumes with the highest change rate first. They’re already pushing a lot of data through your pipe, so you might as well get them taken care of first. And who knows, your replicas might be improved with the new volume.
Hopefully this gives you a few good ideas for your own environment. Thanks for reading.
15 thoughts on “Tips for using Dell’s updated EqualLogic Host Integration Tools – VMware Edition (HIT/VE)”
Great post like always! Hopefully our new ESX servers will get purchased soon so we can get up on vSPhere 5.0!
Thanks Michael. I’m finally not feeling so far behind the times now.
Thisis a great series of articles. What I am curious about is an application-consistent replication and how it is accomplished here. Perhaps I missed something, but I would think you’d need your guest volumes to replicate using ASM/ME on the same schedule as your O/S volumes, just in case there is some interactivity between the two that requires that sort of synchronization. Can you elaborate on that?
Really good question. Your concern about timing of protection (whether it be a snap, or a replica) of the underlying OS, versus the timing of protection for the guest attached volumes may be a concern, but it is not as bad as you think. It’s really about how the servers are set up. For SQL servers, it is important to have all of your databases (and transaction logs) on the guest attached volumes, otherwise you can’t protect them in the same way, or at the same time (including master, model, etc.). Same thing goes for Exchange servers. But here are some examples that might require different approaches. Note that I’ve experienced all three of these examples in the past.
Example #1: Some DLL on a VM running SQL server became corrupt, and the system was caught in an endless loop of rebooting.
Action: The volumes on the guest attached volumes that contain the databases and transaction logs are untouched from such an issue. Pull up an older SmartyCopy via ASM/VE, and it will be fine.
Example #2: Some work against the SQL databases didn’t go well, and its throwing errors as soon as you mount the database
Action: Don’t do anything with ASM/VE or the OS volume, and recover the databases using ASM/ME on the guest attached volumes.
Example #3: You plan to do a big Service Pack upgrade of an Exchange Server
Action: This is a great example of something that would touch both the data, and the OS. In this particular case, I would suggest shutting the VM down, then making a manual snapshot of the OS volume, and a manual snapshot of the guest attached volumes. So that in the event of a recovery, the snaps made while it was off were fully coordinated, so to speak.
For me, this has resulted in the ability for me to increase the frequency of my guest attached volumes improving my application data protection, while keeping my application services (the OS) at a modest frequency. This approach is actually quite similar to traditional backup tools with agents. If you ever had say, BackupExec backup a SQL database, it simply used VSS to quiesce the application, then protect it from there.
If you are uncertain, I really urge you to practice through a couple of these steps. It will give you the confidence to know that the procedures will hold up to real world scenarios when they arise. I actually would love it if Dell/EqualLogic produced some sort of interference checker, so that one could identify concerns like yours.
Thanks for reading!
Great information! You mention taking a snapshot of Exchange and SQL. Is this possible with HIT /VE or do you need the HIT /ME ?
Thanks for reading! As for your question, it depends if your Exchange or SQL DB and Transaction Log files live on guest attached volumes or not. If they do not (e.g. they are just vmdks), then HIT/ME is of no use, and you can rely on HIT/VE (Its ability to protect is much more limited however). If they are on guest attached volumes, yes, you should really use HIT/ME exclusively to protect those guest attached volumes. Tip #19 addresses this issue specifically. I also touch on it at: https://itforme.wordpress.com/2010/05/13/replication-with-an-equallogic-san-part-2/
Let me know if you have any other questions.
My application files (Exchange, SQL) are in vmdk files so I would need to move them out. Is the snapshot worthless then of these application servers? I know you mentioned that individual files can be recovered but what would happen if I try to power on the smart copy that was taken by HIT /VE. Thanks again for the information.
They aren’t necessarily worthless, but relying on hypervisor aware snapshots introduces a level of uncertainly on what matters the most; the data. There are a lot of variables here (OS version, transaction type, etc.) to give a definitive answer if something is protected or not. The real challenge of course is being able to protect the data in a state that is recoverable. Microsoft’s Volume Shadow Copy Services (VSS) framework does a great job at facilitating communication between applications, the OS, and storage subsystems. But there is a bit of a sliding scale as to what is really VSS aware. The best way to think of this is that if it is initiated inside the guest, then it is more likely to be using all of the components of VSS to coordinate a state for protection.
A lot of commercial backup solutions built for virtualized environments rely on the ability for vCenter to see all of the volumes, so products like Veeam require that those databases and logs live as vmdks. They are very good applications, but have additional tools to help the coordination and the integrity checking of the data. If you use a product like Veeam, then it may be best that you stick with the arrangement that you have. If you do not, you may want to consider these other ways to protect your data.
Here are a few resources that you might find helpful.
http://www.veeam.com/blog/the-great-vss-debate.html (older post, but still informative)
http://en.community.dell.com/techcenter/storage/w/wiki/data-drives-in-vmware.aspx (A great post by Will Urban that is applicable to all EqualLogic users)
Sorry I think I inadvertantly posted this comment elsewhere under the wrong entry!
You’ve got a lot of great info here and on the Dell Community site. Thanks for all your work! I’m hoping you can answer a question we’ve had a really hard time finding information on. In several of your posts about EqualLogic snapshots and replication you mention using data volumes attached via the guest software iSCSI initiator. We have used this strategy for a long time, but it’s always been very worrisome because when you upgrade VMWare Tools you replace the VMNIC driver, which is tantamount to disconnecting a SCSI cable while the machine is on. The logs reflect this with warnings about possible corruption of the MFT, delayed write failures, SQL errors, etc.
Is there some obvious preventative step that everyone else takes when upgrading VMWare Tools that we have overlooked? Thanks for any light you can shed on this.
Thanks for the kind words. As for your concern and question, it is a very good one, and a topic that I don’t think I’ve ever expressed in any of my posts (thanks for the idea!) Yes, during any VMware Tools update or reinstallation, you might see some interference of I/O on any of your vNICs (not just the ones dedicated for guest volumes). I definitely take a pretty conservative approach in VMware Tools installations/updates, and Virtual Hardware updates (the latter I feel is the greater risk). I don’t update either one unless I have to. And when I do, I know that there will be no data corruption with respect to this step if the guest volume isn’t attached in the first place. So for Windows based systems running HIT/ME, I’ve temporarily stopped relevant services (e.g. SQL or Exchange related services, etc.), and detached the guest volumes, and for Linux systems running HIT/LE, I’ve done a similar step. It’s a pretty quick step actually. It works in smaller environments, but will admit that if updates were being pushed out to a large environment via VUM, then one might want to reconsider the approach. In reality, if the VMware Tools updates are done during quieter times, one probably wouldn’t see any issues, but that does not negate the concern, and the preventative steps that one should take to reduce potential issues.
Hopefully HIT ASM 4.0 gets a built-in FLR (great for file server VM’s) , a RUN NOW option (using existing Schedule), email notification and a bit more options on setting up a backup window scheduling, EX. run between 6pm to 7am every so often…etc…
Oh and allow and “ignore incompatible datastores on scheduled tasks, not just only on “Create Smart Copy” one time run.
Forgot…a must fix is instant logon to ASM via vCenter, it times out, true plug-ins should single sign-on everytime.
Great points Ry. A file level restore would be nice. I’d also like to see better handling of restoring single VMs from a datastore containing multiple VMs. And finally, one of the biggest annoyances has been the timing out of the plug-in. From what I understand, they are definately aware of this issue.
Do you know of way to limit # of concurrent snapshots on vCenter? ASM/VE runs job and tries to snap all VM’s in folder at same time. Creates havov for HA. I’d like to set MAX=4 as to queue other VM snaps so to speak. We don’t need them to run at same time, we want stable system, Equallogic needs to add a throttle setting. It is trying to do too much. We can move VM’s into folders and create seperate jobs but that gets messy.
Have you checked with Dell EqualLogic on the matter? I have not had any problem with respect to HA, but definately have observed the backlog of snaps during a job.
Sorry for delay, there is a vcenter setting that we tried and seems to work, TaskMax=4.
For the cloning, if you are restoring smart copy to a clone, for just 1 VM lets say, it requires space of entire volume, does not make sense to me, tech support said maybe fixed by 4.5, it prompts same thing on restore a smart copy, but it at least offers an alternative restore option, whereas clone does not and just fails saying not enough space. We use four 2TB volumes, with replication and snapshot space usage, who has free exact TB’s for each volume? Not us.