January 28, 2011 4 Comments
One of the benefits of investing in Dell/EqualLogic’s SAN solutions are the number of great tools included with the product, at no extra charge. I’ve written in the past about leveraging their AutoSnapshot Manager for VM and application consistent snapshots and replicas. Another tool that deserves a few words is SAN HeadQuarters (SANHQ).
SANHQ allows for real-time and historical analysis of your EqualLogic arrays. Many EqualLogic users are well versed with this tool, and may not find anything here that they didn’t already know. But I’m surprised to hear that many are not. So, what better way to help those unfamiliar with SANHQ than to describe how it helps me with my environment.
While the tool itself is “optional” in the sense that you don’t need to deploy it to use the EqualLogic arrays, it is an easy (and free) way to expose the powers of your storage infrastructure. If you want to see what your storage infrastructure is doing, do yourself a favor and run SANHQ.
Starting up the application, you might find something like this:
You’ll find an interesting assortment of graphs, and charts that help you decipher what is going on with your storage. Take a few minutes and do a little digging. There are various ways that it can help you do your job better.
Sometimes good monitoring is downright annoying. It’s like your alarm clock next to the bed; it’s difficult to overlook, but that’s the point. SANHQ has proven to be an effective tool for proactive monitoring and alerting of my arrays. While some of these warnings are never fun, it’s biggest value is that it can help prevent those larger, much more serious problems, which always seem to be a series of small issues thrown together. Here are some examples of how it has acted as the canary in the coalmine for me in my environment.
- When I had a high number of TCP retransmits after changing out my SAN Switchgear, it was SANHQ that told me something was wrong. EqualLogic Support helped me determine that my new switchgear wasn’t handling jumbo frames correctly.
- When I had a network port go down on the SAN, it was SANHQ that alerted me via email. A replacement network cable fixed the problem, and the alarm went away.
- If replication across groups is unable to occur, you’ll get notified right away that replication isn’t running. The reasons for this can be many, but SANHQ usually gives you the first sign that something is up. This works across physical topologies where your target my be at another site.
- Under maintenance scenarios, you might find the need to pause replication on a volume, or on the entire group. SANHQ will do a nice job of reminding you that it’s still not replicating, and will bug you at a regular interval that it’s still not running.
Analysis and Planning
SANHQ will allow you to see performance data at the group level, by storage pools, volumes, or volume collections. One of the first things I do when spinning up a VM that uses guest attached volumes, is to jump into SANHQ, and see how those guest attached volumes are running. How are the average IOPS? What about Latencies and Queue depth? All of those can be found easily in SANHQ, and can help put your mind at ease if you are concerned about your new virtualized Exchange or SQL servers. Here is a screenshot of a 7 day history for SQL server with guest attached volumes, driving our SharePoint backend services.
The same can be done of course for VMFS volumes. This information will compliment existing data one gathers from vCenter to understand if there are performance issues with a particular VMFS volume.
Often times monitoring and analysis isn’t about absolute numbers, but rather, allowing the user to see changes relative to previous conditions. This is especially important for the IT generalist who doesn’t have time or the know-how for deep dive storage analysis, or have a dedicated Storage Administrator to analyze the data. This is where the tool really shines. For whatever type of data you are looking at, you can easily choose a timeline by the last hour, 8 hours, 1 day, 7 days, 30 days, etc. The anomalies, if there are any, will stand out.
Simply click on the Timeline that you want, and the historical data of the Group, member, volume, etc will show up below.
I find analyzing individual volumes (when they are VMFS volumes) and volume collections (when they are guest attached volumes) the most helpful in making sure that there are not any hotspots in I/O. It can help you determine if a VM might be better served in a VMFS volume that hasn’t been demanding as much I/O as the one it’s currently in.
It can also play a role in future procurement. Those 15k SAS drives may sound like a neat idea, but does your environment really need that when you decide to add storage? Thinking about VDI? It can be used to help determine I/O requirements. Recently, I was on the phone with a friend of mine, Tim Antonowicz. Tim is a Senior Solutions Architect from Mosaic Technology who has done a number of successful VDI deployments (and who recently started a new blog). We were discussing the possibility of VDI in my environment, and one of the first things he asked of me was to pull various reports from SANHQ so that he could understand our existing I/O patterns. It wasn’t until then that I noticed all of the great storage analysis offerings that any geek would love. There are a number of canned reports that can be saved out as a pdf, html, csv, or other format to your liking.
The value of SANHQ went way up for me when I started replication. It will give you summaries of the each volume replicated.
If you click on an individual volume, it will help you see transfer sizes and replication times of the most recent replicas. It also separates inbound replica data from outbound replica data.
While the times and the transfer rates will be skewed somewhat if you have multiple replica’s running (as I do), it is a great example on how you can understand patterns in changed data on a specific volume. The volume captured above represents where one of my Domain Controllers lives. As you can see, it’s pretty consistent, and doesn’t change much, as one would expect (probably not much more than the swap file inside the VM, but that’s another story). Other kinds of data replicated will fluctuate more widely. This is your way to see it.
SANHQ will live happily on a stand alone VM. It doesn’t require much, but does need direct access to your SAN, and uses SNMP. Once installed, SANHQ can be run directly on that VM, or the client-only application can be installed on your workstation for a little more convenience. If you are replicating data, you will want SANHQ to connect to the source site, and the target site, for most effective use of the tool.
Improvements? Sure, there are a number of things that I’d love to see. Setting alarms for performance thresholds. Threshold templates that you could apply to a volume (VMFS or native) that would help you understand the numbers (green = good. Red = bad). The ability to schedule reports, and define how and where they are posted. Free pool space activity warnings (important if you choose to keep replica reserves low and leverage free pool space). Array diagnostics dumps directly from SANHQ. Programmatic access for scripting. Improvements like these could make a useful product become indispensible in a production environment.