Upgrading to vSphere 5.0 by starting from scratch. …well, sort of.

It is never any fun getting left behind in IT.  Major upgrades every year or two might not be a big deal if you only had to deal with one piece of software, but take a look at most software inventories, and you’ll see possibly dozens of enterprise level applications and supporting services that all contribute to the chaos.  It can be overwhelming for just one person to handle.  While you may be perfectly justified in holding off on specific upgrades, there still seems to be a bit of guilt around doing so.  You might have ample business and technical factors to support such decisions, and a well crafted message providing clear reasons to stakeholders.  The business and political pressures ultimately win out, and you find yourself addressing the more customer/user facing application upgrades before the behind-the-scenes tools that power it all.

That is pretty much where I stood with my virtualized infrastructure.  My last major upgrade was to vSphere 4.0.  Sure, I had visions of keeping up with every update and patch, but a little time passed, and several hundred distractions later, I found myself left behind.  When vSphere 4.1 came out, I also had every intention of upgrading.  However, I was one of the legions of users who had a vCenter server running on a 32bit OS, and that complicated matters a little bit.  I looked at the various publications and posts on the upgrade paths and experiences.  Nothing seemed quite as easy as I was hoping for, so I did what came easiest to my already packed schedule; nothing.  I wondered just how many Administrators found themselves in the same predicament; not touching an aging, albeit perfectly fine running system.  

My ESX 4.0 cluster served my organization well, but times change, and so do needs.  A few things come up to kick-start the desire to upgrade.

  • I needed to deploy a pilot VDI project, fast.  (more about this in later posts)
  • We were a victim of our own success with virtualization, and I needed to squeeze even more power and efficiency out of our investment in our infrastructure.

Both are pretty good reasons to upgrade, and while I would have loved to do my typical due diligence on every possible option, I needed a fast track.  My move to vSphere 5.0 was really just a prerequisite of sorts to my work with VDI. 

But how should I go about an upgrade?

Do I update my 4.0 hosts to the latest update that would be eligible for an upgrade path to 5.0, and if so, how much work would that be?  Should I transition to a new vCenter server, migrating the database, then run a mixed environment of ESX hosts running with different versions?  What sort of problems would that introduce?  After conferring with a trusted colleague of mine who always seems to have pragmatic sensibilities when it comes to virtualization, I decided which option was going to be the best for me.  I opted not to do any upgrade, and simply transition to a pristine new cluster.  It looked something like this:

  • Take a host (either new, or by removing an existing one from the cluster), and build it up with ESXi 5.0.
  • Build up a new 64bit VM for running a brand new vCenter, and configure as needed.
  • Remove one VM at a time from the old cluster by powering them down, remove from inventory, add to the new cluster.
  • Once enough VM’s have been removed, take another host, remove from the old cluster, rebuild as ESXi 5.0, and add to the new cluster.
  • Repeat until finished.

For me, the decision to start from scratch won out.  Why?

  • I could build up a pristine vCenter server, with a database that wasn’t going to carry over any unwanted artifacts of my previous installation.
  • I could easily set up the new vCenter to emulate my old settings.  Folders, EVC settings, resource pools, etc.
  • I could transition or build up my supporting VM’s or appliances to my new infrastructure to make sure they worked before committing to the transition.
  • I could afford a simple restart of each VM as I transitioned it to a new cluster.  I used this as an opportunity to update the VMware Tools when added to the new inventory.
  • I was willing to give up historical data in my old vSphere 4.0 cluster for the sake of simplicity of the plan and cleanliness of the configuration.
  • Predictability.  I didn’t have to read a single white paper or discussion thread on database migrations or troubles with DSNs.
  • I have a well documented ESX host configuration that is not terribly complex, and easy to recreate across 6 hosts.
  • I just happened to have purchased an additional blade and license of ESX, so it was an ideal time to introduce it to my environment.
  • I could get my entire setup working, then get my licensing figured out after it’s all complete.

You’ll notice that one option similar to this approach would have been to simply remove a host of running VM’s out of the existing cluster, and add it to the new cluster.  This may have been just as good of a plan, as it would have avoided the need to manually shut down and remove each VM one at a time during the transition.  However, I would have needed to run a mix of ESX 4.0 and 5.0 hosts in the new cluster.  I didn’t want to carry anything over from the old setup.  I would have needed to upgrade or rebuild the host anyway, and I had to restart each VM to make sure it was running the latest tools.  If for nothing other than clarity of mind, my approach seemed best for me.

Prior to beginning the transition, I needed to update my Dell EqualLogic firmware to 5.1.2.  A collection of very nice improvements made this a nice upgrade, but a requirement for what I wanted to do.  While the upgrade itself went smoothly, it did re-introduce an issue or two.  The folks at Dell EqualLogic are aware of this, and are working to address it hopefully in their next release.  The combination of the firmware upgrade, and vSphere 5 allowed me to use the latest and greatest tools from EqualLogic, primarily the Host Integration Tools VMWare Edition (HIT/VE) and the storage integration in vSphere thanks to VASA.  Although, as of this writing, EqualLogic does not have a full production release of their MultiPathing Extension Module (MEM) for vSphere 5.0.  The EPA version was just released, but I’ll probably wait for the full release of MEM to come out before I apply it to the hosts in the cluster.

While I was eager to finish the transition, I didn’t want to prematurely create any problems.  I took a page from my own lessons learned during my upgrade to ESX 4.0, and exercised some restraint when it came to updating my Virtual Hardware for each VM to version 8.  My last update of Virtual Hardware levels in each VM caused some unexpected results, as I shared in “Side effects of upgrading VM’s to Virtual Hardware 7 in vSphere”   Apparently, I wasn’t the only one who ran into issues, because that post has statistically been my all time most popular post.  The abilities of Virtual Hardware 8 powered VMs are pretty neat, but I’m in no rush to make any virtual hardware changes to some of my key production systems, especially those noted. 

So, how did it work out?  The actual process completed without a single major hang-up, and am thrilled with the result.  The irony here is that even though vSphere provides most of the intelligence behind my entire infrastructure, and does things that are mind bogglingly cool, it was so much easier to upgrade than say, SharePoint, AD, Exchange, or some other enterprise software.  Great technologies are great because they work like you think they should.  No exception here.  If you are considering a move to vSphere 5.0, and are a little behind on your old infrastructure, this upgrade approach might be worth considering.

Now, onto that little VDI project…

Resources

A great resource on setting up SQL 2008 R2 for vCenter
How to Install Microsoft SQL Server 2008 R2 for VMware vCenter 5

Installing vCenter 5 Best Practices
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2003790

A little VMFS 5.0 info
http://www.yellow-bricks.com/2011/07/13/vsphere-5-0-what-has-changed-for-vmfs/

Information on the EqualLogic Multipathing Extension Module (MEM), and if you are an EqualLogic customer, why you should care.
https://whiteboardninja.wordpress.com/2011/02/01/equallogic-mem-and-vstorage-apis/

Using the Dell EqualLogic HIT for Linux

 

I’ve been a big fan of Dell EqualLogic Host Integration Tools for Microsoft (HIT/ME), so I was looking forward to seeing how the newly released HIT for Linux (HIT/LE) was going to pan out.  The HIT/ME and HIT/LE offer unique features when using guest attached volumes in your VM’s.  What’s the big deal about guest attached volumes?  Well, here is why I like them.

  • It keeps the footprint of the VM really small.  The VM can easily fit in your standard VMFS volumes.
  • Portable/replaceable.  Often times, systems serving up large volumes of unstructured data are hard to update.  Having the data as guest attached means that you can easily prepare a new VM presenting the data (via NFS, Samba, etc.), and cut it over without anyone knowing – especially when you are using DNS aliasing.
  • Easy and fast data recovery.  My “in the trenches” experience with the guest attached volumes in VM’s running Microsoft OS’s (and EqualLogic’s HIT/ME) have proven that recovering data off of guest attached volumes is just easier – whether you recover it from snapshot or replica, clone it for analysis, etc. 
  • Better visibility of performance. Thanks to the independent volume(s), one can easily see with SANHQ what the requirements of that data volume is. 
  • More flexible protection.  With guest attached volumes, it’s easy to crank up the frequency of snapshot and replica protection on just the data, without interfering with the VM that is serving up the data.
  • Efficient, tunable MPIO. 
  • Better utilization of space.  If you wanted to serve up a 2TB volume of storage using a VMDK, more than likely you’d have a 2TB VMFS volume, and something like a 1.6TB VMDK file to accommodate hypervisor snapshots.  With a native volume, you would be able to use the entire 2TB of space. 

The one “gotcha” about guest attached volumes is that they aren’t visible by the vCenter API, so commercial backup applications that rely on the visibility of these volumes via vCenter won’t be able to back them up.  If you use these commercial applications for protection, you may want to determine if guest attached volumes are a good fit, and if so, find alternate ways of protecting the volumes.    Others might contend that because the volumes aren’t seen by vCenter, one is making things more complex, not less.  I understand the reason for thinking this way, but my experience with them have proven quite the contrary.

Motive
I wasn’t trying out the HIT/LE because I ran out of things to do.  I needed it to solve a problem.  I had to serve up a large amount (several Terabytes) of flat file storage for our Software Development Team.  In fact, this was just the first of several large pools of storage that I need to serve up.  It would have been simple enough to deploy a typical VM with a second large VMDK, but managing such an arrangement would be more difficult.  If you are ever contemplating deployment decisions, remember that simplicity and flexibility of management should trump simplicity of deployment if it’s a close call.  Guest attached volumes align well with the “design as if you know it’s going to change” concept.  I knew from my experience with working with guest attached volumes for Windows VM’s, that they were very agile, and offered a tremendous amount of flexibility.

But wait… you might be asking, “If I’m doing nothing but presenting large amounts of raw storage, why not skip all of this and use Dell’s new EqualLogic FS7500 Multi-Protocol NAS solution?”  Great question!  I had the opportunity to see the FS7500 NAS head unit at this year’s Dell Storage Forum.  The FS7500 turns the EqualLogic block based storage accessible only on your SAN network into CIFS/NFS storage presentable to your LAN.  It is impressive.  It is also expensive.  Right now, using VM’s to present storage data is the solution that fits within my budget.  There are some downfalls (Samba not supporting SMB2), but for the most part, it falls in the “good enough” category.

I had visions of this post focusing on the performance tweaks and the unique abilities of the HIT/LE.  After implementing it, I was reminded that it was indeed a 1.0  product.  There were enough gaps in deployment information that I felt it necessary to provide information on exactly how I actually made the HIT for Linux work.  IT Generalists who I suspect make up a significant amount of the Dell EqualLogic customer base have learned to appreciate their philosophy of “if you can’t make it easy, don’t add the feature.”   Not everything can be made intuitive however, especially the first time around.

Deployment Assumptions 
The scenario and instructions are for a single VM that will be used to serve up a single large volume for storage. It could serve up many guest attached volumes, but for the sake of simplicity, we’ll just be connecting to a single volume.

  • VM with 3 total vNICs.  One used for LAN traffic, and the other two, used exclusively for SAN traffic.  The vNIC’s for the SAN will be assigned to the proper vswitch and portgroup, and will have static IP addresses.  The VM name in this example is “testvm”
  • A single data volume in your EqualLogic PS group, with an ACL that allows for the guest VM to connect to the volume using CHAP, IQN, or IP addresses.  (It may be easiest to first restrict it by IP address, as you won’t be able to determine your IQN until the HIT is installed).  The native volume name in this example is “nfs001” and the group IP address is 10.1.0.10
  • Guest attached volume will be automatically connected at boot, and will be accessible via NFS export.  In this example I will be configuring the system so that the volume is available via the “/data1” directory.
  • OS used will be RedHat Enterprise Linux (RHEL) 5.5. 
  • EqualLogic’s HIT 1.0

Each step below that starts with word “VERIFICATION” is not a necessary step, but it helps you understand the process, and will validate your findings.  For brevity, I’ve omitted some of the output of these commands.

Deploying and configuring the HIT for Linux
Here we go…

Prepping for Installation

1.     Verify installation of EqualLogic prerequisites (via rpm -q [pkgname]).  If not installed, run yum install [pkgname]

openssl                    (0.9.8e for RHEL 5.5)

libpcap                    (0.9.4 for RHEL 5.5)

iscsi-initiator-utils      (6.2.0.871 for RHEL 5.5)

device-mapper-multipath    (0.4.7 for RHEL 5.5)

python                                          (2.4 for RHEL 5.5.) 

dkms                       (1.9.5 for RHEL 5.5)

 

(dkms is not part of RedHat repo.  Need to download from http://linux.dell.com/dkms/ or via the "Extra Packages for LInux" epel repository.  I chose Dell website location because it was a newer version.  Simply download and execute RPM.). 

 

2.     Snapshot Linux machine so that if things go terribly wrong, it can be reversed

 

3.     Shutdown VM, and add NIC’s for guest access

Make sure to choose iSCSI network when adding to VM configuration

After startup, manually specify Static IP addresses and subnet mask for both.  (No default gateway!)

Activate NIC’s, and reboot

 

4.     Power up, then add the following lines to /etc/sysctl.conf  (for RHEL 5.5)

net.ipv4.conf.all.arp_ignore = 1

net.ipv4.conf.all.arp_announce = 2

 

5.     Establish NFS and related daemons to automatically boot

chkconfig portmap on

chkconfig nfs on

chkconfig nfslock on

 

6.     Establish directory which will ultimately be used to export for mounting.  In this example, the iSCSI device will mount to a directory called “eql2tbnfs001” in the /mnt directory. 

mkdir /mnt/eql2tbnfs001

 

7.     Make symbolic link called “data1” in the root of the file system.

ln -s /mnt/eql2tbnfs001 /data1 

 

Installation and configuration of the HIT

8.     Verify that the latest HIT Kit for Linux is being used for installation.  (V1.0.0 as of 9/2011)

 

9.     Import public key

      Download the public key from eql support site under HIT for Linux, and place in /tmp/ )

Add key:

rpm –import RPM-GPG-KEY-DELLEQL (docs show lower case, but file is upper case)

 

10.  Run installation

yum localinstall equallogic-host-tools-1.0.0-1.e15.x86_64.rpm

 

Note:  After HIT is installed, you may get the IQN for use of restricting volume access in the EqualLogic group manager by typing the following:

cat /etc/iscsi/initiatorname.iscsi.

 

11.  Run eqltune (verbose).  (Tip.  You may want to capture results to file for future reference and analysis)

            eqltune -v

 

12.  Make adjustments based on eqltune results.  (Items listed below were mine.  Yours may be different)

 

            NIC Settings

   Flow Control. 

ethtool -A eth1 autoneg off rx on tx on

ethtool -A eth2 autoneg off rx on tx on

 

(add the above lines to /etc/rc.d/rc.local to make persistent)

 

There may be a suggestion to use jumbo frames by increasing the MTU size from 1500 to 9000.  This has been omitted from the instructions, as it requires proper configuration of jumbos from end to end.  If you are uncertain, keep standard frames for the initial deployment.

 

   iSCSI Settings

   (make backup of /etc/iscsi/iscid.conf before changes)

 

      Change node.startup to manual.

   node.startup = manual

 

      Change FastAbort to the following:

   node.session.iscsi.FastAbort = No

 

      Change initial_login_retry to the following:

   node.session.initial_login_retry_max = 12

 

      Change number of queued iSCSI commands per session

   node.session.cmds_max = 1024

 

      Change device queue depth

   node.session.queue_depth = 128

 

13.  Re-run Eqltune -v to see if changes took affect

All changes took effect, minus the NIC settings added to the rc.local file.  Looks to be a syntax error from Eql documentation provided.  It has been corrected in the documentation above.

 

14.  Run command to view and modify MPIO settings

rswcli –mpio-parameters

 

This returns the results of:  (seems to be good for now)

Processing mpio-parameters command…

MPIO Parameters:

Max sessions per volume slice:: 2

Max sessions per entire volume:: 6

Minimum adapter speed:: 1000

Default load balancing policy configuration: Round Robin (RR)

IOs Per Path: 16

Use MPIO for snapshots: Yes

Internet Protocol: IPv4

The mpio-parameters command succeeded.

 

15.  Restrict MPIO to just the SAN interfaces

Exclude LAN traffic

            rswcli -E -network 192.168.0.0 -mask 255.255.255.0

 

VERIFICATION:  List status of includes/excludes to verify changes

            rswcli –L

 

VERIFICATION:  Verify Host connection Mgr is managing just two interfaces

      ehcmcli –d

 

16.  Discover targets

iscsiadm -m discovery -t st -p 10.1.0.10

(Make sure no unexpected volumes connect.  But note the IQN name presented.  You’ll need it for later.)

 

VERIFICATION:  shows iface

[root@testvm ~]# iscsiadm -m iface | sort

default tcp,<empty>,<empty>,<empty>,<empty>

eql.eth1_0 tcp,00:50:56:8B:1F:71,<empty>,<empty>,<empty>

eql.eth1_1 tcp,00:50:56:8B:1F:71,<empty>,<empty>,<empty>

eql.eth2_0 tcp,00:50:56:8B:57:97,<empty>,<empty>,<empty>

eql.eth2_1 tcp,00:50:56:8B:57:97,<empty>,<empty>,<empty>

iser iser,<empty>,<empty>,<empty>,<empty>

 

VERIFICATION:  Check connection sessions via iscsiadm -m session to show that no connections exist

[root@testvm ~]# iscsiadm -m session

iscsiadm: No active sessions.

 

VERIFICATION:  Check connection sessions via /dev/mapper to show that no connections exist

[root@testvm ~]# ls -la /dev/mapper

total 0

drwxr-xr-x  2 root root     60 Aug 26 09:59 .

drwxr-xr-x 10 root root   3740 Aug 26 10:01 ..

crw——-  1 root root 10, 63 Aug 26 09:59 control

 

VERIFICATION:  Check connection sessions via ehcmcli -d to show that no connections exist

[root@testvm ~]# ehcmcli -d

 

17.  Login just one of the iface paths of your liking (shown in red here).  Replace the IQN here (shown in green) with yours. The HIT will take care of the rest.

iscsiadm -m node -T iqn.2001-05.com.equallogic:0-8a0906-451da1609-2660013c7c34e45d-nfs001 -I eql.eth1_0 -l

 

This returned:

[root@testvm ~]# iscsiadm -m node -T iqn.2001-05.com.equallogic:0-8a0906-451da1609-2660013c7c34e45d-nfs001 -I eql.eth1_0 -l

Logging in to [iface: eql.eth1_0, target: iqn.2001-05.com.equallogic:0-8a0906-451da1609-2660013c7c34e45d-nfs001, portal: 10.1.0.10,3260]

Login to [iface: eql.eth1_0, target: iqn.2001-05.com.equallogic:0-8a0906-451da1609-2660013c7c34e45d-nfs001, portal: 10.1.0.10,3260] successful.

 

VERIFICATION:  Check connection sessions via iscsiadm -m session

[root@testvm ~]# iscsiadm -m session

tcp: [1] 10.1.0.10:3260,1 iqn.2001-05.com.equallogic:0-8a0906-451da1609-2660013c7c34e45d-nfs001

tcp: [2] 10.1.0.10:3260,1 iqn.2001-05.com.equallogic:0-8a0906-451da1609-2660013c7c34e45d-nfs001

 

VERIFICATION:  Check connection sessions via /dev/mapper.  This is going to give you the string you will need to use making and mounting the filesystem.

[root@testvm ~]# ls -la /dev/mapper

 

 

VERIFICATION:  Check connection sessions via ehcmcli -d

[root@testvm ~]# ehcmcli -d

 

18.  Make new file system from the dm-switch name.  Replace the IQN here (shown in green) with yours.  If this is an existing volume that has been used before (from a snapshot, or another machine) there is no need to perform this step.  Documentation will show this step without the “-j” switch, which will format it as a non-journaled ext2 file system.  The –j switch will format it as an ext3 file system.

mke2fs -j -v /dev/mapper/eql-0-8a0906-451da1609-2660013c7c34e45d-nfs001

 

19.  Mount the device to a directory

[root@testvm mnt]# mount /dev/mapper/eql-0-8a0906-451da1609-2660013c7c34e45d-nfs001 /mnt/eql2tbnfs001

 

20.  Establish iSCSI connection automatically

[root@testvm ~]# iscsiadm -m node -T iqn.2001-05.com.equallogic:0-8a0906-451da1609-2660013c7c34e45d-nfs001 -I eql.eth1_0 -o update -n node.startup -v automatic

 

21.  Mount volume automatically

Change /etc/fstab, adding the following:

/dev/mapper/eql-0-8a0906-451da1609-2660013c7c34e45d-nfs001 /mnt/eql2tbnfs001 ext3 _netdev  0 0

Restart system to verify automatic connection and mounting.

 

Working with guest attached volumes
After you have things configured and operational, you’ll see how flexible guest iSCSI volumes are to work with.

  • Do you want to temporarily mount a snapshot to this same VM or another VM? Just turn the snapshot online, and make a connection inside the VM.
  • Do you need to archive your data volume to tape, but do not want to interfere with your production system? Mount a recent snapshot of the volume to another system, and perform the backup there.
  • Do you want to do a major update to that front end server presenting the data? Just build up a new VM, connect the new VM to that existing data volume, and change your DNS aliasing, (which you really should be using) and you’re done.
  • Do you need to analyze the I/O of the guest attached volumes? Just use SANHQ. You can easily see if that data should be living on some super fast pool of SAS drives, or a pool of PS4000e arrays.  You’ll be able to make better purchasing decisions because of this.

So, how did it measure up?

The good…
Right out of the gate, I noticed a few really great things about the HIT for Linux.

  • The prerequisites and installation.  No compiling or other unnecessary steps.  The installation package installed clean with no fuss.  That doesn’t happen every day.
  • Eqltune.  This little utility is magic.  Talk about reducing overhead in preparing a system for MPIO and all things related to guest based iSCSI volumes.  It gave me a complete set of adjustments to make, divided into 3 simple categories.  After I made the adjustments, I re-ran the utility, everything checked out okay.  Actually, all of the command line tools were extremely helpful.  Bravo!
  • One really impressive trait of the HIT/LE is how it handles the iSCSI sessions for you. Session build up and teardown is all taken care of by the HIT for Linux.

The not so good…
Almost as fast as the good shows up, you’ll notice a few limitations

  • Version 1.0 is only officially supported on RedHat Enterprise Linux (RHEL) 5.5 and 6.0 (no 6.1 as of this writing).  This might be news to Dell, but Debian based systems like Ubuntu are running in enterprises everywhere for it’s cost, solid behavior, and minimalist approach.  RedHat clones dominate much of the market; some commercial, and some free.  Personally, upstream Distributions such as Fedora are sketchy, and prone to breakage with each release (Note to Dell, I don’t blame you for not supporting these.  I wouldn’t either).  Other distributions are quirky for their own reasons of “improvement” and I can understand why these weren’t initially supported either.  A safer approach for Dell (and the more flexible approach for the customer) would be to 1.) Get out a version for Ubuntu as fast as possible, and 2.)  Extend the support of this version to RedHat’s, downstream, 100% binary compatible, very conservative distribution, CentOS.  For you Linux newbies, think of CentOS as being the RedHat installation but with the proprietary components stripped out, and nothing else added.  While my first production Linux server running the HIT is RedHat 5.5, all of my testing and early deployment occurred on a CentOS 5.5 Distribution, and it worked perfectly. 
  • No AutoSnapshot Manager (ASM) or equivalent.  I rely on ASM/ME on my Windows VM’s with guest attached volumes to provide me with a few key capabilities.  1.)  A mechanism to protect the volumes via snaphots and replicas.  2.)  Coordinating applications and I/O so that I/O is flushed properly.  Now, Linux does not have any built-in facility like Microsoft’s Volume Shadow Copy Services (VSS), so Dell can’t do much about that.  But perhaps some simple script templates might give the users ideas on how to flush and pause I/O of the guest attached volumes for snapshots.  Just having a utility to create Smart copies or mount them would be pretty nice. 

The forgotten…
A few things overlooked?  Yep.

  • I was initially encouraged by the looks of the documentation.  However, In order to come up with the above, I had to piece together information from a number of different resources.   Syntax and capitalization errors will kill you in a Linux shell environment.  Some of those inconsistencies and omissions showed up.  With a little triangulation, I was able to get things running correctly, but it quickly became a frustrating, time consuming exercise that I felt like I’ve been through before.  Hopefully the information provided here will help.
  • Somewhat related to the documentation issue is something that has come up with a few of the other EqualLogic tools;  Customers often don’t understand WHY one might want to use the tool.  Same thing goes with the HIT for Linux.  Nobody even gets to the “how” if they don’t understand the “why”.  But, I’m encouraged by the great work the Dell TechCenter has been doing with their white papers and videos.  It has become a great source for current information, and are moving in the right direction of customer education.   

Summary
I’m generally encouraged by what I see, and am hoping that Dell EqualLogic takes on the design queues of the HIT/ME to employ features like AutoSnapshot Manager, and an equivalent to eqlxcp (EqualLogic’s offloaded file copy command in Windows).  The HIT for Linux  helped me achieve exactly what I was trying to accomplish.  The foundation for another easy to use tool in the EqualLogic line up is certainly there, and I’m looking forward to how this can improve.

Helpful resources
Configuring and Deploying the Dell EqualLogic Host Integration Toolkit for Linux
http://en.community.dell.com/dell-groups/dtcmedia/m/mediagallery/19861419/download.aspx

Host Integration Tools for Linux – Installation and User Guide
https://www.equallogic.com/support/download_file.aspx?id=1046 (login required)

Getting more IOPS on workloads running RHEL and EQL HIT for Linux
http://en.community.dell.com/dell-blogs/enterprise/b/tech-center/archive/2011/08/17/getting-more-iops-on-your-oracle-workloads-running-on-red-hat-enterprise-linux-and-dell-equallogic-with-eql-hitkit.aspx 

RHEL5.x iSCSI configuration (Not originally authored by Dell, nor specific to EqualLogic)
http://www.equallogic.com/resourcecenter/assetview.aspx?id=8727 

User’s experience trying to use the HIT on RHEL 6.1, along with some other follies
http://www.linux.com/community/blogs/configuring-dell-equallogic-ps6500-array-to-work-with-redhat-linux-6-el.html 

Dell TechCenter website
http://DellTechCenter.com/ 

Dell TechCenter twitter handle
@DellTechCenter

Reworking my PowerConnect 6200 switches for my iSCSI SAN

It sure is easy these days to get spoiled with the flexibility of virtualization and shared storage.  Optimization, maintenance, fail-over, and other adjustments are so much easier than they used to be.  However, there is an occasional reminder that some things are still difficult to change.  For me, that reminder was my switches I use for my SAN.

One of the many themes I kept hearing at this year’s Dell Storage Forum (a great experience I must say) throughout several of the breakout sessions I went to was “get your SAN switches configured correctly.”  A nice reminder to something I was all too aware of already; my Dell PowerConnect 6224 switches were not configured correctly since the day they replaced my slightly less capable (but rock solid) PowerConnect 5424’s.  I returned from the forum committed to getting my switchgear updated and configured the correct way.  Now for the tough parts…  What does “correct” really mean when it comes to the 6200 series switches?  And why didn’t I take care of this a long time ago?  Here are just a few excuses reasons. 

  • At the time of initial deployment, I had difficulty tracking down documentation written specifically for the 6224’s to be configured with iSCSI.  Eventually, I did my best to interpret the configuration settings of the 5424’s, and apply the same principals to the 6224’s.  Unfortunately, the 6224’s are a different animal than the 5424’s, and that showed up after I placed them into production – a task that I regretfully rushed.
  • When I deployed them into production, the current firmware was the 2.x generation.  It was my understanding after the deployment that the 2.x firmware on the 6200 series definitely had growing pains.  I also had the unfortunate timing that the next major revision came out shortly after I put them into production.
  • I had two stacked 6224 switches running my production SAN environment (a setup that was quite common for those I asked at the Dell Storage Forum). While experimenting with settings might be fun in a lab, it is no fun, and serious business when they are running a production environment. I wanted to make adjustments just once, but had difficulty confirming settings.
  • When firmware needs to be updated (a conclusion to an issue I was reporting to Technical Support), it is going to take down the entire stack.  This means that you’d better have everything that uses the SAN off unless you like living dangerously.  Major firmware updates will also require the boot code in each switch to be updated.  A true “lights out” maintenance window that required everything to be shut down.  The humble little 5424’s LAGd together didn’t have that problem.
  • The 2.x to 3.x firmware update also required the boot code to be updated.  However, you simply couldn’t run an “update bootcode” command.  The documentation made this very clear.  The PowerConnect Technical Support Team indicated that the two versions ran different algorithms to unpack the contents, which was the reason for yet another exception to the upgrade process. 

One of the many best practices recommended at the Forum was to stack the switches instead of LAGing them.  Stack, stack, stack was drilled into everyone’s head.  The reasons are very good, and make a lot of sense.

  • Stacking modules in many ways extend the circuiting of a single switch, thus the stacking module doesn’t have to honor or be limited by traditional Ethernet.
  • Managing one switch manages them all.
  • Better, more scalable bandwidth between switches
  • No messing around with LAG’s

But here lays the conundrum of many Administrators who are responsible for production environments.  While stacked 6224’s offer redundancy against hardware failure, they offer no redundancy when it comes to maintenance.  These stacked switches are seen as one logical unit, and may be your weakest link when it comes to maintenance of your virtualized infrastructure.  Interestingly enough, when inquiring further on effective strategies for updating under this topology, I observed a few things;  many other users who were stuck with this very same dilemma, and the answers provided weren’t too exciting.  There were generally three answers I heard from this design decision:

  • Plan for a “lights out” maintenance window.
  • Buy another set of two switches, stack those, then trunk the two together via 10Gbe,
  • Buy better switches. 

See why I wasn’t too excited about my options?

Decision time.  I knew I’d suffer a bit of downtime updating the firmware and revamping the configuration no matter what I did.  Do I stack them as recommended, only to be faced with the same dilemma on the next firmware upgrade?  Or do I LAG the switches together so that I avoid this upgrade fiasco in the future?  LAG’ing is not perfect either, and the more arrays I add (as well as the inter-array traffic increasing with new array features), the more it might compound some of the limitations of LAGs. 

What option won out?  I decided to give stacking ONE more try.  I had to keep the eye on my primary objective; correcting my configuration by way of firmware upgrade and build up a simple, pristine configuration from scratch.  The idea was that the configuration would initially contain the minimum set of modifications to get them working according to best practices.  Then, I could build off of the configuration in the future.  Also influencing my decision was finding out that recommended settings with LAGs apparently change frequently.  For instance, just recently, the recommended setting for flow control for the port channel in a LAG was changed.  These are the types of things I wanted to stay away from.  But with that said, I will continue to keep the option open to LAGing them, for the sole reason that it offers the flexibility for maintenance without shutting down your entire cluster.

So here was my minimum desired results for the switch stack after the upgrade and reconfiguration.  Pretty straight forward. 

  • Management traffic on another VLAN (VLAN 10) on port 1 (for uplinking) and port 2 (for local access).
  • iSCSI traffic on it’s own VLAN (VLAN 100), on all ports not including the management ports.
  • Essentially no traffic on the Default VLAN
  • Recommended global and port specific settings (flow control, spanning tree, jumbo frames, etc.) for iSCSI traffic endpoint connections
  • iSCSI traffic that was available to be routed through my firewall (for replication).

My configuration rework assumed the successful boot code and firmware upgrade to version 3.2.1.3.  I pondered a few different ways to speed this process up, but ultimately just followed the very good steps provided with the documentation for the firmware.  They were clear, and accurate.

By the way, on June 20th, 2011, Dell released their very latest firmware update (thank you RSS feed) to 3.2.1.3 A23.  This now includes their “Auto Detection” of ports for iSCSI traffic.  Even though the name implies a feature that might be helpful, the documentation did not provide enough information needed, and I decided to manually configure as originally planned.

For those who might be in the same boat as I was, here were the exact steps I did for building up a pristine configuration after updating the firmware and boot code.  The configuration below was definitely a combined effort by the folks from the EqualLogic and PowerConnect Teams, and me pouring over a healthy amount of documentation.  It was my hope that this combined effort would eliminate some of the contradictory information I found in previous best practices articles, forum threads, and KB articles that assumed earlier firmware.  I’d like to thank them for being tolerant of my attention to detail, and to get this right the first time.  You’ll see that the rebuild steps are very simple.  Getting confirmation on this was not.

Step 1:  Reset the switch to defaults (make a backup of your old config, just in case)
enable
delete startup-config
reload

 
Step 2:  When prompted, follow the setup wizard in order to establish your management IP, etc. 
 
Step 3:  Put the switch into admin and configuration mode.
enable
configure

 
Step 4:  Establish Management Settings
hostname [yourstackhostname]
enable password [yourenablepassword]
spanning-tree mode rstp
flowcontrol

 
Step 5: Add the appropriate VLAN IDs to the database and setup interfaces.
vlan database
vlan 10
vlan 100
exit
interface vlan 1
exit
interface vlan 10
name Management
exit
interface vlan 100
name iSCSI
exit
ip address vlan 10
 
Step 6: Create an Etherchannel Group for Management Uplink
interface port-channel 1
switchport mode access
switchport access vlan 10
exit
NOTE: Because the switches are stacked, port one on each switch will be configured in this channel-group which can then be connected to their core switch or intermediate switch for management access. Port two on each switch can be used if they need to plug a laptop into the management VLAN, etc.
 
Step 7: Configure/assign Port 1 as part of the management channel-group:
interface ethernet 1/g1
switchport access vlan 10
channel-group 1 mode auto
exit
interface ethernet 2/g1
switchport access vlan 10
channel-group 1 mode auto
exit
 
Step 8: Configure Port 2 as Management Access Switchports (not part of the channel-group):
interface ethernet 1/g2
switchport access vlan 10
exit
interface ethernet 2/g2
switchport access vlan 10
exit
 
Step 9: Configure Ports 3-24 as iSCSI access Switchports
interface range ethernet 1/g3-1/g24
switchport access vlan 100
no storm-control unicast
spanning-tree portfast
mtu 9216
exit
interface range ethernet 2/g3-2/g24
switchport access vlan 100
no storm-control unicast
spanning-tree portfast
mtu 9216
exit
NOTE:  Binding the xg1 and xg2 interfaces into a port-channel is not required for stacking. 
 
Step 10: Exit from Configuration Mode
exit
 
Step 11: Save the configuration!
copy running-config startup-config

Step 12: Back up the configuration
console#copy startup-config tftp://[yourTFTPip]/conf.cfg

In hindsight, the most time consuming aspect of all of this was trying to confirm the exact settings for the 6224’s in an iSCSI SAN.  Running in second was shutting down all of my VMs, ESX hosts, and anything else that connected to the SAN switchgear.  The upgrade and the rebuild was relatively quick and trouble-free.  I’m thrilled to have this behind me now, and I hope that by passing this information along, you too will have a very simple working example to build your configuration off of.  As for the 6224’s, they are working fine now.  I will continue to keep my fingers crossed that Dell will eventually provide a way to update firmware to a stacked set of 6200 series switches without a lights out maintenance window.

Software that helps make life in IT a little easier

 

In IT, rarely is one truly developing something from the ground up.  In many ways, IT is about making solutions work – disjointed as they may be.  Large enterprise class solutions such as Email and messaging platforms, Content Management Systems, CRM’s, Directory Services, and Security Solutions all are massively complex -  even if they are well designed.  Those of us who are faced with the responsibility to “make it work” must possess the knack to be a deep-dive expert on any number of subjects, while having the big picture perspective of the IT Generalist.  It can be a complex mix of factors that determine how well solutions end up working out.  It’s usually an assorted mix of experience, technical and organizational skillsets, ingenuity, a lot of hard work, and a little bit of luck.  This is how the seasoned IT veteran separates themselves from those less experienced. 

Then, every once in a while a piece of software comes along to make your life in IT easier.  Software that helps bridge the much needed gaps that may exist in cross platform integration, connectivity, management, monitoring, or procedural tasks.  These are applications that don’t make deploying or managing complex systems easy.  They just make it a little easier.  Sometimes you stumble upon helpful applications like these almost by accident, as I have.  Others you knew of, but just never got around to trying out.  So I thought I’d take a brief time-out from my recent focus on all things related to Virtualization, and take a moment to share a few of those applications that are currently making my life in IT a little easier.  Some of these listed below are worthy of their own posts, which I hope to get around to.  It is a list that is neither complete, nor appropriate for every environment, and their importance really depends on how much you need it.  Only time will tell on which solutions become obsolete, and which one’s stand up over time.

Scribe Insight
This may be the best product you’ve never heard of.  If you ever need to transform, manipulate, or convert data from disparate systems, this is the product for you.  No, it’s not a “utility” but an enterprise class solution that demands a commitment in time to learn.  The results are stunning.  Data sources that had no earthly intention of being able to talk to another system can share the same data.   Example:  Your Sales Department uses a CRM running on SQL, but an ERP or Finance system runs on Oracle, and you need those records to interact on a transaction by transaction basis.  Scribe can do that, and much more.  Are those systems running on separate networks?  No problem.  Scribe simplifies the communication channels between autonomous systems.  It can insulate the complexity of convoluted database tables, and in some cases will completely eliminate the need for you to use an application’s SDK for data integration.  Database Administrators would love this tool, but it’s power extends well beyond just database integration.  It’s a true gem.

Tree Size Pro
You have a choice. Spend weeks and weeks trying to get PowerShell or vb scripts to analyze and manipulate your large flat-file storage contents, or spend a few bucks for Tree Size Pro.  This product delivers.  I’ve used it to generate reports on storage usage, and to automate flat file storage cleanup tasks.  When I think about what it would have taken to do it programmatically, I’d still be working on it.

OneNote
I’ve written about OneNote before, and how it can be utilized in IT.  Since that time, I’ve learned how to exploit it even more, and it goes with me everywhere.  It could be 10 times the price it is, and I’d still pay for it myself if needed.  It’s the pocket knife that should be in every Administrator’s tool chest.  The larger your team, the better it works.  Design documentation, troubleshooting active issues, project planning, research, etc.  It will help you become a better Administrator. 

Likewise 
This software allows for Unix, Linux, and Mac systems to authenticate against Active Directory.  It will allow for centralized management of these systems using Group Policy Objects in the same way you manage your windows machines.  I was one of their first customers, and have been thrilled to see it mature over the years.  Their Open Source edition is OEM integrated into Linux Distributions such as Ubuntu, Suse, and other products like VMware vSphere.  The free/Open source edition allows for you to join these systems to AD, while the commercial edition allows for centralized management.

Putty
If you need a solid windows based SSH client to connect to your Linux clients, this is it.  One version (.56b) also supports the “Generic Security Services API” or GSSAPI.  This means that if your Linux machines are domain joined using Likewise, you can leverage Active Directory to log in to that Linux system, inheriting your credentials so that it is all passwordless.  Included with it is “plink” which gives you the ability to run a *nix command remotely from the windows system.  Great for routines initiated from a windows workstation.  “Pscp” is the putty SCP client for getting files to and from that connected *nix system.

CionSystems AD Change Notifier
One of the interesting aspects of Active Directory is that there are object changes all the time, but as an Administrator, you have no way of knowing it. AD Change Notifier helps with that.  Simple, yet effective.  It sends you an email notification of object changes in AD.  You can select whether you want all types of changes (modifies, creates, deletes), as well as particular object types (users, machines, OUs, GPOs, etc.). You learn a little about how objects change in AD, and if you delegate AD responsibility, how and what is being changed in AD.

Wyse Pocket Cloud for the iPhone and iPad
Not unique in its purpose, but this RDP (and optionally PCoIP) client for the iPhone and iPad does what its supposed to do flawlessly.  Any app that can let you reboot a critical server from the golf course is good in my book.  Any app that lets you do that on the golf course, in front of the VP of the company is even better. (True story)

Acronis
Long before the wonders of virtualization, there were byte-level disk imaging solutions to help you with your system protection and recovery needs.  This was like magic at the time, especially as it was becoming obvious file based backups of system partitions were never any good in the first place.  While it may not be needed in the Enterprise like it once was, there are still a few good use cases for it.  It’s also pretty handy to have on your home system, and every one of your neighbors home systems.  …Or the ones that know you’re in IT, and think you are their personalized technical support. 

CionSystems AD Self Service
Yet another tool from CionSystems.  It takes the burden off of IT for user account related activities.  Does the user need to change their cell phone number or their home address?  Does a Department Manager need to change the Title of someone’s position?  AD Self Service can do this, without ever giving these end users privileges.  Updating AD related attributes is especially important if you use other solutions that leverage AD information (Exchange, SharePoint, CRM, etc.).  AD Self Service also allows for a secure way for the user to unlock their locked out account.  The more users you manage, the more this product will help take the burden off of IT.

SolarWinds Subnet Calculator
Some networking purists would flog me on the side of the head for recommending such a cheater app.  But the fact is, I need quick and easy way to review subnetting options in order to make the right decision.  I can subnet manually much like I can do arithmetic manually.  I just choose not to.  I have other projects to allocate my time to, and I need the speed of a calculator to help me visit those options more quickly.  Subnet calculators like SolarWinds offer one other ability often overlooked; the ability to visualize the sizing of your subnetting.  You can create problems by making subnets too small, or too large.  Tools like this give a great visual representation of how you want to split networks.  It doesn’t excuse the requirement that every Administrator should fully understand how subnetting works.  (I still marvel at how brilliant IP subnetting is).  It’s that once they do, an Administrator should be able to use a tool to make it easier and faster for them to make the correct decision.

FileZilla
For as long as FTP has been around, and ubiquitous as it may seem, one might conclude that it all works the same.  Not true.  FTP Servers will have their own unique behaviors, just as FTP clients will have their own quirks.  The firewalls that the FTP traffic pass through add another variable that can frustrate end users and Administrators alike.  FileZilla seems to offer the most flexibility when working with remote FTP servers, and is what I use to handle a variety of different FTP needs.  FileZilla won’t eliminate inherent complexities with the FTP protocol as it traverses multiple networks, it just makes it easier to negotiate.

Enjoy!

Zero to 32 Terabytes in 30 minutes. My new EqualLogic PS4000e

Rack it up, plug it in, and away you go.  Those are basically the steps needed to expand a storage pool by adding another PS array using the Dell/EqualLogic architecture.  A few weeks ago I took delivery of a new PS4000e to compliment my PS6000e at my primary site.  The purpose of this additional array was really simple.  We needed raw storage capacity.  My initial proposal and deployment of my virtualized infrastructure a few years ago was a good one, but I deliberately did not include our big flat-file storage servers in this initial scope of storage space requirements.  There was plenty to keep me occupied between the initial deployment, and now.  It allowed me to get most of my infrastructure virtualized, and gave a chance for buy-in to the skeptics who thought all of this new-fangled technology was too good to be true.  Since that time, storage prices have fallen, and larger drive sizes have become available.  Delaying the purchase aligned well with “just-in-time” purchasing principals, and also gave me an opportunity to address the storage issue in the correct way.   At first, I thought all of this was a subject matter not worthy of writing about.  After all, EqualLogic makes it easy to add storage.  But that only addresses part of the problem.  Many of you face the same dilemma regardless of what your storage solution is; user facing storage growth.

Before I start rambling about my dilemma, let me clarify what I mean by a few terms I’ll be using; “user facing storage” and “non user facing storage.” 

  • User Facing Storage is simply the storage that is presented to end users via file shares (in Windows) and NFS mounts (in Linux).  User facing storage is waiting there, ready to be sucked up by an overzealous end user. 
  • Non User Facing Storage is the storage occupied by the servers themselves, and the services they provide.  Most end users generally have no idea on how much space a server reserves for say, SQL databases or transaction logs (nor should they!)  Non user facing storage is easier to anticipate needs and manage because it is only exposed to system administrators. 

Which array…

I decided to go with the PS4000e because of the value it returns, and how it addresses my specific need.  If I had targeted VDI or some storage for other I/O intensive services, I would have opted for one of the other offerings in the EqualLogic lineup.  I virtualized the majority of my infrastructure on one PS6000e with 16, 1TB drives in it, but it wasn’t capable of the raw capacity that we now needed to virtualize our flat-file storage.  While the effective number of 1GB ports is cut in half on the PS4000e as compared to the PS6000e, I have not been able to gather any usage statistics against my traditional storage servers that suggest the throughput of the PS4000e will not be sufficient.  The PS4000e allowed me to trim a few dollars off of my budget line estimates, and may work well at our CoLo facility if we ever need to demote it.

I chose to create a storage pool so that I could keep my volumes that require higher performance on the PS6000, and have the dedicated storage volumes on the PS4000.  I will do the same for when I eventually add other array types geared for specific roles, such as VDI.

Truth be told, we all know that 16, 2 terabyte drives does not equal 32 Terabytes of real world space.  RAID50 penalty knocks that down to about 21TB.  Cut that by about half for average snapshot reserves, and it’s more like 11TB.  Keeping a little bit of free pools space available is always a good idea, so let’s just say it effectively adds 10TB of full fledged enterprise class storage.  This adds to my effective storage space of 5TB on my PS6000.  Fantastic.  …but wait, one problem.  No, several problems.

The Dilemma

Turning up the new array was the easy part.  In less than 30 minutes, I had it mounted, turned on, and configured to work with my existing storage group.  Now for the hard part; figuring out how to utilize the space in the most efficient way.  User facing storage is a wildcard; do it wrong and you’ll pay for it later.  While I didn’t know the answer, I did know some things that would help me come to an educated decision.

  • If I migrate all of the data on my remaining physical storage servers (two of them, one Linux, and one Windows) over to my SAN, it will consume virtually all of my newly acquired storage space.
  • If I add a large amount of user-facing storage, and present that to end users, it will get sucked up like a vacuum.
  • If I blindly add large amounts of great storage at the primary site without careful thought, I will not have enough storage at the offsite facility to replicate to.
  • Large volumes (2TB or larger) not only run into technical limitations, but are difficult to manage.  At that size, there may also be a co-mingling of data that is not necessarily business critical.  Doling out user facing storage in large volumes is easy to do.  It will come back to bite you later on.
  • Manipulating the old data in the same volume as new data does not bode well for replication and snapshots, which look at block changes.  Breaking them into separate volumes is more effective.
  • Users will not take the time or the effort clean up old data.
  • If data retention policies are in place, users will generally be okay with it after a substantial amount of complaining. It’s not too different than the complaining you might here when there are no data retention policies, but you have no space.  Pick your poison.
  • Your users will not understand data retention policies if you do not understand them.  Time for a plan.

I needed a way to compartmentalize some of the data so that it could be identified as “less important” and then perhaps live on less important storage.  By “less important storage” this could mean that it lives on a part of the SAN that is not replicated, or in a worst case scenario, on even some old decommissioned physical servers, where it resides for a defined amount of time before it is permanently archived and removed from the probationary location.

The Solution (for now)

Data Lifecycle management.  For many this means some really expensive commercial package.  This might be the way to go for you too.  To me, this is really nothing more than determining what is important data, and what isn’t as important, and having a plan to help automate the demotion, or retirement of that data.  However, there is a fundamental problem of this approach.  Who decides what’s important?  What are the thresholds?  Last accessed time?  Last modified time?  What are the ramifications of cherry-picking files from a directory structure because they exceed policy thresholds?  What is this going to break?  How easy is it to recover data that has been demoted?  There are a few steps that I need to do to accomplish this. 

1.  Poor man’s storage tiering.  If you are out of SAN space, re-provision an old server.  The purpose of this will be to serve up volumes that can be linked to the primary storage location through symbolic links.  These volumes can then be backed up at a less frequent interval, as it would be considered less important.  If you eventually have enough SAN storage space, these could be easily moved onto the SAN, but in a less critical role, or on a SAN array that has larger, slower disks.

2.  Breaking up large volumes.  I’m convinced that giant volumes do nothing for you when it comes to understanding and managing the contents.  Turning larger blobs into smaller blobs also serves another very important role.  It allows the intelligence of the EqualLogic solutions to do their work on where the data should live in a collection of arrays.  A storage Group that consists of say, an SSD based array, a PS6000, and a PS4000 can effectively store the volumes in the correct array that best suites the demand.

3.  Automating the process.  This will come in two parts; a.) deciding on structure, policies, etc. and b.) making or using tools to move the files from one location to another.  On the Linux side, this could mean anything from a bash script, or something written in python.  Then use cron to schedule the occurrence.  In Windows, you could leverage PowerShell, vbscript, or batch files.  This would be as simple, or as complex as your needs require.  However, if you are like me, you have limited time to tinker with scripting.  If there is something turn-key that does the job, go for it.  For me, that is an affordable little utility called “TreeSize Pro”  This gives you not only the ability to analyze the contents of NTFS volumes, but can easily automate the pruning of this data to another location.

4.  Monitoring the result.  This one is easy to overlook, but you will need to monitor the fruits of your labor, and make sure it is doing what it should be doing; maintaining available storage space on critical storage devices.  There are a handful of nice scripts that have been written for both platforms that help you monitor free storage space at the server level.

The result

The illustration below helps demonstrate how this would work. 

image

As seen below, once a system is established to automatically move and house demoted data, you can more effectively use storage on the SAN.

image

Separation anxiety…

In order to make this work, you will have to work hard in making sure that the all of this is pretty transparent to the end user.  If you have data that has complex external references, you would want to preserve the integrity of the data that relies on those dependent files.  Hey, I never said this was going to be easy. 

A few things worth remembering…

If 17 years in IT, and a little observation in human nature has taught me one thing, it is that we all undervalue our current data, and overvalue our old data.  You see it time and time again.  Storage runs out, and there are cries for running down to the local box store and picking up some $99 hard drives.  What needs to reside on there is mission critical (hence the undervaluing of the new data).  Conversely, efforts to have users clean up old data from 10+ years ago had users hiding files in special locations, even though it was recorded that it had not been modified, or even accessed in 4+ years.  All of this of course lives on enterprise class storage.  An all too common example of overvaluing old data.

Tip.  Remember your Service Level Agreements.  It is common in IT to not only have SLAs for systems and data, but for one’s position.  These without doubt are tied to one another.  Make sure that one doesn’t compromise the other.  Stop gap measures to accommodate more storage will trigger desperate, affordable solutions.  (e.g. adding cheap non-redundant drives in an old server somewhere).  Don’t do it!  All of those arm-chair administrators in your organization will be nowhere to be found when those drives fail, and you are left to clean up the mess.

Tip.  Don’t ever thin provision user facing storage.  Fortunately, I was lucky to be clued into this early on, but I could only imagine the well intentioned administrator who wanted to present a nice amount of storage space to the user, only to find it sucked up a few days later.  Save the thin provisioning for non user facing storage (servers with SQL databases and transaction logs, etc.)

Tip.  If you are presenting proposals to management, or general information updates to users, I would suggest quoting only the amount of effective, usable space that will be added.  In other words, don’t say you are adding 32TB to your storage infrastructure, when in fact, it is closer to 10TB.  Say that it is 10TB of extremely sophisticated, redundant enterprise class storage that you can “bet the business” on.  It’s scalability, flexibility and robustness is needed for the 24/7 environments we insist upon.  It will just make it easier that way.

Tip.  It may seem unnecessary to you, but continue to show off snapshots, replication, and other unique aspects of SAN storage, if you still have those who doubt the power of this kind of technology – especially when they see the cost per TB.  Repeat to them how long (if even possible) it would take to protect that same data under traditional storage.  Do everything you can to help those who approve these purchases.  More than likely, they won’t be as impressed by say, how quick a snapshot is, but rather, shocked how traditional storage can’t be protected very well.

You may have noticed I do not have any rock-solid answers for managing the growth and sustainability of user facing data.  Situations vary, but the factors that help determine that path for a solution are quite similar.  Whether you decide on a turn-key solution, or choose to demonstrate a little ingenuity in times of tight budgets, the topic is one that you will probably have to face at some point.

 

How I use Dell/EqualLogic’s SANHQ in my environment

 

One of the benefits of investing in Dell/EqualLogic’s SAN solutions are the number of great tools included with the product, at no extra charge.  I’ve written in the past about leveraging their AutoSnapshot Manager for VM and application consistent snapshots and replicas.  Another tool that deserves a few words is SAN HeadQuarters (SANHQ). 

SANHQ allows for real-time and historical analysis of your EqualLogic arrays.  Many EqualLogic users are well versed with this tool, and may not find anything here that they didn’t already know.  But I’m surprised to hear that many are not.  So, what better way to help those unfamiliar with SANHQ than to describe how it helps me with my environment.

While the tool itself is “optional” in the sense that you don’t need to deploy it to use the EqualLogic arrays, it is an easy (and free) way to expose the powers of your storage infrastructure.  If you want to see what your storage infrastructure is doing, do yourself a favor and run SANHQ.   

Starting up the application, you might find something like this:

image

You’ll find an interesting assortment of graphs, and charts that help you decipher what is going on with your storage.  Take a few minutes and do a little digging.  There are various ways that it can help you do your job better.

 

Monitoring

Sometimes good monitoring is downright annoying.  It’s like your alarm clock next to the bed; it’s difficult to overlook, but that’s the point.  SANHQ has proven to be an effective tool for proactive monitoring and alerting of my arrays.  While some of these warnings are never fun, it’s biggest value is that it can help prevent those larger, much more serious problems, which always seem to be a series of small issues thrown together.  Here are some examples of how it has acted as the canary in the coalmine for me in my environment.

  • When I had a high number of TCP retransmits after changing out my SAN Switchgear, it was SANHQ that told me something was wrong.  EqualLogic Support helped me determine that my new switchgear wasn’t handling jumbo frames correctly. 
  • When I had a network port go down on the SAN, it was SANHQ that alerted me via email.  A replacement network cable fixed the problem, and the alarm went away.
  • If replication across groups is unable to occur, you’ll get notified right away that replication isn’t running.  The reasons for this can be many, but SANHQ usually gives you the first sign that something is up.  This works across physical topologies where your target my be at another site.
  • Under maintenance scenarios, you might find the need to pause replication on a volume, or on the entire group.  SANHQ will do a nice job of reminding you that it’s still not replicating, and will bug you at a regular interval that it’s still not running.

 

Analysis and Planning

SANHQ will allow you to see performance data at the group level, by storage pools, volumes, or volume collections.  One of the first things I do when spinning up a VM that uses guest attached volumes, is to jump into SANHQ, and see how those guest attached volumes are running.  How are the average IOPS? What about Latencies and Queue depth?  All of those can be found  easily in SANHQ, and can help put your mind at ease if you are concerned about your new virtualized Exchange or SQL servers.  Here is a screenshot of a 7 day history for SQL server with guest attached volumes, driving our SharePoint backend services.

image

The same can be done of course for VMFS volumes.  This information will compliment existing data one gathers from vCenter to understand if there are performance issues with a particular VMFS volume.

Often times monitoring and analysis isn’t about absolute numbers, but rather, allowing the user to see changes relative to previous conditions.  This is especially important for the IT generalist who doesn’t have time or the know-how for deep dive storage analysis, or have a dedicated Storage Administrator to analyze the data.  This is where the tool really shines.  For whatever type of data you are looking at, you can easily choose a timeline by the last hour, 8 hours, 1 day, 7 days, 30 days, etc.  The anomalies, if there are any, will stand out. 

image

Simply click on the Timeline that you want, and the historical data of the Group, member, volume, etc will show up below.

image

I find analyzing individual volumes (when they are VMFS volumes) and volume collections (when they are guest attached volumes) the most helpful in making sure that there are not any hotspots in I/O.  It can help you determine if a VM might be better served in a VMFS volume that hasn’t been demanding as much I/O as the one it’s currently in.

It can also play a role in future procurement.  Those 15k SAS drives may sound like a neat idea, but does your environment really need that when you decide to add storage?  Thinking about VDI?  It can be used to help determine I/O requirements.  Recently, I was on the phone with a friend of mine, Tim Antonowicz.  Tim is a Senior Solutions Architect from Mosaic Technology who has done a number of successful VDI deployments (and who recently started a new blog).  We were discussing the possibility of VDI in my environment, and one of the first things he asked of me was to pull various reports from SANHQ so that he could understand our existing I/O patterns.  It wasn’t until then that I noticed all of the great storage analysis offerings that any geek would love.  There are a number of canned reports that can be saved out as a pdf, html, csv, or other format to your liking.

image

Replication Monitoring

The value of SANHQ went way up for me when I started replication.  It will give you summaries of the each volume replicated.

image

If you click on an individual volume, it will help you see transfer sizes and replication times of the most recent replicas.  It also separates inbound replica data from outbound replica data.

image

While the times and the transfer rates will be skewed somewhat if you have multiple replica’s running (as I do), it is a great example on how you can understand patterns in changed data on a specific volume.  The volume captured above represents where one of my Domain Controllers lives.  As you can see, it’s pretty consistent, and doesn’t change much, as one would expect (probably not much more than the swap file inside the VM, but that’s another story).  Other kinds of data replicated will fluctuate more widely.  This is your way to see it.

 

Running SANHQ

SANHQ will live happily on a stand alone VM.  It doesn’t require much, but does need direct access to your SAN, and uses SNMP.  Once installed, SANHQ can be run directly on that VM, or the client-only application can be installed on your workstation for a little more convenience.  If you are replicating data, you will want SANHQ to connect to the source site, and the target site, for most effective use of the tool.

Improvements?  Sure, there are a number of things that I’d love to see.  Setting alarms for performance thresholds.  Threshold templates that you could apply to a volume (VMFS or native) that would help you understand the numbers (green = good.  Red = bad).  The ability to schedule reports, and define how and where they are posted.  Free pool space activity warnings (important if you choose to keep replica reserves low and leverage free pool space).  Array diagnostics dumps directly from SANHQ.  Programmatic access for scripting.  Improvements like these could make a useful product become indispensible in a production environment.

Finally. A practical solution to protecting Active Directory

 

Active Directory.  It is the brains of most modern-day IT infrastructures, providing just about every conceivable control of how users, computers and information will interact with each other.  Authentication, user, group and computer access control, all help provide logical barriers that allow for secure access, but a seamless user experience with single sign-on access to resources.  While it has the ability to improve and integrate critical services such as DNS, DHCP, and NTP, in many ways those services become dependent on Active Directory.  These days, Active Directory controls more than just pure Windows environments.  Integration with non Microsoft Operating systems like Ubuntu, Suse, and VMWare’s vSphere is becoming more common thanks to products such as LikeWise.  The environment that I manage has Windows Servers and clients, most distributions of Linux, Macs, a few flavors of Unix, VMware, and iPhones.  All of them rely on Active Directory.  You quickly learn that if Active Directory goes down, so does your job security.

Active Directory will run happily even under less than ideal circumstances.  It is incredibly resilient, and somehow can put up with server crashes, power outages, and all sorts of debauchery.  But neglect is not a required ingredient for things to go wrong.  When it does, the results can be devastating.  AD problems can be difficult to track down, and it’s tentacles will affect services you never considered.  A corrupt Active Directory, or the Controllers it runs on, can make your Exchange and SQL servers crumble around you.  I lived through this experience (barely) a while back, and even though my preparation for such scenarios looked very good on paper, I spent a healthy amount of time licking my wounds, and reassessing my backup strategy of Active Directory.  I never want to put myself in that position again.

As important as Active Directory is, it can be quiet challenging to protect.  Why?  I believe the answer can be boiled down into two main factors; it’s distributed, and it’s transaction based.  In other words, the two traits that makes it robust also makes it difficult to protect.  Large enterprises usually have a well architected AD infrastructure, and at least understand the complexities of protecting their AD environment.  Many others are left with pondering the various ways to protect it.

  • File based backups using traditional backup methods.  This has never been enough, but my bet is that you’d find a number of smaller environments do this – if they do anything at all.  It has worked for them only because they’ve never had a failure of any sort.
  • AD backup agents that are a part of a commercial backup application.  Some applications like Symantec Backup Exec (what I previously relied on) seem like a good idea, but show their true colors when you actually try to use it for recovery.  While the agents should be extending the functionality of the backup software, they just add to an already complex solution that feels like a monstrosity geared for other purposes.
  • Exporting AD on Windows 2008 based Domain controllers by using NTDSUTIL and the like.  This is difficult at best, arguably incomplete, and if you have a mix of Windows 2008 and Windows 2003 DC’s, won’t work.
  • Those who have virtualized their domain controllers often think that the well timed independent snapshot or VCB backup will protect them.  This is not true either.  You will have a VM consistent backup of the VM itself, but it does nothing to coordinate the application with the other Domain Controllers and the integrity of it’s contents.  In theory, they could be backed up properly if every single DC was shut down at the same time, but most of us know that would not be a solution at all.
  • Dedicated Solutions exist to protect Active Directory, but can be overly complex, and outrageously expensive.  I’m sure they do their job well, but I couldn’t get the line item past our budget line owner to find out.

The result can be a desire to want to protect AD, but uncertainty on what “protect” really means.  Is protecting the server good enough?  Is protecting AD itself enough?  Does one need both, and if so, how does one go about doing that?  Without fully understanding the answers to those questions, something inevitably goes wrong, and the Administrator is frantically flipping through the latest TechNet Article on Authoritative Restores, while attempting to figure out their backup software.  It’s particularly painful to the Administrator, who had the impression that they were protecting their Organization (and themselves) when in fact, they were not. 

In my opinion, protecting the domain should occur at two different levels.

  • Application layer.  This is critical.  Among other things, the backup will coordinate Active Directory so that all of it’s Update Sequence Numbers (USN’s) are at an agreed upon state.  This will avoid USN’s that are out of sync, which can be the trouble of so many AD related problems.  Application layer protection should also honor these AD specific attributes so that granular recovery of individual objects is possible.  Good backup software should leverage API’s that take advantage of Volume Shadow Copy Services (VSS).
  • Physical layer.  This protects the system that the services may be running on.  If it’s a physical server, it could be using some disk imaging software such as Acronis, or Backup Exec System Recovery.  If it’s virtualized, an independent backup of the VM will do.  Some might suggest that protecting the actual machine isn’t technically required.  The idea behind that reasoning is that if there is a problem with the physical machine, or the OS, one can quickly decommission and commission another DC with “dcpromo.”  While protecting the system that AD runs on may not be required, it may help speed up your ability (in conjunction with Application layer protection) to correct issues from a previously known working state.

I was introduced to CionSystems by a colleague of mine who suggested their “Active Directory Self-Service” product to help us with another need of ours.  Along the way, I couldn’t help but notice their AD backup offering.  Aptly named, “Active Directory Recovery” is a complete application layer solution.  I tried it out, and was sold.  It allows for a simple, coordinated backup and recovery of Active Directory.  A recovery can be either a complete point-in-time, or a granular restore of an object.  It is agentless, meaning that you don’t have to install software on the DCs.  The first impression after working with it is that it was designed for one purpose; to backup Active Directory.  It does it, and does it well.

The solution will run on any spare machine running IIS and SQL.  Once installed, configuring it is just a matter of pointing it to your Domain Controller that runs the PDC Emulator role.  After a few configuration entries are made, the Administration console can be accessed with your web browser from anywhere on your network.

image

The next step is to set up a backup job, and let it run.  That’s it.  Fast, simple, and complete.  From the home page, there are a few different ways you can look at objects that you want to recover.

If it’s a deleted object, you can click on the “Deleted Objects” section.  Objects with a backup to restore from will show up in green, and present itself below each object.  Below you will see a deleted computer object, and the backups that it can be restored from.

image

The “List Backups” simply shows the backups created in chronological order.  From there you can do full restores, or restore an individual object that still exists in AD.  Unlike authoritative restores, you do not have to do any system restarts.

image

During the restore process, “Active Directory Recovery” will expose individual attributes of the object that you want to restore – if you wish for the restore to be that granular.  If it’s restorable, there is a checkbox next to it.  Non-modifiable objects will not have a checkbox next to it.

image

One of my favorite features is that it provides a way for a true, portable backup.  One can export the backup to a single file (a proprietary .bin file) that is your entire AD backup, and save it onto a CD, or to a remote location.  This is a wish list item I’ve had for about as long as AD has been around.    There are many other nice features, such as email notifications, filtering and comparison tools, as well as backup retention settings. 

I use this product to compliment my existing strategy for protecting my AD infrastructure.  While my virtualized Domain Controllers are replicated to a remote site (the physical protection, so to speak), I protect my AD environment at the application level with this product.  The server that “Active Directory Recovery” runs on is also replicated, but to be extra safe, I create a portable/exported backup that is also shipped off to the offsite location.  This way I have a fully independent backup of AD.  If I’m doing some critical updates to my Domain Controllers, I first make a backup using Active Directory Recovery, then make my snapshots on my virtualized DC’s  That way, I have a way to roll back the changes that are truly application consistent.

After using the product for a while, I can appreciate that I don’t have to invest much time to keep my backups up and running.  I previously used Symantec’s Backup Exec to protect AD, but grew tired of agent issues, licensing problems, and the endless backup failure messages.  I lost confidence in its ability to protect AD, and am not interested in going back. 

Hopefully this gives you a little food for thought on how you are protecting your Active Directory environment.  Good luck!