This concept is something I’ve carried around with me for my last 3 jobs, and since I’m writing it up for my current employer, I figured I should document it here as well. I’ve mainly worked in Linux/Windows environments, so you may sense a bit of bias away from older systems. It’s not intentional, just a result of my experience. Thanks to Mick for introducing me to this schema.
The purpose of this documentation is to provide a clean-cut and straight-forward convention for naming servers. The goal is to look towards the future and not be shackled by our past. This document should be read through once for a basic understanding, and used as a reference when new hosts and services are set up.It should provide a general convention, and may be modified in the future for clarity. All changes should be made by the owner of this document to ensure consistency.
Basic Domain Structure
The domain example.corp we be our designated placeholder for this discussion.
Host names are broken into two classifications: Physical Host Name and Service Host Name.
Physical host names can be compared to the DNA of a server and are by necessity somewhat cryptic.
Used when managing the physical server inventory
Used exclusively by the operations team
Used for resource monitors such as IO, CPU and Memory.
Service host names are more dynamic, user friendly, and may jump between physical servers as applications are migrated.
Used exclusively when discussing specific functionality of the server.
Used for managing cluster or application functionality e.g. “pushing out a new attributes.xml file to all shopcart jboss app servers”
Used for functional service monitoring.
Physical Host Name
Physical host names represent a unique way to refer to each and every physical server, however unlike the machine’s serial number, it has the flexibility to change. Each character of it’s name must have meaning or provide clarity. If the purpose of a host is changed, the host name may be changed to suit the new purpose, although this will happen infrequently (and make sure to define a procedure for changing a Physical Host Name). Names are broken into 6 key pieces of information:
[Location] [Service Level] [OS] [OS Major Version] [Purpose] [Identification Number]
Location refers to the datacenter in which the hardware is physically present. Each entry will consist of a unique 2 character code representing the city where the datacenter is located. The following is a definitive list of locations currently allowed. If a new entry is needed, please contact the owner of this document.
Grand Rapids, MI
Madison Hills, MI
Service level is loosely defined by the importance of the applications running on it, and what our perceived response should be. We currently have 3 designations:
This machine should be treated as a customer facing production host.
Downtime on this server has a significant impact on productivity.
Downtime on this server has minimal impact.
Note that no server should be completely ignored if it is in distress, this just provides a general guideline as to the urgency of the problem.
Operating System / OS Major Version
Automating system updates and configuration roll-outs is a key responsibility of the operations team. Embedding OS identification into the host name not only allows convention-based automation, but provides context for alerts during a production event. OS Identification is broken into two categories; Operating System and the Major Version number of that OS.
Proprietary One-Off/ appliance
Major OS Version
1 or 11
2 or 12
Purpose refers to the overall usage of the server; whether it’s a database server, application server, etc. definitions are purposefully loose to allow for similar services to share the same host. This list will grow as we better define our environment. Purpose designations should refer to generic uses rather than implementation specific (db rather than Oracle or Sybase).
Servers that run Web Applications; JBoss Application Server, Tomcat, IIS
Servers that host static or semi-static content; Apache, Nginx
Servers that host Databases; Oracle, Sybase, MySQL
Servers that host continuous integration; Hudson, TeamCity, CruiseControl
Servers used for by Operations; bind, ldap, pdsh, ssl cert creation
Terminal Server for Serial Access (not to be confused with MS Terminal Services)
Virtual Host Server
Virtual Machine Host Server
The last segment of the physical host name is a simple three digit numeric ID. The Numeric ID can be used in several contexts:
The numeric ID can increment when the rest of the host name is the same (mh1c6ci001, mh1c6ci002, etc).
The numeric ID can be used to designate clustered servers (gr2c6as013,gr2c6as023, gr2c6as033, gr2c6as043).
The follow is a list of sample host names:
mh2c6as011 – Madison Heights, Second Priority CentOS 6 Application Server 11 (App server hosting foo.com qa running on tomcat)
gr1r5db002 – Grand Rapids, Top Priority Red Hat 5 Database Server 2 (Production Oracle RAC server)
ax1c5ci001 – Alexandria, Top Priority CentOS 5 Continuous Integration Server 1 (Master node of Hudson)
Service Host Name
Service Host Names are convenient alias that is associated with functionality rather than physical servers. Names are segmented into the following format:
Application represents the specific functionality associated with a service.
oracle, mysql, sybase
The purpose designation will usually align with the purpose of the physical host name. See the reference list above for details.
Numeric ID is a two-digit incrementing ID based on the uniqueness of the rest of the hostname. e.g. register-as01.dev.example.corp,register-as02.dev.example.corp,register-as01.qa.example.corp
Subdomain refers to either an environment, or a specialized infrastructure subdomain.
dev.example.corp – Environment used for Active Development
qa.example.corp – Environment used for Quality Assurance
stage.example.corp – Environment used for loadtesting and User Acceptance Testing
prod.example.corp – Environment used for Production
sn.example.corp – Storage network used for Backups and NAS
mgt.example.corp – Management network, used for ILO.
There are some exceptions to the Service Host Name conventions. The following is a list of examples.
Some Infrastructure Services are not tied to a given subdomain.
No Purpose and ID
Load-balanced Service Hostnames point directly to the Load Balancer and do not require these fields
Common Service Host Names
Load Balances Service Host Names
Utility Service Host Names
Why are we limiting Physical Host Names to only 10 characters?
Because it’s about as long as it can be and still be phonetically memorable: “mh1 c5 as 001”
rfc 1178 suggests keeping hostnames short.
I’m not sure what to call the new host I’m building?
Speak with the owner of this document if you have any doubts. Everyone on your team should be confident and competent enough to select a new name, but at the same time a centralized person will help ensure consistency.
I’m building a ______ and it doesn’t fit into any of the purposes listed- what do I call it?
Are you sure it doesn’t fit? We don’t want you arbitrarily shoe-horning servers into bad names, but we need to balance that with preventing an over-abundance of entries. If it truly doesn’t fit, work with the team to designate a new entry on this page.
I’m building a ______ and it has multiple purposes, what do I call it?
First ask yourself why it has multiple purposes; would one of those purposes be better suited elsewhere? Provided one does not have a better home elsewhere, use the one that is most appropriate; if a machine is running mysql as a backend for a JBoss application, it’s the JBoss application that people will be most interested in, so you’d label it as rather than db. It’s really a judgment call, and by all means, ask the owner of the doc if you need guidance.
What about VMS/HP-UX/whatever that only allows 6 character names?
You can’t win them all. At some point you have to draw a line on how much legacy you need to support. In this case, it doesn’t make sense to shackle systems made in the last 15 years because incredibly ancient machines have limitations.
You’re missing ____ on your list.
If there is a heinous omission of a location, purpose, application or operating system, by all means let us know. This isn’t set in set in stone, and can be modified if needed.
I’m always looking for feedback and new ideas; if you have any suggestions, I’d love to hear them.
While system administrators often have many different goals, here are two that seem fairly universal:
Automate the redundant tasks
Hand off the simple tasks
I’ve recently found that the build utility Jenkins can be a major boon for an Operations team, and wanted to share my findings with others.
What is Jenkins?
So, what exactly is Jenkins? It’s a popular fork of the open source continuous integration tool, Hudson. While it is normally used for building and deploying software, it can easily be used for more interesting purposes. Here are some of the more useful features:
Console output for each run
Manual and/or cron-based runs
master/slave node configurations
Email, XMPP, IRC notification integration
integration with Bash, Ant, and Maven
integration with cvs, subversion and other version control systems
Dozens of useful plugins.
Due to it’s modular and flexible nature, not only can you pick and choose the plugins you want, but you can write your own as well.
What is Jenkins normally used for?
Jenkins’ intended use is as a Continuous Integration server- It downloads code from a repository, resolves dependencies, builds the code, and then deploys it. As mentioned before, it tightly integrates with both Ant and Maven, making it a boon for java developers. While Ant and Maven integration is incredibly useful, it’s the combination of subversion and bash integration that sysadmins will find useful.
What can Jenkins be used for?
Don’t think of Jenkins as a continuous integration server, think of it as…
Cron with a Gui Interface and nice bells and whistles
Jenkins has the advantage of centralization, visibility (a pretty gui goes a long way), per-run logging, IM and/or email notification, time trending, and a self-contained workspace. I usually draw the line by seeing if one or more of the following statements apply:
Use Jenkins if it involves multiple hosts
Use Jenkins if you wish to capture output on each run
Use Jenkins if you wish to trend disk usage or time of the task
Use Jenkins if you are in need of ACLs
Use Jenkins if it allows a non-technical person to perform a task without the help of a sysadmin
Use Jenkins if you want to run something manually as well as on a schedule.
What shouldn’t Jenkins be used for?
Once you recognize the power and flexibility at your disposal, you may be tempted to use Jenkins as a
Cron with a Gui Interface and nice bells and whistles
See what I did there? My point is that there’s a fine line between what belongs in Jenkins and what does not. Discovering it for yourself is half the battle. Here are some sample tasks as I would categorize them.
Log rotation on a single server
Login user access to restart an init script
Multi-user access to restart an stop an init script, clear logs and restart an init script
Manager access to restart a service
Deploying code across several production nodes
Update Documentation on a Given Project
Run a command nightly
Run a command nightly that emails it’s output
Run a command manually that IM’s you when complete and logs STDOUT
Run a command through PDSH across several servers and update a wiki page
Why not Mcollective or RunDeck?
To be honest I’d never heard of them until after I began writing this article. I’d already been using Jenkins/Hudson for a year and a half. There may be better tools for the job, but Jenkins has some distinct advantages.
You may already have it
Getting buy-in from developers to use it is much easier.
In either case many of the tasks below may be usable under work-alike services.
Sample uses of Jenkins
So lets get down to business- how can Jenkins make your life easier? Your first task is to identify things that can easily be scripted, are incredibly repetitive, and produce data that is outdated quickly. For me, that first task was to generate documentation.
No documentation is better than scores of bad documentation- It’s better to know you don’t know, than to know the wrong thing. One of the many problems we suffered from was simple IP Address allocation. Since we didn’t own the DNS server, it was hard for us to track IP address allocation- i.e. we didn’t know what IP addresses were in use, or what they were used for, or who was using them. Our best attempt to track them was manual modification of our team’s 47 internal subnets. The problems with this method were threefold- 1) it involved a human performing an error-prone routine, 2) There was no simple way to audit the entire system, and 3). There was no central authority.
Needless to say, things would drop through the cracks. On more than one occasion, we had applications go down because their IP address was double-allocated because half of the team didn’t know the IP allocation pages existed.
Example 1: Reverse DNS Entry Cleanup
So how did we solve this? The first step was to audit the current status of our reverses. I wrote a perl script that takes IP ranges as an input, scans the network for active IPs, does reverse lookups on all IP addresses, then documents oddities and writes them to the wiki.
This may seem simple, but it was the first time we saw a glimpse at how ugly our reverses were- multiple IP addresses resolving to the same host name (which shouldn’t happen in our configuration), missing reverses, IPs that didn’t respond to ping, but had reverses designated (usually outdated entries). While running this task was simple enough to do manually, it took 30 minutes to run. Who wants to run a 30 minute CLI job? While I could have put it in cron, I chose to put it in Jenkins for the following reason:
The status was visible; anyone logged into Jenkins could see the process was running, and check the STDOUT to make sure there were no unexpected errors.
When it completed successfully, the status bulb would turn green- A quick visual sweep of 30 jobs can be done in seconds if you’re just looking for a red bulb.
Logs were kept with each run, so I could compare and contrast output
An IM could be sent when the job completed, passing or failing
the latest version of my perl script was always checked out.
I could trend run times and estimate how long the next run would take
I could set it to run automatically once a night at 3am
I could run it manually via a Jenkins “push button”
anyone on my team could run it manually via a Jenkins “push button”
With this output, we were able to drastically reduce the volume of bad reverses. With the help of the Confluence.pm module, we could write the results to the wiki for the whole world to see. This however only showed us bad entries- what about the good ones?
Example 2: X.X.X.X IP Range Pages
Now that we had reasonably good reverses, we can use that information dynamically generate the pages my coworkers were previously maintaining manually.
We started yet another perl script that took a set of ranges as inputs- each range would generate a different page in the wiki under the name “X.X.X.X IP Range”. The script would then plow through each IP address in the range and
Do a reverse lookup to find out what host name resolved,
Fping it to see if it responded,
Check our LDAP inventory repository (also automatically updated) and report the physical host where the IP was bound,
Report the primary IP of that physical host,
The “description” field of the host entry,
Hardware detail of the host entry.
Once the script was written, it could take any number up IP ranges (even 47 subnets) and rip through them, updating a page for each. Jenkins made sure that the job ran nightly, again pulled the latest version of my script from subversion, and kept track of any oddities in it’s output. Suddenly we had IP allocation tables that were guaranteed to be up to date.
Example 3: Hardware Breakdown
Inventory management was a weak point for us when I started- we simply didn’t know how many servers we had. Previously this had been managed by a spreadsheet on the hardware admin’s laptop, but much like IP address allocation, it was error-prone and often out of date. We moved to an LDAP-based inventory management setup (How the inventory script works is outside of the scope of this article) that was dynamically updated, however an LDAP interface is less approachable than a spreadsheet for management.
One of the things we track is hardware makes and model numbers. One of the managers wanted to know how many P-class blade we had left- a quick LDAP search and poof, we had his answer. Then he wanted to know how many C-class blades we had, so I showed him how to do searches. While that sufficed for a while, it soon became apparent that we needed a prettier hardware breakdown for management.
One Perl script later, we were able to dynamically generate wiki pages from this data. You may be asking “why not do this with cron?” The answer is simple- dependencies. the Hardware breakdown page is downstream of the Inventory Update script- After all, if it’s pulling the entirety of it’s data from LDAP, it only makes sense to update when that inventory is updated.
Final note on Generating Documentation
It’s not only the content of the documentation that’s important- my generation scripts also include a warning that the page is auto-generated, the repository location of the script that generated it, and the time it took to generate that specific page.
I’ll be the first to admit that this is a special-case usage, and may not be useful to many, but there are occasions where you’ll want to generate configurations- Not multi-server configurations where Puppet could be useful, but single, complex, repetitive configurations like Nagios.
Generating Nagios Hosts
We have over 800 servers. Manually configuring Nagios was labor-prohibitive, so an automated approach is warranted. Using our continuously updated inventory, we pull pertinent information from LDAP and generate individual host configuration files for each host. At this stage, there are no checks associated with the hosts.
From there, we loop through our newly created host files and scan the network using a predefined set of “features”- Some hosts run MySQL on port 3306, some run JBoss on 8080, etc. Hosts that match a feature are added to a dynamically generated host group with predefined service checks.
For example, only 2 services should be running on TCP port 8080, Tomcat and JBoss. if that port is active, we check 1161, the SNMP port to determine which service it is, then add it to the proper service group. If it reports as JBoss, the host is added to the JBoss-hosts host group, which has checks for heap usage, thread usage, etc.
So if it’s all automated, why bother with Jenkins? Well, there’s a lot of complexity, and even with the best documentation, it’s a chore for even one person to keep track of how it all works. If something happens to the infrastructure while I’m out (e.g. several servers fail and are replaced), we want to implement those and update monitoring quickly. Having a push button “Regenerate Nagios hosts” makes it simple enough for anyone to do it. Having it email me when it happens, create runtime logs, and pull or write to subversion is icing on the cake. Jenkins helps us ensure that each run is handled identically, and ensures consistency when used by various administrators.
Sysadmins usually have an abundant backlog of tasks, be it system audits, upgrade, research, etc. Quite often system administrators get drawn in to user tasks because the task requires elevated privileges. Jenkins can help you hand that off to nontechnical users in a safe and simple manner.
GUI Interface for Non-Technical Users
No matter how technical your coworkers are, eventually you’ll run into a non-technical user that needs to perform some random technical task; perhaps it’s loading content into a custom system or indexing content. The process may have been designed by a developer using a command line program or script; it may require sudo. In either case, the task is technical enough to make the non-techie shy away from doing it because they “don’t want to use the dos window.”
Jenkins to the rescue! Asking that user to log into a web GUI and press a button lowers the barrier for them. In addition, they can watch the progress, see how long previous runs took, see when the job was last run, and who ran it. If the initial task was in any way complicated, that can all be hidden away (yet still clearly visible in the console logs).
As you can see, Jenkins can be re-purposed for a plethora of different uses. With a little bit of creativity and ingenuity, you can greatly improve your productivity. If you’re already doing something like this, please leave a comment below describing your own implementations.
As a system administrator, monitoring is a key job responsibility, yet arguments seem to arise on how to implement it (usually with people who won’t be paged at 3am). Before writing this, I looked around for an article on the goals and philosophy of system monitoring, but found very little that really applied to this topic. Hopefully this will help set some expectations for admins, managers and stakeholders on what you should monitor, and why it should be monitored.
Why you Monitor
Before you set up a single monitor, you have to ask yourself, “what is the goal?” After all, why are you even setting something up? Here are a few common reasons for configuring monitors:
Notification: Warning of an issue that requires intervention. What most people think of when you say “Monitoring”.
Reactionary: Automatic actions are taken when certain criteria are met. If common countermeasures are automated, you’ll have less to handle manually.
Informational: System status and historical trending allows you to show business customers that production “isn’t always down.” In reality, you may have 99% uptime, and often downtime is due to requested deployments. Statistical information can also be used for capacity planning.
Mentally dividing your monitors into groups will help you calculate which monitors require involvement. It’s not uncommon to have several thousand monitors at any given time, so it’s important not to assign critical importance to all of them. A wise man once said “When all alerts are critical, none of them are.”
When you should NOT Notify
Some monitors may have thresholds set which check for certain conditions; when those conditions are met, you may want to send some type of alert to an administrator. There are two types of notifications – Active and Passive:
Active Notification: Immediate Action is Required: “Site is Down!” A phone call, page, or IM may be used to contact someone. Direct action expected.
Passive Notification: Informational Purposes only: “JVM Memory usage is high.” Information is logged, and perhaps an email is sent. No direct action is expected, since there’s usually nothing you can do about it.
It’s easy to become addicted to passive notifications – but remember, data overload can mask important information. It becomes habit to ignore notifications if they are unimportant. The question then is not so much “when should you notify,” but “when shouldn’t you?” What it really boils down to is “Can/should I do anything about it right now?”
Non-critical (disk space creeps above 90% on /var on a dev server at 2am on a Saturday after several months of growth).
Nothing Systemic is wrong (admins can’t fix “low sales”).
3rd party system, such as a geocoding webservice, is down.
Will resolve shortly, such as a backup server pegging the CPU during midnight backups.
Some of these alerts can be avoided by setting a correct monitoring window (ignore CPU during the backup window, or set a blackout window for a deployment). Others simply can’t be addressed by administrators, although you may want to send informational emails to other members of the company (those managing 3rd party SLAs or responsible for tracking online sales)Â The next step after getting an alert is figure out what to do about it.
When a notification is sent out, there should be a definitive action that you can take. Think about why you were notified. There are a few rules to keep in mind when something goes wrong.
Don’t Panic. When 700 alarms go off, your first instinct is to panic. Before you act, take a breath. Spend a moment to get your bearings, and calm yourself. The worst possible thing you can do is flail. Randomly making changes without rhyme or reason and restarting services can do more harm than good and may make the situation worse. Take note of which alarms go off, and in the post mortem look for ways to get the same information with less noise.
Identify Obvious Patterns. What is the commonality? If a central system goes down, you may see many similar alerts. Dependencies can help immensely, masking redundant alerts. A single database failure could take down a dozen sites. Which is better: getting a single alert that the database is down, or 250 alerts that various sites are down and one database notification in the middle? While 250 alerts may impress the gravity of the situation upon you, it may instil panic and anxiety, which leads to flailing.
Get things up and running as quickly as possible. Root-cause analysis can be tedious, time consuming, and occasionally inconclusive. If you have a major system outage, don’t worry about doing root-cause analysis on the spot. Do what you need to in order to get things up and running – you can search the logs later. If the problem is recurring, you’ll get another chance to investigate later.
Communicate with Stakeholders. The business units don’t need to know the details, but they do need to know that there is an outage and that it’s being addressed. If the situation is not quickly resolved, give them status reports. Be warned – any details you reveal will be warped and held against you. I’ve learned this one many times. People have a tendency to blame what they don’t understand. “Site is down? It must be a witch!” At a previous job we had a “jump to conclusions” board which had our favorite scapegoats – load balancer, connection pool, Endeca, etc. Everyone is guilty of it – Business, devs, sysops, QA, etc. Even a one-time problem that has been resolved will be brought back up, even if it’s only tangentially related. Communicating too much information creates future scapegoats.
Contact Domain Experts. If your java site is crashing and you’re not a java developer, get a java developer involved. If your DNS server falls down and the fix isn’t obvious, contact your DNS administrator. Expert eyes on the problem may resolve the issue quicker. Group chat is crucial for sharing information and talking out theories. Someone familiar with the code will know what the error messages mean.
Fix the Problem. It should go without saying that if you find the problem, you should make every effort to resolve it. Workarounds are fine, just don’t let that band-aid become permanent. What often happens is a workaround is put in place; the alert clears and management no longer feels the pain, so they ignore the problem without putting forth the effort to fix the issue. When the next issue appears, a new fix is layered on the old. Band-aid is layered on band-aid. Eventually you’ll need to pull those band-aids off; and the more there are, the more painful it will be.
How Much is Too Much?
Most administrators prefer to be proactive rather than reactive, resolving issues before they become a problem. Proper monitoring can be a great asset, but if you’re not careful it can cause problems. For example, at a previous job we had a load balancer, apache instances and tomcat instances set up for each site. Each site had the following:
In (Sitescope) legacy monitoring system:
Health check on load balancer URL
Health check on Apache instances
Health check on Tomcat instances
Health check on Load balancer URL
Health check on tomcat instances
In Load balancer:
Health check on Apache instances
Health check on Tomcat instances
Individually, these don’t seem that bad. If an apache instance goes down *of course* the load balancer needs to know so it won’t send traffic to that instance. The same with Apache watching Tomcat. The problem was the frequency of the checks; the load balancer was checking each monitor every five seconds. When a poorly load-tested site update was released, certain pages took 7 seconds to load. Things quickly went downhill as threads and processes backed up, crashing the site.
Balancing responsiveness with common sense is essential. Having a monitor check every minute won’t change the fact that it will take an admin 20 minutes to get to a computer, boot up, log into the VPN, and identify the issue. Don’t add to the problem by DOS’ing your applications.
One mistake I’ve seen is using email as a reliable and immediate method of contact, often expecting a quick response. My favorite is when someone sends you and email, then walks down to your desk immediately after and asks “did you see my email?” You check and see it was sent literally less than two minutes ago. You can’t rely on people to continually check their email. Admins especially don’t due to the sheer volume we receive.
Email has it’s uses, but active contact in an emergency situation is not one of them. Personally, I only check my email when I think about it, which may mean large delays between when the message is sent and received. Couple that with spam filters, firewalls, solar flares and the 500 other unread messages and email becomes a less-than-reliable medium for emergency notifications (even during business hours).
Paging (or SMS) is preferable if you expect a quick response, although it is far from perfect. Just like email, SMS messages can be lost in the ether, however recipients usually have their phone alert them when a message comes in since it happens far less often than an email drops into the inbox. That said, every alert should not be sent as a page, or apathy will quickly sink in. The escalation path should look something like this (although all steps are not needed):
Front-end web interface alert: User would have to actively be browsing to see the status change. Usually the first clue something is wrong and shows the most recent status changes on a dashboard.
Email Alert: User would have to be actively checking their email. Usually sent when something is first confirmed down.
Instant Message: User would have to be at a computer and logged into IM to receive the alert. Rarely used, but an option during business hours.
Page/SMS: Reserved for emergencies. This means there is trouble.
Phonecall: Only used if Admin does not respond to the previous contact attempts. Usually performed by an irate manager or director.
If you’re lucky enough to have a 24×7 call center / help desk, they can also be leveraged to resolve issues before a system administrator is needed. If recurring patterns start to emerge, automation can be used to deal with the problem (or better yet you can fix the underlying issue). Sadly, many issues can’t be automated away or solved by a call-center staffer pressing a button. A real admin will eventually need to be contacted.
I don’t want to dig too deeply into on-call rotations, but an effort should be made to balance off-hours support with a personal life. Being on-call means no theaters, fancy dinners, or quality time with the family. Without balance, burn out will ensue.
System monitoring often brings out odd behavior in even the most steadfast of administrators. Some behaviors are relatively benign, while others can cause severe problems down the road. Identifying these behaviors before they cause a problem is just as important as having good monitors.
Data Addiction: Knowledge is power, but do not mistake information with knowledge. It’s possible to have 700 alerts, and not one of them identify the underlying issue. One of my least favorite phrases is “Can we put a monitor on that?” It’s often uttered right after a one-off failure; the type of thing that fails once, and once fixed will never cause a problem again. An example of this is a new server, where apache was not configured to restart after a reboot. When the server is restarted, you quickly find apache is down, start it, configure it to auto-start, and move on. There is already a monitor on the websites hosted by that apache instance as well as a monitor on how many apache threads are currently running; What purpose would another monitor serve? How often would it run? This is a prime example of how a data addict can spin out of control – too many useless monitors will mask a more important issue.
Over Automation: Automation is a wonderful thing, however, it’s possible to have too much of a good thing. In one instance, there was a coldfusion server which would crash often. Rather than trace out the root cause, restarts were automated, then forgotten about. A few years later, it was found that the coldfusion servers were restarting every twenty minutes, and no one knew about it – no one except the users. If it takes 20 seconds to restart, and that’s 26280 twenty-second interrupts over the course of a year – that can translate into a bad user experience and loss of sales. Make sure that automation is audited and verifiable, and doesn’t cause more trouble than it prevents.
Over Communication: While it is important to communicate with stakeholders, it is possible to over communicate. Stakeholders don’t need to know that there are 130 defunct apache processes caused by a combination of a bug in mod_jk and the threading configuration in JBoss – all they need is “Site availability is intermittent – we’ve located the root cause and are working on a solution. More information to follow.” Details aren’t needed. Likewise, not every single person should be notified when an alert goes off – does your backup administrator need to know when a web server goes down? No. Does the DBA need to know when an SSL cert is about to expire? No. Tailor the messages to the correct audience. Most monitoring systems allow you to configure contact groups – use them.
Complexification: There are dozens of relationships between services, hosts, hostgroups, contacts, servicegroups, notification windows, dependencies, parents, etc. Try as you might, it’s usually impossible to perfectly model every relationship. Don’t become distracted by perfecting the configuration – focus on maintainability, scalability and accuracy. If you can’t add new systems and monitors, your configuration is too complex.
Reporting vs Monitoring: Reports are the more successful cousin of Alerts. They may superficially appear similar, but serve entirely different purposes. Monitors should only be used to track and trend data and to notify if there is a problem, whereas reports take the collected data and massage it into an aggregated format. Monitors shouldn’t send out scheduled alerts. They can collect data, but they shouldn’t be used to present it to users. You’d be surprised how often someone asks for a monitor to send a nightly report. That slippery slope will turn your monitoring system into crystal reports.
False Positives: False positives are the scourge of the monitoring world. There are many causes, but the reaction is always the same – start to investigate, realize that it’s a false positive, and lose interest, knowing that nothing is broken. The problem is that a false positive leads to lazy behavior – if you’re pretty sure it’s a false positive, you don’t bother looking into it, figuring it will clear on it’s own. This trains people to have a “wait and see” mentality when alerts go off, causing unneeded delays when a major issue appears.
Apathy: It’s 2am on a Saturday, and you get paged that the CPU on a utility server is pegged. Without looking, you know that it’s the backup process copying the home directories, so you ignore it. The following Monday at 10am the QA JBoss instance stops responding. You know that it will clear within minutes because the QA team always rebuilds the QA instance Monday morning. When you get monitors constantly failing and recovering on their own, you start to ignore the pages that come in because you know they’re unimportant. It’s only a matter of time before you miss something important. If you have a situation that promotes apathy towards alerts, resolve it before something important is missed.
Don’t be [A]pathetic
I mentioned apathy above, but there’s a bit more to it – it’s not just admins that become apathetic. If an issue is identified, action must be taken to correct it. The coldfusion example mentioned above is a great example of company apathy – failure of the business unit to prioritize it and failure of IT to push back hard enough. A former manager once had someone laugh because his team had ignored my manager’s bug report for a full year. That’s not funny; it’s pathetic.
When management fails to address an issue; be it a known system problem or something as simple as morale from a lost team member, it shows the team that they don’t care. It soon becomes a vicious cycle of uncaring when managers no longer care that the site is down, which in turn causes developer apathy. Developers then don’t care about code quality, leading to buggy code. Sysops stop caring that alerts are going off, leading to downtime. By the time the cycle is broken, it’s far too late – you’ve established a bad reputation with your customers.
Often times this will start with unreasonable development expectations, causing devs to cut corners, QA to be rushed, and monitors to be forgotten. There is a balance that must be maintained between getting code out the door and making sure that the code can stand up to the abuse it will receive when it goes live. It’s a team effort, and everyone must care (and keep caring) to keep the systems running.
Wow. Well, that’s a lot more than I intended on writing. I should state that I am guilty of 75% or more of the bad behaviors listed here. I hope that this will help start discussion on how to better improve monitoring systems.
If you have feedback, suggestions or enhancements, please leave them in the comments.
(Thanks to jdrost, jslauter, keith4, pakrat, romaink, and my wife Jackie for their peer review/editing.)
Over the past few years I’ve been doing a lot of work with JBoss and tomcat. One issue we’ve always had is bringing some level of sanity to ports that are in use. My current situation is somewhat abnormal; we have 12 applications, each with two instances, each with 7 ports- that’s a total of 168 ports that we need to keep track of. Now, multiply that by tomcat, apache and nagios’ configurations. Now multiply that by Dev, QA, ITL, PRJ, Preprod and Prod environments on segregated hardware. Even though a LOT of these ports can overlap, that’s a LOT to keep track of in a flat file or spreadsheet. We also require a certain level of flexibility:
Port Expansion: I mentioned there were 7 ports we use; originally there were only 6, but I enabled SNMP to better trend JVM usage (24 JVMs use a lot of RAM). As we add more features, more ports could be needed.
Application Expansion: We originally started out with 9 instances, and after 5 months added another three. More will be possible in the future.
The required flexibility makes numbering conventions difficult. The solution we came up with is to key the information and use it to build the port number. In order to do that, we had to make a few assumptions:
Environments are sequestered to their own hardware, however they’re on the same networks- i.e. dev and qa instances will never be on the same physical server, but will share multicast ranges
There will not be more than 5 instances in a cluster- any more than that would put theoretical ports out the 65536 range; besides, replication shouldn’t be past 4 nodes, right?
Ports below 1024 are off limits
We are constrained by IP addresses; one IP per server.
From here we could allow the following constraints:
Environments can use the same ports, EXCEPT for multicast- we don’t want QA and prod replicating data!
All multicast should use 1 for their Instance ID
With that in mind, port numbers will be laid out with the following structure: XYYZZ
X = Instance ID (e.g. 1-4)
YY = application ID (Helium,Hydrogen,Lithium, etc )
ZZ = Port Type (jmx, ajp, snmp, etc)
Sodium Multicast for Prod 3 is 11111 (1+11+11) (remember, all multicast use 1 for their instance)
Other than the Multicast port kludge, this way seems to work pretty well. Numbers are easy to parse visually and can be maintained with a centralized key on the wiki. Anyone have any suggestions or examples of how they’ve solved this problem?