Work

Hostname Conventions

83

This concept is something I’ve carried around with me for my last 3 jobs, and since I’m writing it up for my current employer, I figured I should document it here as well. I’ve mainly worked in Linux/Windows environments, so you may sense a bit of bias away from older systems. It’s not intentional, just a result of my experience. Thanks to Mick for introducing me to this schema.

Purpose

The purpose of this documentation is to provide a clean-cut and straight-forward convention for naming servers. The goal is to look towards the future and not be shackled by our past. This document should be read through once for a basic understanding, and used as a reference when new hosts and services are set up.It should provide a general convention, and may be modified in the future for clarity. All changes should be made by the owner of this document to ensure consistency.

Basic Domain Structure

The domain example.corp we be our designated placeholder for this discussion.

Naming Conventions

Host names are broken into two classifications: Physical Host Name and Service Host Name.

  • Physical host names can be compared to the DNA of a server and are by necessity somewhat cryptic.
    • Used when managing the physical server inventory
    • Used exclusively by the operations team
    • Used for resource monitors such as IO, CPU and Memory.
  • Service host names are more dynamic, user friendly, and may jump between physical servers as applications are migrated.
    • Used exclusively when discussing specific functionality of the server.
    • Used for managing cluster or application functionality e.g. “pushing out a new attributes.xml file to all shopcart jboss app servers”
    • Used for functional service monitoring.

Physical Host Name

Physical host names represent a unique way to refer to each and every physical server, however unlike the machine’s serial number, it has the flexibility to change. Each character of it’s name must have meaning or provide clarity. If the purpose of a host is changed, the host name may be changed to suit the new purpose, although this will happen infrequently (and make sure to define a procedure for changing a Physical Host Name). Names are broken into 6 key pieces of information:

[Location] [Service Level] [OS] [OS Major Version] [Purpose] [Identification Number]

Location

Location refers to the datacenter in which the hardware is physically present. Each entry will consist of a unique 2 character code representing the city where the datacenter is located. The following is a definitive list of locations currently allowed. If a new entry is needed, please contact the owner of this document.

Designation City
ax Alexandria, VA
gr Grand Rapids, MI
mh Madison Hills, MI

Service Level

Service level is loosely defined by the importance of the applications running on it, and what our perceived response should be. We currently have 3 designations:

Designation Urgency Purpose
1 Top Priority This machine should be treated as a customer facing production host.
2 Medium Priority Downtime on this server has a significant impact on productivity.
3 Low Priority Downtime on this server has minimal impact.

Note that no server should be completely ignored if it is in distress, this just provides a general guideline as to the urgency of the problem.

Operating System / OS Major Version

Automating system updates and configuration roll-outs is a key responsibility of the operations team. Embedding OS identification into the host name not only allows convention-based automation, but provides context for alerts during a production event. OS Identification is broken into two categories; Operating System and the Major Version number of that OS.

Designation Operating System
C CentOS
O Oracle Solaris
P Proprietary One-Off/ appliance
R RedHat
S Suse
V Vmware ESXi
W Windows
Z Solaris Zone
Designation Major OS Version
0 10
1 1 or 11
2 2 or 12
etc.

Purpose

Purpose refers to the overall usage of the server; whether it’s a database server, application server, etc. definitions are purposefully loose to allow for similar services to share the same host. This list will grow as we better define our environment. Purpose designations should refer to generic uses rather than implementation specific (db rather than Oracle or Sybase).

Designation Purpose Example
as Application Server Servers that run Web Applications; JBoss Application Server, Tomcat, IIS
ws Web Server Servers that host static or semi-static content; Apache, Nginx
db Database Servers that host Databases; Oracle, Sybase, MySQL
ci Continuous Integration Servers that host continuous integration; Hudson, TeamCity, CruiseControl
ut Utility server Servers used for by Operations; bind, ldap, pdsh, ssl cert creation
ts Terminal Server Terminal Server for Serial Access (not to be confused with MS Terminal Services)
vh Virtual Host Server Virtual Machine Host Server

Numeric ID

The last segment of the physical host name is a simple three digit numeric ID. The Numeric ID can be used in several contexts:

  • The numeric ID can increment when the rest of the host name is the same (mh1c6ci001, mh1c6ci002, etc).
  • The numeric ID can be used to designate clustered servers (gr2c6as013,gr2c6as023, gr2c6as033, gr2c6as043).

Samples

The follow is a list of sample host names:

  • mh2c6as011 – Madison Heights, Second Priority CentOS 6 Application Server 11 (App server hosting foo.com qa running on tomcat)
  • gr1r5db002 – Grand Rapids, Top Priority Red Hat 5 Database Server 2 (Production Oracle RAC server)
  • ax1c5ci001 – Alexandria, Top Priority CentOS 5 Continuous Integration Server 1 (Master node of Hudson)

Service Host Name

Service Host Names are convenient alias that is associated with functionality rather than physical servers. Names are segmented into the following format:

(application)-(purpose)(id).(subdomain).example.corp

Application

Application represents the specific functionality associated with a service.

Type Example
In-House Application register,webservices,shopcart
Application Server oracle, mysql, sybase

Purpose

The purpose designation will usually align with the purpose of the physical host name. See the reference list above for details.

Numeric ID

Numeric ID is a two-digit incrementing ID based on the uniqueness of the rest of the hostname. e.g. register-as01.dev.example.corp,register-as02.dev.example.corp,register-as01.qa.example.corp

Subdomain

Subdomain refers to either an environment, or a specialized infrastructure subdomain.

  • dev.example.corp – Environment used for Active Development
  • qa.example.corp – Environment used for Quality Assurance
  • stage.example.corp – Environment used for loadtesting and User Acceptance Testing
  • prod.example.corp – Environment used for Production
  • sn.example.corp – Storage network used for Backups and NAS
  • mgt.example.corp – Management network, used for ILO.

Caveats

There are some exceptions to the Service Host Name conventions. The following is a list of examples.

Exception Example Reason
No Subdomain nagios.example.corp Some Infrastructure Services are not tied to a given subdomain.
No Purpose and ID register.stage.example.corp Load-balanced Service Hostnames point directly to the Load Balancer and do not require these fields

Samples

Common Service Host Names

  • register-as01.dev.example.corp
  • shopcart-as06.prod.example.corp
  • mysql-db01.qa.example.corp
  • sybase-db02.stage.example.corp

Load Balances Service Host Names

  • webservices.stage.example.corp
  • register.dev.example.corp
  • shopcart.qa.example.corp

Utility Service Host Names

  • nagios.example.corp
  • svn.example.corp
  • hudson.example.corp

FAQ

  1. Why are we limiting Physical Host Names to only 10 characters?
    • Because it’s about as long as it can be and still be phonetically memorable: “mh1 c5 as 001”
    • rfc 1178 suggests keeping hostnames short.
  2. I’m not sure what to call the new host I’m building?
    • Speak with the owner of this document if you have any doubts. Everyone on your team should be confident and competent enough to select a new name, but at the same time a centralized person will help ensure consistency.
  3. I’m building a ______ and it doesn’t fit into any of the purposes listed- what do I call it?
    • Are you sure it doesn’t fit? We don’t want you arbitrarily shoe-horning servers into bad names, but we need to balance that with preventing an over-abundance of entries. If it truly doesn’t fit, work with the team to designate a new entry on this page.
  4. I’m building a ______ and it has multiple purposes, what do I call it?
    • First ask yourself why it has multiple purposes; would one of those purposes be better suited elsewhere? Provided one does not have a better home elsewhere, use the one that is most appropriate; if a machine is running mysql as a backend for a JBoss application, it’s the JBoss application that people will be most interested in, so you’d label it as rather than db. It’s really a judgment call, and by all means, ask the owner of the doc if you need guidance.
  5. What about VMS/HP-UX/whatever that only allows 6 character names?
    • You can’t win them all. At some point you have to draw a line on how much legacy you need to support. In this case, it doesn’t make sense to shackle systems made in the last 15 years because incredibly ancient machines have limitations.
  6. You’re missing ____ on your list.
    • If there is a heinous omission of a location, purpose, application or operating system, by all means let us know. This isn’t set in set in stone, and can be modified if needed.

I’m always looking for feedback and new ideas; if you have any suggestions, I’d love to hear them.

Application Server Troubleshooting tip

49

We recently ran across a problem in production that we could not replicate in lower environments. Since this is not only a high use application, but an exceptionally “chatty” app, searching the logs was an excersise in futility (*one* of yesterday’s production logs was 6,975,291 lines long, with multiple logfiles per app, multiple apps and multiple servers).

So how do you find a needle in the haystack? Get a smaller haystack. In the quickest window possible, perform the following three steps

  1. tail log1 log2 log3 log4 >combined.log
  2. reproduce error
  3. ctrl-c tail process as quickly as possible

Doing so reduced our 11,480,799 lines (with 780,527 errors) to 1200 lines.

 

We’re Hiring…

67

My employer is currently looking for a sysadmin. If you’re interested, contact me for details.

 

SR SYSTEMS ENGINEER ROLE IN FARMINGTON HILLS, MI

Summary:

We are looking for someone who will administer web hosting Linux systems infrastructure, including server hardware, operating system, enabling software, and application software/data for Internet-facing application systems. Direct other departments’ work on dependent systems such as network, firewall, load balancer, and external storage systems. Provide consultative expertise for our businesses to provide technical guidance, standards, knowledge and understanding of business and technology processes, and integration of technologies to deliver Internet-facing learning products and services.

The position will be responsible for systems configuration, implementation, administration, maintenance, and support, along with  application integration, and troubleshooting for our eLearning systems.

The role encompasses daily operational systems support in development, QA, and production tiers. It also encompasses project work with business units, developers, test labs, end users and other groups involved in the planning, development, integration, testing, and problem solving for applications, content, and data.

Essential Duties/Responsibilities:

  • Ensure maximum uptime of hosted environments, including production, staging, testing, authoring, and development environments. This includes, but is not limited to ensuring the HW is configured properly; is secure; is networked properly; is backed up per company standards; is monitored accordingly; is tested to ensure operability; and is built to company standards.
  • Act as a consultative resource for our businesses to provide technical guidance, standards, knowledge and understanding of  business and technology processes, and integration of technologies for content management and delivery.
  • Assist with integration efforts, including planning and coding where necessary in Apache, Tomcat Java, and MySQL database technologies, and scripting languages.
  • Assume lead role in complex problem solving in hosted environments, offering meaningful solutions and implementation strategies. Engage other departments and direct their work on supporting systems such as network, firewall, load balancers, and external storage. Engage application teams with analysis from logs and data on the servers, and provide recommendations for problem resolution.
  • Be part of an on-call rotation schedule that includes carrying a pager/email device 24/7. Respond to all alerts immediately and inform management of issues and work being performed to remedy the problem. Direct escalation to engage additional resources if required to troubleshoot and resolve a problem.
  • Monitor, analyze, and report performance statistics for web hosting environments. Troubleshoot hosting environment failures and manage / assist in the development of solutions to these problems. This includes not only overall environment / platform problems, but also includes problems affecting individual client accounts (i.e. data integrity, reporting, security, etc.).
  • Analyze web hosting environment averages and peak workloads / throughput compared to existing capacities and plan required accommodations to address environmental growth. Take necessary corrective actions (both scheduled and unscheduled) to proactively address potential problems before they become operational / environmental problems. Notify Manager of projected needs and actions taken.
  • Ensure security of systems, including standard server build and lock-down procedures, and monitoring security access to systems.
  • Review system logs regularly, report and research warnings and errors. Review system logs for backup completions and report any discrepancies.
  • Execute implementation/migration of new software and application versions across the development and staging and production environments and prepare back out plans on all platforms to be updated. Ensure adherence to established Change Management and QA procedures. Verify results with appropriate parties.
  • Work with peers and other departments to analyze ongoing processes and procedures. Where relevant, propose / design improvements to operational processes.
  • Keep up to date with developments in the e-Learning / web-based information technology field through educational and other information resources and make management aware of possible applications for new technologies.
  • For new web hosting infrastructure projects, act as technical lead for planning and implementation.   Mentor and train junior team members in all areas of IT expertise.

Skills/Knowledge/Experience:

Basic (Required)

  • Bachelor’s degree in Information Systems, Computer Science, Business or Engineering or equivalent job related experience.
  • Must have an excellent command of:
    1. Red Hat Linux Operating System
    2. Apache Web Servers
    3. Tomcat application environment running Java
    4. MySQL Database Server
    5. MarkLogic Content Management Systems
  • Must possess experience designing, building, maintaining, migrating, tuning, administering, and supporting three-tiered web/application/database server environments
  • Experience with Internet access and security for servers residing within a DMZ
  • Must have excellent written and oral communications, including technical documents, and process documents.
  • Must possess excellent problem-solving and analytical skills and be able to translate business requirements into information systems solutions.
  • Able to translate business requirements into technical recommendations for information systems solutions.
  • Must possess excellent problem-solving and analytical skills; ability to assist with network, system, and application troubleshooting required.

Preferred

  • This position demands a well-organized, action-oriented team player with the ability to prioritize daily work, change directions quickly, coordinate geographically dispersed team members and work on multiple projects simultaneously.
  • Comprehensive knowledge of problem analysis, structured analysis and design, and programming techniques.
  • Coding and scipting skills for a RedHat/Apache/Tomcat/MySQL environment, clustering and other high-availability architectures, TCP/IP, along with various server management and administrative tools.
  • Ability to work with minimal supervision, engaging peers and other departments to accomplish assigned goals and effectively manage projects in a cross-functional environment.

Administer web hosting infrastructure, including server hardware, operating system, enabling software, and application software/data for content management systems. Direct other departments’ work on dependent systems such as network, firewall, load balancer, and external storage systems. Provide consultative expertise for our businesses to provide technical guidance, standards, knowledge and understanding of business and technology processes, and integration of technologies to deliver Internet-facing learning products and services.

The position will be responsible for systems configuration, implementation, management and support, along with  application integration, and troubleshooting for our MarkLogic-based Content Management Systems. The role includes installation, configuration, administration and maintenance of the content management environment and integrating new systems and products into the platform.

The role encompasses daily operational support of the content management systems and application environment in development, QA, and production tiers. It also encompasses project work with business units, developers, test labs, end users and other groups involved in the planning, development, and testing of products, content, and workflows in the content management systems.

 

Raw WinXP Virtualbox Partitions on a Thinkpad

34

New job, new laptop. Many utilities here are windows only, so it requires a bit of… effort… to get myself up and running efficiently. The solution to the windows problem is VirtualBox. I had set this up on my last laptop with little effort, but this time around required a bit more effort. Hopefully the instructions below will help others get up and running quickly.

Disclaimer– your laptop may catch on fire and explode (or worse) if you attempt this… or something.

We’ll be presuming that you’ve already resized your windows partition and have both a working Windows and Linux partition.

In Windows

Log into XP, grab MergeIDE.zip from Virtualbox’s site, extract and run it. It should be a quick flash and be done. (Note: I am not 100% sure this step is needed)

Create a new hardware profile and name it virtualbox. Make sure to set it as a choice during boot. Try rebooting into native windows once to ensure that it does offer you profile options.

In Linux

You’ll need the following packages installed (May differ for non-ubuntu systems):
mbr, virtualbox-ose, virtualbox-ose-qt

Create a stand-alone mbr file to use for booting (yes, you need the force flag):

install-mbr ~/.VirtualBox/WindowsXP.mbr --force

We’re presuming that your windows partition is /dev/sda1. In the below command, we are defining

  • a vmdk file (WindowsXP.vmdk)
  • which raw disk to read (/dev/sda)
  • which partition (1)
  • the new MBR file we just created

VBoxManage internalcommands createrawvmdk -filename ~/.VirtualBox/WindowsXP.vmdk -rawdisk /dev/sda -partitions 1 -mbr ~/.VirtualBox/WindowsXP.mbr -relative -register

Note that you’ll need read/write access to that drive as your user, so you may want to figure out a cleaner/securer way to implement this, rather than adding your user to the disk group (which is very dumb and insecure). I would, but it’s working and I have more important things to do at the moment.

Another issue- apparently thinkpads report the drive heads and cylinders oddly (T410 for me and T60p in article), so we have to add some vmdk settings before virtualbox creates them incorrectly. Open ~/.VirtualBox/WindowsXP.vmdk and add the following at the bottom:

ddb.geometry.biosCylinders="1024"
ddb.geometry.biosHeads="240"
ddb.geometry.biosSectors="63"

The biosHeads appears to be the magic value- it seems to work if it’s set to 240, but the default is 255 (which fails).

Once you add those, start up virtualbox and check the virtual media manager, your new vmdk should be listed there. Once it’s confirmed, create a new virtual machine. Rather than creating a disk, select your vmdk as an existing disk.

After you finish, go the the VM settings->system and make sure the motherboard tab as io-apic  enabled (I also had PAE/NX enabled under processor and VT-x enabled under Acceleration).

Start the VM

There are several errors that could pop up. I’m sure there are plenty more that I stumbled across, but these were the two big ones:

  • a disk read error occurred, press ctrl+alt+del to restart – Caused by incorrect biosHeads- check and make sure it’s set to 240 (this was the fix for me, results may vary).
  • Complaint about kvm/vmx – Virtualbox does not like kvm. Uninstall qemu-kvm.

If things go well, it should flicker mbr in the corner, then go to the hardware profile selection screen. Select the virtualbox profile, and continue, then log in.

What follows is a half-hour of installing generic drivers and dealing with hardware specific auto start apps complaining that they won’t work on this installation. Windows will warn that the new drivers are not blessed, so be forewarned.

Once completed, at the top of the VM windows select Devices-> Install Guest Additions. This will download and mount an ISO, and windows will pop open a folder with the addition executables. Select the one best for you and run the installer. It will prompt you for video and mouse drivers (and trust me, you want them).

The final step is to shut down the windows VM, then reboot into the native windows partition to make sure it still works.  I did receive a few blue-screens before logging in at the beginning, but they appeared random and haven’t happened since.

And that’s all there is to it- simple, eh? Your windows partition should now run in native mode and vm mode.

The Philosophy of Monitoring

36

As a system administrator, monitoring is a key job responsibility, yet arguments seem to arise on how to implement it (usually with people who won’t be paged at 3am). Before writing this, I looked around for an article on the goals and philosophy of system monitoring, but found very little that really applied to this topic. Hopefully this will help set some expectations for admins, managers and stakeholders on what you should monitor, and why it should be monitored.

Why you Monitor

Before you set up a single monitor, you have to ask yourself, “what is the goal?” After all, why are you even setting something up? Here are a few common reasons for configuring monitors:

  1. Notification: Warning of an issue that requires intervention. What most people think of when you say “Monitoring”.
  2. Reactionary: Automatic actions are taken when certain criteria are met. If common countermeasures are automated, you’ll have less to handle manually.
  3. Informational: System status and historical trending allows you to show business customers that production “isn’t always down.” In reality,  you may have 99% uptime, and often downtime is due to requested deployments. Statistical information can also be used for capacity planning.

Mentally dividing your monitors into groups will help you calculate which monitors require involvement. It’s not uncommon to have several thousand monitors at any given time, so it’s important not to assign critical importance to all of them. A wise man once said “When all alerts are critical, none of them are.”

When you should NOT Notify

Some monitors may have thresholds set which check for certain conditions; when those conditions are met, you may want to send some type of alert to an administrator. There are two types of notifications – Active and Passive:

  • Active Notification: Immediate Action is Required: “Site is Down!” A phone call, page, or IM may be used to contact someone. Direct action expected.
  • Passive Notification: Informational Purposes only:  “JVM Memory usage is high.” Information is logged, and perhaps an email is sent. No direct action is expected, since there’s usually nothing you can do about it.

It’s easy to become addicted to passive notifications – but remember, data overload can mask important information. It becomes habit to ignore notifications if they are unimportant. The question then is not so much “when should you notify,” but “when shouldn’t you?” What it really boils down to is “Can/should I do anything about it right now?”

  • Non-critical (disk space creeps above  90%  on /var on a dev server at 2am on a Saturday after several months of growth).
  • Nothing Systemic is wrong (admins can’t fix “low sales”).
  • 3rd party system, such as a geocoding webservice, is down.
  • Will resolve shortly, such as a backup server pegging the CPU during midnight backups.

Some of these alerts can be avoided by setting a correct monitoring window (ignore CPU during the backup window, or set a blackout window for a deployment). Others simply can’t be addressed by administrators, although you may want to send informational emails to other members of the company (those managing 3rd party SLAs or responsible for tracking online sales)  The next step after getting an alert is figure out what to do about it.

Reacting Properly

When a notification is sent out, there should be a definitive action that you can take. Think about why you were notified. There are a few rules to keep in mind when something goes wrong.

  1. Don’t Panic. When 700 alarms go off, your first instinct is to panic. Before you act, take a breath. Spend a moment to get your bearings, and calm yourself. The worst possible thing you can do is flail. Randomly making changes without rhyme or reason and restarting services can do more harm than good and may make the situation worse. Take note of which alarms go off, and in the post mortem look for ways to get the same information with less noise.
  2. Identify Obvious Patterns. What is the commonality? If a central system goes down, you may see many similar alerts. Dependencies can help immensely, masking redundant alerts. A single database failure could take down a dozen sites. Which is better: getting a single alert that the database is down, or 250 alerts that various sites are down and one database notification in the middle? While 250 alerts may impress the gravity of the situation upon you, it may instil panic and anxiety, which leads to flailing.
  3. Get things up and running as quickly as possible. Root-cause analysis can be tedious, time consuming, and occasionally inconclusive. If you have a major system outage, don’t worry about doing root-cause analysis on the spot.  Do what you need to in order to get things up and running – you can search the logs later. If the problem is recurring, you’ll get another chance to investigate later.
  4. Communicate with Stakeholders. The business units don’t need to know the details, but they do need to know that there is an outage and that it’s being addressed. If the situation is not quickly resolved, give them status reports. Be warned – any details you reveal will be warped and held against you. I’ve learned this one many times. People have a tendency to blame what they don’t understand. “Site is down? It must be a witch!” At a previous job we had a “jump to conclusions” board which had our favorite scapegoats – load balancer, connection pool, Endeca, etc. Everyone is guilty of it – Business, devs, sysops, QA, etc. Even a one-time problem that has been resolved will be brought back up, even if it’s only tangentially related. Communicating too much information creates future scapegoats.
  5. Contact Domain Experts. If your java site is crashing and you’re not a java developer, get a java developer involved. If your DNS server falls down and the fix isn’t obvious, contact your DNS administrator. Expert eyes on the problem may resolve the issue quicker. Group chat is crucial for sharing information and talking out theories. Someone familiar with the code will know what the error messages mean.
  6. Fix the Problem. It should go without saying that if you find the problem, you should make every effort to resolve it. Workarounds are fine, just don’t let that band-aid become permanent. What often happens is a workaround is put in place; the alert clears and management no longer feels the pain, so they ignore the problem without putting forth the effort to fix the issue. When the next issue appears, a new fix is layered on the old. Band-aid is layered on band-aid. Eventually you’ll need to pull those band-aids off; and the more there are, the more painful it will be.

How Much is Too Much?

Most administrators prefer to be proactive rather than reactive, resolving issues before they become a problem. Proper monitoring can be a great asset, but if you’re not careful it can cause problems. For example, at a previous job we had a load balancer, apache instances and tomcat instances set up for each site. Each site had the following:

In (Sitescope) legacy monitoring system:

  • Health check on load balancer URL

In Nagios:

  • Health check on Apache instances
  • Health check on Tomcat instances
  • Health check on Load balancer URL

In Apache:

  • Health check on tomcat instances

In Load balancer:

  • Health check on Apache instances
  • Health check on Tomcat instances

Individually, these don’t seem that bad. If an apache instance goes down *of course* the load balancer needs to know so it won’t send traffic to that instance. The same with Apache watching Tomcat. The problem was the frequency of the checks; the load balancer was checking each monitor every five seconds. When a poorly load-tested site update was released, certain pages took 7 seconds to load. Things quickly went downhill as threads and processes backed up, crashing the site.

Balancing responsiveness with common sense is essential. Having a monitor check every minute won’t change the fact that it will take an admin 20 minutes to get to a computer, boot up, log into the VPN, and identify the issue. Don’t add to the problem by DOS’ing your applications.

Making Contact

One mistake I’ve seen is using email as a reliable and immediate method of contact, often expecting a quick response. My favorite is when someone sends you and email, then walks down to your desk immediately after and asks “did you see my email?” You check and see it was sent literally less than two minutes ago. You can’t rely on people to continually check their email. Admins especially don’t due to the sheer volume we receive.

Email has it’s uses, but active contact in an emergency situation is not one of them. Personally, I only check my email when I think about it, which may mean large delays between when the message is sent and received. Couple that with spam filters, firewalls, solar flares and the 500 other unread messages and email becomes a less-than-reliable medium for emergency notifications (even during business hours).

Paging (or SMS)  is preferable if you expect a quick response, although it is far from perfect. Just like email, SMS messages can be lost in the ether, however recipients usually have their phone alert them when a message comes in since it happens far less often than an email drops into the inbox. That said, every alert should not be sent as a page, or apathy will quickly sink in. The escalation path should look something like this (although all steps are not needed):

  • Front-end web interface alert: User would have to actively be browsing to see the status change. Usually the first clue something is wrong and shows the most recent status changes on a dashboard.
  • Email Alert: User would have to be actively checking their email. Usually sent when something is first confirmed down.
  • Instant Message: User would have to be at a computer and logged into IM to receive the alert. Rarely used, but an option during business hours.
  • Page/SMS: Reserved for emergencies. This means there is trouble.
  • Phonecall: Only used if Admin does not respond to the previous contact attempts. Usually performed by an irate manager or director.

If you’re lucky enough to have a 24×7 call center / help desk, they can also be leveraged to resolve issues before a system administrator is needed. If recurring patterns start to emerge,  automation can be used to deal with the problem (or better yet you can fix the underlying issue). Sadly, many issues can’t be automated away or solved by a call-center staffer pressing a button. A real admin will eventually need to be contacted.

I don’t want to dig too deeply into on-call rotations, but an effort should be made to balance off-hours support with a personal life.  Being on-call means no theaters, fancy dinners, or quality time with the family. Without balance, burn out will ensue.

Afflictions

System monitoring often brings out odd behavior in even the most steadfast of administrators. Some behaviors are relatively benign, while others can cause severe problems down the road. Identifying these behaviors before they cause a problem is just as important as having good monitors.

  • Data Addiction: Knowledge is power, but do not mistake information with knowledge. It’s possible to have 700 alerts, and not one of them identify the underlying issue. One of my least favorite phrases is “Can we put a monitor on that?” It’s often uttered right after a one-off failure; the type of thing that fails once, and once fixed will never cause a problem again. An example of this is a new server, where apache was not configured to restart after a reboot. When the server is restarted, you quickly find apache is down, start it, configure it to auto-start, and move on. There is already a monitor on the websites hosted by that apache instance as well as a monitor on how many apache threads are currently running; What purpose would another monitor serve? How often would it run? This is a prime example of how a data addict can spin out of control – too many useless monitors will mask a more important issue.
  • Over Automation: Automation is a wonderful thing, however, it’s possible to have too much of a good thing. In one instance, there was a coldfusion server which would crash often. Rather than trace out the root cause, restarts were automated, then forgotten about. A few years later, it was found that the coldfusion servers were restarting every twenty minutes, and no one knew about it – no one except the users. If it takes 20 seconds to restart, and that’s 26280 twenty-second interrupts over the course of a year – that can translate into a bad user experience and loss of sales. Make sure that automation is audited and verifiable, and doesn’t cause more trouble than it prevents.
  • Over Communication: While it is important to communicate with stakeholders, it is possible to over communicate. Stakeholders don’t need to know that there are 130 defunct apache processes caused by a combination of a bug in mod_jk and the threading configuration in JBoss – all they need is “Site availability is intermittent – we’ve located the root cause and are working on a solution. More information to follow.” Details aren’t needed. Likewise, not every single person should be notified when an alert goes off – does your backup administrator need to know when a web server goes down? No. Does the DBA need to know when an SSL cert is about to expire? No. Tailor the messages to the correct audience. Most monitoring systems allow you to configure contact groups – use them.
  • Complexification: There are dozens of relationships between services, hosts, hostgroups, contacts, servicegroups, notification windows, dependencies, parents, etc. Try as you might, it’s usually impossible to perfectly model every relationship. Don’t become distracted by perfecting the configuration – focus on maintainability, scalability and accuracy. If you can’t add new systems and monitors, your configuration is too complex.
  • Reporting vs Monitoring: Reports are the more successful cousin of Alerts. They may superficially appear similar, but serve entirely different purposes. Monitors should only be used to track and trend data and to notify if there is a problem, whereas reports take the collected data and massage it into an aggregated format. Monitors shouldn’t send out scheduled alerts. They can collect data, but they shouldn’t be used to present it to users. You’d be surprised how often someone asks for a monitor to send a nightly report. That slippery slope will turn your monitoring system into crystal reports.
  • False Positives: False positives are the scourge of the monitoring world. There are many causes, but the reaction is always the same – start to investigate, realize that it’s a false positive, and lose interest, knowing that nothing is broken. The problem is that a false positive leads to lazy behavior – if you’re pretty sure it’s a false positive, you don’t bother looking into it, figuring it will clear on it’s own. This trains people to have a “wait and see” mentality when alerts go off, causing unneeded delays when a major issue appears.
  • Apathy: It’s 2am on a Saturday, and you get paged that the CPU on a utility server is pegged. Without looking, you know that it’s the backup process copying the home directories, so you ignore it. The following Monday at 10am the QA JBoss instance stops responding. You know that it will clear within minutes because the QA team always rebuilds the QA instance Monday morning. When you get monitors constantly failing and recovering on their own, you start to ignore the pages that come in because you know they’re unimportant. It’s only a matter of time before you miss something important. If you have a situation that promotes apathy towards alerts, resolve it before something important is missed.

Don’t be [A]pathetic

I mentioned apathy above, but there’s a bit more to it – it’s not just admins that become apathetic.  If an issue is identified, action must be taken to correct it. The coldfusion example mentioned above is a great example of  company apathy – failure of the business unit to prioritize it and failure of IT to push back hard enough.  A former manager once had someone laugh because his team had ignored my manager’s bug report for a full year.  That’s not funny; it’s pathetic.

When management fails to address an issue; be it a known system problem or something as simple as morale from a lost team member, it shows the team that they don’t care. It soon becomes a vicious cycle of uncaring when managers no longer care that the site is down, which in turn causes developer apathy.  Developers then don’t care about code quality, leading to buggy code. Sysops stop caring that alerts are going off, leading to downtime. By the time the cycle is broken, it’s far too late – you’ve established a bad reputation with your customers.

Often times this will start with unreasonable development expectations, causing devs to cut corners, QA to be rushed, and monitors to be forgotten. There is a balance that must be maintained between getting code out the door and making sure that the code can stand up to the abuse it will receive when it goes live.  It’s a team effort, and everyone must care (and keep caring) to keep the systems running.

Wow. Well, that’s a lot more than I intended on writing. I should state that I am guilty of 75% or more of the bad behaviors listed here. I hope that this will help start discussion on how to better improve monitoring systems.

If you have feedback, suggestions or enhancements, please leave them in the comments.

(Thanks to jdrost, jslauter, keith4, pakrat, romaink, and my wife Jackie for their peer review/editing.)

home sick

0

So, I’m home sick again. Fourth time this year I’ve been sick. Cold, flu, cold, Bronchitis. Awesome. Will I get to rest today? No, of course not.

My server (Unicron) has been up and running for 2 years now- I got the parts right after Ian was born. I set up a nice software raid array at the time that’s served me well. I’d never set up a raid array like this before, so I wasn’t really sure how to monitor it. The raid array has been running fine for 2 years, so I just sorta let it slide.

Over this past weekend, I did some work resizing the lvm partitions and had to poke around with the raid stuff. I found not one, but two ways to monitor it one was to set up a monitor with the mdadm tools and have it email me if there was a problem, and the other lead me to create a simple nagios monitor. I set both up sunday night.

Flash forward to this morning- jackie wakes me up, asks me if I’m going to work (I’d come home sick the day before and was still out of it.) My intention was to wake up long enough to IM my manager and supervisor and let them know I was gonna be sick. I do so, then minimize the im stuff. Staring me in the face was the following:
raidfailure

Wait, wait, wait- my script must be crappy, there’s no way the raid array choked right after setting up the monitoring. I sorta go into denial and check my email:

This is an automatically generated mail message from mdadm
running on unicron

A Fail event had been detected on md device /dev/md2.

It could be related to component device /dev/sdc2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid5 sda3[0] sdd2[3] sdc2[4](F) sdb3[1]
1461633600 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]

md3 : active raid1 sdc1[0] sdd1[1]
979840 blocks [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
979840 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
192640 blocks [2/2] [UU]

unused devices:

Frak.

So here I am, praying it was a hiccup while I reboot and rebuild my raid array. it looks like sdc2 went nutty about 45 minutes after I went to bed. I restarted the server and sdc2 reappeared, and I’m rebuilding now to see what happens,

md2 : active raid5 sdc2[4] sda3[0] sdd2[3] sdb3[1]
1461633600 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
[=================>…] recovery = 85.0% (414436248/487211200) finish=26.8min speed=45182K/sec

One thing is for sure, I need to get a auxiliary drive in case this one goes kaput for real. I said I’d buy one in june… 2007. suppose I better get one that, huh?

My apologies of this was nonsensical, I’m really tired.

new plugin test.

4

So I’m testing a nifty new wordpress plugin… check this out:

I wonder if it’ll work?

update: no, no it will not.

What’s up?

0

So I’ve been pretty quiet since I hit 100k words- what’s been going on?

  • Round of layoffs at work
  • Friend diagnosed with cancer
  • Another round of layoffs at work.
  • Jackie became a pampered chef consultant
  • Finances have been wiped out from christmas and getting her PC stuff off the ground.
  • 10% paycut at work
  • Guitar lessons are now done because no one can afford them.
  • Have been reading Manuscript Makeover for ways to improve my book
  • Decided to do an initial cleanup of the first draft of my script, then rewrite the outline before starting draft #2
  • started yet another opensource project- this time it’s a collection of Nagios Plugins.

So I’ve been pretty busy. I’ve finished the cleanup of the first two chapters of book 1; hopefully I’ll finish the rest shortly, but it’s very slow going. We’ll see where things head in the next few months- I expect more crappiness.

Free Jabber / XMPP clients for a Blackberry?

14

anyone know of any good jabber clients for the blackberry? I’ve tried a couple with little luck, and most of them cost more than I can afford for this test. Features required

  • Must run on BlackBerry 8703e v4.1.0x
  • Connection server can be configured differently than jid address (i.e. you@morgajel.net for jid, jabber.morgajel.net for connection server.) This rules out Mobber as far as I can tell
  • Requires SSL/TLS
  • Non-strict cert checking

Let me know if you have any suggestions.

What’s blue and white and still not working?

2

My internet connection.

SO here’s the scoop

5 days until cutover:
I call AT&T, tell them I’m moving and need to transfer my Static IP DSL service on the 30th(Monday). Tech says no problem it’s all set. I am pleasantly surprised at how little of a hassle it was and that it was way smoother than any other interaction I’ve had with them.

Saturday, 2 days until cutover:
We’re planning on doing the actual moving Sunday morning and plan to spend Saturday packing and planning. However at 3am Saturday morning, the internet connection drops, leaving me unable to contact many of the people who may be able to help us move. It sucks, but ok, we can work around it. I still have enough people to get by with and have ways to contact most of them. Since we asked to be connected on Monday, maybe they had to disconnect the old line the day before in order to get their stuff in place. Maybe they cut it on Saturday rather than Sunday because nobody works on Sunday. I get that, I can understand it. While annoying, it’s still better than my previous interactions with them.

Sunday,move day, day before cutover:
We move on Sunday and realize that we never actually checked to see if the house had any phone cables in it. It didn’t. Fortunately my father-in-law knows a bit about phone installation and was able to help me wire up a stub for the AT&T guy to connect to.

Monday, 1 day after move:
AT&T shows up, runs cable, says service will be enabled withing X hours. yippie.

Tuesday, 2 days after move:
Connection is there, but my ip address has changed. “crap,” I think, “now I gotta update dns entries for our sites.” But I can understand this, perhaps my old static ip was tied to the network near my old apartment and didn’t reach this area. I can buy that. So I change my DNS entries… and they don’t work. I look again and I apparently mistyped the IP because the new DNS entry doesn’t match the external IP on the router. So I change it again. and 20 minutes later the external IP has changed again.

They had me on a fricking dynamic IP. For the non techies out there, large ISPs only have a limited number of ip addresses, and more often than not don’t have one for every customer. Since few customers stay online 100% of the time, they can take addresses away from people not using them and redistribute them as needed. This is called a Dynamic IP account. For people who run servers from their homes, keeping the same IP is important, so when your computer goes to connect to morgajel.com, it needs to be able to find the right IP address. That’s why I pay extra for AT&T to guarantee me the same IP address. That is why I am pissed. While there are ways to get around this (dyndns) but they’re a pain in the ass an not an option for me since I run an IRC server as well.

So I tinker around, thinking maybe *I* did something wrong- maybe my router was reset and it cleared the static info. I dig around with Jackie’s help and find the original documentation and try to set up the networking listed manually. No dice. Then I remember that yes, they did manage the info via the PPOE settings, and that just required a user name and password, which is what I was originally using. I switch it back and get yet another dynamic IP. I should point out that my static IP range was 75.x.x.x, while the dynamic stayed in the 66.x.x.x range- this made it easy to keep track of what was going on.

So I call them up and surprise surprise, they screwed up. See, they don’t really transfer accounts so much as shut off the old one and create a new one. The tech didn’t bother to notice I had a static account and replaced it with a dynamic account. I’m livid at this point, and tell them that it needs to be switched back. “Ok, I’ll put in the order. It’ll be ready in 10 days.” Now, this should NOT take 10 days from a technical point of view, this is all red tape causing the delay. But WTF can I do, so I say hell with it and go along with it.

At some point my father-in-law comes back over to help with the baby gate and notes that the technician illegally ran the line through the neighbor’s yard. While I’m half tempted to yell at them to fix it, I just wanna get a connection up and running again so I can actually write about the house.

Saturday, 5 days after cutover (timeline gets a little fuzzy here)
Connection is still flaky, but generally working. I call to check on the status of the static IP order, and find out it was never placed. They’ll get right on that.

Sunday, 6 days after cutover
Connection goes down at 7:37am. Completely. It does not come back. Jackie calls tech support this time. Flames, brimstone cries of the undead ensue. Eventually I take the phone and find out there’s still no mention of a static order of any sort for our account. Guess what? They can’t do anything about it because “orders” isn’t open on weekends. They agree to send out a tech to look at the line since they can’t see the modem from their end. He should be out between 8am and noon on monday

Monday, 7 days after cutover:
Connection begins working again around 7am- I think to myself “great, maybe they just took it down to switch over to the static IP- finally I can get my stuff up.” Nope, still a dynamic IP address. I call AT&T to get the static IP address set up and let them know the connection is up. They say hold off until the technician confirms it’s not an issue. ok. I’ll call back later. I spend my time waiting for the technician looking for any other ISPs in the area on dslreports.com

Technician comes out, nice guy, doesn’t see anything wrong, says he’s seen this behavior before when switching from dynamic to static, but the business won’t fess up to it. Whatever. At least the wiring was good, presuming that both the installer and the inspecting tech were both competent. While he was tooling around, I found out that Cyberonic, my ISP from DC, covers this area (they didn’t in grand rapids or rochester hills). They resell business class Covad lines to residential customers. I contemplate switching over to them, but figure it would be too much effort since I’ve gotten this far. I’m not even sure they’d have a decent plan in this area.

So he leaves and I call AT&T back and get the static all set up. She also said the static IP would be in place tomorrow. Just as we’re finishing she informs me that since I don’t have a contract, my payment will go up to $70 a month from $55. “WTF, this isn’t my screwup- you guys said you could transfer service, then you pooch it, then you want to charge me for it??”

“Oh, no,” she says, “When we transfer service, we don’t transfer contracts. If you want the original rate, you’ll have to sign up for another year of service.”

This is where Jesse snaps.

“You know what? Fine, make it the month to month price, because it’ll take me about 3 weeks to get covad in here.” She was a bit shocked by that statement, and the conversation ended awkwardly. I think she was supposed to ask if I was please with my experience but she knew the answer.

I then spent 10 minutes looking through DSL reports for ISPs in the area and narrowing down their plans- turns out that Cyberonic offers the same plan I had in DC for $60. Lets compare the plans side by side:

AT&T Cyberonic
Download speed 3meg 6meg
Upload speed 386k 768k
IP address 5 static 5 static
Stability False True
Cost $55/mo $60/mo

I call up cyberonic, phone is picked up on the 3rd ring. I tell the technician that I’m interested in their plan, I get signed up, cc infos taken, etc. The entire call lasted 22 minutes and 28 seconds. I was never transferred once, my call was never dropped, the technician never once said “I don’t know,” and they were going to do a hotswap on the line and cancel the AT&T DSL for us since we obviously can’t have 2 DSL services on the same line. The transfer should take place in the next 7-14 business days.

I’d like to point out that AT&T still hasn’t got their act together as of this morning (Thursday), and dropped my connection while I was beginning a deployment for work. That was real awesome btw. Thankfully my neighbor is allowing us to use his wireless connection until we get it straightened out. If the issues aren’t resolved by switching to cyberonic, I’ll have the neighbor report the cable crossing his yard and they’ll have to come out and redo it (this is my backup plan).

The good news is we’ve moved our blogs to gopedro.net. I’m still in the process of converting them, but expect to be done by next Monday. The only site that will still point to my static IP is morgajel.com, for my streaming music, IRC server, etc. We’ve also decided to move all of our pictures to flickr, so expect to see broken images for a while.

I really want to thank gopedro.net for in all of this. I highly recommend them for any domain name purchases or hosting. They’ve been handling our domain names for years now, and their service is outstanding. I’d also like to thank our new neighbor Bobby for being one hell of a cool guy.

I’ll keep you updated on how things go. Hopefully I’ll start writing about the house soon.

*UPDATE 2008-07-14*
Cybronic called and told me they’d be sending a technician out tomorrow to verify the lines. Hopefully I should have a working connection soon.

*UPDATE 2008-07-16*
My bad, it was wednesday. Connection is up now and I’m back online with a static IP!

Go to Top