Work

Unfinished Drafts: The Importance of Documentation

This article was originally written on July 19th, 2010, but never published.

Documentation is another topic where there appears to be disagreement in the sysadmin world. When to document, what to document, who do document for, and where to store that documentation always seem to be subjects of contention. Everyone likes documentation, but no one has the time to document, and the rules for documentation often feel arbitrary. I’d like to open this up for discussion and figure out some baselines.

Should I Document?

If you have to ask then probably; but it’s much more complex than that. Documentation is time-consuming and rarely of value at first, so few want to invest the effort into it unless it’s needed. There are several questions here that need to be answered:

  • Why should I Document? What is the purpose of the documentation? Are you documenting a one-off process that you’ll have to do 10 months from now? Are you providing instructions for non-technical users? Perhaps you’re defining procedures for your team to follow. Whatever the reason, focus on it, and state it up front. There are few things worse than reading pages of documentation only to find out that it’s useless. Documentation for the sake of documentation is a waste of time.
  • What should I Document? It’s very easy to ramble when writing documentation (as many of my articles prove). Step back and review what you’ve written, then remove any unneeded content. Find your focus and document only what needs to be explained, leave the rest for footnotes and hyper links.
  • When should I Document? As soon as possible. Ideally you’d document as you worked, creating a perfect step-by-step record. Realistically, pressure to move quickly causes procrastination, but the truth of the matter is that the longer you wait, the less detail you’ll remember. Write down copious notes as you go, and massage it into a coherent plan after the fact.
  • Who should I Document for? Write for your audience- a non-technical customer requires a much lighter touch compared to a seasoned techie. The boss may need things simplified that a coworker would instinctively understand. Pick your target audience and stick to it. Anything that falls outside of the audience interests should be flagged as “[Group B] should take note that…” Also remember that the person who requests the documentation may not be the target audience.
  • Where should I Document? Where you keep documentation is often more important than the quality of your document. You can write the most compelling documentation in the company, but if it’s stored in a powerpoint slide on a shared drive, it’s of no use to someone searching a corporate wiki. Whatever your documentation repository may be, be it Alfresco, Sharepoint, Confluence or even Mediawiki, everyone has to be in agreement on a definitive source. The format should be searchable, track revisions, prevent unwanted access, and be inter-linkable.

Now that we’ve set some boundaries, let’s delve a little bit deeper into the types of documentation.

Types of Documentation

Documentation can take many forms. Over the course of any given day, you’ll see proposals, overviews, tutorials, standards, even in-depth topical arguments.

. Each type of documentation has its own rules and conventions- what’s required for one set may not be needed for another. That said, here are a few general rules to follow.

  • Be Concise
    • NO: thoughtfully contemplate the reduction of flowery adjectives and adverbs for clarification;
    • YES: remove unneeded words. Over-explaining will confuse the reader.
  • Be Clear – Make sure your subject is obvious in each sentence. Ambiguity will destroy reader comprehension.
  • Be Accurate – Incorrect documentation is worse that no documentation.
  • Keep it Bite-sized – Large chunks of data are hard to process, so keep the content in small, digestible chunks that can be processed one at a time.
  • Stay Focused – Keep a TODO list. Whenever you think of an improvement, make a note of it and move on.
  • Refactor – The original structure may not make sense after a few revisions, so don’t be afraid to reorganize.
  • Edit for Content -Make sure your topics are factually correct and the content flows properly.
  • Edit for Grammar – Make sure your punctuation is correct and your structure is technically sound.
  • Edit for Language – Make sure the text is actually interesting to read.
  • Link to Further Information – If someone else has explained it well, link to it rather than rewrite it.
  • Get Feedback – Feedback finds mistakes and adds value. The more trusted sources, the better off you are.

Proposal/RFCs

Proposals can be immensely rewarding (or mind-numbingly frustrating), depending on if they’re accepted or not. That’s not to say you shouldn’t write them; even a failed proposal has value. The point of a proposal is to communicate an idea, a way to tell your team or supervisor “this is what I think we should do.” If you’re successful, the idea will be implemented. If you’re unsuccessful, you may find out a better way to do it. The overall goal should be to improve team performance. Here’s what a proposal should include:

  • The Problem – What problem are you trying to solve? Why is it a problem?
  • The Solution – A simple overview of the solution
  • The Benefits – what benefits it will provide?
  • The Implementation – How to implement it.
  • The Results – Explain the intended results
  • The Flaws – What issues are expected, and if there is currently a solution
  • The Timeframe – When should this project be started and completed? How long and how much effort will it take?

Lets presume you write a knockout proposal. Everything is perfect, and with 2 days of effort you’ll reduce a 2 hour daily task to a 15 minute weekly task. Regardless of the benefits, the response will be one of these:

  • Complete Apathy – the worst response, because it shows how little you are valued. No response, approval, or denial. If this happens, run your idea past an uninvested third party. Perhaps a critical set of eyes may reveal the problem.
  • Denied – perhaps the benefit isn’t worth the cost, the risk is to high, there’s not enough resources, or some other issues not addressed. Try to get specific reasoning as to why it won’t work, and rework your proposal taking that into account.
  • Feigned Interest, no Support – Be it plausible deniability or lack of interest, the response is weak. Push for a yes or no answer, ask what the concerns are with it.
  • Delay – It’s a good idea, but not right now. There might be hesitance due to a minor issue. Find a way to calm their fears, then push for an implementation date, create a checklist of conditions that need to be met.
  • Conditional Agreement – It is a good idea, but conditions must be met first. Create a checklist and verify that it’s complete.
  • Full Agreement – This should be your end goal. Full agreement means support from the boss and the team on implementation. Without support, your efforts may be wasted.

You can’t account for everything in your proposal, so make sure not to paint yourself into a corner. A method for dealing with problems is more valuable than individual solutions. It doesn’t need to be perfect, but does need to be flexible.

The most important thing a proposal needs is buy-in. If your team and management aren’t behind an idea, implementation will be a struggle. The final thing to keep in mind is that not all proposals are good. If there is universal apathy for your idea, it might just be bad and you’re oblivious to it.

Introductions and Overviews

Introductions are the first exposure someone may have to whatever you’ve been working on, be it a JBoss implementation, Apache configuration, or new software package. A clear understanding of what “it” is can help with acceptance. A bad introduction can taint the experience and prevent adaptation. So, how can you ensure a good introduction to a technology?

  • Explain the Purpose – Why is the user reading this introduction? A new Authentication system? Messaging system? Explain why the reader should care.
  • Define your Terms – Include a glossary of any new terms that the user needs to understand. Remember, this may be their first exposure to the topic. Don’t overwhelm them, but at the same time don’t leave them in the dark.
  • Don’t Drown in Detail – An introduction should not cover everything in perfect detail, but it should give you references to follow up on.

The tone should be conversational- you need to draw the reader in, befriend them, and convince them that this new thing is not scary. This can be a tough task if the subject is replacing something that the reader if

Document a Process (Installation, Upgrade, Tutorials, How-to, Walk Through)

Documenting a process serves three purposes- it trains new employees in proper technique, ensures consistency, and covers your rear should something go wrong. That last point may sound a bit cynical, but you never know when you’ll need it.  The process itself should be clear enough that any qualified user can follow it. Process documentation should have the following traits:

  • Steps – Well defined tasks that need to be performed.
  • Subtasks – any moderately complex task should be divided up.
  • Document Common Problems – Surprises can derail a new user. Acknowledgement and fixes for issues can help ease new users into the process.

Dry runs are essential in documenting a process- test the process yourself and have others test it as well. Continual runs will expose flaws and allow you to address deficiencies. Keep testing and refining the process until a sample user can follow it without issue.

Topical guide (Feature-based)

Topical guides are both the most useful and yet the hardest documentation to write. They need to be thorough, both fully covering the material but not burying the user in frivolous details. So what should you cover in a topical guide?

  • Be specific on the topic – Document a feature and all related material. If it’s not related, don’t include it.
  • Cover Relevant Tangents –
  • Be comprehensive – Cover everything a user needs to know, but remember it’s not intended to be a reference book.

Document a Standard (How Something Should be Done)

Inconsistency is the bane of system administration, and consistency can only be had when everyone is in agreement on how things should be done. There must be agreement not only on theory, but also in practice. As such, standards should be documented. What should a standard entail?

  • Dynamic – Not the first word when you think of standards, but something you have to face; your standard will become out of date quickly. Document it and give it a revision number. Soon enough you’ll realize
  • Audit – It’s not enough to document a standard, you also need to enforce it. Periodic verification can spot issues before they become problems. If configuration files are identical, md5sums can be used to find inconsistencies.

Annotation (Config Commenting)

One of the most common types of documentation is never published, yet often the most crucial in day-to-day operations. Comments within configuration files can explain what steps were taken and why.

  • Explain Why – When you make changes, explain why you made the change.
  • Keep it Simple – Comments should not overshadow the configuration. Leave over-documentation to sample configs.
  • Consider Versioning – The best configuration documentation is a history of changes. Configurations that are both critical and fluid (for example, Bind zone files) are perfect candidates for versioning.
  • Sign and Date Changes – When you make a change, leave your name and a datestamp. While versioning comments may be more permanent, inline comments provide instant context This is important when the change is revisited and no one remembers making it.

Hostname Conventions

This concept is something I’ve carried around with me for my last 3 jobs, and since I’m writing it up for my current employer, I figured I should document it here as well. I’ve mainly worked in Linux/Windows environments, so you may sense a bit of bias away from older systems. It’s not intentional, just a result of my experience. Thanks to Mick for introducing me to this schema.

Purpose

The purpose of this documentation is to provide a clean-cut and straight-forward convention for naming servers. The goal is to look towards the future and not be shackled by our past. This document should be read through once for a basic understanding, and used as a reference when new hosts and services are set up.It should provide a general convention, and may be modified in the future for clarity. All changes should be made by the owner of this document to ensure consistency.

Basic Domain Structure

The domain example.corp we be our designated placeholder for this discussion.

Naming Conventions

Host names are broken into two classifications: Physical Host Name and Service Host Name.

  • Physical host names can be compared to the DNA of a server and are by necessity somewhat cryptic.
    • Used when managing the physical server inventory
    • Used exclusively by the operations team
    • Used for resource monitors such as IO, CPU and Memory.
  • Service host names are more dynamic, user friendly, and may jump between physical servers as applications are migrated.
    • Used exclusively when discussing specific functionality of the server.
    • Used for managing cluster or application functionality e.g. “pushing out a new attributes.xml file to all shopcart jboss app servers”
    • Used for functional service monitoring.

Physical Host Name

Physical host names represent a unique way to refer to each and every physical server, however unlike the machine’s serial number, it has the flexibility to change. Each character of it’s name must have meaning or provide clarity. If the purpose of a host is changed, the host name may be changed to suit the new purpose, although this will happen infrequently (and make sure to define a procedure for changing a Physical Host Name). Names are broken into 6 key pieces of information:

[Location] [Service Level] [OS] [OS Major Version] [Purpose] [Identification Number]

Location

Location refers to the datacenter in which the hardware is physically present. Each entry will consist of a unique 2 character code representing the city where the datacenter is located. The following is a definitive list of locations currently allowed. If a new entry is needed, please contact the owner of this document.

Designation City
ax Alexandria, VA
gr Grand Rapids, MI
mh Madison Hills, MI

Service Level

Service level is loosely defined by the importance of the applications running on it, and what our perceived response should be. We currently have 3 designations:

Designation Urgency Purpose
1 Top Priority This machine should be treated as a customer facing production host.
2 Medium Priority Downtime on this server has a significant impact on productivity.
3 Low Priority Downtime on this server has minimal impact.

Note that no server should be completely ignored if it is in distress, this just provides a general guideline as to the urgency of the problem.

Operating System / OS Major Version

Automating system updates and configuration roll-outs is a key responsibility of the operations team. Embedding OS identification into the host name not only allows convention-based automation, but provides context for alerts during a production event. OS Identification is broken into two categories; Operating System and the Major Version number of that OS.

Designation Operating System
C CentOS
O Oracle Solaris
P Proprietary One-Off/ appliance
R RedHat
S Suse
V Vmware ESXi
W Windows
Z Solaris Zone
Designation Major OS Version
0 10
1 1 or 11
2 2 or 12
etc.

Purpose

Purpose refers to the overall usage of the server; whether it’s a database server, application server, etc. definitions are purposefully loose to allow for similar services to share the same host. This list will grow as we better define our environment. Purpose designations should refer to generic uses rather than implementation specific (db rather than Oracle or Sybase).

Designation Purpose Example
as Application Server Servers that run Web Applications; JBoss Application Server, Tomcat, IIS
ws Web Server Servers that host static or semi-static content; Apache, Nginx
db Database Servers that host Databases; Oracle, Sybase, MySQL
ci Continuous Integration Servers that host continuous integration; Hudson, TeamCity, CruiseControl
ut Utility server Servers used for by Operations; bind, ldap, pdsh, ssl cert creation
ts Terminal Server Terminal Server for Serial Access (not to be confused with MS Terminal Services)
vh Virtual Host Server Virtual Machine Host Server

Numeric ID

The last segment of the physical host name is a simple three digit numeric ID. The Numeric ID can be used in several contexts:

  • The numeric ID can increment when the rest of the host name is the same (mh1c6ci001, mh1c6ci002, etc).
  • The numeric ID can be used to designate clustered servers (gr2c6as013,gr2c6as023, gr2c6as033, gr2c6as043).

Samples

The follow is a list of sample host names:

  • mh2c6as011 – Madison Heights, Second Priority CentOS 6 Application Server 11 (App server hosting foo.com qa running on tomcat)
  • gr1r5db002 – Grand Rapids, Top Priority Red Hat 5 Database Server 2 (Production Oracle RAC server)
  • ax1c5ci001 – Alexandria, Top Priority CentOS 5 Continuous Integration Server 1 (Master node of Hudson)

Service Host Name

Service Host Names are convenient alias that is associated with functionality rather than physical servers. Names are segmented into the following format:

(application)-(purpose)(id).(subdomain).example.corp

Application

Application represents the specific functionality associated with a service.

Type Example
In-House Application register,webservices,shopcart
Application Server oracle, mysql, sybase

Purpose

The purpose designation will usually align with the purpose of the physical host name. See the reference list above for details.

Numeric ID

Numeric ID is a two-digit incrementing ID based on the uniqueness of the rest of the hostname. e.g. register-as01.dev.example.corp,register-as02.dev.example.corp,register-as01.qa.example.corp

Subdomain

Subdomain refers to either an environment, or a specialized infrastructure subdomain.

  • dev.example.corp – Environment used for Active Development
  • qa.example.corp – Environment used for Quality Assurance
  • stage.example.corp – Environment used for loadtesting and User Acceptance Testing
  • prod.example.corp – Environment used for Production
  • sn.example.corp – Storage network used for Backups and NAS
  • mgt.example.corp – Management network, used for ILO.

Caveats

There are some exceptions to the Service Host Name conventions. The following is a list of examples.

Exception Example Reason
No Subdomain nagios.example.corp Some Infrastructure Services are not tied to a given subdomain.
No Purpose and ID register.stage.example.corp Load-balanced Service Hostnames point directly to the Load Balancer and do not require these fields

Samples

Common Service Host Names

  • register-as01.dev.example.corp
  • shopcart-as06.prod.example.corp
  • mysql-db01.qa.example.corp
  • sybase-db02.stage.example.corp

Load Balances Service Host Names

  • webservices.stage.example.corp
  • register.dev.example.corp
  • shopcart.qa.example.corp

Utility Service Host Names

  • nagios.example.corp
  • svn.example.corp
  • hudson.example.corp

FAQ

  1. Why are we limiting Physical Host Names to only 10 characters?
    • Because it’s about as long as it can be and still be phonetically memorable: “mh1 c5 as 001”
    • rfc 1178 suggests keeping hostnames short.
  2. I’m not sure what to call the new host I’m building?
    • Speak with the owner of this document if you have any doubts. Everyone on your team should be confident and competent enough to select a new name, but at the same time a centralized person will help ensure consistency.
  3. I’m building a ______ and it doesn’t fit into any of the purposes listed- what do I call it?
    • Are you sure it doesn’t fit? We don’t want you arbitrarily shoe-horning servers into bad names, but we need to balance that with preventing an over-abundance of entries. If it truly doesn’t fit, work with the team to designate a new entry on this page.
  4. I’m building a ______ and it has multiple purposes, what do I call it?
    • First ask yourself why it has multiple purposes; would one of those purposes be better suited elsewhere? Provided one does not have a better home elsewhere, use the one that is most appropriate; if a machine is running mysql as a backend for a JBoss application, it’s the JBoss application that people will be most interested in, so you’d label it as rather than db. It’s really a judgment call, and by all means, ask the owner of the doc if you need guidance.
  5. What about VMS/HP-UX/whatever that only allows 6 character names?
    • You can’t win them all. At some point you have to draw a line on how much legacy you need to support. In this case, it doesn’t make sense to shackle systems made in the last 15 years because incredibly ancient machines have limitations.
  6. You’re missing ____ on your list.
    • If there is a heinous omission of a location, purpose, application or operating system, by all means let us know. This isn’t set in set in stone, and can be modified if needed.

I’m always looking for feedback and new ideas; if you have any suggestions, I’d love to hear them.

Application Server Troubleshooting tip

We recently ran across a problem in production that we could not replicate in lower environments. Since this is not only a high use application, but an exceptionally “chatty” app, searching the logs was an excersise in futility (*one* of yesterday’s production logs was 6,975,291 lines long, with multiple logfiles per app, multiple apps and multiple servers).

So how do you find a needle in the haystack? Get a smaller haystack. In the quickest window possible, perform the following three steps

  1. tail log1 log2 log3 log4 >combined.log
  2. reproduce error
  3. ctrl-c tail process as quickly as possible

Doing so reduced our 11,480,799 lines (with 780,527 errors) to 1200 lines.

 

We’re Hiring…

My employer is currently looking for a sysadmin. If you’re interested, contact me for details.

 

SR SYSTEMS ENGINEER ROLE IN FARMINGTON HILLS, MI

Summary:

We are looking for someone who will administer web hosting Linux systems infrastructure, including server hardware, operating system, enabling software, and application software/data for Internet-facing application systems. Direct other departments’ work on dependent systems such as network, firewall, load balancer, and external storage systems. Provide consultative expertise for our businesses to provide technical guidance, standards, knowledge and understanding of business and technology processes, and integration of technologies to deliver Internet-facing learning products and services.

The position will be responsible for systems configuration, implementation, administration, maintenance, and support, along with  application integration, and troubleshooting for our eLearning systems.

The role encompasses daily operational systems support in development, QA, and production tiers. It also encompasses project work with business units, developers, test labs, end users and other groups involved in the planning, development, integration, testing, and problem solving for applications, content, and data.

Essential Duties/Responsibilities:

  • Ensure maximum uptime of hosted environments, including production, staging, testing, authoring, and development environments. This includes, but is not limited to ensuring the HW is configured properly; is secure; is networked properly; is backed up per company standards; is monitored accordingly; is tested to ensure operability; and is built to company standards.
  • Act as a consultative resource for our businesses to provide technical guidance, standards, knowledge and understanding of  business and technology processes, and integration of technologies for content management and delivery.
  • Assist with integration efforts, including planning and coding where necessary in Apache, Tomcat Java, and MySQL database technologies, and scripting languages.
  • Assume lead role in complex problem solving in hosted environments, offering meaningful solutions and implementation strategies. Engage other departments and direct their work on supporting systems such as network, firewall, load balancers, and external storage. Engage application teams with analysis from logs and data on the servers, and provide recommendations for problem resolution.
  • Be part of an on-call rotation schedule that includes carrying a pager/email device 24/7. Respond to all alerts immediately and inform management of issues and work being performed to remedy the problem. Direct escalation to engage additional resources if required to troubleshoot and resolve a problem.
  • Monitor, analyze, and report performance statistics for web hosting environments. Troubleshoot hosting environment failures and manage / assist in the development of solutions to these problems. This includes not only overall environment / platform problems, but also includes problems affecting individual client accounts (i.e. data integrity, reporting, security, etc.).
  • Analyze web hosting environment averages and peak workloads / throughput compared to existing capacities and plan required accommodations to address environmental growth. Take necessary corrective actions (both scheduled and unscheduled) to proactively address potential problems before they become operational / environmental problems. Notify Manager of projected needs and actions taken.
  • Ensure security of systems, including standard server build and lock-down procedures, and monitoring security access to systems.
  • Review system logs regularly, report and research warnings and errors. Review system logs for backup completions and report any discrepancies.
  • Execute implementation/migration of new software and application versions across the development and staging and production environments and prepare back out plans on all platforms to be updated. Ensure adherence to established Change Management and QA procedures. Verify results with appropriate parties.
  • Work with peers and other departments to analyze ongoing processes and procedures. Where relevant, propose / design improvements to operational processes.
  • Keep up to date with developments in the e-Learning / web-based information technology field through educational and other information resources and make management aware of possible applications for new technologies.
  • For new web hosting infrastructure projects, act as technical lead for planning and implementation.   Mentor and train junior team members in all areas of IT expertise.

Skills/Knowledge/Experience:

Basic (Required)

  • Bachelor’s degree in Information Systems, Computer Science, Business or Engineering or equivalent job related experience.
  • Must have an excellent command of:
    1. Red Hat Linux Operating System
    2. Apache Web Servers
    3. Tomcat application environment running Java
    4. MySQL Database Server
    5. MarkLogic Content Management Systems
  • Must possess experience designing, building, maintaining, migrating, tuning, administering, and supporting three-tiered web/application/database server environments
  • Experience with Internet access and security for servers residing within a DMZ
  • Must have excellent written and oral communications, including technical documents, and process documents.
  • Must possess excellent problem-solving and analytical skills and be able to translate business requirements into information systems solutions.
  • Able to translate business requirements into technical recommendations for information systems solutions.
  • Must possess excellent problem-solving and analytical skills; ability to assist with network, system, and application troubleshooting required.

Preferred

  • This position demands a well-organized, action-oriented team player with the ability to prioritize daily work, change directions quickly, coordinate geographically dispersed team members and work on multiple projects simultaneously.
  • Comprehensive knowledge of problem analysis, structured analysis and design, and programming techniques.
  • Coding and scipting skills for a RedHat/Apache/Tomcat/MySQL environment, clustering and other high-availability architectures, TCP/IP, along with various server management and administrative tools.
  • Ability to work with minimal supervision, engaging peers and other departments to accomplish assigned goals and effectively manage projects in a cross-functional environment.

Administer web hosting infrastructure, including server hardware, operating system, enabling software, and application software/data for content management systems. Direct other departments’ work on dependent systems such as network, firewall, load balancer, and external storage systems. Provide consultative expertise for our businesses to provide technical guidance, standards, knowledge and understanding of business and technology processes, and integration of technologies to deliver Internet-facing learning products and services.

The position will be responsible for systems configuration, implementation, management and support, along with  application integration, and troubleshooting for our MarkLogic-based Content Management Systems. The role includes installation, configuration, administration and maintenance of the content management environment and integrating new systems and products into the platform.

The role encompasses daily operational support of the content management systems and application environment in development, QA, and production tiers. It also encompasses project work with business units, developers, test labs, end users and other groups involved in the planning, development, and testing of products, content, and workflows in the content management systems.

 

Raw WinXP Virtualbox Partitions on a Thinkpad

New job, new laptop. Many utilities here are windows only, so it requires a bit of… effort… to get myself up and running efficiently. The solution to the windows problem is VirtualBox. I had set this up on my last laptop with little effort, but this time around required a bit more effort. Hopefully the instructions below will help others get up and running quickly.

Disclaimer– your laptop may catch on fire and explode (or worse) if you attempt this… or something.

We’ll be presuming that you’ve already resized your windows partition and have both a working Windows and Linux partition.

In Windows

Log into XP, grab MergeIDE.zip from Virtualbox’s site, extract and run it. It should be a quick flash and be done. (Note: I am not 100% sure this step is needed)

Create a new hardware profile and name it virtualbox. Make sure to set it as a choice during boot. Try rebooting into native windows once to ensure that it does offer you profile options.

In Linux

You’ll need the following packages installed (May differ for non-ubuntu systems):
mbr, virtualbox-ose, virtualbox-ose-qt

Create a stand-alone mbr file to use for booting (yes, you need the force flag):

install-mbr ~/.VirtualBox/WindowsXP.mbr --force

We’re presuming that your windows partition is /dev/sda1. In the below command, we are defining

  • a vmdk file (WindowsXP.vmdk)
  • which raw disk to read (/dev/sda)
  • which partition (1)
  • the new MBR file we just created

VBoxManage internalcommands createrawvmdk -filename ~/.VirtualBox/WindowsXP.vmdk -rawdisk /dev/sda -partitions 1 -mbr ~/.VirtualBox/WindowsXP.mbr -relative -register

Note that you’ll need read/write access to that drive as your user, so you may want to figure out a cleaner/securer way to implement this, rather than adding your user to the disk group (which is very dumb and insecure). I would, but it’s working and I have more important things to do at the moment.

Another issue- apparently thinkpads report the drive heads and cylinders oddly (T410 for me and T60p in article), so we have to add some vmdk settings before virtualbox creates them incorrectly. Open ~/.VirtualBox/WindowsXP.vmdk and add the following at the bottom:

ddb.geometry.biosCylinders="1024"
ddb.geometry.biosHeads="240"
ddb.geometry.biosSectors="63"

The biosHeads appears to be the magic value- it seems to work if it’s set to 240, but the default is 255 (which fails).

Once you add those, start up virtualbox and check the virtual media manager, your new vmdk should be listed there. Once it’s confirmed, create a new virtual machine. Rather than creating a disk, select your vmdk as an existing disk.

After you finish, go the the VM settings->system and make sure the motherboard tab as io-apic  enabled (I also had PAE/NX enabled under processor and VT-x enabled under Acceleration).

Start the VM

There are several errors that could pop up. I’m sure there are plenty more that I stumbled across, but these were the two big ones:

  • a disk read error occurred, press ctrl+alt+del to restart – Caused by incorrect biosHeads- check and make sure it’s set to 240 (this was the fix for me, results may vary).
  • Complaint about kvm/vmx – Virtualbox does not like kvm. Uninstall qemu-kvm.

If things go well, it should flicker mbr in the corner, then go to the hardware profile selection screen. Select the virtualbox profile, and continue, then log in.

What follows is a half-hour of installing generic drivers and dealing with hardware specific auto start apps complaining that they won’t work on this installation. Windows will warn that the new drivers are not blessed, so be forewarned.

Once completed, at the top of the VM windows select Devices-> Install Guest Additions. This will download and mount an ISO, and windows will pop open a folder with the addition executables. Select the one best for you and run the installer. It will prompt you for video and mouse drivers (and trust me, you want them).

The final step is to shut down the windows VM, then reboot into the native windows partition to make sure it still works.  I did receive a few blue-screens before logging in at the beginning, but they appeared random and haven’t happened since.

And that’s all there is to it- simple, eh? Your windows partition should now run in native mode and vm mode.

The Philosophy of Monitoring

As a system administrator, monitoring is a key job responsibility, yet arguments seem to arise on how to implement it (usually with people who won’t be paged at 3am). Before writing this, I looked around for an article on the goals and philosophy of system monitoring, but found very little that really applied to this topic. Hopefully this will help set some expectations for admins, managers and stakeholders on what you should monitor, and why it should be monitored.

Why you Monitor

Before you set up a single monitor, you have to ask yourself, “what is the goal?” After all, why are you even setting something up? Here are a few common reasons for configuring monitors:

  1. Notification: Warning of an issue that requires intervention. What most people think of when you say “Monitoring”.
  2. Reactionary: Automatic actions are taken when certain criteria are met. If common countermeasures are automated, you’ll have less to handle manually.
  3. Informational: System status and historical trending allows you to show business customers that production “isn’t always down.” In reality,  you may have 99% uptime, and often downtime is due to requested deployments. Statistical information can also be used for capacity planning.

Mentally dividing your monitors into groups will help you calculate which monitors require involvement. It’s not uncommon to have several thousand monitors at any given time, so it’s important not to assign critical importance to all of them. A wise man once said “When all alerts are critical, none of them are.”

When you should NOT Notify

Some monitors may have thresholds set which check for certain conditions; when those conditions are met, you may want to send some type of alert to an administrator. There are two types of notifications – Active and Passive:

  • Active Notification: Immediate Action is Required: “Site is Down!” A phone call, page, or IM may be used to contact someone. Direct action expected.
  • Passive Notification: Informational Purposes only:  “JVM Memory usage is high.” Information is logged, and perhaps an email is sent. No direct action is expected, since there’s usually nothing you can do about it.

It’s easy to become addicted to passive notifications – but remember, data overload can mask important information. It becomes habit to ignore notifications if they are unimportant. The question then is not so much “when should you notify,” but “when shouldn’t you?” What it really boils down to is “Can/should I do anything about it right now?”

  • Non-critical (disk space creeps above  90%  on /var on a dev server at 2am on a Saturday after several months of growth).
  • Nothing Systemic is wrong (admins can’t fix “low sales”).
  • 3rd party system, such as a geocoding webservice, is down.
  • Will resolve shortly, such as a backup server pegging the CPU during midnight backups.

Some of these alerts can be avoided by setting a correct monitoring window (ignore CPU during the backup window, or set a blackout window for a deployment). Others simply can’t be addressed by administrators, although you may want to send informational emails to other members of the company (those managing 3rd party SLAs or responsible for tracking online sales)  The next step after getting an alert is figure out what to do about it.

Reacting Properly

When a notification is sent out, there should be a definitive action that you can take. Think about why you were notified. There are a few rules to keep in mind when something goes wrong.

  1. Don’t Panic. When 700 alarms go off, your first instinct is to panic. Before you act, take a breath. Spend a moment to get your bearings, and calm yourself. The worst possible thing you can do is flail. Randomly making changes without rhyme or reason and restarting services can do more harm than good and may make the situation worse. Take note of which alarms go off, and in the post mortem look for ways to get the same information with less noise.
  2. Identify Obvious Patterns. What is the commonality? If a central system goes down, you may see many similar alerts. Dependencies can help immensely, masking redundant alerts. A single database failure could take down a dozen sites. Which is better: getting a single alert that the database is down, or 250 alerts that various sites are down and one database notification in the middle? While 250 alerts may impress the gravity of the situation upon you, it may instil panic and anxiety, which leads to flailing.
  3. Get things up and running as quickly as possible. Root-cause analysis can be tedious, time consuming, and occasionally inconclusive. If you have a major system outage, don’t worry about doing root-cause analysis on the spot.  Do what you need to in order to get things up and running – you can search the logs later. If the problem is recurring, you’ll get another chance to investigate later.
  4. Communicate with Stakeholders. The business units don’t need to know the details, but they do need to know that there is an outage and that it’s being addressed. If the situation is not quickly resolved, give them status reports. Be warned – any details you reveal will be warped and held against you. I’ve learned this one many times. People have a tendency to blame what they don’t understand. “Site is down? It must be a witch!” At a previous job we had a “jump to conclusions” board which had our favorite scapegoats – load balancer, connection pool, Endeca, etc. Everyone is guilty of it – Business, devs, sysops, QA, etc. Even a one-time problem that has been resolved will be brought back up, even if it’s only tangentially related. Communicating too much information creates future scapegoats.
  5. Contact Domain Experts. If your java site is crashing and you’re not a java developer, get a java developer involved. If your DNS server falls down and the fix isn’t obvious, contact your DNS administrator. Expert eyes on the problem may resolve the issue quicker. Group chat is crucial for sharing information and talking out theories. Someone familiar with the code will know what the error messages mean.
  6. Fix the Problem. It should go without saying that if you find the problem, you should make every effort to resolve it. Workarounds are fine, just don’t let that band-aid become permanent. What often happens is a workaround is put in place; the alert clears and management no longer feels the pain, so they ignore the problem without putting forth the effort to fix the issue. When the next issue appears, a new fix is layered on the old. Band-aid is layered on band-aid. Eventually you’ll need to pull those band-aids off; and the more there are, the more painful it will be.

How Much is Too Much?

Most administrators prefer to be proactive rather than reactive, resolving issues before they become a problem. Proper monitoring can be a great asset, but if you’re not careful it can cause problems. For example, at a previous job we had a load balancer, apache instances and tomcat instances set up for each site. Each site had the following:

In (Sitescope) legacy monitoring system:

  • Health check on load balancer URL

In Nagios:

  • Health check on Apache instances
  • Health check on Tomcat instances
  • Health check on Load balancer URL

In Apache:

  • Health check on tomcat instances

In Load balancer:

  • Health check on Apache instances
  • Health check on Tomcat instances

Individually, these don’t seem that bad. If an apache instance goes down *of course* the load balancer needs to know so it won’t send traffic to that instance. The same with Apache watching Tomcat. The problem was the frequency of the checks; the load balancer was checking each monitor every five seconds. When a poorly load-tested site update was released, certain pages took 7 seconds to load. Things quickly went downhill as threads and processes backed up, crashing the site.

Balancing responsiveness with common sense is essential. Having a monitor check every minute won’t change the fact that it will take an admin 20 minutes to get to a computer, boot up, log into the VPN, and identify the issue. Don’t add to the problem by DOS’ing your applications.

Making Contact

One mistake I’ve seen is using email as a reliable and immediate method of contact, often expecting a quick response. My favorite is when someone sends you and email, then walks down to your desk immediately after and asks “did you see my email?” You check and see it was sent literally less than two minutes ago. You can’t rely on people to continually check their email. Admins especially don’t due to the sheer volume we receive.

Email has it’s uses, but active contact in an emergency situation is not one of them. Personally, I only check my email when I think about it, which may mean large delays between when the message is sent and received. Couple that with spam filters, firewalls, solar flares and the 500 other unread messages and email becomes a less-than-reliable medium for emergency notifications (even during business hours).

Paging (or SMS)  is preferable if you expect a quick response, although it is far from perfect. Just like email, SMS messages can be lost in the ether, however recipients usually have their phone alert them when a message comes in since it happens far less often than an email drops into the inbox. That said, every alert should not be sent as a page, or apathy will quickly sink in. The escalation path should look something like this (although all steps are not needed):

  • Front-end web interface alert: User would have to actively be browsing to see the status change. Usually the first clue something is wrong and shows the most recent status changes on a dashboard.
  • Email Alert: User would have to be actively checking their email. Usually sent when something is first confirmed down.
  • Instant Message: User would have to be at a computer and logged into IM to receive the alert. Rarely used, but an option during business hours.
  • Page/SMS: Reserved for emergencies. This means there is trouble.
  • Phonecall: Only used if Admin does not respond to the previous contact attempts. Usually performed by an irate manager or director.

If you’re lucky enough to have a 24×7 call center / help desk, they can also be leveraged to resolve issues before a system administrator is needed. If recurring patterns start to emerge,  automation can be used to deal with the problem (or better yet you can fix the underlying issue). Sadly, many issues can’t be automated away or solved by a call-center staffer pressing a button. A real admin will eventually need to be contacted.

I don’t want to dig too deeply into on-call rotations, but an effort should be made to balance off-hours support with a personal life.  Being on-call means no theaters, fancy dinners, or quality time with the family. Without balance, burn out will ensue.

Afflictions

System monitoring often brings out odd behavior in even the most steadfast of administrators. Some behaviors are relatively benign, while others can cause severe problems down the road. Identifying these behaviors before they cause a problem is just as important as having good monitors.

  • Data Addiction: Knowledge is power, but do not mistake information with knowledge. It’s possible to have 700 alerts, and not one of them identify the underlying issue. One of my least favorite phrases is “Can we put a monitor on that?” It’s often uttered right after a one-off failure; the type of thing that fails once, and once fixed will never cause a problem again. An example of this is a new server, where apache was not configured to restart after a reboot. When the server is restarted, you quickly find apache is down, start it, configure it to auto-start, and move on. There is already a monitor on the websites hosted by that apache instance as well as a monitor on how many apache threads are currently running; What purpose would another monitor serve? How often would it run? This is a prime example of how a data addict can spin out of control – too many useless monitors will mask a more important issue.
  • Over Automation: Automation is a wonderful thing, however, it’s possible to have too much of a good thing. In one instance, there was a coldfusion server which would crash often. Rather than trace out the root cause, restarts were automated, then forgotten about. A few years later, it was found that the coldfusion servers were restarting every twenty minutes, and no one knew about it – no one except the users. If it takes 20 seconds to restart, and that’s 26280 twenty-second interrupts over the course of a year – that can translate into a bad user experience and loss of sales. Make sure that automation is audited and verifiable, and doesn’t cause more trouble than it prevents.
  • Over Communication: While it is important to communicate with stakeholders, it is possible to over communicate. Stakeholders don’t need to know that there are 130 defunct apache processes caused by a combination of a bug in mod_jk and the threading configuration in JBoss – all they need is “Site availability is intermittent – we’ve located the root cause and are working on a solution. More information to follow.” Details aren’t needed. Likewise, not every single person should be notified when an alert goes off – does your backup administrator need to know when a web server goes down? No. Does the DBA need to know when an SSL cert is about to expire? No. Tailor the messages to the correct audience. Most monitoring systems allow you to configure contact groups – use them.
  • Complexification: There are dozens of relationships between services, hosts, hostgroups, contacts, servicegroups, notification windows, dependencies, parents, etc. Try as you might, it’s usually impossible to perfectly model every relationship. Don’t become distracted by perfecting the configuration – focus on maintainability, scalability and accuracy. If you can’t add new systems and monitors, your configuration is too complex.
  • Reporting vs Monitoring: Reports are the more successful cousin of Alerts. They may superficially appear similar, but serve entirely different purposes. Monitors should only be used to track and trend data and to notify if there is a problem, whereas reports take the collected data and massage it into an aggregated format. Monitors shouldn’t send out scheduled alerts. They can collect data, but they shouldn’t be used to present it to users. You’d be surprised how often someone asks for a monitor to send a nightly report. That slippery slope will turn your monitoring system into crystal reports.
  • False Positives: False positives are the scourge of the monitoring world. There are many causes, but the reaction is always the same – start to investigate, realize that it’s a false positive, and lose interest, knowing that nothing is broken. The problem is that a false positive leads to lazy behavior – if you’re pretty sure it’s a false positive, you don’t bother looking into it, figuring it will clear on it’s own. This trains people to have a “wait and see” mentality when alerts go off, causing unneeded delays when a major issue appears.
  • Apathy: It’s 2am on a Saturday, and you get paged that the CPU on a utility server is pegged. Without looking, you know that it’s the backup process copying the home directories, so you ignore it. The following Monday at 10am the QA JBoss instance stops responding. You know that it will clear within minutes because the QA team always rebuilds the QA instance Monday morning. When you get monitors constantly failing and recovering on their own, you start to ignore the pages that come in because you know they’re unimportant. It’s only a matter of time before you miss something important. If you have a situation that promotes apathy towards alerts, resolve it before something important is missed.

Don’t be [A]pathetic

I mentioned apathy above, but there’s a bit more to it – it’s not just admins that become apathetic.  If an issue is identified, action must be taken to correct it. The coldfusion example mentioned above is a great example of  company apathy – failure of the business unit to prioritize it and failure of IT to push back hard enough.  A former manager once had someone laugh because his team had ignored my manager’s bug report for a full year.  That’s not funny; it’s pathetic.

When management fails to address an issue; be it a known system problem or something as simple as morale from a lost team member, it shows the team that they don’t care. It soon becomes a vicious cycle of uncaring when managers no longer care that the site is down, which in turn causes developer apathy.  Developers then don’t care about code quality, leading to buggy code. Sysops stop caring that alerts are going off, leading to downtime. By the time the cycle is broken, it’s far too late – you’ve established a bad reputation with your customers.

Often times this will start with unreasonable development expectations, causing devs to cut corners, QA to be rushed, and monitors to be forgotten. There is a balance that must be maintained between getting code out the door and making sure that the code can stand up to the abuse it will receive when it goes live.  It’s a team effort, and everyone must care (and keep caring) to keep the systems running.

Wow. Well, that’s a lot more than I intended on writing. I should state that I am guilty of 75% or more of the bad behaviors listed here. I hope that this will help start discussion on how to better improve monitoring systems.

If you have feedback, suggestions or enhancements, please leave them in the comments.

(Thanks to jdrost, jslauter, keith4, pakrat, romaink, and my wife Jackie for their peer review/editing.)

home sick

So, I’m home sick again. Fourth time this year I’ve been sick. Cold, flu, cold, Bronchitis. Awesome. Will I get to rest today? No, of course not.

My server (Unicron) has been up and running for 2 years now- I got the parts right after Ian was born. I set up a nice software raid array at the time that’s served me well. I’d never set up a raid array like this before, so I wasn’t really sure how to monitor it. The raid array has been running fine for 2 years, so I just sorta let it slide.

Over this past weekend, I did some work resizing the lvm partitions and had to poke around with the raid stuff. I found not one, but two ways to monitor it one was to set up a monitor with the mdadm tools and have it email me if there was a problem, and the other lead me to create a simple nagios monitor. I set both up sunday night.

Flash forward to this morning- jackie wakes me up, asks me if I’m going to work (I’d come home sick the day before and was still out of it.) My intention was to wake up long enough to IM my manager and supervisor and let them know I was gonna be sick. I do so, then minimize the im stuff. Staring me in the face was the following:
raidfailure

Wait, wait, wait- my script must be crappy, there’s no way the raid array choked right after setting up the monitoring. I sorta go into denial and check my email:

This is an automatically generated mail message from mdadm
running on unicron

A Fail event had been detected on md device /dev/md2.

It could be related to component device /dev/sdc2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid5 sda3[0] sdd2[3] sdc2[4](F) sdb3[1]
1461633600 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]

md3 : active raid1 sdc1[0] sdd1[1]
979840 blocks [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
979840 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
192640 blocks [2/2] [UU]

unused devices:

Frak.

So here I am, praying it was a hiccup while I reboot and rebuild my raid array. it looks like sdc2 went nutty about 45 minutes after I went to bed. I restarted the server and sdc2 reappeared, and I’m rebuilding now to see what happens,

md2 : active raid5 sdc2[4] sda3[0] sdd2[3] sdb3[1]
1461633600 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
[=================>…] recovery = 85.0% (414436248/487211200) finish=26.8min speed=45182K/sec

One thing is for sure, I need to get a auxiliary drive in case this one goes kaput for real. I said I’d buy one in june… 2007. suppose I better get one that, huh?

My apologies of this was nonsensical, I’m really tired.

new plugin test.

So I’m testing a nifty new wordpress plugin… check this out:

I wonder if it’ll work?

update: no, no it will not.

What’s up?

So I’ve been pretty quiet since I hit 100k words- what’s been going on?

  • Round of layoffs at work
  • Friend diagnosed with cancer
  • Another round of layoffs at work.
  • Jackie became a pampered chef consultant
  • Finances have been wiped out from christmas and getting her PC stuff off the ground.
  • 10% paycut at work
  • Guitar lessons are now done because no one can afford them.
  • Have been reading Manuscript Makeover for ways to improve my book
  • Decided to do an initial cleanup of the first draft of my script, then rewrite the outline before starting draft #2
  • started yet another opensource project- this time it’s a collection of Nagios Plugins.

So I’ve been pretty busy. I’ve finished the cleanup of the first two chapters of book 1; hopefully I’ll finish the rest shortly, but it’s very slow going. We’ll see where things head in the next few months- I expect more crappiness.

Free Jabber / XMPP clients for a Blackberry?

anyone know of any good jabber clients for the blackberry? I’ve tried a couple with little luck, and most of them cost more than I can afford for this test. Features required

  • Must run on BlackBerry 8703e v4.1.0x
  • Connection server can be configured differently than jid address (i.e. you@morgajel.net for jid, jabber.morgajel.net for connection server.) This rules out Mobber as far as I can tell
  • Requires SSL/TLS
  • Non-strict cert checking

Let me know if you have any suggestions.

Go to Top