Hobbies

This is a generic catchall for all of my hobbies.

Frack it, I’ll Build My Own.

6

 

So long story short, I’ve decided to attempt building a guitar by hand- mainly because I’m an idiot. It doesn’t have to be beautiful or well formed or even sound good, I just want to make my own. So… I went to Lowes, bought some wood, bought some tools, and got to work.

Supplies:

  • 1″x3″x4′ Poplar board.
  • 1″x4″x2′ Poplar board.
  • 1″x6″x2′ Poplar board.
  • 1″x12″x2′ Poplar board.

Tools:

  • Coping saw
  • 1/2″ chisel
  • Hammer
  • Miter box
  • Surform shaver

Note that the design I’m working on is lopsided-the bottom half is larger than the top half. The main reason was simply because I wanted to keep costs down and go with the 12″ wide board for the back.

Here’s the template I came up with- remember that I’m a lefty, so the horn would be on top and the ball will be on the bottom (and the strings would be facing us). Note that the big board here will be on the back, and the section of neck seen here will be tapering in like a heel cut.

The straight-through neck idea; I plan on gluing and clamping these three boards together, then clamping and gluing them on top of the larger board’s cutout, then sanding the edges down. Note the “teeth” at the bottom  of the lower board- that’s because I’m using a coping saw rather than a band saw, so I have to take small pieces off at a time.

The length you see here is the same as my Fender strat, so the scale should be correct. The small section of board above the neck was extra length from the neck.I’m debating using that as the headstock and attaching it at an angle like a Gibson.

Not pictured is the 1/4″ red oak laminate that I’m going to use for the fingerboard. I’m not sure what to do for the bridge, nut and probable pickups- I might just go buy a broken used guitar and strip out the electronics.

I’ll post more updates as I have them. So far I’ve put about 4 hours of work into it and <$100 in wood.

Using Jenkins for System Administration

109

Preface

While system administrators often have many different goals, here are two that seem fairly universal:

  • Automate the redundant tasks
  • Hand off the simple tasks

I’ve recently found that the build utility Jenkins can be a major boon for an Operations team, and wanted to share my findings with others.

What is Jenkins?

So, what exactly is Jenkins? It’s a popular fork of the open source continuous integration tool, Hudson. While it is normally used for building and deploying software, it can easily be used for more interesting purposes. Here are some of the more useful features:

  • LDAP authentication
  • Group/Matrix-style permissions
  • User-definable views
  • Console output for each run
  • Manual and/or cron-based runs
  • master/slave node configurations
  • Email, XMPP, IRC notification integration
  • integration with Bash, Ant, and Maven
  • integration with cvs, subversion and other version control systems
  • Dozens of useful plugins.

Due to it’s modular and flexible nature, not only can you pick and choose the plugins you want, but you can write your own as well.

What is Jenkins normally used for?

Jenkins’ intended use is as a Continuous Integration server- It downloads code from a repository, resolves dependencies, builds the code, and then deploys it. As mentioned before, it tightly integrates with both Ant and Maven, making it a boon for java developers. While Ant and Maven integration is incredibly useful, it’s the combination of subversion and bash integration that sysadmins will find useful.

What can Jenkins be used for?

Don’t think of Jenkins as a continuous integration server, think of it as…

  • Cron with a Gui Interface and nice bells and whistles
  • Sudo replacement
  • Documentation updater

Jenkins has the advantage of centralization, visibility (a pretty gui goes a long way), per-run logging, IM and/or email notification, time trending, and a self-contained workspace. I usually draw the line by seeing if one or more of the following statements apply:

  • Use Jenkins if it involves multiple hosts
  • Use Jenkins if you wish to capture output on each run
  • Use Jenkins if you wish to trend disk usage or time of the task
  • Use Jenkins if you are in need of ACLs
  • Use Jenkins if it allows a non-technical person to perform a task without the help of a sysadmin
  • Use Jenkins if you want to run something manually as well as on a schedule.

What shouldn’t Jenkins be used for?

Once you recognize the power and flexibility at your disposal, you may be tempted to use Jenkins as a

  • Cron with a Gui Interface and nice bells and whistles
  • Sudo replacement
  • Documentation updater

See what I did there? My point is that there’s a fine line between what belongs in Jenkins and what does not. Discovering it for yourself is half the battle. Here are some sample tasks as I would categorize them.

Task Sudo Cron Jenkins
Log rotation on a single server X
Login user access to restart an init script X
Multi-user access to restart an stop an init script, clear logs and restart an init script X
Manager access to restart a service X
Deploying code across several production nodes X
Update Documentation on a Given Project X
Run a command nightly X
Run a command nightly that emails it’s output X
Run a command manually that IM’s you when complete and logs STDOUT X
Run a command through PDSH across several servers and update a wiki page X X

Why not Mcollective or RunDeck?

To be honest I’d never heard of them until after I began writing this article. I’d already been using Jenkins/Hudson for a year and a half. There may be better tools for the job, but Jenkins has some distinct advantages.

  1. You may already have it
  2. Getting buy-in from developers to use it is much easier.

In either case many of the tasks below may be usable under work-alike services.

Sample uses of Jenkins

So lets get down to business- how can Jenkins make your life easier? Your first task is to identify things that can easily be scripted, are incredibly repetitive, and produce data that is outdated quickly. For me, that first task was to generate documentation.

Generating Documentation

No documentation is better than scores of bad documentation- It’s better to know you don’t know, than to know the wrong thing. One of the many problems we suffered from was simple IP Address allocation. Since we didn’t own the DNS server, it was hard for us to track IP address allocation- i.e. we didn’t know what IP addresses were in use, or what they were used for, or who was using them. Our best attempt to track them was manual modification of our team’s 47 internal subnets. The problems with this method were threefold- 1) it involved a human performing an error-prone routine, 2) There was no simple way to audit the entire system, and 3). There was no central authority.

Needless to say, things would drop through the cracks. On more than one occasion, we had applications go down because their IP address was double-allocated because half of the team didn’t know the IP allocation pages existed.

Example 1: Reverse DNS Entry Cleanup

So how did we solve this? The first step was to audit the current status of our reverses. I wrote a perl script that takes IP ranges as an input, scans the network for active IPs, does reverse lookups on all IP addresses, then documents oddities and writes them to the wiki.

This may seem simple, but it was the first time we saw a glimpse  at how ugly our reverses were- multiple IP addresses resolving to the same host name (which shouldn’t happen in our configuration), missing reverses, IPs that didn’t respond to ping, but had reverses designated (usually outdated entries). While running this task was simple enough to do manually, it took 30 minutes to run. Who wants to run a 30 minute CLI job? While I could have put it in cron, I chose to put it in Jenkins for the following reason:

  • The status was visible; anyone logged into Jenkins could see the process was running, and check the STDOUT to make sure there were no unexpected errors.
  • When it completed successfully, the status bulb would turn green- A quick visual sweep of 30 jobs can be done in seconds if you’re just looking for a red bulb.
  • Logs were kept with each run, so I could compare and contrast output
  • An IM could be sent when the job completed, passing or failing
  • the latest version of my perl script was always checked out.
  • I could trend run times and estimate how long the next run would take
  • I could set it to run automatically once a night at 3am
  • I could run it manually via a Jenkins “push button”
  • anyone on my team could run it manually via a Jenkins “push button”

With this output, we were able to drastically reduce the volume of bad reverses. With the help of the Confluence.pm module, we could write the results to the wiki for the whole world to see. This however only showed us bad entries- what about the good ones?

 Example 2: X.X.X.X IP Range Pages

Now that we had reasonably good reverses, we can use that information dynamically generate the pages my coworkers were previously maintaining manually.

We started yet another perl script that took a set of ranges as inputs- each range would generate a different page in the wiki under the name “X.X.X.X IP Range”. The script would then plow through each IP address in the range and

  • Do a reverse lookup to find out what host name resolved,
  • Fping it to see if it responded,
  • Check our LDAP inventory repository (also automatically updated) and report the physical host where the IP was bound,
  • Report the primary IP of that physical host,
  • The “description” field of the host entry,
  • Hardware detail of the host entry.

Once the script was written, it could take any number up IP ranges (even 47 subnets) and rip through them, updating a page for each.  Jenkins made sure that the job ran nightly, again pulled the latest version of my script from subversion, and kept track of any oddities in it’s output. Suddenly we had IP allocation tables that were guaranteed to be up to date.

Example 3: Hardware Breakdown

Inventory management was a weak point for us when I started- we simply didn’t know how many servers we had. Previously this had been managed by a spreadsheet on the hardware admin’s laptop, but much like IP address allocation, it was error-prone and often out of date. We moved to an LDAP-based inventory management setup (How the inventory script works is outside of the scope of this article) that was dynamically updated, however an LDAP interface is less approachable than a spreadsheet for management.

One of the things we track is hardware makes and model numbers.  One of the managers wanted to know how many P-class blade we had left- a quick LDAP search and poof, we had his answer. Then he wanted to know how many C-class blades we had, so I showed him how to do searches. While that sufficed for a while, it soon became apparent that we needed a prettier hardware breakdown for management.

One Perl script later, we were able to dynamically generate wiki pages from this data. You may be asking “why not do this with cron?” The answer is simple- dependencies. the Hardware breakdown page is downstream of the Inventory Update script- After all, if it’s pulling the entirety of it’s data from LDAP, it only makes sense to update when that inventory is updated.

Final note on Generating Documentation

It’s not only the content of the documentation that’s important- my generation scripts also include a warning that the page is auto-generated, the repository location of the script that generated it, and the time it took to generate that specific page.

Generating Configurations

I’ll be the first to admit that this is a special-case usage, and may not be useful to many, but there are occasions where you’ll want to generate configurations- Not multi-server configurations where Puppet could be useful, but single, complex, repetitive configurations like Nagios.

Generating Nagios Hosts

We have over 800 servers. Manually configuring Nagios was labor-prohibitive, so an automated approach is warranted. Using our continuously updated inventory, we pull pertinent information from LDAP and generate individual host configuration files for each host. At this stage, there are no checks associated with the hosts.

From there, we loop through our newly created host files and scan the network using a predefined set of “features”- Some hosts run MySQL on port 3306, some run JBoss on 8080, etc. Hosts that match a feature are added to a dynamically generated host group with predefined service checks.

For example, only 2 services should be running on TCP port 8080, Tomcat and JBoss. if that port is active, we check 1161, the SNMP port to determine which service it is, then add it to the proper service group. If it reports as JBoss, the host is added to the JBoss-hosts host group, which has checks for heap usage, thread usage, etc.

So if it’s all automated, why bother with Jenkins? Well, there’s a lot of complexity, and even with the best documentation, it’s a chore for even one person to keep track of how it all works. If something happens to the infrastructure while I’m out (e.g. several servers fail and are replaced), we want to implement those and update monitoring quickly. Having a push button “Regenerate Nagios hosts” makes it simple enough for anyone to do it. Having it email me when it happens, create runtime logs, and pull or write to subversion is icing on the cake. Jenkins helps us ensure that each run is handled identically, and ensures consistency when used by various administrators.

Empowering Users

Sysadmins usually have an abundant backlog of tasks, be it system audits, upgrade, research, etc. Quite often system administrators get drawn in to user tasks because the task requires elevated privileges. Jenkins can help you hand that off to nontechnical users in a safe and simple manner.

GUI Interface for Non-Technical Users

No matter how technical your coworkers are, eventually you’ll run into a non-technical user that needs to perform some random technical task; perhaps it’s loading content into a custom system or indexing content. The process may have been designed by a developer using a command line program or script; it may require sudo. In either case, the task is technical enough to make the non-techie shy away from doing it because they “don’t want to use the dos window.”

Jenkins to the rescue! Asking that user to log into a web GUI and press a button lowers the barrier for them. In addition, they can watch the progress, see how long previous runs took, see when the job was last run, and who ran it. If the initial task was in any way complicated, that can all be hidden away (yet still clearly visible in the console logs).

 Final Thoughts

As you can see, Jenkins can be re-purposed for a plethora of different uses. With a little bit of creativity and ingenuity, you can greatly improve your productivity. If you’re already doing something like this, please leave a comment below describing your own implementations.

Suggestions for Rocksmith

5

After trying to contact Ubisoft to provide them with feedback on Rocksmith (and receiving little more than an automated response,) I figured I might as well put my feedback here for all the good it’ll do. So here’s a list of things that I think they could improve.

  1. Why do I have to press 3 buttons to get into the game? Press A, Press Start, Press A… come on, I got 10 minutes to get my fix, and it takes 2 minutes to load up and get to a song.
  2. Consistent UI- press start, press A, make some noise- choose one and stick with it.
  3. The ability to navigate the menu with the guitar: pluck red open to select, purple open to go back. slide to scroll left or right (remember us lefties though).
  4. After finishing a song, the ability go back to the library, centered on the song you just played.
  5. After finishing a song, riff repeating.
  6. More lives/ easier method to reselect that riff.
  7. You say intonation is important, but provide no way to check intonation. Give us a way to check it.
  8. A mode where you can use the controller to rewind or fast forward, even use the left and right bumpers to skip sections of the song.
  9. The ability to string a couple of sections together for riff repeater.
  10. More games- “Name that Tone” might be a good one for teaching note recognition.
  11. Remember, most users are playing with their guitars, not their controllers- I good portion of them are using their toes to operate the controller. Plan accordingly.
  12. $3/song? That’s a bit steep. Give me a discount for buying in bulk at least. How about a discount if you buy all new songs at once?
  13. “Favorite songs” option in the main menu, as well as a “recently played”.
  14. When doing technique challenges and such, after I see the video the first time, I don’t care to see it a second time, ESPECIALLY if it means another load screen.
  15. Speaking of loading, restarting in the middle or even near the end of a song is instantaneous, but after finishing a song and hitting “play again” it loads… that seems silly. I don’t give a crap about your menu, keep that song in memory if it means no load time.

 

I’m sure I’ll have more as I continue to play the game.

 

Update (20111201)

I have a couple more I’d like to add to the list:

  1. Ghost mode- Show the full note patterns, but at 30% opacity. If you play a section correctly, your mastery is adjusted accordingly.
  2. Assessment mode- If I know a song 90% of the way already, it’s infuriating waiting for Rocksmith to “catch up” to my knowledge. This was a major turnoff for an excellent guitar player that I showed the game to. One note every three seconds for someone who knows the song by heart? (I hadn’t played the song yet myself.)
  3. Riff repeater, Riff repeater, play full song, continue journey, songs, song x, riff repeater. The bolded sections shouldn’t be required- I should be able to go from the finished song back to the riff repeater without all that other crap. Actually, Riff repeater after any finished rehearsal should be available.
  4. Saved playlists- I’d like to be able to select several songs and queue them up so I can play them back to back without it being an event. I’d also like to save that queue for use later.
  5. Rehease any song for an event- I see 4 songs for an event, 1 is qualified, 3 are not. I’d like to rehease those three. It should be trivial to navigate and select them- get a cursor on that list and let me select any of them right from the main menu.
  6. Rehearsal Reversal – ever screw up part of a song and be upset about it? Imagine being able to use the controller to reverse the music like a tape player.
  7. Better communication- I just downloaded a 4 meg update and have no idea why. Did it fix something? Communicate that info- I’m a big boy, I can take it. Gimme technical details.
  8. Open Tuner mode – Some times I just want to mess with my tuning. The current setup is very restrictive.

Again, great game, but I’d like to see some of these warts removed and make it an exceptional game.

Update (20111203)

Add to the list:

  1. When you finish a song, show your score, your last score, and highscore. Gimme some stats man, show me that I improved!

Someone Stole My Monkey

43

So for whatever reason, I was googling around today and stumbled across this and saw that, damnit, someone stole my monkey. Not only that, but they’re linking directly to my server, and have been since 2006. I google more, and find someone using my security monkey to demonstrate an XSS attack.

Now, the reason I created this image was for my security monkey shirt.

I ask that everyone show their solidarity by spreading the work of the security monkey shirt to people most likely to buy one.

Make Custom Gifts at CafePress

SPoE: Slow but steady.

0

So I have a few user stories; time to start putting the infrastructure together. So What have I decided on so far?

Language: Java
Framework: Spring*
Repository: Subversion
IDE: Eclipse
Continuous Integration/ Deployer: Hudson
Build Automation: Maven2

I’m in the process of getting all my pieces together and in place. I’ve set up a subversion repository and eclipse. I have a very basic .war file setup committed and a maven script to build the war file. Every five minutes, a Hudson job polls the repository and rebuilds the warfile, then deploys it to JBoss. It’s a pretty sweet setup despite it only deploying a “Hello World”.

Now I need to refresh myself on Java and learn Spring.  This leads me to my first task:

Card SPoE-0001
User Creation

As a User, I want to be able to provide a preferred username and email address to create an account.

Status: New
Version: 0.0.1
Component: User account
Original Estimate: ?
Time spent: 0
Time Needed: ?

Going into this, I’m familiar with MVC framework in Rails, so I’m guessing the concepts aren’t much different for Spring MVC I’m still new to this process(Java webapp layout, Agile and Spring), so if you see any mistakes or have suggestions, please let me know. If you’re interested in helping me with development,

*Despite suggestions of both Struts and Spring, I went with Spring after scientific investigation.

The Philosophy of Monitoring

29

As a system administrator, monitoring is a key job responsibility, yet arguments seem to arise on how to implement it (usually with people who won’t be paged at 3am). Before writing this, I looked around for an article on the goals and philosophy of system monitoring, but found very little that really applied to this topic. Hopefully this will help set some expectations for admins, managers and stakeholders on what you should monitor, and why it should be monitored.

Why you Monitor

Before you set up a single monitor, you have to ask yourself, “what is the goal?” After all, why are you even setting something up? Here are a few common reasons for configuring monitors:

  1. Notification: Warning of an issue that requires intervention. What most people think of when you say “Monitoring”.
  2. Reactionary: Automatic actions are taken when certain criteria are met. If common countermeasures are automated, you’ll have less to handle manually.
  3. Informational: System status and historical trending allows you to show business customers that production “isn’t always down.” In reality,  you may have 99% uptime, and often downtime is due to requested deployments. Statistical information can also be used for capacity planning.

Mentally dividing your monitors into groups will help you calculate which monitors require involvement. It’s not uncommon to have several thousand monitors at any given time, so it’s important not to assign critical importance to all of them. A wise man once said “When all alerts are critical, none of them are.”

When you should NOT Notify

Some monitors may have thresholds set which check for certain conditions; when those conditions are met, you may want to send some type of alert to an administrator. There are two types of notifications – Active and Passive:

  • Active Notification: Immediate Action is Required: “Site is Down!” A phone call, page, or IM may be used to contact someone. Direct action expected.
  • Passive Notification: Informational Purposes only:  “JVM Memory usage is high.” Information is logged, and perhaps an email is sent. No direct action is expected, since there’s usually nothing you can do about it.

It’s easy to become addicted to passive notifications – but remember, data overload can mask important information. It becomes habit to ignore notifications if they are unimportant. The question then is not so much “when should you notify,” but “when shouldn’t you?” What it really boils down to is “Can/should I do anything about it right now?”

  • Non-critical (disk space creeps above  90%  on /var on a dev server at 2am on a Saturday after several months of growth).
  • Nothing Systemic is wrong (admins can’t fix “low sales”).
  • 3rd party system, such as a geocoding webservice, is down.
  • Will resolve shortly, such as a backup server pegging the CPU during midnight backups.

Some of these alerts can be avoided by setting a correct monitoring window (ignore CPU during the backup window, or set a blackout window for a deployment). Others simply can’t be addressed by administrators, although you may want to send informational emails to other members of the company (those managing 3rd party SLAs or responsible for tracking online sales)  The next step after getting an alert is figure out what to do about it.

Reacting Properly

When a notification is sent out, there should be a definitive action that you can take. Think about why you were notified. There are a few rules to keep in mind when something goes wrong.

  1. Don’t Panic. When 700 alarms go off, your first instinct is to panic. Before you act, take a breath. Spend a moment to get your bearings, and calm yourself. The worst possible thing you can do is flail. Randomly making changes without rhyme or reason and restarting services can do more harm than good and may make the situation worse. Take note of which alarms go off, and in the post mortem look for ways to get the same information with less noise.
  2. Identify Obvious Patterns. What is the commonality? If a central system goes down, you may see many similar alerts. Dependencies can help immensely, masking redundant alerts. A single database failure could take down a dozen sites. Which is better: getting a single alert that the database is down, or 250 alerts that various sites are down and one database notification in the middle? While 250 alerts may impress the gravity of the situation upon you, it may instil panic and anxiety, which leads to flailing.
  3. Get things up and running as quickly as possible. Root-cause analysis can be tedious, time consuming, and occasionally inconclusive. If you have a major system outage, don’t worry about doing root-cause analysis on the spot.  Do what you need to in order to get things up and running – you can search the logs later. If the problem is recurring, you’ll get another chance to investigate later.
  4. Communicate with Stakeholders. The business units don’t need to know the details, but they do need to know that there is an outage and that it’s being addressed. If the situation is not quickly resolved, give them status reports. Be warned – any details you reveal will be warped and held against you. I’ve learned this one many times. People have a tendency to blame what they don’t understand. “Site is down? It must be a witch!” At a previous job we had a “jump to conclusions” board which had our favorite scapegoats – load balancer, connection pool, Endeca, etc. Everyone is guilty of it – Business, devs, sysops, QA, etc. Even a one-time problem that has been resolved will be brought back up, even if it’s only tangentially related. Communicating too much information creates future scapegoats.
  5. Contact Domain Experts. If your java site is crashing and you’re not a java developer, get a java developer involved. If your DNS server falls down and the fix isn’t obvious, contact your DNS administrator. Expert eyes on the problem may resolve the issue quicker. Group chat is crucial for sharing information and talking out theories. Someone familiar with the code will know what the error messages mean.
  6. Fix the Problem. It should go without saying that if you find the problem, you should make every effort to resolve it. Workarounds are fine, just don’t let that band-aid become permanent. What often happens is a workaround is put in place; the alert clears and management no longer feels the pain, so they ignore the problem without putting forth the effort to fix the issue. When the next issue appears, a new fix is layered on the old. Band-aid is layered on band-aid. Eventually you’ll need to pull those band-aids off; and the more there are, the more painful it will be.

How Much is Too Much?

Most administrators prefer to be proactive rather than reactive, resolving issues before they become a problem. Proper monitoring can be a great asset, but if you’re not careful it can cause problems. For example, at a previous job we had a load balancer, apache instances and tomcat instances set up for each site. Each site had the following:

In (Sitescope) legacy monitoring system:

  • Health check on load balancer URL

In Nagios:

  • Health check on Apache instances
  • Health check on Tomcat instances
  • Health check on Load balancer URL

In Apache:

  • Health check on tomcat instances

In Load balancer:

  • Health check on Apache instances
  • Health check on Tomcat instances

Individually, these don’t seem that bad. If an apache instance goes down *of course* the load balancer needs to know so it won’t send traffic to that instance. The same with Apache watching Tomcat. The problem was the frequency of the checks; the load balancer was checking each monitor every five seconds. When a poorly load-tested site update was released, certain pages took 7 seconds to load. Things quickly went downhill as threads and processes backed up, crashing the site.

Balancing responsiveness with common sense is essential. Having a monitor check every minute won’t change the fact that it will take an admin 20 minutes to get to a computer, boot up, log into the VPN, and identify the issue. Don’t add to the problem by DOS’ing your applications.

Making Contact

One mistake I’ve seen is using email as a reliable and immediate method of contact, often expecting a quick response. My favorite is when someone sends you and email, then walks down to your desk immediately after and asks “did you see my email?” You check and see it was sent literally less than two minutes ago. You can’t rely on people to continually check their email. Admins especially don’t due to the sheer volume we receive.

Email has it’s uses, but active contact in an emergency situation is not one of them. Personally, I only check my email when I think about it, which may mean large delays between when the message is sent and received. Couple that with spam filters, firewalls, solar flares and the 500 other unread messages and email becomes a less-than-reliable medium for emergency notifications (even during business hours).

Paging (or SMS)  is preferable if you expect a quick response, although it is far from perfect. Just like email, SMS messages can be lost in the ether, however recipients usually have their phone alert them when a message comes in since it happens far less often than an email drops into the inbox. That said, every alert should not be sent as a page, or apathy will quickly sink in. The escalation path should look something like this (although all steps are not needed):

  • Front-end web interface alert: User would have to actively be browsing to see the status change. Usually the first clue something is wrong and shows the most recent status changes on a dashboard.
  • Email Alert: User would have to be actively checking their email. Usually sent when something is first confirmed down.
  • Instant Message: User would have to be at a computer and logged into IM to receive the alert. Rarely used, but an option during business hours.
  • Page/SMS: Reserved for emergencies. This means there is trouble.
  • Phonecall: Only used if Admin does not respond to the previous contact attempts. Usually performed by an irate manager or director.

If you’re lucky enough to have a 24×7 call center / help desk, they can also be leveraged to resolve issues before a system administrator is needed. If recurring patterns start to emerge,  automation can be used to deal with the problem (or better yet you can fix the underlying issue). Sadly, many issues can’t be automated away or solved by a call-center staffer pressing a button. A real admin will eventually need to be contacted.

I don’t want to dig too deeply into on-call rotations, but an effort should be made to balance off-hours support with a personal life.  Being on-call means no theaters, fancy dinners, or quality time with the family. Without balance, burn out will ensue.

Afflictions

System monitoring often brings out odd behavior in even the most steadfast of administrators. Some behaviors are relatively benign, while others can cause severe problems down the road. Identifying these behaviors before they cause a problem is just as important as having good monitors.

  • Data Addiction: Knowledge is power, but do not mistake information with knowledge. It’s possible to have 700 alerts, and not one of them identify the underlying issue. One of my least favorite phrases is “Can we put a monitor on that?” It’s often uttered right after a one-off failure; the type of thing that fails once, and once fixed will never cause a problem again. An example of this is a new server, where apache was not configured to restart after a reboot. When the server is restarted, you quickly find apache is down, start it, configure it to auto-start, and move on. There is already a monitor on the websites hosted by that apache instance as well as a monitor on how many apache threads are currently running; What purpose would another monitor serve? How often would it run? This is a prime example of how a data addict can spin out of control – too many useless monitors will mask a more important issue.
  • Over Automation: Automation is a wonderful thing, however, it’s possible to have too much of a good thing. In one instance, there was a coldfusion server which would crash often. Rather than trace out the root cause, restarts were automated, then forgotten about. A few years later, it was found that the coldfusion servers were restarting every twenty minutes, and no one knew about it – no one except the users. If it takes 20 seconds to restart, and that’s 26280 twenty-second interrupts over the course of a year – that can translate into a bad user experience and loss of sales. Make sure that automation is audited and verifiable, and doesn’t cause more trouble than it prevents.
  • Over Communication: While it is important to communicate with stakeholders, it is possible to over communicate. Stakeholders don’t need to know that there are 130 defunct apache processes caused by a combination of a bug in mod_jk and the threading configuration in JBoss – all they need is “Site availability is intermittent – we’ve located the root cause and are working on a solution. More information to follow.” Details aren’t needed. Likewise, not every single person should be notified when an alert goes off – does your backup administrator need to know when a web server goes down? No. Does the DBA need to know when an SSL cert is about to expire? No. Tailor the messages to the correct audience. Most monitoring systems allow you to configure contact groups – use them.
  • Complexification: There are dozens of relationships between services, hosts, hostgroups, contacts, servicegroups, notification windows, dependencies, parents, etc. Try as you might, it’s usually impossible to perfectly model every relationship. Don’t become distracted by perfecting the configuration – focus on maintainability, scalability and accuracy. If you can’t add new systems and monitors, your configuration is too complex.
  • Reporting vs Monitoring: Reports are the more successful cousin of Alerts. They may superficially appear similar, but serve entirely different purposes. Monitors should only be used to track and trend data and to notify if there is a problem, whereas reports take the collected data and massage it into an aggregated format. Monitors shouldn’t send out scheduled alerts. They can collect data, but they shouldn’t be used to present it to users. You’d be surprised how often someone asks for a monitor to send a nightly report. That slippery slope will turn your monitoring system into crystal reports.
  • False Positives: False positives are the scourge of the monitoring world. There are many causes, but the reaction is always the same – start to investigate, realize that it’s a false positive, and lose interest, knowing that nothing is broken. The problem is that a false positive leads to lazy behavior – if you’re pretty sure it’s a false positive, you don’t bother looking into it, figuring it will clear on it’s own. This trains people to have a “wait and see” mentality when alerts go off, causing unneeded delays when a major issue appears.
  • Apathy: It’s 2am on a Saturday, and you get paged that the CPU on a utility server is pegged. Without looking, you know that it’s the backup process copying the home directories, so you ignore it. The following Monday at 10am the QA JBoss instance stops responding. You know that it will clear within minutes because the QA team always rebuilds the QA instance Monday morning. When you get monitors constantly failing and recovering on their own, you start to ignore the pages that come in because you know they’re unimportant. It’s only a matter of time before you miss something important. If you have a situation that promotes apathy towards alerts, resolve it before something important is missed.

Don’t be [A]pathetic

I mentioned apathy above, but there’s a bit more to it – it’s not just admins that become apathetic.  If an issue is identified, action must be taken to correct it. The coldfusion example mentioned above is a great example of  company apathy – failure of the business unit to prioritize it and failure of IT to push back hard enough.  A former manager once had someone laugh because his team had ignored my manager’s bug report for a full year.  That’s not funny; it’s pathetic.

When management fails to address an issue; be it a known system problem or something as simple as morale from a lost team member, it shows the team that they don’t care. It soon becomes a vicious cycle of uncaring when managers no longer care that the site is down, which in turn causes developer apathy.  Developers then don’t care about code quality, leading to buggy code. Sysops stop caring that alerts are going off, leading to downtime. By the time the cycle is broken, it’s far too late – you’ve established a bad reputation with your customers.

Often times this will start with unreasonable development expectations, causing devs to cut corners, QA to be rushed, and monitors to be forgotten. There is a balance that must be maintained between getting code out the door and making sure that the code can stand up to the abuse it will receive when it goes live.  It’s a team effort, and everyone must care (and keep caring) to keep the systems running.

Wow. Well, that’s a lot more than I intended on writing. I should state that I am guilty of 75% or more of the bad behaviors listed here. I hope that this will help start discussion on how to better improve monitoring systems.

If you have feedback, suggestions or enhancements, please leave them in the comments.

(Thanks to jdrost, jslauter, keith4, pakrat, romaink, and my wife Jackie for their peer review/editing.)

What’s up?

0

So I’ve been pretty quiet since I hit 100k words- what’s been going on?

  • Round of layoffs at work
  • Friend diagnosed with cancer
  • Another round of layoffs at work.
  • Jackie became a pampered chef consultant
  • Finances have been wiped out from christmas and getting her PC stuff off the ground.
  • 10% paycut at work
  • Guitar lessons are now done because no one can afford them.
  • Have been reading Manuscript Makeover for ways to improve my book
  • Decided to do an initial cleanup of the first draft of my script, then rewrite the outline before starting draft #2
  • started yet another opensource project- this time it’s a collection of Nagios Plugins.

So I’ve been pretty busy. I’ve finished the cleanup of the first two chapters of book 1; hopefully I’ll finish the rest shortly, but it’s very slow going. We’ll see where things head in the next few months- I expect more crappiness.

Guitar Lessons

3

So I’ve started teaching guitar again- This time the cash will go directly towards the G-400. As you can see on the sidebar, I’m now a hair closer. My new student is a coworker who is very excited to learn, so that makes things easy on several fronts (schedules, payment, attendance, etc). At this rate, I should have the guitar by next fall- sooner if I pick up another student (which is a possibility).

that said, I’m still accepting donations 😀

Reading Guitar Tab

0

So some of you may know that I’ve been working on a second book- this one is music-based. Anyways, I have a few friends who are new to guitar and my book is more or less aimed at them, however some of them don’t know how to read tab- hence this post. So here’s the rundown:


e|-------------------------------------------
B|--------x----------------------------------
G|--------x-------5---5-7b9r7-5--------------
D|--------x--5h7---7--------------7~~~~------
A|---5/7--x----------------------------------
E|-------------------------------------------

The above is a sample of some tablature. Each of the 6 lines represents a string on the guitar, each number represents a fret on that string. The lower case (small) e represents the “high e” string on the guitar (the thinnest one), and the rest fall into place from there. In the example above, to play the first note you’d place a finger right behind the 5th fret on the A string (second fattest string), and pluck the string with a finger or a pick.

Above I’ve also laid out some basic notation, as listed below:

  • /: indicates a slide between two or more frets, e.g. 5/7 says start on the 5th fret and slide to the 7th. A forward slash usually indicates sliding up, while a backslash indicates liding down (e.g. 5/7\5\3).
  • x: Indicates a muted string. This is usually done with the fleshy edge of your palm on the pinkie side. In the instance above it’s used to set rhythm.
  • h: Indicates a “hammer-on”, where a note is struck and you hammer a finger on the next fret without actually striking the string a second time. By quickly pressing the following fret you retain the vibration from the previous note. This is often paired with p, pull-offs (e.g. 5h7p5 is 5, hammer on 7, release back to 5).
  • b: Bend a note. By stretching the string slightly sideways on the fretboard you can change the pitch of the note. Notes are usually only bent one or two step (frets), and are occasionally bent back, which is signified by an r (e.g. 7b9r7 means bend the 7th fret to sound like a 9, then back down to 7)
  • ~: Vibrato. there are two ways to do this- slightly vary the pressure on the string of a struck note so it wabbles back and forth, or bend it back and forth using the technique above very slightly, like a quarter of a step. It produces an effect similar to a whammy bar on an electric guitar. The more of these in a row, the longer you do it.

You’ll read through guitar tablature like the old pianos with the punch-card sheet music on a reel, playing each note as you go. Tab is meant to be a rough guide, so don’t expect exquisite timing details. Generally speaking, the farther apart the notes, the farther the pause; the closer the notes, the quicker the interval. Notes that appear on the same column are usually chords, and should be played in a single strum. Some tablature will define a set of used chords at the top, and simply refer to their name later on.

So that’s a quick intro into guitar tab. Let me know if I missed anything.

An Epiphany.

0

I’m the first to admit I’ve been slacking on my scales practice, mainly sticking to pentatonic (because I’m lazy). So while reading through my Scales and Modes book and I stumbled across something obvious, yet I’d never recognized. Each scale has a mode for each note in the scale- Major scale having 7, pentatonic scale having 5, etc. That I was remotely aware of, but didn’t think much of it.

I never really bothered with the major scale since it’s sorta boring, and felt overwhelmed by all of the basic scales (ionian, dorian, phrygian, etc) knowing that I’d have to learn their modes as well. Then the book pointed out that the first mode of the major scale was called the Ionian scale- wait, what? It turns out that all of those scales I feared learning didn’t have modes- they were modes- of the major scale!

So here’s the following:


Ionian In C:   C D E F G A B 
Dorian In D:     D E F G A B C
Phrygian In E:     E F G A B C D
etc...

This means, rather than learning 7 scales with 7 modes each, I just have to get down the 7 modes of the major scale. So simple, yet I never put it together.

Go to Top