Archive for the ‘Work’ Category

Raw WinXP Virtualbox Partitions on a Thinkpad

Tuesday, July 13th, 2010

New job, new laptop. Many utilities here are windows only, so it requires a bit of… effort… to get myself up and running efficiently. The solution to the windows problem is VirtualBox. I had set this up on my last laptop with little effort, but this time around required a bit more effort. Hopefully the instructions below will help others get up and running quickly.

Disclaimer- your laptop may catch on fire and explode (or worse) if you attempt this… or something.

We’ll be presuming that you’ve already resized your windows partition and have both a working Windows and Linux partition.

In Windows

Log into XP, grab MergeIDE.zip from Virtualbox’s site, extract and run it. It should be a quick flash and be done. (Note: I am not 100% sure this step is needed)

Create a new hardware profile and name it virtualbox. Make sure to set it as a choice during boot. Try rebooting into native windows once to ensure that it does offer you profile options.

In Linux

You’ll need the following packages installed (May differ for non-ubuntu systems):
mbr, virtualbox-ose, virtualbox-ose-qt

Create a stand-alone mbr file to use for booting (yes, you need the force flag):

install-mbr ~/.VirtualBox/WindowsXP.mbr --force

We’re presuming that your windows partition is /dev/sda1. In the below command, we are defining

  • a vmdk file (WindowsXP.vmdk)
  • which raw disk to read (/dev/sda)
  • which partition (1)
  • the new MBR file we just created

VBoxManage internalcommands createrawvmdk -filename ~/.VirtualBox/WindowsXP.vmdk -rawdisk /dev/sda -partitions 1 -mbr ~/.VirtualBox/WindowsXP.mbr -relative -register

Note that you’ll need read/write access to that drive as your user, so you may want to figure out a cleaner/securer way to implement this, rather than adding your user to the disk group (which is very dumb and insecure). I would, but it’s working and I have more important things to do at the moment.

Another issue- apparently thinkpads report the drive heads and cylinders oddly (T410 for me and T60p in article), so we have to add some vmdk settings before virtualbox creates them incorrectly. Open ~/.VirtualBox/WindowsXP.vmdk and add the following at the bottom:

ddb.geometry.biosCylinders="1024"
ddb.geometry.biosHeads="240"
ddb.geometry.biosSectors="63"

The biosHeads appears to be the magic value- it seems to work if it’s set to 240, but the default is 255 (which fails).

Once you add those, start up virtualbox and check the virtual media manager, your new vmdk should be listed there. Once it’s confirmed, create a new virtual machine. Rather than creating a disk, select your vmdk as an existing disk.

After you finish, go the the VM settings->system and make sure the motherboard tab as io-apic  enabled (I also had PAE/NX enabled under processor and VT-x enabled under Acceleration).

Start the VM

There are several errors that could pop up. I’m sure there are plenty more that I stumbled across, but these were the two big ones:

  • a disk read error occurred, press ctrl+alt+del to restart - Caused by incorrect biosHeads- check and make sure it’s set to 240 (this was the fix for me, results may vary).
  • Complaint about kvm/vmx – Virtualbox does not like kvm. Uninstall qemu-kvm.

If things go well, it should flicker mbr in the corner, then go to the hardware profile selection screen. Select the virtualbox profile, and continue, then log in.

What follows is a half-hour of installing generic drivers and dealing with hardware specific auto start apps complaining that they won’t work on this installation. Windows will warn that the new drivers are not blessed, so be forewarned.

Once completed, at the top of the VM windows select Devices-> Install Guest Additions. This will download and mount an ISO, and windows will pop open a folder with the addition executables. Select the one best for you and run the installer. It will prompt you for video and mouse drivers (and trust me, you want them).

The final step is to shut down the windows VM, then reboot into the native windows partition to make sure it still works.  I did receive a few blue-screens before logging in at the beginning, but they appeared random and haven’t happened since.

And that’s all there is to it- simple, eh? Your windows partition should now run in native mode and vm mode.

The Philosophy of Monitoring

Wednesday, June 30th, 2010

As a system administrator, monitoring is a key job responsibility, yet arguments seem to arise on how to implement it (usually with people who won’t be paged at 3am). Before writing this, I looked around for an article on the goals and philosophy of system monitoring, but found very little that really applied to this topic. Hopefully this will help set some expectations for admins, managers and stakeholders on what you should monitor, and why it should be monitored.

Why you Monitor

Before you set up a single monitor, you have to ask yourself, “what is the goal?” After all, why are you even setting something up? Here are a few common reasons for configuring monitors:

  1. Notification: Warning of an issue that requires intervention. What most people think of when you say “Monitoring”.
  2. Reactionary: Automatic actions are taken when certain criteria are met. If common countermeasures are automated, you’ll have less to handle manually.
  3. Informational: System status and historical trending allows you to show business customers that production “isn’t always down.” In reality,  you may have 99% uptime, and often downtime is due to requested deployments. Statistical information can also be used for capacity planning.

Mentally dividing your monitors into groups will help you calculate which monitors require involvement. It’s not uncommon to have several thousand monitors at any given time, so it’s important not to assign critical importance to all of them. A wise man once said “When all alerts are critical, none of them are.”

When you should NOT Notify

Some monitors may have thresholds set which check for certain conditions; when those conditions are met, you may want to send some type of alert to an administrator. There are two types of notifications – Active and Passive:

  • Active Notification: Immediate Action is Required: “Site is Down!” A phone call, page, or IM may be used to contact someone. Direct action expected.
  • Passive Notification: Informational Purposes only:  “JVM Memory usage is high.” Information is logged, and perhaps an email is sent. No direct action is expected.

It’s easy to become addicted to passive notifications – but remember, data overload can mask important information. It becomes habit to ignore notifications if they are unimportant. The question then is not so much “when should you notify,” but “when shouldn’t you?” What it really boils down to is “Can/should I do anything about it right now?”

  • Non-critical (disk space creeps above  90%  on /var on a dev server at 2am on a Saturday after several months of growth).
  • Nothing Systemic is wrong (admins can’t fix “low sales”).
  • 3rd party system, such as a geocoding webservice, is down.
  • Will resolve shortly, such as a backup server pegging the CPU during midnight backups.

Some of these alerts can be avoided by setting a correct monitoring window (ignore CPU during the backup window). Others simply can’t be addressed by administrators, although you may want to send informational emails to other members of the company (those managing 3rd party SLAs or responsible for tracking online sales).  The next step after getting an alert is figure out what to do about it.

Reacting Properly

When a notification is sent out, there should be a definitive action that you can take. Think about why you were notified. There are a few rules to keep in mind when something goes wrong.

  1. Don’t Panic. When 100 alarms go off, your first instinct is to panic. Before you act, take a breath. Spend a moment to get your bearings, and calm yourself. The worst possible thing you can do is flail. Randomly making changes without rhyme or reason and restarting services can do more harm than good and may make the situation worse.
  2. Identify Obvious Patterns. What is the commonality? If a central system goes down, you may see many similar alerts. Dependencies can help immensely, masking redundant alerts. A single database failure could take down a dozen sites. Which is better: getting a single alert that the database is down, or 250 alerts that various sites are down and one database notification in the middle? While 250 alerts may impress the gravity of the situation upon you, it may instill panic and anxiety, which leads to flailing.
  3. Get things up and running as quickly as possible. Root-cause analysis can be tedious, time consuming, and occasionally inconclusive. If you have a major system outage, don’t worry about doing root-cause analysis on the spot.  Do what you need to in order to get things up and running – you can search the logs later. If the problem is recurring, you’ll get another chance to investigate later.
  4. Communicate with Stakeholders. The business units don’t need to know the details, but they do need to know that there is an outage and that it’s being addressed. If the situation is not quickly resolved, give them status reports. Be warned – any details you reveal will be warped and held against you. I’ve learned this one many times. People have a tendency to blame what they don’t understand. “Site is down? It must be a witch!” At my last job we had a “jump to conclusions” board which had our favorite scapegoats – load balancer, connection pool, Endeca, etc. Everyone is guilty of it – Business, devs, sysops, QA, etc. Even a one-time problem that has been resolved will be brought back up, even if it’s only tangentially related. Communicating too much information creates future scapegoats.
  5. Contact Domain Experts. If your java site is crashing and you’re not a java developer, get a java developer involved. If your DNS server falls down and the fix isn’t obvious, contact your DNS administrator. Expert eyes on the problem may resolve it quicker. Group chat is crucial for sharing information and talking out theories. Someone familiar with the code will know what the error messages mean.
  6. Fix the Problem. It should go without saying that if you find the problem, you should make every effort to resolve it. Workarounds are fine, just don’t let that band-aid become permanent. What often happens is a workaround is put in place; the alert clears and management no longer feels the pain, so they ignore the problem without putting forth the effort to fix the issue. When the next issue appears, a new fix is layered on the old. Band-aid is layered on band-aid. Eventually you’ll need to pull those band-aids off; and the more there are, the more painful it will be.

How Much is Too Much?

Most administrators prefer to be proactive rather than reactive, resolving issues before they become a problem. Proper monitoring can be a great asset, but if you’re not careful it can cause problems. For example, at a previous job we had a load balancer, apache instances and tomcat instances set up for each site. Each site had the following:

In (Sitescope) legacy monitoring system:

  • Health check on load balancer URL

In Nagios:

  • Health check on Apache instances
  • Health check on Tomcat instances
  • Health check on Load balancer URL

In Apache:

  • Health check on tomcat instances

In Load balancer:

  • Health check on Apache instances
  • Health check on Tomcat instances

Individually, these don’t seem that bad. If an apache instance goes down *of course* the load balancer needs to know so it won’t send traffic to that instance. The same with Apache watching Tomcat. The problem was the frequency of the checks; the load balancer was checking each monitor every five seconds. When a poorly load-tested site update was released, certain pages took 7 seconds to load. Things quickly went downhill as threads and processes backed up, crashing the site.

Balancing responsiveness with common sense is essential. Having a monitor check every minute won’t change the fact that it will take an admin 20 minutes to get to a computer, boot up, log into the VPN, and identify the issue. Don’t add to the problem by DOS’ing your applications.

Making Contact

One mistake I’ve seen is using email as a reliable and immediate method of contact, often expecting a quick response. My favorite is when someone sends you and email, then walks down to your desk immediately after and asks “did you see my email?” You check and see it was sent literally less than two minutes ago. People don’t reliably check their email. Admins especially don’t due to the sheer volume we receive.

Email has it’s uses, but active contact in an emergency situation is not one of them. Personally, I only check my email when I think about it, which may mean large delays between when the message is sent and received. Couple that with spam filters, firewalls, solar flares and the 500 other unread messages and email becomes a less-than-reliable medium for emergency notifications (even during business hours).

Paging (or SMS)  is preferable if you expect a quick response, although it is far from perfect. Just like email, SMS messages can be lost in the ether, however recipients usually have their phone alert them when a message comes in since it happens far less often than an email drops into the inbox. That said, every alert should not be sent as a page, or apathy will quickly sink in. The escalation path should look something like this (although all steps are not needed):

  • Front-end web interface alert: User would have to actively be browsing to see the status change. Usually the first clue something is wrong and shows the most recent status changes on a dashboard.
  • Email Alert: User would have to be actively checking their email. Usually sent when something is first confirmed down.
  • Instant Message: User would have to be at a computer and logged into IM to receive the alert. Rarely used, but an option during business hours.
  • Page/SMS: Reserved for emergencies. This means there is trouble.
  • Phonecall: Only used if Admin does not respond to the previous contact attempts. Usually performed by an irate manager.

If you’re lucky enough to have a 24×7 call center/ help desk, they can also be leveraged to resolve issues before a system administrator is needed. If recurring patterns start to emerge,  automation can be used to deal with the problem (or better yet you can fix the underlying issue). Sadly, many issues can’t be automated away or solved by a call-center staffer pressing a button. A real admin will eventually need to be contacted.

I don’t want to dig too deeply into on-call rotations, but an effort should be made to balance off-hours support with a personal life.  Being on-call means no theaters, fancy dinners, or quality time with the family. Without balance, burn out will ensue.

Afflictions

System monitoring often brings out odd behavior in even the most steadfast of administrators. Some behaviors are relatively benign, while others can cause severe problems down the road. Identifying these behaviors before they cause a problem is just as important as having good monitors.

  • Data Addiction: Knowledge is power, but do not mistake information with knowledge. It’s possible to have 700 alerts, and not one of them identify the underlying issue. One of my least favorite phrases is “Can we put a monitor on that?” It’s often uttered right after a one-off failure; the type of thing that fails once, and once fixed will never cause a problem again. An example of this is a new server, where apache was not configured to restart after a reboot. When the server is restarted, you quickly find apache is down, start it, configure it to auto-start, and move on. There is already a monitor on the websites hosted by that apache instance as well as a monitor on how many apache threads are currently running; What purpose would another monitor serve? How often would it run? This is a prime example of how a data addict can spin out of control – too many useless monitors will mask a more important issue.
  • Over Automation: Automation is a wonderful thing, however, it’s possible to have too much of a good thing. In one instance, there was a coldfusion server which would crash often. Rather than trace out the root cause, restarts were automated, then forgotten about. A few years later, it was found that the coldfusion servers were restarting every twenty minutes, and no one knew about it – no one except the users. If it takes 20 seconds to restart, and that’s 26280 twenty-second interrupts over the course of a year – that can translate into a bad user experience and loss of sales. Make sure that automation is audited and verifiable, and doesn’t cause more trouble than it prevents.
  • Over Communication: While it is important to communicate with stakeholders, it is possible to over communicate. Stakeholders don’t need to know that there are 130 defunct apache processes caused by a combination of a bug in mod_jk and the threading configuration in JBoss – all they need is “Site availability is intermittent – we’ve located the root cause and are working on a solution. More information to follow.” Details aren’t needed. Likewise, not every single person should be notified when an alert goes off – does your backup administrator need to know when a web server goes down? No. Does the DBA need to know when an SSL cert is about to expire? No. Tailor the messages to the correct audience. Most monitoring systems allow you to configure contact groups – use them.
  • Complexification: There are dozens of relationships between services, hosts, hostgroups, contacts, servicegroups, notification windows, dependencies, parents, etc. Try as you might, it’s usually impossible to perfectly model every relationship. Don’t become distracted by perfecting the configuration – focus on maintainability, scalability and accuracy. If you can’t add new systems and monitors, your configuration is too complex.
  • Reporting vs Monitoring: Reports are the more successful cousin of Alerts. They may superficially appear similar, but serve entirely different purposes. Monitors should only be used to track and trend data and to notify if there is a problem, whereas reports take the collected data and massage it into an aggregated format. Monitors shouldn’t send out scheduled alerts. They can collect data, but they shouldn’t be used to present it to users. You’d be surprised how often someone asks for a monitor to send a nightly report. That slippery slope will turn your monitoring system into crystal reports.
  • False Positives: False positives are the scourge of the monitoring world. There are many causes, but the reaction is always the same – start to investigate, realize that it’s a false positive, and lose interest, knowing that nothing is broken. The problem is that a false positive leads to lazy behavior – if you’re pretty sure it’s a false positive, you don’t bother looking into it, figuring it will clear on it’s own. This trains people to have a “wait and see” mentality when alerts go off, causing unneeded delays when a major issue appears.
  • Apathy: It’s 2am on a Saturday, and you get paged that the CPU on a utility server is pegged. Without looking, you know that it’s the backup process copying the home directories, so you ignore it. The following Monday at 10am the QA JBoss instance stops responding. You know that it will clear within minutes because the QA team always rebuilds the QA instance Monday morning. When you get monitors constantly failing and recovering on their own, you start to ignore the pages that come in because you know they’re unimportant. It’s only a matter of time before you miss something important. If you have a situation that promotes apathy towards alerts, resolve it before something important is missed.

Don’t be [A]pathetic

I mentioned apathy above, but there’s a bit more to it – it’s not just admins that become apathetic.  If an issue is identified, action must be taken to correct it. The coldfusion example mentioned above is a great example of  company apathy – failure of the business unit to prioritize it and failure of IT to push back hard enough.  A former manager once had someone laugh because his team had ignored my manager’s bug report for a full year.  That’s not funny; it’s pathetic.

When management fails to address an issue; be it a known system problem or something as simple as morale from a lost team member, it shows the team that they don’t care. It soon becomes a vicious cycle of uncaring when managers no longer care that the site is down, which in turn causes developer apathy.  Developers then don’t care about code quality, leading to buggy code. Sysops stop caring that alerts are going off, leading to downtime. By the time the cycle is broken, it’s far too late – you’ve established a bad reputation with your customers.

Often times this will start with unreasonable development expectations, causing devs to cut corners, QA to be rushed, and monitors to be forgotten. There is a balance that must be maintained between getting code out the door and making sure that the code can stand up to the abuse it will receive when it goes live.  It’s a team effort, and everyone must care (and keep caring) to keep the systems running.

Wow. Well, that’s a lot more than I intended on writing. I should state that I am guilty of 75% or more of the bad behaviors listed here. I hope that this will help start discussion on how to better improve monitoring systems.

If you have feedback, suggestions or enhancements, please leave them in the comments.

(Thanks to jdrost, jslauter, keith4, pakrat, romaink, and my wife Jackie for their peer review/editing.)

home sick

Tuesday, March 24th, 2009

So, I’m home sick again. Fourth time this year I’ve been sick. Cold, flu, cold, Bronchitis. Awesome. Will I get to rest today? No, of course not.

My server (Unicron) has been up and running for 2 years now- I got the parts right after Ian was born. I set up a nice software raid array at the time that’s served me well. I’d never set up a raid array like this before, so I wasn’t really sure how to monitor it. The raid array has been running fine for 2 years, so I just sorta let it slide.

Over this past weekend, I did some work resizing the lvm partitions and had to poke around with the raid stuff. I found not one, but two ways to monitor it one was to set up a monitor with the mdadm tools and have it email me if there was a problem, and the other lead me to create a simple nagios monitor. I set both up sunday night.

Flash forward to this morning- jackie wakes me up, asks me if I’m going to work (I’d come home sick the day before and was still out of it.) My intention was to wake up long enough to IM my manager and supervisor and let them know I was gonna be sick. I do so, then minimize the im stuff. Staring me in the face was the following:
raidfailure

Wait, wait, wait- my script must be crappy, there’s no way the raid array choked right after setting up the monitoring. I sorta go into denial and check my email:

This is an automatically generated mail message from mdadm
running on unicron

A Fail event had been detected on md device /dev/md2.

It could be related to component device /dev/sdc2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid5 sda3[0] sdd2[3] sdc2[4](F) sdb3[1]
1461633600 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]

md3 : active raid1 sdc1[0] sdd1[1]
979840 blocks [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
979840 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
192640 blocks [2/2] [UU]

unused devices:

Frak.

So here I am, praying it was a hiccup while I reboot and rebuild my raid array. it looks like sdc2 went nutty about 45 minutes after I went to bed. I restarted the server and sdc2 reappeared, and I’m rebuilding now to see what happens,

md2 : active raid5 sdc2[4] sda3[0] sdd2[3] sdb3[1]
1461633600 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
[=================>...] recovery = 85.0% (414436248/487211200) finish=26.8min speed=45182K/sec

One thing is for sure, I need to get a auxiliary drive in case this one goes kaput for real. I said I’d buy one in june… 2007. suppose I better get one that, huh?

My apologies of this was nonsensical, I’m really tired.

new plugin test.

Thursday, March 12th, 2009

So I’m testing a nifty new wordpress plugin… check this out:

I wonder if it’ll work?

update: no, no it will not.

What’s up?

Thursday, January 29th, 2009

So I’ve been pretty quiet since I hit 100k words- what’s been going on?

  • Round of layoffs at work
  • Friend diagnosed with cancer
  • Another round of layoffs at work.
  • Jackie became a pampered chef consultant
  • Finances have been wiped out from christmas and getting her PC stuff off the ground.
  • 10% paycut at work
  • Guitar lessons are now done because no one can afford them.
  • Have been reading Manuscript Makeover for ways to improve my book
  • Decided to do an initial cleanup of the first draft of my script, then rewrite the outline before starting draft #2
  • started yet another opensource project- this time it’s a collection of Nagios Plugins.

So I’ve been pretty busy. I’ve finished the cleanup of the first two chapters of book 1; hopefully I’ll finish the rest shortly, but it’s very slow going. We’ll see where things head in the next few months- I expect more crappiness.

Free Jabber / XMPP clients for a Blackberry?

Wednesday, August 6th, 2008

anyone know of any good jabber clients for the blackberry? I’ve tried a couple with little luck, and most of them cost more than I can afford for this test. Features required

  • Must run on BlackBerry 8703e v4.1.0x
  • Connection server can be configured differently than jid address (i.e. you@morgajel.net for jid, jabber.morgajel.net for connection server.) This rules out Mobber as far as I can tell
  • Requires SSL/TLS
  • Non-strict cert checking

Let me know if you have any suggestions.

What’s blue and white and still not working?

Thursday, July 10th, 2008

My internet connection.

SO here’s the scoop

5 days until cutover:
I call AT&T, tell them I’m moving and need to transfer my Static IP DSL service on the 30th(Monday). Tech says no problem it’s all set. I am pleasantly surprised at how little of a hassle it was and that it was way smoother than any other interaction I’ve had with them.

Saturday, 2 days until cutover:
We’re planning on doing the actual moving Sunday morning and plan to spend Saturday packing and planning. However at 3am Saturday morning, the internet connection drops, leaving me unable to contact many of the people who may be able to help us move. It sucks, but ok, we can work around it. I still have enough people to get by with and have ways to contact most of them. Since we asked to be connected on Monday, maybe they had to disconnect the old line the day before in order to get their stuff in place. Maybe they cut it on Saturday rather than Sunday because nobody works on Sunday. I get that, I can understand it. While annoying, it’s still better than my previous interactions with them.

Sunday,move day, day before cutover:
We move on Sunday and realize that we never actually checked to see if the house had any phone cables in it. It didn’t. Fortunately my father-in-law knows a bit about phone installation and was able to help me wire up a stub for the AT&T guy to connect to.

Monday, 1 day after move:
AT&T shows up, runs cable, says service will be enabled withing X hours. yippie.

Tuesday, 2 days after move:
Connection is there, but my ip address has changed. “crap,” I think, “now I gotta update dns entries for our sites.” But I can understand this, perhaps my old static ip was tied to the network near my old apartment and didn’t reach this area. I can buy that. So I change my DNS entries… and they don’t work. I look again and I apparently mistyped the IP because the new DNS entry doesn’t match the external IP on the router. So I change it again. and 20 minutes later the external IP has changed again.

They had me on a fricking dynamic IP. For the non techies out there, large ISPs only have a limited number of ip addresses, and more often than not don’t have one for every customer. Since few customers stay online 100% of the time, they can take addresses away from people not using them and redistribute them as needed. This is called a Dynamic IP account. For people who run servers from their homes, keeping the same IP is important, so when your computer goes to connect to morgajel.com, it needs to be able to find the right IP address. That’s why I pay extra for AT&T to guarantee me the same IP address. That is why I am pissed. While there are ways to get around this (dyndns) but they’re a pain in the ass an not an option for me since I run an IRC server as well.

So I tinker around, thinking maybe *I* did something wrong- maybe my router was reset and it cleared the static info. I dig around with Jackie’s help and find the original documentation and try to set up the networking listed manually. No dice. Then I remember that yes, they did manage the info via the PPOE settings, and that just required a user name and password, which is what I was originally using. I switch it back and get yet another dynamic IP. I should point out that my static IP range was 75.x.x.x, while the dynamic stayed in the 66.x.x.x range- this made it easy to keep track of what was going on.

So I call them up and surprise surprise, they screwed up. See, they don’t really transfer accounts so much as shut off the old one and create a new one. The tech didn’t bother to notice I had a static account and replaced it with a dynamic account. I’m livid at this point, and tell them that it needs to be switched back. “Ok, I’ll put in the order. It’ll be ready in 10 days.” Now, this should NOT take 10 days from a technical point of view, this is all red tape causing the delay. But WTF can I do, so I say hell with it and go along with it.

At some point my father-in-law comes back over to help with the baby gate and notes that the technician illegally ran the line through the neighbor’s yard. While I’m half tempted to yell at them to fix it, I just wanna get a connection up and running again so I can actually write about the house.

Saturday, 5 days after cutover (timeline gets a little fuzzy here)
Connection is still flaky, but generally working. I call to check on the status of the static IP order, and find out it was never placed. They’ll get right on that.

Sunday, 6 days after cutover
Connection goes down at 7:37am. Completely. It does not come back. Jackie calls tech support this time. Flames, brimstone cries of the undead ensue. Eventually I take the phone and find out there’s still no mention of a static order of any sort for our account. Guess what? They can’t do anything about it because “orders” isn’t open on weekends. They agree to send out a tech to look at the line since they can’t see the modem from their end. He should be out between 8am and noon on monday

Monday, 7 days after cutover:
Connection begins working again around 7am- I think to myself “great, maybe they just took it down to switch over to the static IP- finally I can get my stuff up.” Nope, still a dynamic IP address. I call AT&T to get the static IP address set up and let them know the connection is up. They say hold off until the technician confirms it’s not an issue. ok. I’ll call back later. I spend my time waiting for the technician looking for any other ISPs in the area on dslreports.com

Technician comes out, nice guy, doesn’t see anything wrong, says he’s seen this behavior before when switching from dynamic to static, but the business won’t fess up to it. Whatever. At least the wiring was good, presuming that both the installer and the inspecting tech were both competent. While he was tooling around, I found out that Cyberonic, my ISP from DC, covers this area (they didn’t in grand rapids or rochester hills). They resell business class Covad lines to residential customers. I contemplate switching over to them, but figure it would be too much effort since I’ve gotten this far. I’m not even sure they’d have a decent plan in this area.

So he leaves and I call AT&T back and get the static all set up. She also said the static IP would be in place tomorrow. Just as we’re finishing she informs me that since I don’t have a contract, my payment will go up to $70 a month from $55. “WTF, this isn’t my screwup- you guys said you could transfer service, then you pooch it, then you want to charge me for it??”

“Oh, no,” she says, “When we transfer service, we don’t transfer contracts. If you want the original rate, you’ll have to sign up for another year of service.”

This is where Jesse snaps.

“You know what? Fine, make it the month to month price, because it’ll take me about 3 weeks to get covad in here.” She was a bit shocked by that statement, and the conversation ended awkwardly. I think she was supposed to ask if I was please with my experience but she knew the answer.

I then spent 10 minutes looking through DSL reports for ISPs in the area and narrowing down their plans- turns out that Cyberonic offers the same plan I had in DC for $60. Lets compare the plans side by side:

AT&T Cyberonic
Download speed 3meg 6meg
Upload speed 386k 768k
IP address 5 static 5 static
Stability False True
Cost $55/mo $60/mo

I call up cyberonic, phone is picked up on the 3rd ring. I tell the technician that I’m interested in their plan, I get signed up, cc infos taken, etc. The entire call lasted 22 minutes and 28 seconds. I was never transferred once, my call was never dropped, the technician never once said “I don’t know,” and they were going to do a hotswap on the line and cancel the AT&T DSL for us since we obviously can’t have 2 DSL services on the same line. The transfer should take place in the next 7-14 business days.

I’d like to point out that AT&T still hasn’t got their act together as of this morning (Thursday), and dropped my connection while I was beginning a deployment for work. That was real awesome btw. Thankfully my neighbor is allowing us to use his wireless connection until we get it straightened out. If the issues aren’t resolved by switching to cyberonic, I’ll have the neighbor report the cable crossing his yard and they’ll have to come out and redo it (this is my backup plan).

The good news is we’ve moved our blogs to gopedro.net. I’m still in the process of converting them, but expect to be done by next Monday. The only site that will still point to my static IP is morgajel.com, for my streaming music, IRC server, etc. We’ve also decided to move all of our pictures to flickr, so expect to see broken images for a while.

I really want to thank gopedro.net for in all of this. I highly recommend them for any domain name purchases or hosting. They’ve been handling our domain names for years now, and their service is outstanding. I’d also like to thank our new neighbor Bobby for being one hell of a cool guy.

I’ll keep you updated on how things go. Hopefully I’ll start writing about the house soon.

*UPDATE 2008-07-14*
Cybronic called and told me they’d be sending a technician out tomorrow to verify the lines. Hopefully I should have a working connection soon.

*UPDATE 2008-07-16*
My bad, it was wednesday. Connection is up now and I’m back online with a static IP!

What’s on QA…

Friday, May 16th, 2008

Had an amusing conversation with a developer at work that had the feeling of an abbot and costello bit. I’m leaving his name out of this to protect him, but he’s read this site before and will know instantly that it’s him. The conversation revolves around our new continuous integration system, and how the terminology has changed.

BTW, QA=Quality Assurance, UAT= User Acceptance Testing (staging)

(10:50:50 AM) morgajel: ok, so after talking to mick, it looks like my suspicions were correct
(10:50:54 AM) morgajel: there is no QA
(10:51:04 AM) morgajel: the name QA is misleading.
(10:51:16 AM) DeveloperX: hmmm
(10:51:25 AM) morgajel: trunk is for development, release is for qa, uat and prod
(10:51:26 AM) DeveloperX: what do u mean
(10:52:47 AM) DeveloperX: i have a release branch…
(10:52:50 AM) morgajel: think about it- anything going to QA is being marked as a release.
(10:53:12 AM) morgajel: so so release gets deployed to QA, then if it passes it goes on up the chain to production
(10:53:52 AM) DeveloperX: this is how the system currentlyis
(10:54:23 AM) DeveloperX: it seems to make snese to me? assueming nothing makes it ti UAT unlees it goes through QA
(10:54:36 AM) morgajel: that’s the way it should be.
(10:54:49 AM) DeveloperX: i think thats how it is
(10:54:57 AM) morgajel: the difference between this and the old system is, rather than labelling it projectX[QA], they’re now calling projectX_release_
(10:55:07 AM) DeveloperX: no? i only have a trunk and a release
(10:55:11 AM) morgajel: that’s the main difference for you
(10:55:20 AM) DeveloperX: which means the only thing goingto UAT will be from the release
(10:55:51 AM) morgajel: correct- the only thing going to qa or uat or prod will be from the release.
(10:56:12 AM) DeveloperX: so why can’t we still call it QA?
(10:56:42 AM) morgajel: because it’s *more* that QA- QA was a bad name for it from a organizational perspective
(10:57:32 AM) DeveloperX: is his only going to affect me, or is this the new approach for everyone?
(10:57:40 AM) morgajel: for everyone
(10:58:27 AM) DeveloperX: so your saying there is no pointin multiple srevers?
(10:58:33 AM) morgajel: no
(10:59:04 AM) morgajel: I’m saying whatever we put on UAT should be the EXACT same thing as QA
(10:59:13 AM) DeveloperX: so we will still push from trunkto release to UAT?
(10:59:20 AM) morgajel: and whatever we put in prod should be the EXACT same thing as both of them.
(10:59:24 AM) DeveloperX: it is
(10:59:28 AM) DeveloperX: currenlty
(10:59:29 AM) morgajel: well, from your side, here’s what happens
(10:59:37 AM) DeveloperX: at least for anything i;ve worked on
(10:59:44 AM) morgajel: you guys develop in trunk, and when you’re ready to release, you mark it as release.
(10:59:53 AM) DeveloperX: k
(10:59:55 AM) morgajel: then the QA people will check out release to the QA servers
(11:00:04 AM) DeveloperX: k
(11:00:10 AM) morgajel: once they give it a go, we will check it out to UAT,
(11:00:22 AM) DeveloperX: k
(11:00:24 AM) morgajel: once that’s good, it will be checked out from release toprod,
(11:00:54 AM) DeveloperX: put the code that is working it’s way up is exacly the same in each instance
(11:01:12 AM) morgajel: correct- as it should be.
(11:01:18 AM) DeveloperX: it is jsut being push multiple times, to make sure everything will work when we finally hit orid
(11:01:20 AM) DeveloperX: prod
(11:01:27 AM) morgajel: correct
(11:02:06 AM) morgajel:
now, I’m not sure what the exact mechanism will be for pushing to uat
and prod- we may check it out directly from svn, but more likely than
not, we won’t.
(11:02:33 AM) DeveloperX: ya, i would guess not
(11:02:37 AM) morgajel: the important thing is we can say “release 11234234 of projectX” and can match it up with a specific commit in subversion.
(11:03:50 AM) DeveloperX: i gotcha there…we used to havesomethiing like that, but it completely maintianed by me
(11:04:16 AM) DeveloperX: i have a new release branch for every deployment i make to projectX
(11:04:34 AM) DeveloperX: so I code in theory go back to 3deployments ago
(11:04:46 AM) DeveloperX: but you guys should have access to this…not just me
(11:05:35 AM) DeveloperX: make sense?
(11:08:09 AM) morgajel: you’re close
(11:08:23 AM) morgajel: I’m presuming you’re talking about http://buildserver/svn/projectX/branches/, correct?
(11:08:32 AM) morgajel: each of those listed is a different release branch, correct?
(11:09:01 AM) morgajel: what they’re doing is instead of creating a folder for each release, they’re creating a single folder
(11:09:17 AM) morgajel: and making note of the commit version when they commit acode to it
(11:10:14 AM) morgajel: hence when subversion says “committed version #1123223″,you’d call it “release 1123223″ in your notes
(11:10:30 AM) morgajel: I think that’s what brandon is trying to aim for, but we’ll have to have him verify.
(11:11:00 AM) DeveloperX: ok, i think i gocha

Request Tracker 3.6.5 broken after updating Cent OS

Monday, April 7th, 2008

Can’t locate object method “seek” via package “File::Temp” at /usr/lib/perl5/site_perl/5.8.8/MIME/Parser.pm line 816.

The underlying problem is perl was updated and overwrote the “correct” version of File::Temp that you probably installed when setting up RT and forgot about. To fix this issue


cpan install File::Temp
/etc/init.d/httpd restart

MAKE SURE TO RESTART APACHE! I didn’t, and it cost me probably 2 hours of screwing around with it.

I’m posting this because
http://www.nabble.com/RT-3.6.5-and-Sendmail-error-and-looks-like-perl-error-td15989015.html
Didn’t really mention what the final working solution was.

LDAP+ Sudo +TLS fix

Tuesday, October 9th, 2007

For those of you who can’t get those three to work together, make sure you specify both TLS_CACERT tls_cacertfile- I didn’t and it caused me grief.