Unfinished Drafts: Proposal for New Server Implementation

Jesse Morgan

September 27, 2019

Builds have been a sore point for our for our team for some time. Common complaints involve:

Reliance on a proprietary tool (HP RDP), which is windows based and owned by another team
Reliance on DNS entries for the build process, which may take days to go through
Lack of Tribal knowledge of the build process (only 2 team members are fully educated in it)
Lack of visibility and documentation of the process and details
Lack of centralized account management ownership
Slow to resolve issues with build (no default jdk install, ulimit)
Newly built servers are not up to date (patched)
Aged distributions (SLES 9, SLES 10) require hardware-specific drivers on newer hardware.

Beyond our build problems, we have further issues:

While we have done our best to address some of these non-build issues, only a full revamp of the build process will address the underlying problems.

The repercussions of our build issues have both obvious and indirect costs.

Builds require DNS Changes: RDP requires DNS entries, which require Change Request windows. This can roadblock a project for up to two days.
Inconsistency: Tracking down simple production issues require intimate domain knowledge due to the sheer number of one offs.
Lack of Visibility: Without domain knowledge, the steps to tracking down an issue requires extensive sleuthing to fight the right servers, pools, projects, irules, etc.
Lack of Auditing: With no mechanism within the team to “circle back” and clean up after ourselves, unresolved issues sit for months, resulting in confusion later.
Lack of up-to-date Documentation: Much of our documentation is woefully out of date, leading to poor decisions based on bad intel.
Lack of Instrumentation: Applications consist of multiple layers, but due to firewall, code, authentication and DNS constraints, Applications cannot easily be tested at all layers.
High Ramp-up time for New Employees: Time is wasted for both the new employee and trainer to learn all of the nuances.
Context Thrashing: Humans aren’t nearly as good at multitasking as they think. The constant thrash of interruptions reduce efficiency.

Licensing: Only a small minority of our servers have valid SLES licenses, making update costs somewhat dubious. Updates via OpenSuse/CentOS are a viable option, but places us in a hybrid environment.
- Suse quoted around $260k to fully license and support
- Red Hat quoted significantly more to fully license and support
Support: Hardware support, software support, offshore support are not cheap.

Suggested Solution