Archive for March, 2012
For grins, I went through Tom Limoncelli’s sysadmin questionaire to see how “a team I have worked with previously” fares:
- A. Public facing practices:
- *1. Are user requests tracked via a ticket system? No, a high estimate would be 1/3rd of their requests are tracked.
- *2. Are “the 3 empowering policies” defined and published? No.
- 3. Does the team record monthly metrics? No. Outages are tracked by management, but that’s it. No alert stats, system-usage stats, etc.
- *4. Do you have a “policy and procedure” wiki? Yes, although admittedly it is missing quite a bit.
- 5. Do you have a password safe? No.
- 6. Is your team’s code kept in a source code control system? Most if not all, yes.
- 7. Does your team use a bug-tracking system for their own code? No.
- 8. In your bugs/tickets, does stability have a higher priority than new features? N/A
- 9. Does your team write “design docs”? No. We have a few, but it’s not S.O.P.
- 10. Do you have a “post-mortem” process? Yes, each week we do one for the previous week’s oncall.
- *11. Does each service have an OpsDoc? No.
- *12. Does each service have appropriate monitoring? No, probably only 60-70% coverage.
- 13. Do you have a pager rotation schedule? Yes. one out of every 9 weeks we are oncall.
- 14. Do you have separate development, QA, and production systems? Yes, we have dev, qa,stage and prod.
- 15. Do roll-outs to many machines have a “canary process”? No.
- 16. Do you use configuration management tools like cfengine/puppet/chef? No, but I am working on implementing Puppet for our new builds.
- 17. Do automated administration tasks run under role accounts? No.
- 18. Do automated processes that generate email only do so when they have something to say? No, but this has greatly improved.
- *19. Is there a database of all machines? Yes, LDAP Inventory
- 20. Is OS installation automated? Yes, and we are improving it.
- *21. Can you automatically patch software across your entire fleet? No.
- 22. Do you have a PC refresh policy? No, presuming we’re talking about Servers.
- *23. Can your servers keep operating even if 1 disk dies? Yes (as far as I know).
- 24. Is the network core N+1? Unknown.
- *25. Are your backups automated? Unknown.
- *26. Are your disaster recovery plans tested periodically? Never to my knowledge.
- 27. Do machines in your data center have remote power / console access? Yes, HP ILO.
- *28. Do desktops/laptops/servers run self-updating, silent, anti-malware software? No.
- *29. Do you have a written security policy? No.
- 30. Do you submit to periodic security audits? Yes, but they are very rudamentary.
- 31. Can a user’s account be disabled on all systems in 1 hour? No, too many one-off systems.
- 32. Can you change all privileged (root) passwords in 1 hour? No, We can change 90%, but not a handful of oneoffs, which are difficult to identify.
Wow… that was… depressing. 10/32= 31% That I could answer yes with some degree of confidence.
Setting up a new NTP client? Can’t tell if it’s syncing properly? Use
ntpq -c lpeers
to figure out if things are syncing properly.