
I am a Sr Systems Engineer for Compendium Blogware. I maintain the production systems as well some of the support systems for the engineering group. When Compendium's blog software goes down, I am the one who has a phone ringing off the hook.
Some of the technologies that we use and I will be posting about are Apache httpd, PHP, memcache, IPVS, Amazon's EC2, SQS, S3, and Cloudfront, MySQL, CentOS, trac and bacula.
Here at Compendium in the product group we take the opportunity on a weekly basis to improve the base level of knowledge of our employees by getting together and discussing the world of computer software. We alternate weeks between code reviews of internally produced work and a reading group.
Last week we ran a reading group. This quarter our reading groups are structured such that we select a topic and everyone in the group presents some relevant information on a sub-topic. This give everyone the chance to delve a little deeper into something that are interested in while improving the knowledge for everyone.
Although the information we discuss doesn't always immediately apply to the task at hand with respect to blogging software, we do sometimes learn something that comes in handy later.
In a past reading group about algorithms, I found this post:
http://brainz.org/15-real-world-applications-genetic-algorithms; and although blogging software isn't on the list of applications for genetic algorithms, marketing techniques are. Marketing techniques are at the core of what business blogging is all about. This demonstrates effectively the breadth of reach a seemingly irrelevant topic might have and why it's important to continue to think about ideas that might not necessarily seem relevant at first glance.
This might be the most pertinent reason for blogging. A blog simply is the best way to bring traffic to your Web site. This is especially the case if you are rolling out a new Web site that needs to establish its presence.
Blog administrators who consistently add valuable content over time will find their stock rising (and their Google PageRank heading north). Search engines like Google consider three major factors when returning search results:
- Authority: Do other people consider your Web site to be important?
- Longevity: How long has your Web site been around?
- Relevance: How relevant is your Web site to the search terms entered
Once you have a general understanding of the process that goes into search results, you can more efficiently dissect each point. Authority is primarily based on links, creating links to other important sites and having important sites link to you. For example, the higher the PageRank, the more valuable the link will prove. (Domain names ending in .edu and .gov typically provide stronger results.) Unfortunately, you can’t do anything about longevity, unless you’re buying a blog that has been around for a while – this element requires patience and a consistent effort. Relevance is based on the keywords and links used in your blog. Researching and implementing a keyword strategy is an essential step in beginning any blog. The average blogger can positively increase both their authority and relevance, thereby increasing his or her organic search ranking on certain keywords.
- MT
Having recently moved our blog hosting infrastructure onto Amazon's EC2 cloud system, I have been debating reviewing our monitoring solution. We have been using Zenoss for about a year to serve both as a graphical system that is used for identifying potential problems as well as an alerting mechanism.
When I last looked into potential solutions I was most familiar with nagios having set it up a couple of times in the past, but I was lured into Zenoss due to the built-in graphical interface and the promise of a web API that I could use to automate addition and removal of nodes. As it turns out the API is not particularly easy to use and Zenoss has had several bugs over the past year some of which have cost me a significant amount of time.
Now that I've moved from a traditional co-lo to EC2 I am intrigued by CloudWatch, but not enough to switch. The reasons for this are primarily cost and flexibility. Running CloudWatch at ~$10/server/month quickly becomes a large expense when compared to $74/month for a single m1.small instance that can be used to monitor many servers at a fixed cost. Further, with that small instance running Zenoss, I can trigger alerts on anything thing that I like. I am not limited to the datapoints that CloudWatch monitors.
In conclusion, if I were to be running only a couple of instances or I felt access to EC2 auto-scaling was a requirement, it might be worth the cost to run CloudWatch, but if you're running a large number of servers and are willing to give up auto-scaling(or build out a solution yourself or a 3rd party tool like rightscale), then CloudWatch just doesn't make sense right now.
I'm considering taking a new approach to blogging. I recently discovered that I am effective at obtaining unique information. So, my plan is to attempt to answer questions for you.
If there is anything at all that you would like to know about blogging or systems engineering or systems administration, leave a comment on one of my posts and I'll do my best to apply the scientific process.
The question that spawned this was that I was asked to determine how many lights are used to power the signature sign at the Wynn hotel in Las Vegas. The answer: 2500.
I recently implemented an email check on our servers in order to ensure consistency in email delivery. The check itself sends an email through the local SMTP instance which then passes that message off to our production mail service. Then the same machine begins running a check on a remote IMAP connection to ensure that the message arrives at it's destination mailbox within an acceptable(configurable) period of time.
The problem that inevitably rose from this was the remote IMAP service that we were using began blocking connections. First, we tried gmail or rather google apps to handle this, but I frequently received the error message "Web login required." After some research, I tried a couple of suggestions including selection the option to always use SSL through the web interface and turning off captchas for the account. This did not solve the problem. Presumably, gmail also handles rate-limiting, but does not publish the rate-limit that causes this message to be generated for an account.
I then search for a free or cheap IMAP service to use. I found fastmail.fm. They seemed to have good reviews and they publish a lot of information about their infrastructure which is always a good thing. 3 minutes into testing their service and I got a rate-limit message from them as well. There is a maximum of 100 connections to the account in 600 seconds. Clearly not a scalable solution using a single account with lots of machines.
After several days of testing this off and on, the new plan is host IMAP ourselves. As a general rule, I'd prefer to spend my time improving the hosting environment for our blogging application, but when you can't find someone willing to host the tool that you need, sometimes you end up having to roll your own.
Over the next several days I will be creating a series of posts that describe how I am using Puppet. I wrote previously about why puppet sucks and further about how I'd rather being using chef, but alas, I am using puppet and it does get the job done.
The reason I am posting now is that I finally feel that I have a handle on how to effectively lay out configuration files and manage machines with puppet now. I hope that this information is useful to someone.
The first thing that you need to keep in mind is separate your configs as much as makes sense. For example, every service should have it's own config file. Here is my /etc/puppet heirarchy:
/etc/puppet/
/etc/puppet/puppet.conf
/etc/puppet/fileserver.conf
/etc/puppet/autosign.conf
/etc/puppet/manifests
/etc/puppet/manifests/users
/etc/puppet/manifests/users/virt_systems_users.pp
/etc/puppet/manifests/users/jlitton.pp
/etc/puppet/manifests/users/dmartin.pp
/etc/puppet/manifests/users/virt_dev_users.pp
/etc/puppet/manifests/users/developer.pp
/etc/puppet/manifests/users/virt_developer_user.pp
/etc/puppet/manifests/users/bmatheny.pp
/etc/puppet/manifests/groups
/etc/puppet/manifests/groups/develper.pp
/etc/puppet/manifests/groups/virt_wheel_group.pp
/etc/puppet/manifests/nodes.pp
/etc/puppet/manifests/services
/etc/puppet/manifests/services/varnish.pp
/etc/puppet/manifests/services/blog.pp
/etc/puppet/manifests/services/lvs.pp
/etc/puppet/manifests/services/sql.pp
/etc/puppet/manifests/services/presentation.pp
/etc/puppet/manifests/services/services.pp
/etc/puppet/manifests/services/xen.pp
/etc/puppet/manifests/services/daemon
/etc/puppet/manifests/services/daemon/postfix.pp
/etc/puppet/manifests/services/daemon/bacula-server.pp
/etc/puppet/manifests/services/daemon/apache.pp
/etc/puppet/manifests/services/daemon/mon.pp
/etc/puppet/manifests/services/daemon/nfs.pp
/etc/puppet/manifests/services/daemon/logrotate.pp
/etc/puppet/manifests/services/daemon/ssh.pp
/etc/puppet/manifests/services/daemon/bacula-client.pp
/etc/puppet/manifests/services/daemon/ntp.pp
/etc/puppet/manifests/services/daemon/stunnel.pp
/etc/puppet/manifests/services/daemon/syslog-ng.pp
/etc/puppet/manifests/services/daemon/ampstack.pp
/etc/puppet/manifests/services/daemon/yum.pp
/etc/puppet/manifests/services/daemon/snmp.pp
/etc/puppet/manifests/services/daemon/named.pp
/etc/puppet/manifests/services/daemon/memcache.pp
/etc/puppet/manifests/services/daemon/tftpd.pp
/etc/puppet/manifests/services/daemon/heartbeat.pp
/etc/puppet/manifests/services/daemon/dhcp.pp
/etc/puppet/manifests/services/daemon/mysql.pp
/etc/puppet/manifests/site.pp
/etc/puppet/manifests/os
/etc/puppet/manifests/os/redhat.pp
/etc/puppet/manifests/environments
/etc/puppet/manifests/environments/development.pp
/etc/puppet/manifests/environments/test.pp
/etc/puppet/manifests/environments/production.pp
/etc/puppet/manifests/templates
/etc/puppet/manifests/templates/base.pp
/etc/puppet/manifests/templates/xenguest.pp
Admittedly this could use some cleanup, but the point is if you are going to manage any significant number of machines, you need to start defining services from the beginning.
This may seem a bit chicken before the egg as it doesn't describe how to effectively use any of these files, but I'll post again shortly with an example nodes.pp and base.pp to help defining what a base class looks like.
Here at Compendium in the product group we take the opportunity on a weekly basis to improve the base level of knowledge of our employees by getting together and discussing the world of computer software. We alternate weeks between code reviews of internally produced work and a reading group.
Last week we ran a reading group. This quarter our reading groups are structured such that we select a topic and everyone in the group presents some relevant information on a sub-topic. This give everyone the chance to delve a little deeper into something that are interested in while improving the knowledge for everyone.
Although the information we discuss doesn't always immediately apply to the task at hand with respect to blogging software, we do sometimes learn something that comes in handy later.
Last week's reading group was about algorithms and Aaron presented about Genetic algorithms because he did his doctoral thesis on a related subject.
Today I found this post: http://brainz.org/15-real-world-applications-genetic-algorithms/ and although blogging software isn't on the list of applications for genetic algorithms, marketing techniques are. Marketing techniques are at the core of what business blogging is all about. This demonstrates effectively the breadth of reach a seemingly irrelevant topic might have and why it's important to continue to think about ideas that might not necessarily seem relevant at first glance.
I read an article yesterday about
Today I spent time dealing with an issue that may have been avoided by sticking to best practices. Don't do that it. It's just not worth it.
I had locked myself out of an EC2 instance. I ran an iptables rule on the machine last night and checked that everything was working as expected. I knew that iptables was not the right tool for the job because I've used security zones, but I was "just testing something" and didn't think that the rule was appropriate for the zone so I went ahead. Everything seemed fine.
This morning I discovered that I could no longer access that instance. In fact, I could not access a single open port. This was not OK. I did not need to deal with data loss here(the image was brought up before EBS existed and it wasn't originally being used for much....). Anyhow, I rebooted the box and everything was accessible again. I'm not convinced that the iptables rule was what broke access to the machine, but there was no reason to even be considering it. It's easy enough to add another security group and bring up another instance that there was no reason to even have a question about this.
In summation, just stick to best practices and spend the 15 minutes up front doing it right; don't spend half an hour in a panic fixing it tomorrow. At least tools like EC2 exist that make the right thing easy and cheap by design.
I just stumbled across this article:
http://agiletesting.blogspot.com/2009/02/load-balancing-in-amazon-ec2-with.html
The suggestion made is that since EC2 does not yet have a load-balancing feature built-in you must use a software load-balancer, specifically mentioned is using haproxy for load-balancing.
I've been using a software load-balancer to host Compendium's web services for nearly a year now, but it's coming time to re-evaluate. I've been using ipvsadm to manage the ipvs linux kernel module with nat for load-balancing. I decided to use this instead of something like haproxy primarily due to the fact that it was not limited to proxying http requests. When the first generation hardware for our application was being deployed, it was important to realize that we needed to conserve some resources and still attempt to maintain redundancy to whatever extent possible.
A year in now and the primary issue that I have yet to solve with ipvsadm is dynamically updating the configuration. Disappointingly, haproxy does not seem to solve this either. What I would like to see is a load-balancer that allows me to make changes to the configuration programatically, either through a configuration maintained in memcache or even better via API calls to a web-service endpoint.
I have spent the past couple of weeks doing a review of the state of our production environment. Out of that review, it was determined that one of the more pressing issues was a revamp of our backup procedures. I wanted to implement a solution that could be centrally managed and could maintain flexible policies.
Bacula seems to fit the bill. It is an open source backup solution that uses a pull mechanism to grab the data off of servers. I have used this solution for backups before due to encryption requirements for another project, but I have not previously set up bacula to manage a significant number of machines.
I will create another post once I have gotten a solid configuration together along with some testing I will create another post regarding managability and resource utilization.
I was recently introduced to a new tool that has been develop to solve the problems presented by puppet for configuration management called
Chef.
I have not had an opportunity to play with it, yet, but it's primary advantage seems to be that your configuration management is done be defining scripts to run as opposed to being limited to an overly simplified configuration language.
I hope that Chef will prove to be an effective tool as the community around it grows. I'll be keeping my eye on it.
Today I set up pxe reinstalls at our co-lo provider. This was a straight-forward process based on the blog post http://technomojo-hmb.blogspot.com/2008/03/installing-linux-over-network.html.
The only bit o' honey that I would like to add to the information that is pretty clearly laid out in that post is that when handling this over KVM-over-IP, you must be aware that the visual indicators will not be available. This can and should be handled by using pre and/or post execution scripts within the kickstart configuration. At minimum, you should ensure that the post execution script reboots the machine.
I do not have extensive experience with Kickstart, but it does seem worthwhile to suggest doing some testing of your own with respect to the pre and post execution scripts. Moving forward I would like to attempt using the pre execution script to start up the SSH server on the host in order to monitor progress as the system is being installed. Additionally, a post-execution script that handles the next step of the setup process I am using for hosts would be ideal. I would like to have it download a script from a webserver, change permissions and then execute as root to install puppet which I am using to go from OS installed to configured.
Combining pxebooting, kickstart and puppet I am very close to being able to create a bare-metal install/recovery process of our application. Assuming that I can bridge the gap between install and configuration with this scripting, I would also like to attempt using this methodology to install on EC2. If that can be done, it would provide a truly effective last-ditch failover plan for maintaining uptime after a catastrophic failure.
I've been setting up puppet for over a month now and I have yet to take advantage of all of the features. This is likely not truly a short-coming of puppet, but rather an issue with my vision of what puppet should be.
I decided that I would use puppet for server configuration management without any experience and having done what in retrospect seems like only a cursory level of research into the potential solutions for this. Clearly the best tools available are puppet and CFEngine and I found the complaints about the CFEngine development process too overwhelming to ignore.
After choosing and implementing a system, reading the definitive guide, Pulling strings with Puppet , and deciding to use it a manner it was not designed for, I've decided that it just plain sucks.
My complaint revolves around 2 issues. First is that what I want is tool to make "bare-metal" restores from a base image easy. This is not what puppet is designed for. The second and although more pressing issue is that the default types do not always do as expected.
First, there's just no good tool for this. It works OK for this really. Better than you'd expect at any rate.
Faulty types. The File type when you use ensure => true doesn't create a file or maybe it does...I don't remember, but when you do that with a directory(note there is no directory type) it fails. You must create and Exec, which I have begun to realize is the default type to get anything done, in order to create a directory.
And the types the suck. Here they are:
http://reductivelabs.com/trac/puppet/wiki/TypeReference