How To Get Alerts When a Sidekiq Instance Goes Down

https://www.flickr.com/photos/emmandevin/6089454195/

For those of us who use the popular distributed job queueing system Sidekiq, it’s a common problem: a Sidekiq instance containing a pool of workers dies, and the only way you find out about the problem is by checking the Sidekiq dashboard and seeing that you’ve got a ton of jobs backed up and fewer busy workers than expected.

Luckily, there’s an easy way to get email alerts when one of your instances goes down, using Dead Man’s Snitch and a little bit of code inspired by the sidekiq\_snitch gem.

The general approach we’ll take will be to create a special SnitchWorker class that pings a snitch URL and then continually re-enqueues itself so it keeps pinging that URL every 30 minutes as long as the instance is alive. We’ll also do a little extra work to ensure that the queue we’re using is specific to the snitch URL and, presumably, to the instance, as well.

Before we proceed any further, a word on why I personally don’t use sidekiq\_snitch gem. I run multiple Sidekiq instances on multiple hosts, and I want to create a different snitch for each instance. With the approach outlined here, I get an alert if one of the instances goes down. The `sidekiq_snitch` gem, on the other hand, only really makes sense when used with one instance.

Step 1: Configure Dead Man's Snitch

The first step is to sign up for the Dead Man’s Snitch and create your first free snitch. When you create the snitch, keep in mind that it will only apply to a specific Sidekiq instance, and configure it accordingly (we’ll talk about naming it in a moment). If you have multiple instances, then you’ll need multiple snitches. But for now, let’s just set up one.

Configure the snitch for hourly check-in, at first. We’ll try to have our workers check in every 30 minutes, but we have to make allowances for when instances get busy and backed up with work. Only you know how busy your workers are on average, so it’ll be up to you to set a combination of a snitch interval, snitch worker respawn period, and snitch queue priority level that will keep you from getting false positives.

Step 2: Create a Snitch Worker

The next step is to create a sidekiq directory in your app/workers directory and add the file app/workers/sidekiq/snitch\_worker.rb, as follows:

module Sidekiq
  class SnitchWorker
    include Sidekiq::Worker

    def self.snitch_url
      ENV['SIDEKIQ_SNITCH_URL']
    end

    def self.queue_name
      # Extract the snitch token from the snitch url
      # and use it to name the queue
      token = snitch_url ? snitch_url.split("/").last : 'q'
      ['snitch', token].join('_')
    end

    sidekiq_options queue: queue_name

    def perform
      return unless url = self.class.snitch_url
      Net::HTTP.get(URI(url))

      # groundhog day!
      SnitchWorker.perform_in(30.minutes)
    end
  end
end

Let’s take a look at the worker above and see what it does. The first method just makes it easier to grab the ENV variable SIDEKIQ_SNITCH_URL, which you’ll set for each instance with the snitch url that you created in the first step above. I put this convenience method in there because we’ll be using it in more than one spot. Most of you will probably be using the figaro gem, in which case you’d just call Figaro.env.sidekiq\_snitch\_url, so this won’t add much readability or convenience so you can skip it.

The next method assigns the SnitchWorker to a Sidekiq queue the name of which is based on the SIDEKIQ_SNITCH_URL environment variable that you set. This queue name will be shared by any instance that you booted with this same snitch url. I presume that you’re only booting one instance per snitch url, but if you wanted have more than one instance pinging the same URL then you could.

There is a little bit of guard logic in this method so that you can easily boot a Sidekiq instance that’s unmonitored by simply omitting the snitch url ENV variable.

The perform method is straightforward, as it does an HTTP GET to the snitch url and then enqueues another SnitchWorker to repeat the process in an hour.

Step 3: Configure Sidekiq

The final step in giving each instance a Dead Man’s Snitch-based heartbeat monitor is to properly configure Sidekiq. You’ve already got a config/initializers/Sidekiq.rb file that sets up redis and so on for you, so we’ll add the following lines to that file at the very end, after you’ve configured the Sidekiq client and server:

Sidekiq.options[:queues].insert 0, Sidekiq::SnitchWorker.queue_name
Sidekiq::SnitchWorker.perform_async

The above code first adds the queue name that we generated above (based on the snitch url) to the front of the list of queues that this Sidekiq instance listens on. Then, it enqueues the first SnitchWorker job, which will then continually run every 30 minutes to ping the snitch URL.

Again, if you don’t set the snitch url environment variable, then that first SnitchWorker will just return without doing anything and won’t respawn.

Step 4: Boot Sidekiq

Now you can boot Sidekiq with the ENV variable set to the snitch url that you generated in step 1, and your instance will automatically extract the token from the url, listen on a queue with a name based on that token, and run the snitch worker every 30 minutes in that queue.

$> SIDEKIQ_SNITCH_URL=https://nosnch.in/c2354d53dd Sidekiq -v -q foo -q bar
$> 2015-05-22T18:27:08.686Z 54099 TID-4rw DEBUG: {:queues=>["snitch_c2354d53dd", "foo", "bar"], :labels=>[], :concurrency=>25, :require=>".", :environment=>nil, :timeout=>8, :error_handlers=>[#, #, #], :lifecycle_events=>{:startup=>[], :quiet=>[], :shutdown=>[]}, :dead_max_jobs=>10000, :dead_timeout_in_seconds=>15552000, :verbose=>true, :strict=>true, :backup_limit=>1000, :tag=>"rails-app-name"

Conclusion

Setting up Dead Man’s Snitch-based uptime monitoring on your Sidekiq instances is super easy, and it’s also extremely useful. I don’t know about you, but my Sidekiq instances do go down and/or stop responding periodically, which means that I need to go in and reboot them. In fact, if I’d like to cook up some scheme for rebooting the instances based on the receipt of an email alert from the service – perhaps that could be a topic for a future post.

If you haven’t tried Dead Man’s Snitch, be sure to sign up for the free tier and try it out. You’ll be glad you did.


Warning Light by Britt Reints is licensed under CC BY 2.0

Photo of Jon Stokes

Jon is a founder of Ars Technica, and a former Wired editor. When he’s not developing code for Collective Idea clients, he still keeps his foot in the content world via freelancing and the occasional op-ed.

Comments

  1. December 06, 2016 at 13:01 PM

    Does anyone know how to find sidekiq’s pidfile to gracefully shut it down?