Playing with Go: Embarrassingly Parallel Scripts

I recently needed to take a list of domain and find which ones point to a specific IP address. For a small list, say less than 10, manually running dig in the console would work great, but this list had almost 800 domains so I needed a script. As domain lookup is a network request and thus very slow, setting up the domain requests in parallel made sense. I could easily just do this in Ruby, my language du-jour, but I’ve done this type of thread work before and frankly it can be tedious to set up, fragile, and still won’t have access to all of my system’s resources due to the GVL1. I’ve been keeping an eye on Google’s Go for some time now and decided to see how it handled this problem.

I’ve been intrigued by Go since it was originally announced about three years ago. Here was a compiled, fast, light-weight, low level language with many of the features we take for granted these days, such as garbage collection, while also adding on a very sophisticated concurrency model similar to what’s found in Erlang: very lightweight internal processes managed by the runtime. Sounds like a perfect fit for my requirements.

The code I ended up with is here: https://gist.github.com/4170926. For the sake of comparisons I built a sequential version of the script as well as the parallel version and added timings for running both scripts against the full list of domains.

Running these scripts for yourself is a one-liner: go run [script.go]. The input file domains.txt needs to be a newline-delimited list of domains. I’ll go over the more confusing parts of the two scripts to help with understanding what’s really going on here.

Objects?

Go’s object model is very close to C’s: structs with data and methods that operate on said structs. Both scripts only use a small, two-element struct, DomainMap, to keep track of the IP address found for a given domain. I use the short-form to initialization new instances of the DomainMap structure. The order of values maps directly to the order of the defined fields at the top of the scripts.

type DomainMap struct {
  Domain string
  IpMapping string
}

object := DomainMap{domain, ipAddress}

object.Domain == domain
object.IpMapping == ipAddress

Error handling

Go does error handling by returning multiple values from a function, where the second return value is expected to be a value of type error. You can ignore this with the _ variable.

rawIpAddresses, _ := net.LookupIP(domain)

Parallelism

The parallel version of the script has some new concepts that need explaining, particularly goroutines, channels, and channel communication.

A goroutine is a very lightweight process, sort of like a Ruby Fiber. Creating one is simple:

go domainLookup(responseChannel, domain)

Go will grab the function call after the go keyword and execute it in parallel. However, given that we’re no longer in the main process, we can’t just return values from the function. We now need a different way to get the return value. This is where channels come in.

responseChannel := make(chan DomainMap)

As Go is a statically typed language, we need to define the type of channel being created. Channels can only accept data of the same type as the channel. Communication through channels is done with the reverse-stabby operator <-, which should be read as “the data on the right side is flowing to the left side”:

// Write into a channel
returnChannel <- DomainMap{domain, ipAddress}
// Read from the channel
domainMap := <- responseChannel

And that’s all the special syntax. The only real difference between the parallel and sequential scripts is the map-reduce-esque setup to wait for all the goroutines to finish. I didn’t need to worry about thread pooling, system capabilities, or thread safety. Go makes it so easy to write truly parallel code that there’s no excuse not to anymore. I was able to run almost 800 goroutines (one per domain) all throwing out DNS queries and coming back in less than 10 seconds, in a script that doesn’t even look like it’s running in parallel.

Now that Go 1.0 stable is out, it’s a great time to get familiar with this language. I highly recommend checking out the Tour of Go for basic introductions into every major feature of the language, and there’s a ton of documentation on the main website golang.org. For the little bit of time I’ve played with Go now, I see a very bright future for this language.


1 Global VM Lock, more about Ruby’s concurrency here: http://www.engineyard.com/blog/2011/ruby-concurrency-and-you/

Photo of Jason Roelofs

Jason is an expert coder and smart technologist who’s worked for two local startups before joining our team. His deep technical understanding of all things code manifests itself in his plentiful open source contributions and his ability to answer most questions with an experienced answer.

Having started coding as a teenager, Jason lives for writing code and loves experimenting with different languages and new technologies. Jason is also an avid gamer when he isn’t working on code, writing, or practicing martial arts (currently Kung Fu and Tai Chi).

A Calvin College grad, Jason now lives in Holland with his wife Martha.

Comments:


Post a Comment

(optional)
(optional — will be included as a link.)
  1. Can you post the full domains.txt file on your gist as well?

    Sir
    Sir
    December 03, 2012 at 11:02 AM
  2. Celluloid

    Grogenaut
    Grogenaut
    December 03, 2012 at 11:14 AM
  3. Ignoring errors is not the greatest thing to do…

    Carlos
    Carlos
    December 03, 2012 at 11:46 AM
  4. @Sir: I’d rather not as it contains customer data.

    @Carlos: As this is a one-off script, re-running the script is good enough error handling for me. This would definitely be far different if it was a module run inside of a bigger application.

    December 03, 2012 at 12:07 PM
  5. I know you wanted to use GO for this article, but you know you could have considered JRuby for this work, right? 

    Marcalc
    Marcalc
    December 03, 2012 at 12:12 PM
  6. Please use gofmt whenever publishing Go source code.

    jnml
    jnml
    December 03, 2012 at 12:59 PM
  7. Thanks for your blog post, it was very instructive.

    One question: what does this mean?

        domainMapping = append(domainMapping, on that line.

    kikito
    kikito
    December 03, 2012 at 13:05 PM
  8. It seems that the code was mingled by a html strip tags, but never mind, I think I figured it out. When you have a LEFT ARROW channel in a param, you are just using whatever that channel returns next as  the param. I assume that this is a blocking call.

    kikito
    kikito
    December 03, 2012 at 13:17 PM
  9. @kikito: http://golang.org/pkg/builtin/#append

    Append is a built-in function to work on the slice data type, and it always returns the modified slice because this call might resize the one you passed in or a new slice might be allocated depending on the capacity of said slice.

    Also yes [left arrow] is a blocking call.

    December 03, 2012 at 13:23 PM
  10. Another nitpick: it’s not really parallel – it’s concurrent. For now, Go is single-core unless you explicitly tell it to use multiple cores10. As far as I can tell, your code is running single-core. Which actually makes this a nice example of how concurrency can be faster regardless!

    10 http://golang.org/pkg/runtime/#GOMAXPROCS

    Job van der Zwan
    Job van der Zwan
    December 03, 2012 at 16:02 PM
  11. Here’s a list of domains I found https://raw.github.com/tarr11/Webmail-Domains/master/domains.txt

    Phil
    Phil
    December 03, 2012 at 18:30 PM
  12. I tired this under windows 7 and both scripts run the same…I have a Core 2 Duo.  I also: set GOMAXPROCS=2 

    thx!

    john
    john
    December 03, 2012 at 19:54 PM
  13. Go’s approach to parallelism reminds me of something…  ah!  Unix and its shells.  It’s very easy to parallelize shell scripts too…  And go channels look remarkably like pipes.  Of course, the shells are kinda sucky and outdated, so yes, Go is better.

    Nico
    Nico
    December 04, 2012 at 1:27 AM
  14. Echoing a previous comment – If you were itching to give Go a try, that’s one thing, but saying that you couldn’t do it in Ruby because of the GVL is fallacious and misleading.
    You could easily have used JRuby and get an industrial-strength Ruby implementation without a GVL.

    Anthony
    Anthony
    December 04, 2012 at 7:53 AM
  15. @Anthony: I never said I couldn’t do it in Ruby. What I said was that Ruby’s GVL ensures that you won’t get full use of your system when trying to build concurrent systems. Yes you can switch to JRuby but then you’re not using Ruby, you’re using JRuby, and I wanted to branch out and try something completely outside of the Ruby ecosystem.

    @Job van der Zwan: Right, thanks for pointing that out! Had only glanced at some of that previously, I’ll be sure to remember that setting in the future.

    December 05, 2012 at 10:00 AM
  16. @Jason: In MRI 1.9, threads blocked by IO will run in parallel. You don’t need JRuby to run a bunch of network requests on all your cores. I understand you just wanted to use Go, but please understand that the GVL doesn’t necessarily block ruby threads from running in parallel. 

    See Aaron Patterson’s MagmaRails 2012 talk for a simple demonstration http://www.youtube.com/watch?v=vERwKWqDC0c#t=11m00s

    December 05, 2012 at 11:17 AM
  17. I wrote an example showing Ruby 1.9.3 on a Macbook Pro with a Core i7 resolving 800 random-ish hostnames: https://gist.github.com/ec353d84522531fe2bfa

    As you can see, it takes about 16 seconds, but the point is that the requests run in parallel on MRI with nothing but Thread.new.

    Don’t get me wrong, I think it’s great that you found a simple but practical example to introduce Go’s concurrency primitives, and I appreciate the time you took to write up this blog post. Kudos! I just find that there is a lot of confusion about concurrency when it comes to MRI’s thread implementation, and I think it’s a shame that Rubyists don’t realize they can parallelize IO-bound tasks.

    December 05, 2012 at 17:37 PM
  18. @benolee: If anything I shouldn’t have mentioned Ruby’s GVL at all, as that ended up distracting from the point I was trying to make which was to show my playing with concurrency in Go. Doing anything IO bound is of course a very easily parallellizable task for any language, which puts us back in the realm of how hard it is to put together a good example. I never meant to say “Ruby sucks. What does this better?”, but “I’ve done this in Ruby and I want to try another language now!” and talking about my experiments.

    December 05, 2012 at 18:08 PM