Optimizing Rails for Memory Usage Part 3: Pluck and Database Laziness

This is part three in a four-part series on optimizing a potentially memory-heavy Rails action without resorting to pagination. The posts in the series are:

Part 1: Before You Optimize

Part 2: Tuning the GC

Part 3: Pluck and Database Laziness

Part 4: Lazy JSON Generation and Final Thoughts

Pluck the Chicken

If changing Ruby’s GC parameters doesn’t lower your memory usage enough, it may be time to change your Ruby code. Figuring out where your code is allocating memory can be tricky. Sam Saffron has a memory_profiler gem that will tell you where your code is allocating objects, or if you prefer, Aman Gupta’s stackprof has a mode that can count where your object allocations are happening. The number of allocations is not the same as the amount of memory allocated, but memory_profiler or stackprof can help identify some hot spots.

I’ll give you a hint, though. If you’re trying to optimize the memory usage of a big index view in Rails, your hotspot is almost surely ActiveRecord instantiation. ActiveRecord objects are memory hogs. If you only have to load 30 AR objects from the database, fine. However, 300 or 3000 instantiations will really eat memory.

Helpfully, if you just need a few values you can use ActiveRecord#pluck to avoid instantiating ActiveRecord objects.

# This uses WAY more memory (and time)...
User.select(:id).map(&:id)

# ...than this!
User.pluck(:id)

If you just need a few columns off the table (e.g. for an admin view) you can pluck multiple columns in Rails 4. You will get back an array of arrays.

User.pluck(:id, :name, :email)
# [
#   [123, "Alice",   "[email protected]"],
#   [124, "Brian",   "[email protected]"],
#   [125, "Cynthia", "[email protected]"]
# ]

If you are generating JSON views and you are using Postgres, you can cache an object’s JSON representation in the database.[1] Generating your index view could become as simple as:

render json: my_scope.pluck(:json)

Since cache invalidation is hard we introduced a bug into our app when we added JSON caching. I recommend you consider other strategies before you adopt this approach.

Be Lazy

Even though pluck uses much less memory than loading full ActiveRecord objects, if your collection is large you may want to avoiding loading data from all the records at once. Traditionally, the way to walk over a collection without keeping the whole result set in memory is by using ActiveRecord#find_each:

User.find_each.lazy.map(&:some_calculation_in_ruby).reduce(:+)

The find_each method will load records in batches of 1000. After each 1000 records are processed they can be reclaimed by the garbage collector. The lazy call ensures that map does not generate a huge array.

Unfortunately, if you need the results in a specific order you cannot use find_each because ActiveRecord sorts by the id field to build its batches. Have no fear, there is a solution! If you are using Postgres you can batch load and sort by anything you want with the postgresql_cursor gem. Using postgresql_cursor is dead simple. Just add it to your Gemfile and then use the each_instance or each_hash methods on your scope:

# Fetch full ActiveRecord objects
User.order(created_at: :desc).each_instance.lazy.map { do_something }

# Fetch just a hash of each row (faster and uses much less memory)
User.order(created_at: :desc).each_hash.lazy.map { do_something }

# A lazy pluck. Native ActiveRecord#pluck cannot be made lazy.
User.select(:id, :name, :email).each_hash.lazy.map(&:values)

Voila!

If you need to eager load an association to avoid the n+1 queries problem, it can be done. ActiveRecord’s find_each supports eager loading just fine:

User.includes(:posts).find_each.lazy
# Each user will have user.posts preloaded efficiently.

However, postgresql_cursor does not natively support eager loading. If you need to efficiently preload an association you can trigger it manually:

batch_size = 1000

User.
  each_instance(block_size: batch_size).
  lazy.
  each_slice(batch_size).
  flat_map do |batch_of_users|
    ActiveRecord::Associations::Preloader.new.preload(batch_of_users, :posts)
    batch_of_users
  end

When you have completed your lazy record loading setup you can further optimize memory usage (or runtime) by trying different batch sizes:

PROCESSING_BATCH_SIZE = Integer(ENV["PROCESSING_BATCH_SIZE"].presence || 1000)

# ActiveRecord’s find_each
User.find_each(batch_size: PROCESSING_BATCH_SIZE).lazy

# postgresql_cursor’s methods
User.each_instance(block_size: PROCESSING_BATCH_SIZE).lazy
User.each_hash(block_size: PROCESSING_BATCH_SIZE).lazy

The default size of 1000 is usually pretty good, but if you do eager loading you may find some memory improvement by shrinking the batch size. Similarly, if you use each_hash to avoid instantiating ActiveRecord objects you may get both memory and runtime improvements by using a larger batch_size. Modify your test script to automatically test a bunch of different batch sizes and have it run while you eat lunch.

If the end of your processing pipeline renders out a long JSON list and like us you are using ActiveModel::Serializers, you have the problem that this last step of the process is not lazy. A large array needs to exist in Ruby in order to render it to JSON. Lazy enumerators are not natively set up to serialize to JSON. In the next post, we will discuss how to serialize an enumerator and close with a few final thoughts.

On to Part 4: Lazy JSON Generation and Final Thoughts →

[1] In other databases, you could store a JSON string. Note, however, that because the field will come back from the database as a string, you may have to parse the JSON before you re-serialize it into a response.

About Brian Hempel

Brian worked with us long before he came on full-time, and had we seen the baby face lurking beneath his programmer beard, we probably wouldn’t have assumed he was as smart. He proved quickly that he has earned the beard, both as a graduate of Michigan Tech in Bioinformatics and Biochemistry/Molecular Biology, and as an experienced coder who picks up new tools quickly.

An occasional violinist and lover of birds, Brian is a cheerful addition to the office.

Comments

Amol
September 25, 2018 at 8:16 AM

as_json is useful to like pluck, to retrieve raw data instead getting AR object created in memory

By Brian Hempel

March 05, 2015