The N=1000 Optimization Trap

A few dozen times, I’ve run into problems with what I’m calling “The N=1000 Optimization Trap”.

If it has another name, I’d love to hear it. It’s a premature optimization variant with a twist, in that it actually breaks stuff.

Here’s a contrived ActiveRecord example that’d live in a background job:

  # We start off with plain textbook code.
  User.find_each do |user|
    coordinates = user.generate_current_coordinates
    CurrentLocation.create(user: user, coordinates: coordinates)
  end

After a while, we’ve got a few hundred or maybe a thousand users, and it seems unsettling that we’re inserting N records with N SQL INSERTS, so we optimize for queries with something like:

  # We keep a list of records we can insert in-bulk later
  records = []
  User.find_each do |user|
    coordinates = user.generate_current_coordinates
    records.push({ user: user, coordinates: coordinates})
  end

  # Now SQL insert them
  CurrentLocation.insert_all(records)

This is the trap.

Below a few thousand records, this optimization doesn’t matter, because N is so small. When N is in “adolescent ranges”, this optimization seems like it helps, and it probably actually does.

But then N starts getting large (lets say hundreds of thousands+), and we run into problems:

Do we have enough memory to hold records? If it’s 1GB of record data on a 512MB instance, then no.
Does generate_current_coordinates have side-effects that’d matter if the job hits an exception before we batch-commit all the records?
How big is the SQL string generated by insert_all? It turns out, Postgres will accept really long strings, but imagine the stress this is putting on the query parser, and then checking constraints. This is exasperated by the next point:
Does that insert_all imply taking an exclusive lock on a heavily used table? Inside or outside a transaction, this can stall other readers.

You normally notice this when production starts getting read stalls, and everything locks up.

In a recent situation, we knew we had an area like this that could blow up, and after some tests decided “it’ll probably still be fine”, and it still blew up production.

The Fix

So we roll up our sleeves and write the fix:

  # We end up with plain textbook code again.
  User.find_each do |user|
    coordinates = user.generate_current_coordinates
    CurrentLocation.create(user: user, coordinates: coordinates)
  end

I’ve used similar examples to argue and compare speed with scalability. Here’s the rules I try to go by:

As is common knowledge, avoid premature optimizations.
If you’ve decided something could use an optimization now, consider how it’d still apply at every point along the trip to thousands or millions, if applicable.
If it involves ultra-tricky caching, or table joins that’d be hard to explain without a diagram, think about it a few more times.

The End.