An adventure in serialization

Rahim Packir Saibo Software Architect, Applications

DALL·E's interpretation of blue green rails cerealization

A cautionary tale debugging the strange behaviour of a Rails upgrade during a blue-green deployment.

Hybrid canary blue/green Rails upgrades

For major updates of the Litmus Rails monolith we choose to provision a whole new parallel production application server tier (our green env) alongside our existing one (our blue env), deploy our new application version there and then incrementally transition production workloads across to the new environment.

This allows us to reduce risk:

  • by gaining additional information (performance, stability, correctness) beyond what our test suite and manual QA have already shown – nothing is more reflective than observing our real traffic, using the real database and real external services.

  • by ensuring we have an escape hatch, if a problem is observed we can quickly drop the percentage of work being handled by the green environment to 0.

To allow for that, the blue and green environments share everything that’s not contained within the tier, most significantly that means they share the same database. Any HTTP request or any background job might be assigned to either blue or green environment.

As it stands we have no concept of stickiness for sessions or for jobs being run in the environment that they were scheduled in. That means old and new must be ready to handle data written by the other, eg sessions, queued jobs, caches.

This forward and backwards compatibility is not guaranteed by Rails, nor does it seem to be an explicit design goal. This means there’s onus on us to test safety of this ahead of time.

Earlier this year we completed an upgrade from Rails 5.2 to Rails 6.1.

At the beginning of the project we tested and confirmed session forwards and backwards compatibility.

Late in the project we were bitten by subtle problems where that back/forwards requirement had been broken, for:

  • a small fraction of our caching calls
  • a small fraction of our background job invocations

We’ll cover the first of those here and the latter in a follow up post.

Caching

After extensive testing in our staging environment and with our complete test suite passing, we moved on to QA within our green environment for additional confidence before any customer traffic was sent there.

We encountered our first surprising problem at this stage, with a key action failing due to a stack overflow.

Our investigation isolated the problem to an area of code where we recursively traverse a tree of ActiveRecord objects. Critically, this traversal lent on the Rails cache as a performance optimization.

In Rails 5, the expected usage of the cache had us making cache.fetch calls with a cache key of something like "parent-for-folder-123" for a parent folder with id 123.

In Rails 6, we were seeing the cache.fetch passed the string "parent-for-folder-" – ie no id in some cases. More worryingly this was yielding an actual cache hit.

The instance returned was found to respond nil to #id leading to endless recursion.

We reasoned something had poisoned the cache (populating parent-for-folder-), but still didn’t have an explanation as to how #id could ever have been nil. Guards we had in place should have ensured we were always dealing with a persisted record.

We opened a Rails 5 and Rails 6 console side by and started comparing cache queries.

After a lot of head scratching we skewered our cause – an earlier cache.fetch had yielded a deserialized Folder object that looked and behaved correctly, including showing a valid id in its attributes, yet indeed responded nil to #id 🙀

Comparing a fetch of that entry that shouldn’t exist we found that the weird #id behaviour seemed specific to Rails 6.

# Rails 5

irb(main):001:0> cached_folder = Rails.cache.read "parent-for-folder-"
=> #<Folder id: 30228, title: "foo", parent_folder_id: 14069, ...>
irb(main):002:0> cached_folder.id
=> 30228

vs

# Rails 6

irb(main):001:0> cached_folder = Rails.cache.read "parent-for-folder-"
=> #<Folder id: 30228, title: "foo", parent_folder_id: 14069, ...>
            ^^^^^^^^^
irb(main):002:0> cached_folder.id
=> nil      # 🤯

Note, that unlike almost all our other cache usage, we’re not storing primitive data here, but instead directly passing the Rails cache an ActiveRecord instance.

That instance has to be marshalled when placed in the cache and unmarshalled on the way out.

Marshal#dump serializes an object instance (and its descendants) to a string when we persist to the cache.

For a toy class, we can see that string describes our object’s instance variables, but not class variables or methods.

class Foo
  @@bar = :bar
  def initialize
    @baz = :baz
  end
  def qux
    :qux
  end
end
irb(main):001:0> Marshal.dump(Foo.new)
=> "\x04\bo:\bFoo\x06:\t@baz:\bbaz"

Instead class variables and methods are defined by the class’s definition in the runtime at the point we come to unmarshal.

If we redefine our class above

class Foo
  @@bar = :shenanigans
  def initialize
    @baz = :shenanigans
  end
  def qux
    :shenanigans
  end
end

And then deserialize that earlier string

irb(main):001:0> deserialized = Marshal.load("\x04\bo:\bFoo\x06:\t@baz:\bbaz")
=> #<Foo:0x00007ff70989ef10 @baz=:baz>

irb(main):002:0> deserialized.qux
=> :shenanigans

We see see that we get the behaviour of the new class definition, hydrated with the instance variables we persisted with an earlier one.

Problems arise when a later class definition embeds new assumptions about the shape of those instance variables, that aren’t matched by already serialized objects.

That’s exactly what occurred in this case – ActiveRecord internals had changed between Rails 5 and 6, moving the primary_key lookup (used by #id) from a class variable to an instance variable.

In Rails 6, if we pull a record fresh from the database we can see this new instance variable is set

irb(main):024:0> f = Folder.find 16949
=> #<Folder id: 16949,  ...

irb(main):026:0> f.instance_variable_get :@primary_key
=> 
"id"

but not if we pull the same record from the cache, written by Rails 5

irb(main):027:0> cf = Rails.cache.read 'parent-for-folder-18395'
=> #<Folder id: 16949, ...

irb(main):028:0> cf.instance_variable_get :@primary_key
=> nil

It’s possible to manually patch over this at runtime

irb(main):038:0> cf = Rails.cache.read 'parent-for-folder-18395'
=> #<Folder id: 16949, ...

irb(main):039:0> cf.id
=> nil

irb(main):040:0> cf.instance_variable_set(:@primary_key, "id")
=> "id"

irb(main):041:0> cf.id
=> 16949

Perhaps that might have have been used to address the issue had we been making a hard cutover to Rails 6, however given we wanted to gradually transition traffic between the pools, we’d be writing and reading entries from both Rails 5 and 6 concurrently.

Backwards compatibility of ActiveRecord instances written to the cache by Rails 6 and read with Rails 5 turns out to behave worse, raising an exception because the graph includes child classes that don’t exist in Rails 5:

# Write in Rails 6

irb(main):002:0> Rails.cache.write "foo", Folder.find(42)
=> #<Folder id: 42, title: "Foo", ...>
# Attempt to read in Rails 5

irb(main):001:0> Rails.cache.read("foo")
Traceback (most recent call last):
        1: from (irb):1
NameError (uninitialized constant ActiveModel::LazyAttributeSet)
Did you mean?  ActiveModel::AttributeSet
               ActiveModel::LazyAttributeHash

So how did we solve the problem?

We were lucky. 😅

We audited the codebase for instances where we were storing ActiveRecord instances and found only two including the case above. We were able to show the remote cache wasn’t as helpful as simply memoizing for one and for the other we replaced with caching primitives.

The takeaway here – it’s generally a bad idea to store anything other than primitives in the Rails cache.