An adventure in serialization
A cautionary tale debugging the strange behaviour of a Rails upgrade during a blue-green deployment.
Hybrid canary blue/green Rails upgrades
For major updates of the Litmus Rails monolith we choose to provision a whole new parallel production application server tier (our green env) alongside our existing one (our blue env), deploy our new application version there and then incrementally transition production workloads across to the new environment.
This allows us to reduce risk:
-
by gaining additional information (performance, stability, correctness) beyond what our test suite and manual QA have already shown – nothing is more reflective than observing our real traffic, using the real database and real external services.
-
by ensuring we have an escape hatch, if a problem is observed we can quickly drop the percentage of work being handled by the green environment to 0.
To allow for that, the blue and green environments share everything that’s not contained within the tier, most significantly that means they share the same database. Any HTTP request or any background job might be assigned to either blue or green environment.
As it stands we have no concept of stickiness for sessions or for jobs being run in the environment that they were scheduled in. That means old and new must be ready to handle data written by the other, eg sessions, queued jobs, caches.
This forward and backwards compatibility is not guaranteed by Rails, nor does it seem to be an explicit design goal. This means there’s onus on us to test safety of this ahead of time.
Earlier this year we completed an upgrade from Rails 5.2 to Rails 6.1.
At the beginning of the project we tested and confirmed session forwards and backwards compatibility.
Late in the project we were bitten by subtle problems where that back/forwards requirement had been broken, for:
- a small fraction of our caching calls
- a small fraction of our background job invocations
We’ll cover the first of those here and the latter in a follow up post.
Caching
After extensive testing in our staging environment and with our complete test suite passing, we moved on to QA within our green environment for additional confidence before any customer traffic was sent there.
We encountered our first surprising problem at this stage, with a key action failing due to a stack overflow.
Our investigation isolated the problem to an area of code where we recursively traverse a tree of ActiveRecord objects. Critically, this traversal lent on the Rails cache as a performance optimization.
In Rails 5, the expected usage of the cache had us making cache.fetch
calls with a cache key of something like "parent-for-folder-123"
for a parent folder with id 123
.
In Rails 6, we were seeing the cache.fetch
passed the string "parent-for-folder-"
– ie no id in some cases. More worryingly this was yielding an actual cache hit.
The instance returned was found to respond nil
to #id
leading to endless recursion.
We reasoned something had poisoned the cache (populating parent-for-folder-
), but still didn’t have an explanation as to how #id
could ever have been nil
. Guards we had in place should have ensured we were always dealing with a persisted record.
We opened a Rails 5 and Rails 6 console side by and started comparing cache queries.
After a lot of head scratching we skewered our cause – an earlier cache.fetch
had yielded a deserialized Folder
object that looked and behaved correctly, including showing a valid id
in its attributes, yet indeed responded nil
to #id
🙀
Comparing a fetch of that entry that shouldn’t exist we found that the weird #id
behaviour seemed specific to Rails 6.
# Rails 5
irb(main):001:0> cached_folder = Rails.cache.read "parent-for-folder-"
=> #<Folder id: 30228, title: "foo", parent_folder_id: 14069, ...>
irb(main):002:0> cached_folder.id
=> 30228
vs
# Rails 6
irb(main):001:0> cached_folder = Rails.cache.read "parent-for-folder-"
=> #<Folder id: 30228, title: "foo", parent_folder_id: 14069, ...>
^^^^^^^^^
irb(main):002:0> cached_folder.id
=> nil # 🤯
Note, that unlike almost all our other cache usage, we’re not storing primitive data here, but instead directly passing the Rails cache an ActiveRecord instance.
That instance has to be marshalled when placed in the cache and unmarshalled on the way out.
Marshal#dump
serializes an object instance (and its descendants) to a string when we persist to the cache.
For a toy class, we can see that string describes our object’s instance variables, but not class variables or methods.
class Foo
@@bar = :bar
def initialize
@baz = :baz
end
def qux
:qux
end
end
irb(main):001:0> Marshal.dump(Foo.new)
=> "\x04\bo:\bFoo\x06:\t@baz:\bbaz"
Instead class variables and methods are defined by the class’s definition in the runtime at the point we come to unmarshal.
If we redefine our class above
class Foo
@@bar = :shenanigans
def initialize
@baz = :shenanigans
end
def qux
:shenanigans
end
end
And then deserialize that earlier string
irb(main):001:0> deserialized = Marshal.load("\x04\bo:\bFoo\x06:\t@baz:\bbaz")
=> #<Foo:0x00007ff70989ef10 @baz=:baz>
irb(main):002:0> deserialized.qux
=> :shenanigans
We see see that we get the behaviour of the new class definition, hydrated with the instance variables we persisted with an earlier one.
Problems arise when a later class definition embeds new assumptions about the shape of those instance variables, that aren’t matched by already serialized objects.
That’s exactly what occurred in this case – ActiveRecord internals had changed between Rails 5 and 6, moving the primary_key
lookup (used by #id
) from a class variable to an instance variable.
In Rails 6, if we pull a record fresh from the database we can see this new instance variable is set
irb(main):024:0> f = Folder.find 16949
=> #<Folder id: 16949, ...
irb(main):026:0> f.instance_variable_get :@primary_key
=>
"id"
but not if we pull the same record from the cache, written by Rails 5
irb(main):027:0> cf = Rails.cache.read 'parent-for-folder-18395'
=> #<Folder id: 16949, ...
irb(main):028:0> cf.instance_variable_get :@primary_key
=> nil
It’s possible to manually patch over this at runtime
irb(main):038:0> cf = Rails.cache.read 'parent-for-folder-18395'
=> #<Folder id: 16949, ...
irb(main):039:0> cf.id
=> nil
irb(main):040:0> cf.instance_variable_set(:@primary_key, "id")
=> "id"
irb(main):041:0> cf.id
=> 16949
Perhaps that might have have been used to address the issue had we been making a hard cutover to Rails 6, however given we wanted to gradually transition traffic between the pools, we’d be writing and reading entries from both Rails 5 and 6 concurrently.
Backwards compatibility of ActiveRecord instances written to the cache by Rails 6 and read with Rails 5 turns out to behave worse, raising an exception because the graph includes child classes that don’t exist in Rails 5:
# Write in Rails 6
irb(main):002:0> Rails.cache.write "foo", Folder.find(42)
=> #<Folder id: 42, title: "Foo", ...>
# Attempt to read in Rails 5
irb(main):001:0> Rails.cache.read("foo")
Traceback (most recent call last):
1: from (irb):1
NameError (uninitialized constant ActiveModel::LazyAttributeSet)
Did you mean? ActiveModel::AttributeSet
ActiveModel::LazyAttributeHash
So how did we solve the problem?
We were lucky. 😅
We audited the codebase for instances where we were storing ActiveRecord instances and found only two including the case above. We were able to show the remote cache wasn’t as helpful as simply memoizing for one and for the other we replaced with caching primitives.
The takeaway here – it’s generally a bad idea to store anything other than primitives in the Rails cache.