A cautionary tale debugging the strange behaviour of a Rails upgrade during a blue-green deployment.
Hybrid canary blue/green Rails upgrades
For major updates of the Litmus Rails monolith we choose to provision a whole new parallel production application server tier (our green env) alongside our existing one (our blue env), deploy our new application version there and then incrementally transition production workloads across to the new environment.
This allows us to reduce risk:
by gaining additional information (performance, stability, correctness) beyond what our test suite and manual QA have already shown – nothing is more reflective than observing our real traffic, using the real database and real external services.
by ensuring we have an escape hatch, if a problem is observed we can quickly drop the percentage of work being handled by the green environment to 0.
To allow for that, the blue and green environments share everything that’s not contained within the tier, most significantly that means they share the same database. Any HTTP request or any background job might be assigned to either blue or green environment.
As it stands we have no concept of stickiness for sessions or for jobs being run in the environment that they were scheduled in. That means old and new must be ready to handle data written by the other, eg sessions, queued jobs, caches.
This forward and backwards compatibility is not guaranteed by Rails, nor does it seem to be an explicit design goal. This means there’s onus on us to test safety of this ahead of time.
Earlier this year we completed an upgrade from Rails 5.2 to Rails 6.1.
At the beginning of the project we tested and confirmed session forwards and backwards compatibility.
Late in the project we were bitten by subtle problems where that back/forwards requirement had been broken, for:
- a small fraction of our caching calls
- a small fraction of our background job invocations
We’ll cover the first of those here and the latter in a follow up post.
After extensive testing in our staging environment and with our complete test suite passing, we moved on to QA within our green environment for additional confidence before any customer traffic was sent there.
We encountered our first surprising problem at this stage, with a key action failing due to a stack overflow.
Our investigation isolated the problem to an area of code where we recursively traverse a tree of ActiveRecord objects. Critically, this traversal lent on the Rails cache as a performance optimization.
In Rails 5, the expected usage of the cache had us making
cache.fetch calls with a cache key of something like
"parent-for-folder-123" for a parent folder with id
In Rails 6, we were seeing the
cache.fetch passed the string
"parent-for-folder-" – ie no id in some cases. More worryingly this was yielding an actual cache hit.
The instance returned was found to respond
#id leading to endless recursion.
We reasoned something had poisoned the cache (populating
parent-for-folder-), but still didn’t have an explanation as to how
#id could ever have been
nil. Guards we had in place should have ensured we were always dealing with a persisted record.
We opened a Rails 5 and Rails 6 console side by and started comparing cache queries.
After a lot of head scratching we skewered our cause – an earlier
cache.fetch had yielded a deserialized
Folder object that looked and behaved correctly, including showing a valid
id in its attributes, yet indeed responded
Comparing a fetch of that entry that shouldn’t exist we found that the weird
#id behaviour seemed specific to Rails 6.
# Rails 5 irb(main):001:0> cached_folder = Rails.cache.read "parent-for-folder-" => #<Folder id: 30228, title: "foo", parent_folder_id: 14069, ...> irb(main):002:0> cached_folder.id => 30228
# Rails 6 irb(main):001:0> cached_folder = Rails.cache.read "parent-for-folder-" => #<Folder id: 30228, title: "foo", parent_folder_id: 14069, ...> ^^^^^^^^^ irb(main):002:0> cached_folder.id => nil # 🤯
Note, that unlike almost all our other cache usage, we’re not storing primitive data here, but instead directly passing the Rails cache an ActiveRecord instance.
That instance has to be marshalled when placed in the cache and unmarshalled on the way out.
Marshal#dump serializes an object instance (and its descendants) to a string when we persist to the cache.
For a toy class, we can see that string describes our object’s instance variables, but not class variables or methods.
class Foo @@bar = :bar def initialize @baz = :baz end def qux :qux end end
irb(main):001:0> Marshal.dump(Foo.new) => "\x04\bo:\bFoo\x06:\t@baz:\bbaz"
Instead class variables and methods are defined by the class’s definition in the runtime at the point we come to unmarshal.
If we redefine our class above
class Foo @@bar = :shenanigans def initialize @baz = :shenanigans end def qux :shenanigans end end
And then deserialize that earlier string
irb(main):001:0> deserialized = Marshal.load("\x04\bo:\bFoo\x06:\t@baz:\bbaz") => #<Foo:0x00007ff70989ef10 @baz=:baz> irb(main):002:0> deserialized.qux => :shenanigans
We see see that we get the behaviour of the new class definition, hydrated with the instance variables we persisted with an earlier one.
Problems arise when a later class definition embeds new assumptions about the shape of those instance variables, that aren’t matched by already serialized objects.
That’s exactly what occurred in this case – ActiveRecord internals had changed between Rails 5 and 6, moving the
primary_key lookup (used by
#id) from a class variable to an instance variable.
In Rails 6, if we pull a record fresh from the database we can see this new instance variable is set
irb(main):024:0> f = Folder.find 16949 => #<Folder id: 16949, ... irb(main):026:0> f.instance_variable_get :@primary_key => "id"
but not if we pull the same record from the cache, written by Rails 5
irb(main):027:0> cf = Rails.cache.read 'parent-for-folder-18395' => #<Folder id: 16949, ... irb(main):028:0> cf.instance_variable_get :@primary_key => nil
It’s possible to manually patch over this at runtime
irb(main):038:0> cf = Rails.cache.read 'parent-for-folder-18395' => #<Folder id: 16949, ... irb(main):039:0> cf.id => nil irb(main):040:0> cf.instance_variable_set(:@primary_key, "id") => "id" irb(main):041:0> cf.id => 16949
Perhaps that might have have been used to address the issue had we been making a hard cutover to Rails 6, however given we wanted to gradually transition traffic between the pools, we’d be writing and reading entries from both Rails 5 and 6 concurrently.
Backwards compatibility of ActiveRecord instances written to the cache by Rails 6 and read with Rails 5 turns out to behave worse, raising an exception because the graph includes child classes that don’t exist in Rails 5:
# Write in Rails 6 irb(main):002:0> Rails.cache.write "foo", Folder.find(42) => #<Folder id: 42, title: "Foo", ...>
# Attempt to read in Rails 5 irb(main):001:0> Rails.cache.read("foo") Traceback (most recent call last): 1: from (irb):1 NameError (uninitialized constant ActiveModel::LazyAttributeSet) Did you mean? ActiveModel::AttributeSet ActiveModel::LazyAttributeHash
So how did we solve the problem?
We were lucky. 😅
We audited the codebase for instances where we were storing ActiveRecord instances and found only two including the case above. We were able to show the remote cache wasn’t as helpful as simply memoizing for one and for the other we replaced with caching primitives.
The takeaway here – it’s generally a bad idea to store anything other than primitives in the Rails cache.