Deep copying objects in Ruby

Time and time again I forget that Object#clone in Ruby is a shallow clone and end up biting myself and spending 30 seconds looking at myself asking what the hell happened. The only difference today is that I decided to finally post about it in my blog – let’s hope this time is the last.

Well, what is a deep copy?

In C++ there is the concept of a copy constructor, which is used when an object is initialized as a copy of an existing object. In many situations this can be deduced by a compiler and you don’t have to worry. If your object contains pointers to things that can’t be shared, however, you have to provide what is called a user-defined copy constructor:

A user-defined copy constructor is generally needed when an object owns pointers or non-shareable references, such as to a file [...] — Wikipedia on Copy Constructor

In Ruby, variables have references to objects. If you want a clone of that variable (e.g. an Array), you can simply do:

x = [1, 2, 3]
y = x.clone

This works because numbers are singletons, so you’re not passing references around (when working in 64 bits, the Ruby interpreter inlines Numeric objects for most operations as well). But if you have other arrays or hashes instead of numbers, things start to break. You don’t have the object, but a reference to it, thus when you do:

x = [{ rank: 1 }, { rank: 11 }]
y = x.clone
y[0][:rank] = 5 # x is altered as well.

This kind of bug can be hard to understand when first encountered, so it’s definitely a good thing to have in mind.

The answer

I was in the middle of implementing what I just explained when I noticed I was reinventing the wheel. Turning to Stackoverflow, I found an answer similar to what I was doing and another, simpler, more interesting and applicable to my specific situation:

def tentative_deep_copy(obj)
  Marshal.load(Marshal.dump(obj))
end

Duh. The Marshal library is a tool for storing objects as bytes outside of the program for later use. I’ve never had to use it before, as the only apparent use case I have (storing trained statistical classifiers) can be achieved more robustly by saving parameters in a JSON.

But I digress. By storing the object’s data as a byte stream and reconstructing the same object afterwards, you create new copies of each of the constituent objects.

However, there are two problems with this approach:

The second point could be solved by thinking the problem through, but I had 30 minutes to come up with an argument for a point I was about to make in a meeting. I sure hope I never have to do this again (famous last words…)