The Problem
I noticed some unusual behavior with respect to YAML String serialization between my Linux production system and my Mac OSX development system.
After dumping the production database via pg_dump -O –no-acl mydb | gzip > ~/mydb.sql.gz and then restoring it on my development system via rake db:drop; rake db:create; psql mydb < mydb.sql, I noticed that a particular serialized field in my Rails app that should always be an Array of String objects occasionally contained Integers.
After a little research and experimentation, I discovered that the production Linux system would occasionally omit quotations around Strings containing only numeric digits. I haven’t analyzed the pattern fully, but here are some examples where the YAML serialization did or did not use quotes:
- “90103″
- 000080
- “000071″
- “000124″
- “000003″
- 008397
- 000408
- 000009
- 000188
- “000021″
Further investigation revealed that the Linux production system was using Syck (a “dated C implementation of YAML 1.0″) and my Mac OSX development system was using psych (a “libyaml wrapper (in Ruby core for 1.9.2)”). libyaml is a “fast C implementation of YAML 1.1. So, either the quotation rules have changed between YAML 1.0 and YAML 1.1, or there is a bug in one of the implementations (likely Syck).
The Solution
The solution for proper “future” behavior is pretty simple. Install libyaml on the Linux system as follows:
wget http://pyyaml.org/download/libyaml/yaml-0.1.4.tar.gz tar xzf yaml-0.1.4.tar.gz cd yaml-0.1.4 ./configure make make install
I think that’s enough, but I went ahead and rebuilt my Ruby 1.9.2 just in case it needed to know about the existence of libyaml at build time.
The solution for converting my database with YAML 1.0 serialization to YAML 1.1 serialization is a bit trickier. Since the “dump” and “load” operations are matched for a particular version of YAML, it seems difficult to load the data using YAML 1.0 (thereby retaining the String type when reading an unquoted 000088) and then dump the data using YAML 1.1 (to get proper quoting of ‘000088′). Further complicating this is the fact that Rails handles the serialization operations automatically.
It does appear possible to dynamically switch between syck and psyck by using the following:
YAML::ENGINE.yamler = 'syck' YAML::ENGINE.yamler = 'psych'
So, one option is to repeatedly switch to syck, read in data, switch to psych, and then write the data. <sigh>
Update:
It appears that due to the semantics of the Rails serialize function, it’s not enough to just read the model object using syck and then immediately write with psych because that doesn’t appear to be enough to cause the field to be deserialized. I had to refer to the field for each object. This is a pain because it prevents me from doing a generic loop where I can handle all model objects easily w/o reference to their specific fields.
I’ll withhold judgment for a while, but my first inclination is to consider abandoning YAML serialization for something a little more robust and portable.
Update 2:
It appears my welcome from psych is a serious memory leak. I’ve been running long running Ruby/Rails processes for years, and this is the first time I’ve experienced a failure due to an out of memory condition. There are a number of Google hits regarding the issue. After I fix the leak, I’ll begin researching alternatives to YAML serialization in Rails.
Update 3:
The number of bug reports on psych and rubygems I’ve had to wade through recently is amazing. My current solution is to remove the psych system gem and install Ruby 1.9.3p0 which required upgrading Passenger to the latest version from source to get Ruby 1.9.3 compatibility. I still had to track down a few odd errors such as “undefined method `yaml’ for #<Psych::Nodes::Stream:…>” and “invalid date format in specification: “2011-10-02 00:00:00.000000000Z”" – all because I chose to use the default Rails serialization assuming there would be no issues. Lesson learned.