The greatnesses and gotchas of YAML
By Sidney Liebrand on Dec 23, 2017•13 min readUpdate 08–11–2018: Thank you Anatoli Babenia
for pointing to the base 60 parsing 'feature' in the docker-compose
documentation.
It led to me finding another great resource and added it along with some new content
to this post.
In this post I want to talk about YAML. Like the very popular JSON format, it is a file format that allows you to store data in a structured way. Last week I had a discussion with a colleague about an unexpected output value when parsing YAML to a Ruby hash. The YAML data looks like this:
---some_key: some_other_key: nil
When parsed in Ruby, it looks like this:
{'some_key' => {'some_other_key' => 'nil'}}
And the equivalent Python output:
{'some_key': {'some_other_key': 'nil'}}
The confusion was about the value of some_other_key
which we
both thought would become nil
instead of 'nil'
. I mentioned to my
colleague that if he wanted to get a nil value, he might as well
leave it completely empty:
---some_key: some_other_key:
Which indeed, leads to the expected result in Ruby:
{'some_key' => {'some_other_key' => nil}}
And of course, in Python too:
{'some_key': {'some_other_key': None}}
At this point we became curious, I mean, there must be some kind of nil
value,
right? So we ventured to Google and well, found an answer in no time at all :)
There is a nil
value in YAML, it's called null
!
---some_key: some_other_key: null
Also yields the expected result for both Ruby and Python.
And this was only the start...
Since that moment I've been wondering what more is there to YAML. I've written literally thousands of lines of YAML test data for one of my gems but I've never really wondered what the language could really do.
What I also noticed is that there aren't all that many YAML posts out there, some resources I used while gathering information for this post:
So I would like to share some of the features of YAML that you might not know about and also, share some differences between YAML parsers (the Ruby and Python parsers).
Inheritance
One cool feature, which I first saw when bootstrapping a sample Rails application was that you
can define "defaults" using anchors. In Rails, the config/database.yml
file contains the
following content by default:
default: &default adapter: sqlite3 pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %> timeout: 5000 development: <<: *default database: db/development.sqlite3 test: <<: *default database: db/test.sqlite3 production: <<: *default database: db/production.sqlite3
As you can see, there is a default
key followed by &default
. The &default
keyword here represents
the anchor. Then, in another YAML node, you can inherit properties from that anchor by adding a
special key <<
followed by *default
in this case. To overwrite a default value, simply add the key
you want to overwrite with its new value below the <<:* default
line.
Write JSON in your YAML
Another handy thing to know is that you can write JSON inside YAML, this is pretty neat and to be expected as YAML is a superset of JSON (or well, since version 1.2 it is at least).
The following YAML:
---key: {"some": "json"}another: [1, 2, 3]
Parsed in Ruby this results in:
{"key"=>{"some"=>"json"}, "another"=>[1, 2, 3]}
YAML keys as Ruby symbols
This one I looked for specifically when I started a major rewrite of one of my gems and decided to migrate test data out of Ruby into YAML. I was curious to see if YAML could actually store Ruby Symbols instead of Strings. While I didn't have thousands of tests written in YAML at the time, I thought "Why not?". The answer was that indeed, the Ruby parser understands symbols written in YAML, and treats them as such when parsing in Ruby.
---:my_symbol_key: :or_value
In Ruby, evaluates to the following:
{:my_symbol_key=>:or_value}
Whereas the same YAML parsed in Python outputs:
{':my_symbol_key': ':or_value'}
I only recently gave this some thought, if I were to port my gem to Python for whatever reason,
I couldn't "conveniently" use this YAML anymore and for anyone wanting to use the gem's YAML
outside of Ruby, it would contain useless :
characters at the start of every "symbol".
So yeah, while awesome, use with caution! I'm considering rewriting my gem's YAML to just
use strings instead of symbols because of this "exclusive" Ruby feature :)
Multiline strings? YAML's got your back!
Another topic often discussed in programming languages in general is how to handle multiline
strings, various languages have different solutions to the same problem. YAML has it's own
two solutions. The pipe (|
) character and the greater than (>
) sign.
The pipe notation, also referred to as "literal block":
literal: | This block of text will be the value of the 'literal' key, with line breaks being preserved. It continues until de-dented, leading indentation is stripped. Any lines that are 'more-indented' keep the rest of their indentation - these lines will be indented by 4 spaces.
The greater than sign notation, also referred to as "folded block":
folded: > This block of text will be the value of 'folded', but this time, all newlines will be replaced with a single space. Blank lines, like above, are converted to a newline character. 'More-indented' lines keep their newlines, too - this text will appear over two lines.
Both snippets came from here. This post also contains a lot of other great YAML examples you should definitely check out!
Quoted strings, begone!
Unlike its friend JSON, YAML doesn't mind if you don't put your strings between quotes. The following will output exactly what you would expect:
some_key: with a string value
In Ruby and Python, the results are the same (output in Ruby):
{"some_key"=>"with a string value"}
Keys don't have to be quoted either, so removing the _
from some_key
results
in the following in both Ruby and Python (output in Ruby):
{"some key"=>"with a string value"}
While this makes copying certain values easier YAML tries to be smart about some
(more than you might think) of them. When a key with a value of either yes
, Yes
,
YES
, on
, On
or ON
is present, the resulting value when parsing this YAML will be
a boolean. The same is true for values no
, No
, NO
, off
, Off
and OFF
.
The following example shows Ruby syntax but Python 3.6 parsed it exactly the same.
# All the following equal trueYAML.load("key: Yes")YAML.load("key: yes")YAML.load("key: YES")YAML.load("key: on")YAML.load("key: On")YAML.load("key: ON")# => {"key"=>true} # All the following equal falseYAML.load("key: no")YAML.load("key: No")YAML.load("key: NO")YAML.load("key: off")YAML.load("key: Off")YAML.load("key: OFF")# => {"key"=>false}
If you expect your program to see these values as strings, the solution is to quote the string or to cast the value as we'll see in the next section.
Casting values
If you want to ensure that a key has a value of a specific type, you can cast values
explicitly: key: !!str 0.5
=> {"key" => "0.5"}
in both Ruby and Python. Likewise
key: !!float '0.5'
=> {"key" => 0.5}
as well.
Some parsers actually implement language specific tags. These can be used to create specific data structures for that given language:
---key: !!python/tuple [1, 2]
Results in the following in Python:
{'key': (1, 2)}
What REALLY surprised me here was that the Ruby parser turned it into an Array instead:
{"key" => [1, 2]}
So I thought to myself, "What if I change !!python/tuple
to !!ruby/array?
".
So I went on ahead and updated the snippet:
---key: !!ruby/array [1, 2]
And as expected, Ruby returns the correct result:
{"key" => [1, 2]}
Our friend Python on the other hand, has some issues here:
...snipped...yaml.constructor.ConstructorError: could not determine aconstructor for the tag 'tag:yaml.org,2002:ruby/array' in "<unicode string>", line 1, column 6: key: !!ruby/array [1, 2]
In the above example we see that the Python parser throws an error because it can't find the correct constructor for the tag. When Ruby finds a language specific tag that it doesn't know how to use, it is simply ignored. I think both languages have a different point of view where Python is more "demanding" about what kind of YAML you feed it and Ruby tries to "cushion" your experience whenever it can.
So thank you Ruby (at least MRI Ruby) for supporting and treating these Pythonic types as if they were your own ♥️
Integer notation
This is a small one, and part of multiple programming languages to improve readability of
large integers / binary numbers. YAML allows the usage of _ characters to "group" numbers,
e.g. 1000000000
vs 1_000_000_000
. I think the latter is many more times more readable and
therefore think that YAML deserves a honorable mention for including this awesome feat! 👍
Sexagesimal numbers?
We've already seen some weird behavior with some unquoted string values magically turning
into booleans but there is more! YAML parses numbers in ii:jj
format in base 60! For example, in Ruby:
YAML.load("key: 12:30:00")# => {"key"=>45000}
While the result is following the spec, it is more often than
not undesired. It becomes more interesting when the digit starts with a leading 0
. In Ruby:
YAML.load("key: 01:30:00")# => {"key"=>5400}
Whereas in Python:
yaml.safe_load("key: 01:30:00") # => {'key': '01:30:00'}
Ruby seems to be trying to "fix" this by trimming the leading 0
and parsing the rest in base 60 whereas
Python sees that this value is not valid ii:jj
format. I am not sure why this is but my guess is what
we're going to talk about next.
Octal numbers
If your YAML contains integer values that start with a 0
and do not contain digits greater than 7
,
they will be parsed as octal values. In Ruby:
# parsed as octalYAML.load("key: 0123")# => {"key": 83} # parsed 'normally'YAML.load("key: 01238")# => {"key": "01238"}
Python does exactly the same thing in this case. To get back to the previous example, I think Python
sees the value 01:30:00
as an invalid octal number and therefore chooses to parse it as a string.
Complex keys
Aside from string keys, YAML won't complain if you want to use floats:
1.1: hello there
=> {1.1 => "hello there"}
but this is still a simple key.
It will complain about using a list or hash as key: [1, 2, 3]: hello there
=> error
.
Both the Ruby and Python parsers give an error when trying either.
The solution is to use a language specific tag. This can be used to create keys that are complex data types such as a Ruby Array or Python Tuple.
A complex key is created by first inserting a question mark followed by a space, followed by the language specific tag and the final value of the key. Then, on a new line, the value is added as usual, starting with a colon followed by a space character and the value of the key:
--- ? !!python/tuple [1, 2] : hello
In Python, this will result in:
{(1, 2): 'hello'}
Ruby on the other hand, has no "Tuple" type (nor did I expect it to understand the python tags) and uses the thing that most closely resembles it, an Array:
{[1, 2] => "hello"}
So while it is a bit awkward and not very portable, still something useful to know just in case :)
Comments
We've already seen what kind of beast YAML actually is under the hood, I actually learned new things
myself while writing this post since I ran every example through both the Python and Ruby REPL at the
same time (Thank you tmux pane-synchronization ♥️) and it doesn't stop there! Another
seemingly-trivial-yet-missing-from-JSON feature would be the fact that you can add # comments
.
In JSON, comments aren't supported but of course, YAML has our back and lets us do pretty much whatever we want, a comment starts with a # sign:
---some: yaml# oh noes! A commentno: problem
Both Ruby and Python simply ignore the comment:
{"key"=>[1, 2], "key2"=>"no problem"}
Summary
In short, this post described the following features:
-
Inheritance / defaults
-
Write JSON within YAML
-
Ruby Symbols as keys
-
Multiline strings
-
Quoted strings
-
Casting values
-
Integer notation
-
Sexagesimal numbers?
-
Octal numbers
-
Complex keys
-
Comments
YAML is certainly a versatile marku...lang... yeah never mind that :) But seriously though, YAML is indeed very versatile, it can do lots of things as you have hopefully seen in the examples.
The REPLS used for testing were pry for Ruby and Python's builtin REPL. The Ruby parser used was Yaml on Ruby (MRI) 2.4.1 and for Python, pyyaml was used on Python 3.6.2.
Post update: During the process of updating this post, I used pry for Ruby (MRI) 2.5.1 and Python's (3.6.7) builtin REPL. The same libraries were used for testing.
Conclusion
I think YAML is great! Every experience I've had so far with YAML has been a positive one, whether it includes writing thousands of lines or debugging an issue. Even writing this post was a pleasure, I just took my time, opened my favorite REPL's with pane-sync on to reduce typing and started compiling information and examples, sometimes with side-effects I didn't even anticipate which led to interesting results.
I'm pretty sure I've missed some things considering what we've just witnessed earlier in the Casting Values section, there are probably lots more of these nuances between various other parsers.
From this point on, I hope that your YAML experience will also be great, it is a powerful tool to be able to wield, and I also hope you learned something new.
Cheers!
👋