Parsing text in Ruby, part 1
Parslet transforming my world
March 15, 2023 Ā· Felipe Vogel Ā·- Regular expressions: not the problem
- Discovering Parslet
- The part where I wanted to run crying back to regular expressions, until I learned to implement my parser incrementally
- Takeaways
- Next steps
I made a new Ruby gem: Litter. Itās for avid trash-picker-uppers like me who keep a log of interesting litter that I collect.
(Surely there are other people who do thatā¦ *does some googling* See? Iām not alone!)
But I had another reason to make this odd gem: I wanted to explore a more structured approach to text parsing than what Iāve come up with on my own so far. So I used the Parslet gem, and I thought it would be worth writing this post on how it went.
Regular expressions: not the problem
Hereās the backstory. Iām building another gem called Reading that parses my CSV reading log. It has a custom-built parser that is quite messy, for reasons that I couldnāt articulate until just now when I tried Parslet.
At first I thought the messiness came from the numerous regular expressions in my homespun parser. The well-known saying comes to mind:
Some people, when confronted with a problem, think āI know, Iāll use regular expressions.ā Now they have two problems.
ā Jamie Zawinski
Itās true that some of my regular expressions are long and hard to read, but I donāt think regular expressions are the problem because even if I broke them up and made them easier to read, that parser code would still be a mess.
The problem with my Reading parser, I now realize, is this: it mixes up parsing and transformation rather than separating them into two steps.
Hold up, what is ātransformationā and how is it different from parsing? I didnāt know either, before I tried Parslet. So letās take a look at Parslet to find out!
Discovering Parslet
In the book Text Processing with Ruby I found the structured parsing tool that I was looking for: Parslet. Below is an example similar to the one in the book.
A lot of the syntax is self-explanatory, and for the rest you can refer to Parsletās Get Started, Parser, and Transformation guides.
require "parslet"
# An example config file to be parsed.
INPUT = <<~EOM.freeze
name = Felipe's website
url = http://fpsvogel.com/
cool = true
post count = 37
EOM
# The output after parsing and transforming.
OUTPUT = {
name: "Felipe's website",
url: "http://fpsvogel.com/",
cool: true,
post_count: 37,
}
# Parses a string into a tree structure, which we'll then transform
# into the above output. (See the code at the bottom.)
class MyParser < Parslet::Parser
rule(:whitespace) { match('[\s\t]').repeat(1) }
rule(:newline) { str("\r\n") | str("\n") }
# e.g. "name" or "url" in the example above.
rule(:key) { match('[\w\s]').repeat(1).as(:key) }
rule(:assignment) { str('=') >> whitespace.maybe }
# e.g. "Felipe's website" in the example.
# All characters until the end of the line.
rule(:value) { (newline.absent? >> any).repeat.as(:string) }
rule(:item) { key >> assignment >> value.as(:value) >> newline }
rule(:document) { (item.repeat).as(:document) >> newline.repeat }
root :document
end
# Transforms a parsed tree into a hash like OUTPUT above.
class MyTransform < Parslet::Transform
rule(string: simple(:s)) {
case s
when "true"
true
when "false"
false
when /\A[0-9]+\z/
s.to_i
else
s.to_s
end
}
rule(key: simple(:k), value: simple(:v)) {
[k.to_s.strip.gsub(" ", "_").to_sym,
v]
}
rule(document: subtree(:i)) { i.to_h }
end
parsed = MyParser.new.parse(INPUT)
hash = MyTransform.new.apply(parsed)
puts hash == OUTPUT
# => true
To summarize, a two-step process happens here: first the INPUT
string is parsed into an intermediate tree structure, and then the tree is transformed into the simpler hash as in OUTPUT
.
Neat! This looks a lot cleaner than if I had mixed parsing and transformation together like I did in my Reading parser.
Impressed by this tidiness, I proceeded to build my litter log parser with Parslet.
The part where I wanted to run crying back to regular expressions, until I learned to implement my parser incrementally
My enthusiasm was curbed as soon as I wrote my first attempt at a parser aaaandā¦ I got a Parslet::ParseFailed
error. It did tell me the input line where the problem occurred, but that doesnāt help when there are many rules at play in one line and I donāt know which of them needs to be adjusted. I was stumped.
This happened several times until I realized that instead of writing a bunch of rules and then testing them out together, I have to write one rule at a time, or even one bit of a rule at a time, and examine the output at each step. That way, if I get an error then I know itās because of the one change that I just made.
Takeaways
In the end, my parser and transformation are definitely easier to understand than if Iād winged it and built an ad-hoc parser that (as before) doesnāt separate transformation into a separate step.
In the tests you can see example input and example output. Considering how different the input and output are, the amount of code that I had to write is fairly small.
You may have noticed that my transformation class doesnāt use Parslet rules. Thatās because Parslet::Transform
works best when a parsed tree is very predictable in its structure, and when the basic structure of the tree doesnāt need to change. To quote Parsletās āTransformationā doc:
Transformations are there for one thing: Getting out of the hash/array/slice mess parslet creates (on purpose) into the realm of your own beautifully crafted AST classes. Such AST nodes will generally correspond 1:1 to hashes inside your intermediary tree.
In my case, I needed to radically change the structure of the output (grouping item occurrences by item instead of by date of occurrence), so it made sense to iterate over the parsed output in my own way.
Next steps
After this trial run with Parslet, Iām considering using it in my larger project Reading, where it could replace my ad-hoc parsing code. It would take a lot of work to replace what amounts to the majority of the code in that gem, but it might be worthwhile for a couple of reasons:
- I have a hard time understanding the parsing code in my Reading gem because, as Iāve mentioned, parsing and transformation is all mixed up. Itās only after playing around with Parslet, which separates the two, that Iām finally able to perceive this problem.
- The column that I still need to implement is the most complex of all (the History column for fine-grained tracking of reading/watching), and Iāve been putting off implementing it because of how messy I imagine it will be with the old hodgepodge approach.
And, well, doing more Parslet (or taking a similar parse-and-transform approach) is the thing Iām most interested in right now, and I think that counts for something in my open-source project that no one uses besides me š
Before you go, hereās a particularly trashy spot in the park where Iāve begun cleaning up. (That is a mattress in the upper-right corner. It will definitely find its way into my litter log.)