Laying around in my computer, I have this file:
-rw-r--r-- 1 joseph joseph 165K Feb 5 1995 COOKFINL.XLS
Every time I see that timestamp, I am amazed. I can’t believe that on this very computer, I have a long sequence of 1’ and 0’s that has remained unchanged since they were first created nineteen years ago. I was not even a year old when this file was made, and yet here it is just hanging out on the cloud in my Dropbox folder. It’s seriously awesome.
The file itself is a cookbook that my mom made one year as a Christmas gift to our extended family. She compiled around 150 of the most popular recipes from both sides of our family, typed them into Excel, and then had spiral-bound books printed with each recipe indexed and grouped into categories like appetizers, entrees, and desserts. Over winter break, she expressed some interest in updating the cookbook to fix some errors she’s found over the years. She also wanted to be able to send a copy of the book to the members of my generation of the family, as we are now away off at school and wanting to access some of the old, familiar recipes of home. Instead of printing out physical copies again, we decided that it would be best to send the digital copies out to all the kids. Cookbooks are nice, but the convenience of having the entire thing just on your phone wherever you are (from the couch, to the grocery, to the kitchen) overshadows any advantages of an actual dead tree book.
The problem is that this file is an Excel file. My mom complained about how terrible it was to make this 140-some page beast cell-by-cell in a spreadsheet document, but she did it because it was what the techie guy who guided her through this project told her it was the best tool for the job. Really, with the consumer-friendly programs available to my mom at the time (e.g. Microsoft Office) it probably was the best. While she always said she should have just done it in Word, I can’t imagine trying to format each recipe with that would be anything less than horrendous.
But I knew there had to be a better format. I didn’t know what I wanted to actually do with all the data, but I knew that it would be much easier to work with the recipes if they weren’t stuck in an old Excel file from ‘95. I decided I would try and parse it into a JSON file, and I would use Perl to do so. I chose Perl because I will soon be using it in my student sysadmin job and I wanted to get some experience with it.
I started by saving the file as a CSV from LibreOffice, and then got to work parsing the thing with Perl. I quickly realized I couldn’t just parse this as a normal CSV, as it was nowhere near formatted like a traditional CSV. Instead of neat rows of comma-separated columns of fields, it was hundreds of entries like this:
,,, ,,, ,,, ,BLUEBERRY MUFFINS,, ,,, ,NUMBER OF SERVINGS:,12 muffins,
,AUTHOR:,Joan Graham, ,,, ,2 cups flour,1 C. sugar, ,2 tsp. baking powder,1/2
tsp. salt, ,1/2 C. butter,1/2 C. milk, ,2 eggs,1 tsp. vanilla extract, ,2-1/2 C.
blueberries,sugar for sprinkling tops of muffins, ,,, ,,, ,,, ,,, ,Preheat oven
to 375°F. Grease and flour a 12-cup muffin pan. In medium bowl,, ,""combine dry
ingredients. With pastry blender or 2 knives, cut in butter until mixture"",,
,""resembles coarse crumbs. In bowl combine milk, eggs and vanilla. Add to
dry"",, ,""ingredients all at once, stirring just until moistened. Do not
overmix. Crush 1/2 cup"",, ,berries and add to batter. Fold in remaining
berries and spoon into muffin pan. ,, ,Sprinkle with sugar. Bake 30 minutes or
until toothpick inserted in center comes out,, ,clean. Cool in pan 5 minutes;
turn onto wire rack. (250 calories each),, ,,, ,,,
My plan of attack was as follows: parse for the recipe name by looking for an
all-caps line, parse for number of servings line and the author, and finally
parse the ingredients list and instructions as normal CSV’s. The way I went
about doing this was, admittedly, an enormous hack. My first effort of looking
for all caps lines with something like if ($line eq uc($line))
worked alright,
but I had some instances of the NUMBER OF SERVINGS line giving me issues if it
just included a number. So I ended up with an ugly hack where I first looked
for the author line, since I knew every recipe in the book had an author, and
then worked backwards to get the servings and recipe name data. I say work
backwards loosely, as I was doing this all line-by-line reading the file,
but instead I threw the information into different spots in appropriate
arrays based on whether or not I had hit an author line. It was ugly. I
also realized pretty late on that I didn’t want all these arrays, and
instead I went for a nested hash where each recipe name pointed to hashes
for author, servings, ingredients, and instructions.
I also got a chance to learn a little more about regular expressions. I knew a
bit about regexes mostly from vim find-and-replace commands, but with this
project I actually read up more on them. Admittedly, I read up on them only a
little, and my lack of understanding definitely shows in the code. I’m sure
there is some fancy way to address nearly all of my parsing problems with a
well-crafted regex, but instead I broke them into little bits and pieces doing
one thing at a time (like removing the NUMBER OF SERVINGS:,
bit, and leading
and trailing commas, and other things). Also, there were probably lots of
simple cases where I didn’t need the regexes and instead could have written a
little function to do what I needed more simply and cleanly, but as the saying
goes, when all you have is a hammer, everything looks like a nail.
In the end, I was successful (mostly) in producing a JSON file of all the recipes. Here’s what I learned while doing so:
-
Standardizing your data to fit nicely into formats is hard, and not always the right way to go. My mom and I ended up not using my JSON for our original goal. We ended up just fixing the typos, adding in a few things, and sharing the Excel file anyway. The main problem with trying to fit everything into the JSON was that not everything had the same format. This was a compilation cookbook from dozens of different authors, each with their own style. Some included ingredients in-line with the instructions, while most put them all at the start. Some provided multiple sets of ingredients for various serving amounts. There were some ““recipes”” that weren’t even recipes at all, like ““A Recipe for a Happy Marriage”” and some other small poems. In the end, the trade off was to standardize everything into a clean, uniform format, or to save the personalities and unique traits of each individual recipe the way that the original author wrote it. For this project, a family compilation cookbook, the trade off leans heavily towards preserving the original intent of the authors to give each entry a more personal feel.
-
Keep future proofing in mind when saving your data. For a document that you could realistically see yourself using decades from now (like a cookbook), it is worth it to take a moment and think if the way you are saving your data now will be problematic in the future. As with anything tech-related, this is hard to predict. However, it is reasonable to think that saving in an open standard format will stand the test of time more robustly than a proprietary Microsoft office format. If you save in a format only readable by one program, what happens when that one program no longer runs on your machine 20 years from now?
-
It is nice to finish a project out to completion. Pretty early on in this endeavor I realized that it was not going in the direction we needed. However, I had plenty of time, and still needed to learn some Perl anyways, so I stuck with it. And I’m glad I did. Running the program for the final time and seeing the nicely-formatted JSON left me with a sense of accomplishment that I haven’t really felt in a while. These days I’ve really been putting all of my productive efforts into schoolwork, and it’s been a while since I focused on a personal project like this. Hopefully blogging like this will give me an incentive to follow things through to completion so I can do a final write up on them. However, I also realize that these things are never really complete. While the JSON passed a quick skim-though test for any glaring errors, I’m sure there are little bits here and there that weren’t parsed completely right or were spit out in a strange way. Also, my code is a disaster. That will be another goal for future projects: write code that I am willing so share with the world on Github or Bitbucket.