Log in

No account? Create an account
Adventures in Ruby: File Extensions and Frozen Strings 
3rd-Oct-2014 01:38 pm
epic, long
I'm doing a project at work where I've got a bunch of CSV files with several thousands of lines of data. I need to slice and dice this data in various ways, mostly by pulling out subsets of lines with certain strings occurring in them, counting the number of times certain values occur, and so on.

Looking at this data, it became clear that I could either a) become a serious Microsoft Excel power user, or b) put my slowly growing Ruby scripting skills to work. That wasn't much of a contest.

I actually managed to knock together the skeleton of a useful script pretty quickly. Now I'm polishing it up to make it useable and adding a bit of basic error-checking. I encountered two little issues that strike me as the kind of thing that I'm likely to forget about and then encounter again at some point in the future. So, blogging for my own reference, and because it might possibly be useful to some other Ruby newbie.

First up, file extensions. Basically, I wanted to be able to specify an input file at the command line, and have the script write the output to a file that had the same name as the input file, but with "_counts" appended.

So, my first thought was to read the first command line argument into a variable, and stick the string "_counts" on the end:

input_file = ARGV[0]
output_file = input_file + "_counts"

Well, I didn't even finish typing that into my text editor before I realized the problem. My data files are all comma-separated value files, which mostly end with the file extension .csv. So an input file of the form "myfile.csv" would product output "myfile.csv_counts". Which is not what I want.

Well, okay, it would be easy enough to locate the .csv at the end of the file name, strip it off, append "_counts", and add the .csv extension back on. But, what if somebody decides to feed the script a comma-separated value file with a different extension? The script ought to be able to deal gracefully with any extension.

What's the most general way of defining what a file extension is? Well, I decided that it is a period at the end of a filename, followed by any number of characters. So, I headed over to Rubular.com and played around until I could turn this definition into a regular expression.

Here's what I came up with: \.\w+\z. That translates to a period character(\.) followed by 1 or more letters, numbers, or underscores(\w+), at the end of the string(\z). That \z prevents a stupid result if someone passes a filename like "my.stupid.csv.file.csv" to the script.

After a bit of tinkering, I built that regular expression into a little piece of code.

m = /\.\w+\z/ =~ input_file
if m
output_file = input_file.insert m, "_counts"
output_file = input_file + "_counts"

That first line uses the match operator ~=. If the regular expression on the left of the operator matches any part of the string on the right, it returns the position of the match. If there's no match, it returns "nil".

The if/else statement was my attempt to write "If there is a file extension, insert the string "_counts" into the input file name just before the extension and make that the output file name. Otherwise, just stick "_counts" on the end of the input file name and make that the output file name.

It ran fine for the case without the file extension. But when I gave it a filename with an extension, I got this error: can't modify frozen String (RuntimeError).

I did a bit of searching, and what I think happened there is that Ruby freezes command-line arguments so that they don't get changed in the midst of program execution, because Bad Things could happen as a result of that. So, when the script tried to run output_file = input_file.insert m, "_counts", input_file can't be modified.

I solved this by using the dup method, which makes an unfrozen copy of an object. So my final if/else statement looked like this:

m = /\.\w+\z/ =~ input_file
if m
output_file = input_file.dup
output_file = output_file.insert m, "_counts"
output_file = input_file + "_counts"

I have a feeling there's probably a more elegant way to handle that, but that works.

Actually, as I was checking in my most recent changes, it occurred to me that Ruby probably has a class with built-in methods for doing things like handling file name extensions. But reinventing the occasional wheel is educational.
3rd-Oct-2014 08:50 pm (UTC)
Yeah, I just totally reinvented File.extname. I'm such a dweeb.
3rd-Oct-2014 09:59 pm (UTC)
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. --jwz

But seriously, it was a valuable exercise both to learn how to do this from scratch and to discover that there was a better way. Good on you.
3rd-Oct-2014 10:59 pm (UTC)
You know, I've heard that "two problems" quote any number of times, but this is the first time that I've connected it with "The guy who emails me a 'guy walks into a bar' joke every Tuesday because I'm on the DNA Lounge mailing list." It's a small world!
This page was loaded Feb 24th 2018, 8:16 am GMT.