I'm working on a project where we're trying to get a bunch of data that has been kept on internal wiki pages into a database, so that it can be searchable, we can have automatic detection of duplicates, various other stuff.
Part of my contribution to this effort is to get the data off these wiki pages and into CSV files that can be imported into the database. It's a pretty trivial effort if you've got the Ruby gem Nokogiri (which parses HTML and XML files).
Well, it's sort of trivial. So far, about 20% of my time has been spent writing the part of the script that does the real work, and 80% has been spent dealing with oddities caused by unexpected white space, white space that Ruby does not recognize as white space by default ( ), and quirks of people's wiki markup.
My guess is that this is probably par for the course when web scraping.
Also, I wrote documentation for my homebrew hacky script that probably 2 other people besides me are ever gonna use, because that's how I roll.
When I'm done with this project, I'm considering switching from Ruby to Python. I like working in Ruby, but Python is quite literally what all the cool kids are using, since it seems to be the current language of choice for teaching children to program.