Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!
I’ve been fortunate to have spent the last year and a half with classmates who have engineering and programming backgrounds. They taught me to persist when programming. If something doesn’t work, try again. Console.log, Google, Stackoverflow, talk out the syntax out loud are all options I practice on a daily basis these days.
But when do you stop?
Scraping One Year of the Pictures of the Year Website
Looks simple enough right?
I needed the following from the 62nd POYi competition (2004) to fill in a massive hole in my data:
- Photographer names
- Title of the Photograph
I have experience with CSS and HTML so I was thinking it couldn’t be too hard. After digging around for a way to use R, I landed on rvest and was thrilled. This should be easy and fast.
Boy was I wrong.
Time to Stop and Go Manual
Ta-da! Nested tables. Remember we used
<table> for layout back in the early naughts. 😱 In this case, the entire website was structured with table after table. In addition, many of the tables and rows were given the same class names! So much for targeting selectors.
Of course, this didn’t dawn on me right away. I stumbled for quite awhile trying to make various scraping methods work. Surely there was a way to work around the nested tables? Well, after several tries over too many hours, I ended up with itty bitty bits of partial data and/or duplicate data. It hit me that manually copy-pasting the information I needed was going to be more expedient than merging, deleting and joining partial bits of potentially hundreds of CSV files that were as messy as the nested tables.
So, in between a re-write of a conference paper (in less than a day), tweaking prototypes for an empirical study, classes, and other design work, I set about going through every link on this page to capture the data I was missing in my master sheet. I think the picture stories were the most tedious since the pagination dots were so small and even though I had created a pretty sweet copy-paste flow to minimize mistakes, it still happened.
Plus, I ran into the same issue I had when migrating the image file paths to the master sheet. The rows in the master were in a different sort order than the order of the winners’ list and the galleries contained within. So, what’s a girl to do?
Triple-check, Double-check, Cross-check!
The downside of the manual approach is there is plenty of room for error. I pasted in the wrong row or the wrong column a few times. Plus, the master was not labeled as clearly as I had when copying and pasting to my sheets of data. Some sections were easier than others but wow, it was a maze. The bright side? I only had to do this for one year. Rather, I gave myself that constraint since I’m feeling incredibly behind on my schedule.
Getting to this point was a struggle. My first lesson only a little over a year ago was that 80% of the work is in finding, cleaning and analyzing the data. I keep reminding myself this when I’m feeling like I haven’t moved the dial forward.
So, a big gaping hole in my data is no longer. BUT, there are still holes so I need to figure out a way to manage them. I think the next best thing is to create some pivot tables to get a better look. The goal is to see if I can scale down the scope as a way to deal with the gaps. 🤔