Learning When to Stop

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

I’ve been fortunate to have spent the last year and a half with classmates who have engineering and programming backgrounds. They taught me to persist when programming. If something doesn’t work, try again. Console.log, Google, Stackoverflow, talk out the syntax out loud are all options I practice on a daily basis these days.

But when do you stop?

Scraping One Year of the Pictures of the Year Website

Looks simple enough right?

I needed the following from the 62nd POYi competition (2004) to fill in a massive hole in my data:

  • Division
  • Category
  • Award
  • Photographer names
  • Publication
  • Title of the Photograph
  • Caption

I have experience with CSS and HTML so I was thinking it couldn’t be too hard. After digging around for a way to use R, I landed on rvest and was thrilled. This should be easy and fast.

Boy was I wrong.

Time to Stop and Go Manual

Ta-da! Nested tables. Remember we used <table> for layout back in the early naughts. 😱 In this case, the entire website was structured with table after table. In addition, many of the tables and rows were given the same class names! So much for targeting selectors.

Of course, this didn’t dawn on me right away. I stumbled for quite awhile trying to make various scraping methods work. Surely there was a way to work around the nested tables? Well, after several tries over too many hours, I ended up with itty bitty bits of partial data and/or duplicate data. It hit me that manually copy-pasting the information I needed was going to be more expedient than merging, deleting and joining partial bits of potentially hundreds of CSV files that were as messy as the nested tables.

So, in between a re-write of a conference paper (in less than a day), tweaking prototypes for an empirical study, classes, and other design work, I set about going through every link on this page to capture the data I was missing in my master sheet. I think the picture stories were the most tedious since the pagination dots were so small and even though I had created a pretty sweet copy-paste flow to minimize mistakes, it still happened.

Plus, I ran into the same issue I had when migrating the image file paths to the master sheet. The rows in the master were in a different sort order than the order of the winners’ list and the galleries contained within. So, what’s a girl to do?

Matching the unique ID of images with the actual image, photographer, award, and caption.

Triple-check, Double-check, Cross-check!

The downside of the manual approach is there is plenty of room for error. I pasted in the wrong row or the wrong column a few times. Plus, the master was not labeled as clearly as I had when copying and pasting to my sheets of data. Some sections were easier than others but wow, it was a maze. The bright side? I only had to do this for one year. Rather, I gave myself that constraint since I’m feeling incredibly behind on my schedule.

Getting to this point was a struggle. My first lesson only a little over a year ago was that 80% of the work is in finding, cleaning and analyzing the data. I keep reminding myself this when I’m feeling like I haven’t moved the dial forward.

So, a big gaping hole in my data is no longer. BUT, there are still holes so I need to figure out a way to manage them. I think the next best thing is to create some pivot tables to get a better look. The goal is to see if I can scale down the scope as a way to deal with the gaps. 🤔

Match Image File Paths to Image IDs Across Excel Sheets

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

Image paths – Check.

The realization my master Excel workbook (merged from multiple workbooks with multiple sheets) had a sort order different from the workbook with my image paths? Roadblock.

What I thought would be a simple copy entire column A from workbook 1 (all image file paths) to column B in workbook 2 (empty column where paths should go) soon became a hot mess.

Because the order of the image IDs in my master was not consistent with the order of the image paths I had extracted I realized I had to manually match, then copy-paste so there was a match between the image file paths and the rows with the image IDs in my master. What makes this dataset so dirty is that the file naming conventions were changed periodically so there might be a year or a few where the names of images were consistent then suddenly they were modified.

After several hours, it hit me that I should try this using a function either in R or Excel. I consulted The Google for ways to match the ID with a file path across/between multiple workbooks, tables, or sheets. I wasn’t sure what I was looking for but finally came upon two options that seemed worth the effort to learn.

=VLOOKUP or =Index + Match?

Testing a small sample set with some of the IDs and file paths was the only way to go and just. try. something. I think R would have been the best route IF the dataframes already existed in R but I wasn’t ready to commit because my thought was I still had loads of data to merge.

So I decided to go with Excel functions and decided to try Index with Match or VLookup. This website, ExcelJet, came to my rescue. Finally.

After several tries with Match and Index, I kept getting #NAME! or #N/A. The #NAME! error is easy to fix since that’s an alert that your syntax needs help but the #N/A was surprising since I knew there were matches. I believe the issue was that match was looking for an exact string? Possibly. 🤷🏻‍♀️

Not having any luck with Index + Match, I decided to try VLOOKUP. Someone told me that if I spend more than 20 minutes on a problem and don’t get anywhere to a) take a break and come back to it b) ask for help or c) move on to something else. Twenty minutes seems too short so I give myself an hour (ok, much more sometimes because I can be stubborn) and switch gears.

After several attempts with the same results as Index + Match it occurred to me that once again an exact match was needed. Enter the * wildcard. It seems obvious to me now but if you are like me and don’t use wildcards in your everyday activities, it is hard to remember it exists. Am I right?

Enter ExcelJet’s Partial Match with VLOOKUP article.

VLOOKUP Across Multiple Sheets

Using a sample from my image file paths and the image IDs from the master workbook, I ran the following function (below) and boom — happy dance.

=VLOOKUP("*" & A33681 & "*",Sheet2!A47:A62035,1,FALSE)

Now, I took that function to my master and moved slowly because the wildcards might be a bit too loose and insert an incorrect path. Thankfully those instances were few and by moving slowly down my sheet instead of filling the entire column with the function I saved myself a future headache.

I wish I could say this was the last step in cleaning my data but alas, there is so much more to do. In fact, I’m missing quite a bit. Mild panic.

Use R to Extract Thousands of Image File Paths and Write to CSV

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

After spending way too much time going through the process I outlined in my previous post, about adding links and image paths to an Excel file. I decided to figure out another, possibly more efficient solution with either R or Python. It was worth the effort and it’s amazing how you can go down the rabbit hole testing and trying out different methods. I had to cut myself off – lol.

Every project means learning something new even if it seems basic but since I’ve been learning R entirely on my own using the big bad Google and a few books, I’m trying not to beat myself up for learning how to ride my bike for the first time. So, after a few attempts to figure out a solution for extracting links to thousands of images in folders and subfolders with Python, I decided to try R and was finally successful.

After some digging, I stumbled upon this post about working with files and folders in R. It was exactly what I needed to understand how to eventually extract the file paths to every single image on my external hard drive.

list.files()

#get the full name (path and file name) of each file
list.files(full.name = TRUE)

#list of subfolders and files within the subfolders
list.files(recursive = TRUE)

I could finally make a visual connection between what was printed in R and what was on my hard drive – yay.

Now, before I go any further I should explain why I need the paths to the image files. I’m hoping to use PixPlot by the Yale Digital Humanities Lab and/or TensorFlow to analyze and organize images. Using these tools will be my first attempt and if the wheels come off, I’ll have to figure out another solution.

OK, onward!

list.files("C:/path/to/somewhere/else", full.names = TRUE, recursive = TRUE)

Using the line of code above for reference, I printed the full path to the file, the name of the file (e.g. 01.jpg) and provided instructions to look inside every subfolder (I knew there were many).

Then, I fed the list (results) into a variable using an assignment operator into a variable I created: poy_img_paths It looks like this:

poy_img_paths <- list.files(full.names = TRUE, recursive = TRUE)

What I love about R and R.Studio to be exact, is that you can view the data within R so I could see exactly what I was getting.

Now that I had my list of paths to the images, I wanted to create a dataframe and then export to an excel file.

poyImagePaths.data <- data.frame(poy_img_paths, stringsAsFactors=FALSE)

Then, after trying various excel or xls packages, I ran into a wall. Turns out some of them required java and others just plum wouldn’t work. I got loads of errors – oh yea. I still can’t find a solution for the errors so I went another route but you can see the repeated, ugly results here:

I then tried:
write_excel_csv(poyImagePaths.data, "poyImgPaths.xlsx")
write_excel_csv2(poyImagePaths.data, "poyImgPaths.xlsx")

No luck and I’m still not 100 percent sure why but clearly there are dependencies like Java that are not working and I got some errors about corrupted files.

So, what do you do? Move on to the alternative: export a CSV.
write_csv(poyImagePaths.data, "poyImgPaths.csv")

Boom.

I have no idea if I need active links to the images but I’ll soon find out!

Right now, I have more cleaning to do. With this data, every day is full of surprises.

How to Add Links to Images to an Excel Sheet

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

This is one of the search phrases I used to figure out an efficient way to add the image paths from the thousands of images I have from the POYi archives to an Excel sheet. I searched for any combination of the following:

  • “hyperlink to photographs”
  • “URL”
  • “batch processing”
  • “image file paths”

It is always hard and slightly frustrating when you are learning how to do something and you have no clue what keywords to search for in your quest because you don‘t know what you don‘t know. Then, there is the uncertainty that what you have found will even work. But, hey, this is learning and if you find this post, I hope I can save you some pain.

The disclaimer is that there might be a much faster way but I have yet to discover it. If you are OK with that, keep on reading and if I do find a more efficient way of handling this problem, I’ll update this post. *Pinky promise*

HYPERLINK(link_location, [friendly_name])

But first, I want to share that I did try using the =hyperlink function in Excel but after several attempts, I finally found a post/tutorial that informed me this function only works with URLs. I took that to mean the images need to be on a server. But, according to the Microsoft Office documentation, I should have been able to use =hyperlink to access the image files in my folder on my desktop. So, I’m puzzled why it didn’t work for me and rather than spend another hour trying to sort that out, I went back to searching and found an interim solution courtesy of James Cone in the Microsoft Office Dev Center forums.

There’s more than one way to skin a [you fill it in because I can’t even write it].

Make sure you have your images organized

I’m using an external SSD (Solid State Drive) to store my images for my project and this may change but for now, I wanted to keep the image paths as clean as possible. I still need to do one more test before I commit to keeping the images on the external hard drive and so far, my current solution feels right. Dropbox is such a processing hog and the way it handles files and urls…I don’t trust it. Now, there could be something to their API and I need to look into that as a possibility for serving up images. In the meantime, on to the next step.

Copy the absolute file path to the folder of images

I didn’t know how to do this in Catalina. In previous Mac operating systems you could click on the folder and use command-i or “get info” and see the absolute path in “where”. I’m not sure when it changed but now in order to get the absolute path, you need to do some trickery.

First, make sure you have “Show path bar” selected under “View” in the OS menu bar.

Then, with the folder containing your images open, look for the path at the bottom and command-click the folder name. You should get a context-menu. Select, “Copy [the name of your folder] as Pathname”.

Open a new window in Firefox

Paste the path you copied into the new window’s address bar and hit return or refresh the page. You should see a list of your image file names along with the size, and last modified information.

Copy the links from Firefox to Excel

You should get something like the image above and the first link in the image is different from the rest because I changed the display format of that first link. I don’t need a “friendly” display.

Click on one of the links to double-check

It’s good practice to make sure the links and paths are mapping to the correct file. So, when you click on one of the links, the image should open up in your default preview application.

So, that’s it!

Now, I do realize there might be a more efficient way. I’m definitely going to look into perhaps another Excel function or try the =hyperlink function again or a python script or even using R to handle this but since that would take longer for me to learn, I opted to go with a solution that worked for me for the interim so I can test and learn how to execute a process for image plotting in python.

Learning How to Wrangle Data with R

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

From Wikipedia:

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations.

This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, “munging” the raw data using algorithms (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use.[1]

Now that I have as much of the POYi data as I believe exists, I’ve been in the process of learning how to use R to wrangle data and learning about possible ways to create a database. A few people have recommended a database but I’m not sure I really need one, yet.

One great source I’ve discovered is this post by Sharon Musings titled, Great R packages for data import, wrangling and visualization. So, this is where I’ve started. If I run into a roadblock there, I’ll see about a database. Either path is rich with learning but at least with R I have some familiarity with dyplr, tidyr, magriitr, ggplot and others in the list because of my last big data viz project about Anne Sexton.

Here I go!