Use R to Extract Thousands of Image File Paths and Write to CSV

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

After spending way too much time going through the process I outlined in my previous post, about adding links and image paths to an Excel file. I decided to figure out another, possibly more efficient solution with either R or Python. It was worth the effort and it’s amazing how you can go down the rabbit hole testing and trying out different methods. I had to cut myself off – lol.

Every project means learning something new even if it seems basic but since I’ve been learning R entirely on my own using the big bad Google and a few books, I’m trying not to beat myself up for learning how to ride my bike for the first time. So, after a few attempts to figure out a solution for extracting links to thousands of images in folders and subfolders with Python, I decided to try R and was finally successful.

After some digging, I stumbled upon this post about working with files and folders in R. It was exactly what I needed to understand how to eventually extract the file paths to every single image on my external hard drive.

list.files()

#get the full name (path and file name) of each file
list.files(full.name = TRUE)

#list of subfolders and files within the subfolders
list.files(recursive = TRUE)

I could finally make a visual connection between what was printed in R and what was on my hard drive – yay.

Now, before I go any further I should explain why I need the paths to the image files. I’m hoping to use PixPlot by the Yale Digital Humanities Lab and/or TensorFlow to analyze and organize images. Using these tools will be my first attempt and if the wheels come off, I’ll have to figure out another solution.

OK, onward!

list.files("C:/path/to/somewhere/else", full.names = TRUE, recursive = TRUE)

Using the line of code above for reference, I printed the full path to the file, the name of the file (e.g. 01.jpg) and provided instructions to look inside every subfolder (I knew there were many).

Then, I fed the list (results) into a variable using an assignment operator into a variable I created: poy_img_paths It looks like this:

poy_img_paths <- list.files(full.names = TRUE, recursive = TRUE)

What I love about R and R.Studio to be exact, is that you can view the data within R so I could see exactly what I was getting.

Now that I had my list of paths to the images, I wanted to create a dataframe and then export to an excel file.

poyImagePaths.data <- data.frame(poy_img_paths, stringsAsFactors=FALSE)

Then, after trying various excel or xls packages, I ran into a wall. Turns out some of them required java and others just plum wouldn’t work. I got loads of errors – oh yea. I still can’t find a solution for the errors so I went another route but you can see the repeated, ugly results here:

I then tried:
write_excel_csv(poyImagePaths.data, "poyImgPaths.xlsx")
write_excel_csv2(poyImagePaths.data, "poyImgPaths.xlsx")

No luck and I’m still not 100 percent sure why but clearly there are dependencies like Java that are not working and I got some errors about corrupted files.

So, what do you do? Move on to the alternative: export a CSV.
write_csv(poyImagePaths.data, "poyImgPaths.csv")

Boom.

I have no idea if I need active links to the images but I’ll soon find out!

Right now, I have more cleaning to do. With this data, every day is full of surprises.

How to Add Links to Images to an Excel Sheet

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

This is one of the search phrases I used to figure out an efficient way to add the image paths from the thousands of images I have from the POYi archives to an Excel sheet. I searched for any combination of the following:

  • “hyperlink to photographs”
  • “URL”
  • “batch processing”
  • “image file paths”

It is always hard and slightly frustrating when you are learning how to do something and you have no clue what keywords to search for in your quest because you don‘t know what you don‘t know. Then, there is the uncertainty that what you have found will even work. But, hey, this is learning and if you find this post, I hope I can save you some pain.

The disclaimer is that there might be a much faster way but I have yet to discover it. If you are OK with that, keep on reading and if I do find a more efficient way of handling this problem, I’ll update this post. *Pinky promise*

HYPERLINK(link_location, [friendly_name])

But first, I want to share that I did try using the =hyperlink function in Excel but after several attempts, I finally found a post/tutorial that informed me this function only works with URLs. I took that to mean the images need to be on a server. But, according to the Microsoft Office documentation, I should have been able to use =hyperlink to access the image files in my folder on my desktop. So, I’m puzzled why it didn’t work for me and rather than spend another hour trying to sort that out, I went back to searching and found an interim solution courtesy of James Cone in the Microsoft Office Dev Center forums.

There’s more than one way to skin a [you fill it in because I can’t even write it].

Make sure you have your images organized

I’m using an external SSD (Solid State Drive) to store my images for my project and this may change but for now, I wanted to keep the image paths as clean as possible. I still need to do one more test before I commit to keeping the images on the external hard drive and so far, my current solution feels right. Dropbox is such a processing hog and the way it handles files and urls…I don’t trust it. Now, there could be something to their API and I need to look into that as a possibility for serving up images. In the meantime, on to the next step.

Copy the absolute file path to the folder of images

I didn’t know how to do this in Catalina. In previous Mac operating systems you could click on the folder and use command-i or “get info” and see the absolute path in “where”. I’m not sure when it changed but now in order to get the absolute path, you need to do some trickery.

First, make sure you have “Show path bar” selected under “View” in the OS menu bar.

Then, with the folder containing your images open, look for the path at the bottom and command-click the folder name. You should get a context-menu. Select, “Copy [the name of your folder] as Pathname”.

Open a new window in Firefox

Paste the path you copied into the new window’s address bar and hit return or refresh the page. You should see a list of your image file names along with the size, and last modified information.

Copy the links from Firefox to Excel

You should get something like the image above and the first link in the image is different from the rest because I changed the display format of that first link. I don’t need a “friendly” display.

Click on one of the links to double-check

It’s good practice to make sure the links and paths are mapping to the correct file. So, when you click on one of the links, the image should open up in your default preview application.

So, that’s it!

Now, I do realize there might be a more efficient way. I’m definitely going to look into perhaps another Excel function or try the =hyperlink function again or a python script or even using R to handle this but since that would take longer for me to learn, I opted to go with a solution that worked for me for the interim so I can test and learn how to execute a process for image plotting in python.

Learning How to Wrangle Data with R

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

From Wikipedia:

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations.

This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, “munging” the raw data using algorithms (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use.[1]

Now that I have as much of the POYi data as I believe exists, I’ve been in the process of learning how to use R to wrangle data and learning about possible ways to create a database. A few people have recommended a database but I’m not sure I really need one, yet.

One great source I’ve discovered is this post by Sharon Musings titled, Great R packages for data import, wrangling and visualization. So, this is where I’ve started. If I run into a roadblock there, I’ll see about a database. Either path is rich with learning but at least with R I have some familiarity with dyplr, tidyr, magriitr, ggplot and others in the list because of my last big data viz project about Anne Sexton.

Here I go!

Capstone: Visualizing the Pictures of the Year International (POYi) archives

It’s the beginning of my last semester at UM. I can hardly believe it. If all goes well (and why shouldn’t it?) I’ll graduate this May with an MFA in Interactive Media. Cool.

Last week, I officially proposed my capstone project. Working title: Reframing Photographs: Visualizing 75 years of the Pictures of the Year International archives. The goal is to explore and understand the archives. Then, create an interactive experience (most likely a website) that presents my findings and create a tool that allows people to explore the data on their own. Both will be true feats that I will pull off.

Need to Learn

To get started, I’ve created a rough project schedule and scheduled a meeting with our local data librarian tomorrow. I’m hoping he’ll be able to give me some insight on what options there might be for building the exploratory tool. My guess is that this will be a test of my javaScript skills. This semester I’m also learning more D3.js and I’m taking a WebGIS course. Most likely I’ll be using the skills I learn in both those classes as well as what I learn on my own for this project. R will definitely come in handy as well.

The data

What makes this data interesting so far is that there is a mix of images and text, a mix of black and white and color images and that in some places data is missing. I’m still working my way through it but some years have more holes than others. Figuring out how to present the missing data will be interesting.

Questions

These are some of the questions I’ve come up with and I got some criticism that these questions are too obvious. I’m not sure I agree with that but it is good to consider.

Next Steps

I’m still cleaning up the data but soon I’ll start analyzing the data using R and sketching using Flourish, RawGraphs, Data Illustrator or Illustrator. Wish me luck!