Use R to Extract Thousands of Image File Paths and Write to CSV

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

After spending way too much time going through the process I outlined in my previous post, about adding links and image paths to an Excel file. I decided to figure out another, possibly more efficient solution with either R or Python. It was worth the effort and it’s amazing how you can go down the rabbit hole testing and trying out different methods. I had to cut myself off – lol.

Every project means learning something new even if it seems basic but since I’ve been learning R entirely on my own using the big bad Google and a few books, I’m trying not to beat myself up for learning how to ride my bike for the first time. So, after a few attempts to figure out a solution for extracting links to thousands of images in folders and subfolders with Python, I decided to try R and was finally successful.

After some digging, I stumbled upon this post about working with files and folders in R. It was exactly what I needed to understand how to eventually extract the file paths to every single image on my external hard drive.


#get the full name (path and file name) of each file
list.files( = TRUE)

#list of subfolders and files within the subfolders
list.files(recursive = TRUE)

I could finally make a visual connection between what was printed in R and what was on my hard drive – yay.

Now, before I go any further I should explain why I need the paths to the image files. I’m hoping to use PixPlot by the Yale Digital Humanities Lab and/or TensorFlow to analyze and organize images. Using these tools will be my first attempt and if the wheels come off, I’ll have to figure out another solution.

OK, onward!

list.files("C:/path/to/somewhere/else", full.names = TRUE, recursive = TRUE)

Using the line of code above for reference, I printed the full path to the file, the name of the file (e.g. 01.jpg) and provided instructions to look inside every subfolder (I knew there were many).

Then, I fed the list (results) into a variable using an assignment operator into a variable I created: poy_img_paths It looks like this:

poy_img_paths <- list.files(full.names = TRUE, recursive = TRUE)

What I love about R and R.Studio to be exact, is that you can view the data within R so I could see exactly what I was getting.

Now that I had my list of paths to the images, I wanted to create a dataframe and then export to an excel file. <- data.frame(poy_img_paths, stringsAsFactors=FALSE)

Then, after trying various excel or xls packages, I ran into a wall. Turns out some of them required java and others just plum wouldn’t work. I got loads of errors – oh yea. I still can’t find a solution for the errors so I went another route but you can see the repeated, ugly results here:

I then tried:
write_excel_csv(, "poyImgPaths.xlsx")
write_excel_csv2(, "poyImgPaths.xlsx")

No luck and I’m still not 100 percent sure why but clearly there are dependencies like Java that are not working and I got some errors about corrupted files.

So, what do you do? Move on to the alternative: export a CSV.
write_csv(, "poyImgPaths.csv")


I have no idea if I need active links to the images but I’ll soon find out!

Right now, I have more cleaning to do. With this data, every day is full of surprises.

How to Add Links to Images to an Excel Sheet

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

This is one of the search phrases I used to figure out an efficient way to add the image paths from the thousands of images I have from the POYi archives to an Excel sheet. I searched for any combination of the following:

  • “hyperlink to photographs”
  • “URL”
  • “batch processing”
  • “image file paths”

It is always hard and slightly frustrating when you are learning how to do something and you have no clue what keywords to search for in your quest because you don‘t know what you don‘t know. Then, there is the uncertainty that what you have found will even work. But, hey, this is learning and if you find this post, I hope I can save you some pain.

The disclaimer is that there might be a much faster way but I have yet to discover it. If you are OK with that, keep on reading and if I do find a more efficient way of handling this problem, I’ll update this post. *Pinky promise*

HYPERLINK(link_location, [friendly_name])

But first, I want to share that I did try using the =hyperlink function in Excel but after several attempts, I finally found a post/tutorial that informed me this function only works with URLs. I took that to mean the images need to be on a server. But, according to the Microsoft Office documentation, I should have been able to use =hyperlink to access the image files in my folder on my desktop. So, I’m puzzled why it didn’t work for me and rather than spend another hour trying to sort that out, I went back to searching and found an interim solution courtesy of James Cone in the Microsoft Office Dev Center forums.

There’s more than one way to skin a [you fill it in because I can’t even write it].

Make sure you have your images organized

I’m using an external SSD (Solid State Drive) to store my images for my project and this may change but for now, I wanted to keep the image paths as clean as possible. I still need to do one more test before I commit to keeping the images on the external hard drive and so far, my current solution feels right. Dropbox is such a processing hog and the way it handles files and urls…I don’t trust it. Now, there could be something to their API and I need to look into that as a possibility for serving up images. In the meantime, on to the next step.

Copy the absolute file path to the folder of images

I didn’t know how to do this in Catalina. In previous Mac operating systems you could click on the folder and use command-i or “get info” and see the absolute path in “where”. I’m not sure when it changed but now in order to get the absolute path, you need to do some trickery.

First, make sure you have “Show path bar” selected under “View” in the OS menu bar.

Then, with the folder containing your images open, look for the path at the bottom and command-click the folder name. You should get a context-menu. Select, “Copy [the name of your folder] as Pathname”.

Open a new window in Firefox

Paste the path you copied into the new window’s address bar and hit return or refresh the page. You should see a list of your image file names along with the size, and last modified information.

Copy the links from Firefox to Excel

You should get something like the image above and the first link in the image is different from the rest because I changed the display format of that first link. I don’t need a “friendly” display.

Click on one of the links to double-check

It’s good practice to make sure the links and paths are mapping to the correct file. So, when you click on one of the links, the image should open up in your default preview application.

So, that’s it!

Now, I do realize there might be a more efficient way. I’m definitely going to look into perhaps another Excel function or try the =hyperlink function again or a python script or even using R to handle this but since that would take longer for me to learn, I opted to go with a solution that worked for me for the interim so I can test and learn how to execute a process for image plotting in python.

Learning How to Wrangle Data with R

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

From Wikipedia:

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations.

This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, “munging” the raw data using algorithms (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use.[1]

Now that I have as much of the POYi data as I believe exists, I’ve been in the process of learning how to use R to wrangle data and learning about possible ways to create a database. A few people have recommended a database but I’m not sure I really need one, yet.

One great source I’ve discovered is this post by Sharon Musings titled, Great R packages for data import, wrangling and visualization. So, this is where I’ve started. If I run into a roadblock there, I’ll see about a database. Either path is rich with learning but at least with R I have some familiarity with dyplr, tidyr, magriitr, ggplot and others in the list because of my last big data viz project about Anne Sexton.

Here I go!

Waiting for Data…

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my journey, here are my posts tagged with capstone.

Between the holidays, an emergency appendectomy (yep, that happened and I’m still not 100%), weather, and various other acts of God, I’m trying to keep calm and carry on. It turns out that I’m missing a lot of data. So, I made the effort to acquire it but there’s no guarantee. So, I decided to focus on what I can do and can control while I wait.

“Think about the visual form”

This advice (below) comes from Nadieh Bremmer in an interview, “Forge a career in data viz with these pro tips and tricks” in ahead of her generateJS Conference talk.

I like that she puts an emphasis on “getting the basic chart form right”. Even bar charts have a lot of variation and you really have to be careful. Alberto Cairo taught us about being responsible designers. I took that to heart and the last thing I ever want to do or be accused of is misleading people as that can have serious consequences.

Design Precedents

So, based on what I do know about the archive and the data so far, I looked for design precedents, works that focused on aspects I wanted to explore. Some work with photos, others with paintings or even the covers of magazines. All work with large datasets.

I’m still sorting out how to focus given the so many different directions I could go. It feels on many levels like a giant bunch of tangled yarn!

Selfie City

One person who is new to me and whose work I’ve come to admire is Moritz Stefaner. Below is Selfie City, an explanatory and exploratory visualization of selfies from five international cities. The “imageplots” are so much fun. From this screenshot (below), you can see the patterns and relationships between each city and within each city. Then, as you hover, you see the individual within the larger context. The visualization becomes more than a data point (Source: Manovich)

Executive Colors

Luis Melgar who is a graduate of the IMFA program (Go Canes!) did a fun personal exploration about the colors of presidential portraits.

175 Years of Scientific American’s Covers

And, this visualization by Jen Christiansen and Nicholas Rougeux on 175 years of Scientific American covers gives me ideas on how to possibly present the photographs in the archives. It’s accessible, fun, and engaging. But also very telling about technology and editorial direction over time. Yes, there are stories from magazine covers!

Understand the tools I need to learn

Based on the design precedents above, I know this project will force me to learn or reacquaint myself with new tools, including programming languages (eep!). My semester turned into a programming semester with a heavy emphasis on D3js.

Python may be needed as well as more R and if I can acquire missing data, I’ve been speaking with the data librarians and GIS specialists available to me at Richter Library about using ESRI’s ArcGIS to plot images on a map. Mapbox is another possible solution.

There are a lot of possibilities so I made an effort to research what I may need to learn to execute my ideas.

Capstone: Visualizing the Pictures of the Year International (POYi) archives

It’s the beginning of my last semester at UM. I can hardly believe it. If all goes well (and why shouldn’t it?) I’ll graduate this May with an MFA in Interactive Media. Cool.

Last week, I officially proposed my capstone project. Working title: Reframing Photographs: Visualizing 75 years of the Pictures of the Year International archives. The goal is to explore and understand the archives. Then, create an interactive experience (most likely a website) that presents my findings and create a tool that allows people to explore the data on their own. Both will be true feats that I will pull off.

Need to Learn

To get started, I’ve created a rough project schedule and scheduled a meeting with our local data librarian tomorrow. I’m hoping he’ll be able to give me some insight on what options there might be for building the exploratory tool. My guess is that this will be a test of my javaScript skills. This semester I’m also learning more D3.js and I’m taking a WebGIS course. Most likely I’ll be using the skills I learn in both those classes as well as what I learn on my own for this project. R will definitely come in handy as well.

The data

What makes this data interesting so far is that there is a mix of images and text, a mix of black and white and color images and that in some places data is missing. I’m still working my way through it but some years have more holes than others. Figuring out how to present the missing data will be interesting.


These are some of the questions I’ve come up with and I got some criticism that these questions are too obvious. I’m not sure I agree with that but it is good to consider.

Next Steps

I’m still cleaning up the data but soon I’ll start analyzing the data using R and sketching using Flourish, RawGraphs, Data Illustrator or Illustrator. Wish me luck!