Use R to Extract Thousands of Image File Paths and Write to CSV

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

After spending way too much time going through the process I outlined in my previous post, about adding links and image paths to an Excel file. I decided to figure out another, possibly more efficient solution with either R or Python. It was worth the effort and it’s amazing how you can go down the rabbit hole testing and trying out different methods. I had to cut myself off – lol.

Every project means learning something new even if it seems basic but since I’ve been learning R entirely on my own using the big bad Google and a few books, I’m trying not to beat myself up for learning how to ride my bike for the first time. So, after a few attempts to figure out a solution for extracting links to thousands of images in folders and subfolders with Python, I decided to try R and was finally successful.

After some digging, I stumbled upon this post about working with files and folders in R. It was exactly what I needed to understand how to eventually extract the file paths to every single image on my external hard drive.

list.files()

#get the full name (path and file name) of each file
list.files(full.name = TRUE)

#list of subfolders and files within the subfolders
list.files(recursive = TRUE)

I could finally make a visual connection between what was printed in R and what was on my hard drive – yay.

Now, before I go any further I should explain why I need the paths to the image files. I’m hoping to use PixPlot by the Yale Digital Humanities Lab and/or TensorFlow to analyze and organize images. Using these tools will be my first attempt and if the wheels come off, I’ll have to figure out another solution.

OK, onward!

list.files("C:/path/to/somewhere/else", full.names = TRUE, recursive = TRUE)

Using the line of code above for reference, I printed the full path to the file, the name of the file (e.g. 01.jpg) and provided instructions to look inside every subfolder (I knew there were many).

Then, I fed the list (results) into a variable using an assignment operator into a variable I created: poy_img_paths It looks like this:

poy_img_paths <- list.files(full.names = TRUE, recursive = TRUE)

What I love about R and R.Studio to be exact, is that you can view the data within R so I could see exactly what I was getting.

Now that I had my list of paths to the images, I wanted to create a dataframe and then export to an excel file.

poyImagePaths.data <- data.frame(poy_img_paths, stringsAsFactors=FALSE)

Then, after trying various excel or xls packages, I ran into a wall. Turns out some of them required java and others just plum wouldn’t work. I got loads of errors – oh yea. I still can’t find a solution for the errors so I went another route but you can see the repeated, ugly results here:

I then tried:
write_excel_csv(poyImagePaths.data, "poyImgPaths.xlsx")
write_excel_csv2(poyImagePaths.data, "poyImgPaths.xlsx")

No luck and I’m still not 100 percent sure why but clearly there are dependencies like Java that are not working and I got some errors about corrupted files.

So, what do you do? Move on to the alternative: export a CSV.
write_csv(poyImagePaths.data, "poyImgPaths.csv")

Boom.

I have no idea if I need active links to the images but I’ll soon find out!

Right now, I have more cleaning to do. With this data, every day is full of surprises.

Leave a Reply

%d bloggers like this: