Use R to Extract Thousands of Image File Paths and Write to CSV

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

After spending way too much time going through the process I outlined in my previous post, about adding links and image paths to an Excel file. I decided to figure out another, possibly more efficient solution with either R or Python. It was worth the effort and it’s amazing how you can go down the rabbit hole testing and trying out different methods. I had to cut myself off – lol.

Every project means learning something new even if it seems basic but since I’ve been learning R entirely on my own using the big bad Google and a few books, I’m trying not to beat myself up for learning how to ride my bike for the first time. So, after a few attempts to figure out a solution for extracting links to thousands of images in folders and subfolders with Python, I decided to try R and was finally successful.

After some digging, I stumbled upon this post about working with files and folders in R. It was exactly what I needed to understand how to eventually extract the file paths to every single image on my external hard drive.

list.files()

#get the full name (path and file name) of each file
list.files(full.name = TRUE)

#list of subfolders and files within the subfolders
list.files(recursive = TRUE)

I could finally make a visual connection between what was printed in R and what was on my hard drive – yay.

Now, before I go any further I should explain why I need the paths to the image files. I’m hoping to use PixPlot by the Yale Digital Humanities Lab and/or TensorFlow to analyze and organize images. Using these tools will be my first attempt and if the wheels come off, I’ll have to figure out another solution.

OK, onward!

list.files("C:/path/to/somewhere/else", full.names = TRUE, recursive = TRUE)

Using the line of code above for reference, I printed the full path to the file, the name of the file (e.g. 01.jpg) and provided instructions to look inside every subfolder (I knew there were many).

Then, I fed the list (results) into a variable using an assignment operator into a variable I created: poy_img_paths It looks like this:

poy_img_paths <- list.files(full.names = TRUE, recursive = TRUE)

What I love about R and R.Studio to be exact, is that you can view the data within R so I could see exactly what I was getting.

Now that I had my list of paths to the images, I wanted to create a dataframe and then export to an excel file.

poyImagePaths.data <- data.frame(poy_img_paths, stringsAsFactors=FALSE)

Then, after trying various excel or xls packages, I ran into a wall. Turns out some of them required java and others just plum wouldn’t work. I got loads of errors – oh yea. I still can’t find a solution for the errors so I went another route but you can see the repeated, ugly results here:

I then tried:
write_excel_csv(poyImagePaths.data, "poyImgPaths.xlsx")
write_excel_csv2(poyImagePaths.data, "poyImgPaths.xlsx")

No luck and I’m still not 100 percent sure why but clearly there are dependencies like Java that are not working and I got some errors about corrupted files.

So, what do you do? Move on to the alternative: export a CSV.
write_csv(poyImagePaths.data, "poyImgPaths.csv")

Boom.

I have no idea if I need active links to the images but I’ll soon find out!

Right now, I have more cleaning to do. With this data, every day is full of surprises.

How to Add Links to Images to an Excel Sheet

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

This is one of the search phrases I used to figure out an efficient way to add the image paths from the thousands of images I have from the POYi archives to an Excel sheet. I searched for any combination of the following:

  • “hyperlink to photographs”
  • “URL”
  • “batch processing”
  • “image file paths”

It is always hard and slightly frustrating when you are learning how to do something and you have no clue what keywords to search for in your quest because you don‘t know what you don‘t know. Then, there is the uncertainty that what you have found will even work. But, hey, this is learning and if you find this post, I hope I can save you some pain.

The disclaimer is that there might be a much faster way but I have yet to discover it. If you are OK with that, keep on reading and if I do find a more efficient way of handling this problem, I’ll update this post. *Pinky promise*

HYPERLINK(link_location, [friendly_name])

But first, I want to share that I did try using the =hyperlink function in Excel but after several attempts, I finally found a post/tutorial that informed me this function only works with URLs. I took that to mean the images need to be on a server. But, according to the Microsoft Office documentation, I should have been able to use =hyperlink to access the image files in my folder on my desktop. So, I’m puzzled why it didn’t work for me and rather than spend another hour trying to sort that out, I went back to searching and found an interim solution courtesy of James Cone in the Microsoft Office Dev Center forums.

There’s more than one way to skin a [you fill it in because I can’t even write it].

Make sure you have your images organized

I’m using an external SSD (Solid State Drive) to store my images for my project and this may change but for now, I wanted to keep the image paths as clean as possible. I still need to do one more test before I commit to keeping the images on the external hard drive and so far, my current solution feels right. Dropbox is such a processing hog and the way it handles files and urls…I don’t trust it. Now, there could be something to their API and I need to look into that as a possibility for serving up images. In the meantime, on to the next step.

Copy the absolute file path to the folder of images

I didn’t know how to do this in Catalina. In previous Mac operating systems you could click on the folder and use command-i or “get info” and see the absolute path in “where”. I’m not sure when it changed but now in order to get the absolute path, you need to do some trickery.

First, make sure you have “Show path bar” selected under “View” in the OS menu bar.

Then, with the folder containing your images open, look for the path at the bottom and command-click the folder name. You should get a context-menu. Select, “Copy [the name of your folder] as Pathname”.

Open a new window in Firefox

Paste the path you copied into the new window’s address bar and hit return or refresh the page. You should see a list of your image file names along with the size, and last modified information.

Copy the links from Firefox to Excel

You should get something like the image above and the first link in the image is different from the rest because I changed the display format of that first link. I don’t need a “friendly” display.

Click on one of the links to double-check

It’s good practice to make sure the links and paths are mapping to the correct file. So, when you click on one of the links, the image should open up in your default preview application.

So, that’s it!

Now, I do realize there might be a more efficient way. I’m definitely going to look into perhaps another Excel function or try the =hyperlink function again or a python script or even using R to handle this but since that would take longer for me to learn, I opted to go with a solution that worked for me for the interim so I can test and learn how to execute a process for image plotting in python.

Learning How to Wrangle Data with R

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

From Wikipedia:

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations.

This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, “munging” the raw data using algorithms (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use.[1]

Now that I have as much of the POYi data as I believe exists, I’ve been in the process of learning how to use R to wrangle data and learning about possible ways to create a database. A few people have recommended a database but I’m not sure I really need one, yet.

One great source I’ve discovered is this post by Sharon Musings titled, Great R packages for data import, wrangling and visualization. So, this is where I’ve started. If I run into a roadblock there, I’ll see about a database. Either path is rich with learning but at least with R I have some familiarity with dyplr, tidyr, magriitr, ggplot and others in the list because of my last big data viz project about Anne Sexton.

Here I go!

Capstone: Visualizing the Pictures of the Year International (POYi) archives

It’s the beginning of my last semester at UM. I can hardly believe it. If all goes well (and why shouldn’t it?) I’ll graduate this May with an MFA in Interactive Media. Cool.

Last week, I officially proposed my capstone project. Working title: Reframing Photographs: Visualizing 75 years of the Pictures of the Year International archives. The goal is to explore and understand the archives. Then, create an interactive experience (most likely a website) that presents my findings and create a tool that allows people to explore the data on their own. Both will be true feats that I will pull off.

Need to Learn

To get started, I’ve created a rough project schedule and scheduled a meeting with our local data librarian tomorrow. I’m hoping he’ll be able to give me some insight on what options there might be for building the exploratory tool. My guess is that this will be a test of my javaScript skills. This semester I’m also learning more D3.js and I’m taking a WebGIS course. Most likely I’ll be using the skills I learn in both those classes as well as what I learn on my own for this project. R will definitely come in handy as well.

The data

What makes this data interesting so far is that there is a mix of images and text, a mix of black and white and color images and that in some places data is missing. I’m still working my way through it but some years have more holes than others. Figuring out how to present the missing data will be interesting.

Questions

These are some of the questions I’ve come up with and I got some criticism that these questions are too obvious. I’m not sure I agree with that but it is good to consider.

Next Steps

I’m still cleaning up the data but soon I’ll start analyzing the data using R and sketching using Flourish, RawGraphs, Data Illustrator or Illustrator. Wish me luck!

Two Books That Kept Me Going When I Wasn’t Feeling 100% in Grad School

Sometimes, ok, most of the time, life as a grad student can be overwhelming. So, you need friends who insist you need to get some fresh air or stretch and you need books to remind you that pretty much everyone else is going through something similar and you are not alone. Two books filled that need and I found myself returning to them periodically throughout the semester.

Am I Overthinking This?

This book by Michelle Rial is one of my new all-time favorites. It makes me smile and oftentimes think, “So true” or “Yep, been there”. Her book helped provide some levity when I felt like I was in the middle of analysis paralysis.

Take this chart for example. I don’t drink soda but the number of coffee cups that I collected around me while jumping from one project to the next was downright hilarious. What I would add to this would be the number of snack wrappers. I would not have survived without KIND bars.

From, Am I Overthinking This? by Martha Rial.

In my case, it would be the stove. Even the hot stove indicator light didn’t help (don’t get me started on the design of stovetops) because the LED would be on to communicate, “Hey, this is still hot” (even when off) and “Hey, this is on”. So just by looking I couldn’t tell.

The number of times I glanced at the stove trying to figure out if it was on or off as I walked out the door with my bike is countless. Or, the number of times I would feel a slight panic thinking I left it on when I had already reached campus.

It gets worse when you are sleep deprived and your cognitive abilities start to seriously decline. I put eggs in the cupboard, my hot coffee in the fridge, my cell phone in the freezer and would flat-out forget what I was going to do next.

From, Am I Overthinking This? by Martha Rial.

This last one I’ll share is one I have near my computer. Now, I know this chart just from life experience but when you are learning new things every single day and completely out of your comfort zone every single day, it is easy to forget you’ve already failed numerous times and are still successful.

The academic measures (grades) somehow also manage to cloud the main reasons why you are in school or in my case, back to school. In my lucid, non-stressed state, I know I’m not back in school for grades. But when a professor tells you, “You know, your grade will be affected by this”, it’s hard not to care; to feel like you decreased your chances for success.

From, Am I Overthinking This? by Martha Rial.

Info We Trust

I didn’t read this book from the beginning to the end. What I love about this book is that I could skip around and still am. I don’t know why but I tend to open books from the back (no idea how I picked up this quirk). For Info We Trust, by RJ Andrews, I landed in Chapter 19, “Creative Routines”. How apropos!

Here are a few highlights:

Creatives have routines

Creatives, according to RJ, did not have a similar pattern of activity but what they shared was a routine. That surprised me because I tend to think of creatives as well…creative. In my mind that is a bit of chaos and work when the mood strikes. My thinking does not come from research but most likely TV or movies. I’m happy to learn I’ve been misinformed.

Magical aha! moments are lovely when they arrive. But real creative production is about steady discipline, not waiting around for inspiration. You must create the time and space for work to happen.

RJ Andrews, Info We Trust, p. 198

Two professors, Dr. Barbara Millet and Alberto Cairo would often share how important it is to establish a routine. It also wasn’t enough to establish one but to also focus on one thing for a specific timeframe. For example: If you are going to read journal articles, you might want to set aside Fridays to read and write in the mornings from 5 am to 7 am. Multitasking, after all, is horrible for your brain.

At first, I thought I couldn’t possibly change how I work. Also, within the first semester, I learned that everyone else’s schedule can put a wrench in your best plans for a routine. Teamwork and graduate studies don’t equal routines, especially when you are the sole morning person and the rest of your team prefers to work between 7 p.m. and whatever time it takes into the wee hours of the morning. Rough. So, focus on what you can control and when.

Experimentation, Learning, Exploration — Play, is Critical

Allow your curiosity to get the upper hand

RJ Andrews, Info We Trust, p. 198

This may sound crazy coming from me but coding has become a great place to play for me. A little music and something to learn, I find I can get in a zone where I’m willing to try as much as I need to figure out a bug. Granted, I’m not coding for the critical deployment of software. Released from that pressure, coding is becoming a great source of play. I’m learning and when something works, the emotion is off the charts. Look what I made!

Compare yourself—Try Not

One consequence of [learning from many different kinds of experts] is unfairly comparing yourself to specialists. That can lead to feeling like an imposter. Do not be too hard on yourself.

RJ Andrews, Info We Trust, p. 199

That last sentence… I struggle with that one—a lot. You too, right?

All I have to say about that right now is this: It’s a humbling experience going back to school full-time in your forties. I know a lot and all of a sudden I feel like I know nothing. Starting over is tough, rough and takes a lot of persistence. It requires remembering this is for the long game. As a former colleague reminded me, “It’s a marathon; not a sprint”.

Consume as many data stories as possible

In order to be a better data journalist or data visualization designer, look at and study more charts and data stories. Review what has been done in the past as it can influence what you do in the future.

Deconstruct past work to reveal your own unique blend of technical and temporal biases.

RJ Andrews, Info We Trust, p. 202

Alberto Cairo recommended this in his introductory class. He strongly recommended that we subscribe to The New York Times or any print edition of a major newspaper so that we can discover maps, charts, diagrams and data stories. The experience of print is different from browsing online.

I did just that and while there are some days where the paper piles up like when I was a subscriber to The New Yorker, I would spend a half-hour either when I got home from a full day of classes or in the mornings with my coffee flipping through the paper.

The pile of newspapers I’ve clipped and saved is embarrassing but I discovered a lot of stories that could become data-driven stories. I’m hoping to make them personal projects so I can keep practicing what I’ve learned.

Networks

I wrote briefly about networks before at the end of this post and RJ has to say this about connections:

Creativity is all about making new mappings between previously unconnected things.

RJ Andrews, Info We Trust, p. 200

This. This place of “mental fireworks” is what makes going back to school worth the sometimes unbearable feelings of frustration and insecurity. I felt these fireworks twice this semester and I cannot describe how magical it truly feels.

For me, the addition of an Artificial Intelligence class brought many concepts and ideas together from the current and previous semesters plus the research I had been reading and writing about as a GA in the UX Lab. My neurons were firing at a rapid place and the buzz was noted.

Be Active and Get Sleep

So this made me laugh out loud. Being active I could do because I rode my bike nearly every day to campus. The rides to and from campus reminded me of the beauty of mornings and a way to decompress after hours of classes and work. Sleep on the other hand…

Sleep is essential for health, but it also a productive creative tool. Taking a nap or sleeping on it overnight creates a natural space for the brain to ingest new information.

RJ Andrews, Info We Trust, p. 201

RJ has some great quotes in this section and I completely agree with this idea of “loading data overnight” but as a student sleep becomes a rare and cherished state.

Still, I did choose sleep a few times (even just 3-4 hours) over pushing through the night without when I could so that I could let what either felt like a big hurdle or a complex interaction marinate a bit. It definitely worked. With fresh eyes, I was more productive and more often than not found a solution.

Learning Takes Time and Sometimes Making Tough Choices

I had to make what still feels like major sacrifices this semester. This fact was hard to reconcile in my mind. Either I spend the time to write a blog post about a book I read or I read the peer-reviewed papers in order to write a very important literature review for a job for which I am getting paid (UX Lab). Either I spend the time learning how to code to complete my project or I spend the time creating a visual style guide and running all of my colors through WebAim for documentation.

All were important. How do you choose? I chose to be a responsible employee so that my boss and her professional endeavors and schedule aren’t compromised. I chose to code over visual styles because it is a skill I have not mastered. Those decisions may not have been right but those were the choices I made and I have no regrets. I learned plenty of skills and I learned and continue to learn a great deal about myself. Most of all, I’m proud of the work I created.

These two books helped me get some perspective and keep my sanity. I’m certain I’ll refer back to them many more times in the near future. Thank you RJ Andrews and Michelle Rial.