Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!
Image paths – Check.
The realization my master Excel workbook (merged from multiple workbooks with multiple sheets) had a sort order different from the workbook with my image paths? Roadblock.
What I thought would be a simple copy entire column A from workbook 1 (all image file paths) to column B in workbook 2 (empty column where paths should go) soon became a hot mess.
Because the order of the image IDs in my master was not consistent with the order of the image paths I had extracted I realized I had to manually match, then copy-paste so there was a match between the image file paths and the rows with the image IDs in my master. What makes this dataset so dirty is that the file naming conventions were changed periodically so there might be a year or a few where the names of images were consistent then suddenly they were modified.
After several hours, it hit me that I should try this using a function either in R or Excel. I consulted The Google for ways to match the ID with a file path across/between multiple workbooks, tables, or sheets. I wasn’t sure what I was looking for but finally came upon two options that seemed worth the effort to learn.
=VLOOKUP or =Index + Match?
Testing a small sample set with some of the IDs and file paths was the only way to go and just. try. something. I think R would have been the best route IF the dataframes already existed in R but I wasn’t ready to commit because my thought was I still had loads of data to merge.
So I decided to go with Excel functions and decided to try Index with Match or VLookup. This website, ExcelJet, came to my rescue. Finally.
After several tries with Match and Index, I kept getting
#NAME! error is easy to fix since that’s an alert that your syntax needs help but the
#N/A was surprising since I knew there were matches. I believe the issue was that
match was looking for an exact string? Possibly. 🤷🏻♀️
Not having any luck with Index + Match, I decided to try VLOOKUP. Someone told me that if I spend more than 20 minutes on a problem and don’t get anywhere to a) take a break and come back to it b) ask for help or c) move on to something else. Twenty minutes seems too short so I give myself an hour (ok, much more sometimes because I can be stubborn) and switch gears.
After several attempts with the same results as Index + Match it occurred to me that once again an exact match was needed. Enter the * wildcard. It seems obvious to me now but if you are like me and don’t use wildcards in your everyday activities, it is hard to remember it exists. Am I right?
VLOOKUP Across Multiple Sheets
Using a sample from my image file paths and the image IDs from the master workbook, I ran the following function (below) and boom — happy dance.
=VLOOKUP("*" & A33681 & "*",Sheet2!A47:A62035,1,FALSE)
Now, I took that function to my master and moved slowly because the wildcards might be a bit too loose and insert an incorrect path. Thankfully those instances were few and by moving slowly down my sheet instead of filling the entire column with the function I saved myself a future headache.
I wish I could say this was the last step in cleaning my data but alas, there is so much more to do. In fact, I’m missing quite a bit. Mild panic.