Getting Feedback: Pandemic Edition

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my journey, here are my posts tagged with capstone.

So, as you may have guessed most universities went remote with classes. We had an extra week of spring break so that faculty and students could make some adjustments. I was thrilled to hear it because I’m still juggling caring for my hubs and trying to make the best of sorting out how to get groceries, what to cook (empty shelves of meat – really?), how long viruses live on surfaces (deliveries), and where to find masks and hand sanitizer. I was prepared long before most people but things have clearly gotten much worse and the scale of how the U.S. is so unprepared… I’m still digesting that.

Still, I want to make it clear that my situation is much better than most. I’m one of the lucky ones. I just wish people weren’t taking this out on Asians. But I’m not going there. It’s just another layer of stress. There have been a few incidents at the local grocery stores. Another reason I have avoided going out.

Mural

One of the hardest aspects about remote is that everyone has different tools they use. I hear Zoom is the meeting app of choice but I prefer Slack (Yes, there is video) because it allows you to share all types of content and the integration is stellar. Zoom doesn’t allow for sharing of content, only sharing of screens.

So I had to figure out a way to share the many charts I’ve been exploring. Sending out screen grabs and attachments in an email is tedious and doesn’t work. I thought about using Google Slides but the format and proportions are clunky.

Enter Mural.

I discovered this tool eons ago but rediscovered it while doing some team work for I believe our UX Research Methods class. It is an awesome whiteboard/collaboration tool. Super easy to use.

My explorations using RawGraphs, Flourish, and DataWrapper

The great part about Mural is that people can leave notes and draw on the board. Even during a screen share, you can see where people are drawing so communication is clear. I’m sure I haven’t used all the great features they offer and I’m looking forward to learning more about it. Right now, everything feels like “just learn enough” to get the project done!

Weekly Meetings

One of my advisors, Dr. Barbara Millet, set up a weekly meeting for all of her advisees. This was great because I could get additional feedback on the direction of my charts and focus from three classmates plus Dr. Millet on top of feedback from Alberto Cairo. Two advisors is a win-win. They offer different perspectives and see different aspects. I couldn’t have asked for a better “committee”. I also met with Lenny Martinez who is teaching many of us an intro to D3.js. My other source of feedback was my husband, who not long ago, expressed his frustration with charts online: “Some take took long to figure out”. He was my litmus test.

So, the best combination for feedback during a pandemic?

It depends on your project but for dataviz, I found weekly meetings, two advisors with Mural, What’s App, Slack and Zoom to be helpful.

But the first piece is you. If you are healthy and have the time to get out of bed, turn off Netflix … create!

Focus on Gender

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my journey, here are my posts tagged with capstone.

Well, it’s roughly mid-semester and I feel like I haven’t made much progress though I’ve been working non-stop acquiring data and manually fixing errors as much as I am able. In my excitement, I proposed a project that is probably more than what can be done in just a few months. Given I’m a one-woman band and its been established that there’s quite a bit of missing data, analyzing the entire POYi archives — photos, text and numbers — won’t be feasible in the time I have left.

Two Paths

  1. Focus on 10 or 20 years (which years, to be determined) and go deep; meaning: look at gender, color analysis, text analysis, similarity, etc.
  2. Focus on gender because it is the one variable that has a low percentage of missing data.

I’m more excited about the first than the latter but what concerns me about going deep is that each one of those types of analysis could become huge. I would want to learn the process Nicholas Rougeux took to analyze colors of the New Yorker and for similarity, I would learn the process outlined by the Yale Humanities Lab. In fact, I already started testing (see gif below) but the next steps require some serious brainpower (focus), and honestly, the coronavirus is making me a bit crazy and news-obsessed.

I got this to work! A test using Tensorflow from documentation by the Yale Digital Humanities Lab

Practical Path

So, the ambitious side of me needs to chill which means I’ll go with path 2 and focus on gender. It’s realistic, I think. But I still need to determine what the time period will be and for how long. Ten years seems too little. Twenty seems like a good timeframe. I want a time period where there might be some great context to the charts.

My big concern is that an analysis on demographics might be too boring. Who wants to be boring??? If there was information about race or ethnicity, that could be more interesting but alas, there is not.

Time to not worry about it and just move forward.

Data Visualization Design Process

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my journey, here are my posts tagged with capstone.

Documentation

One of the most challenging aspects of grad school while juggling a research assistant position and course work is setting aside time to write documentation about your process. You are moving so fast and switching mindsets between projects, classes, and types of projects (writing for documentation and narratives is different from an academic paper) that documentation is the least of your concerns. But, it has to be done because without it, you’ll forget. There’s nothing quite as bad as when you realized you’ve made a mistake and need to process data over again or recreate a chart with new data after filtering many different sets!

“Expect to make mistakes”.

Cameron Riopelle, Head of Data Services, Richter Library, University of Miami Libraries is credited with expressing that to me during one of our meetings. I took it to heart especially given the fact that I’m doing a lot of manual data munging.

So, what is my data visualization process?

The design process is exactly what I’ve been thinking about since starting this capstone because I never really put one together. It was always organic and most of what I’ve been learning is Design Thinking, Human-Centered Design (HCD), in addition to the benefit of UX Research Methods.

I think there are definite overlaps and application of design thinking and HCD in data visualization since most designers (I hope) place emphasis on effective, efficient, and satisfactory experiences. Data visualization is about communication, after all. Optimizing for a positive user experience is at the heart of most design solutions; however, the struggle I’ve had is when a project is forced into a process that isn’t the best fit and the deliverables aren’t the best match. But to say that and not provide an alternative is where there are problems. It’s like criticizing a person’s work but not offering up ideas or suggestions.

Look to the experts

In an effort to offer an alternative and figure out a design process that is specific to data visualization for my own sanity, I consulted books and the Google. I discovered another great article by those fine peeps at DataWrapper. What Questions to Ask When Creating Charts hit the nail on the head for me. As Lisa says, “The process of creating a data visualisation can be messy”.

Two of the six data viz workflows she shares resonate with me more than the others though Lisa’s process is a definite bonus and functions as a reminder when you lose your way. I’ll try to explain later but for now, the first data viz workflow is from Ben Fry. It’s from his book, “Visualizing Data: Exploring and Explaining Data with the Processing Environment“.

The Seven Stages of Data Visualization by Ben Fry.

What I like about this process is that it is very much focused on the data and understanding it. It is also not an assembly line as it is iterative and that you may need to go back to some aspects as you progress to others.


The other workflow I felt was a good fit is Andy Kirk’s Four Stages of the Data Visualization Design Process:

Andy Kirk’s process from his book, Visualizing Data which is in its Second Edition.

It has fewer steps at the top level there are similarities that encompass a lot of Ben Fry’s process at the second level within each stage. I’ve been trying to read this off and on for days … The process above he shares early in the book.


The third is Lisa Charlotte Rost’s process.

For me, her process falls somewhere between Ben Fry’s Represent and Refine phases and Andy Kirk’s Stage 3 and Stage 4. Only once you understand the data (What’s your Point?) can you move to Proof and Explaining.


“Reducing the randomness of your approach”

That’s from Andy Kirk’s book, Data Visualization: A Handbook for Data Driven Design. It seems obvious enough, right? Well, if I get to read more this semester, perhaps that crazy tangled ball of yarn won’t be so tangled in the future!

Learning When to Stop

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

I’ve been fortunate to have spent the last year and a half with classmates who have engineering and programming backgrounds. They taught me to persist when programming. If something doesn’t work, try again. Console.log, Google, Stackoverflow, talk out the syntax out loud are all options I practice on a daily basis these days.

But when do you stop?

Scraping One Year of the Pictures of the Year Website

Looks simple enough right?

I needed the following from the 62nd POYi competition (2004) to fill in a massive hole in my data:

  • Division
  • Category
  • Award
  • Photographer names
  • Publication
  • Title of the Photograph
  • Caption

I have experience with CSS and HTML so I was thinking it couldn’t be too hard. After digging around for a way to use R, I landed on rvest and was thrilled. This should be easy and fast.

Boy was I wrong.

Time to Stop and Go Manual

Ta-da! Nested tables. Remember we used <table> for layout back in the early naughts. 😱 In this case, the entire website was structured with table after table. In addition, many of the tables and rows were given the same class names! So much for targeting selectors.

Of course, this didn’t dawn on me right away. I stumbled for quite awhile trying to make various scraping methods work. Surely there was a way to work around the nested tables? Well, after several tries over too many hours, I ended up with itty bitty bits of partial data and/or duplicate data. It hit me that manually copy-pasting the information I needed was going to be more expedient than merging, deleting and joining partial bits of potentially hundreds of CSV files that were as messy as the nested tables.

So, in between a re-write of a conference paper (in less than a day), tweaking prototypes for an empirical study, classes, and other design work, I set about going through every link on this page to capture the data I was missing in my master sheet. I think the picture stories were the most tedious since the pagination dots were so small and even though I had created a pretty sweet copy-paste flow to minimize mistakes, it still happened.

Plus, I ran into the same issue I had when migrating the image file paths to the master sheet. The rows in the master were in a different sort order than the order of the winners’ list and the galleries contained within. So, what’s a girl to do?

Matching the unique ID of images with the actual image, photographer, award, and caption.

Triple-check, Double-check, Cross-check!

The downside of the manual approach is there is plenty of room for error. I pasted in the wrong row or the wrong column a few times. Plus, the master was not labeled as clearly as I had when copying and pasting to my sheets of data. Some sections were easier than others but wow, it was a maze. The bright side? I only had to do this for one year. Rather, I gave myself that constraint since I’m feeling incredibly behind on my schedule.

Getting to this point was a struggle. My first lesson only a little over a year ago was that 80% of the work is in finding, cleaning and analyzing the data. I keep reminding myself this when I’m feeling like I haven’t moved the dial forward.

So, a big gaping hole in my data is no longer. BUT, there are still holes so I need to figure out a way to manage them. I think the next best thing is to create some pivot tables to get a better look. The goal is to see if I can scale down the scope as a way to deal with the gaps. 🤔

Match Image File Paths to Image IDs Across Excel Sheets

Note: I’m a graduate student at the University of Miami working on my capstone, a visualization of the Pictures of the Year International Archives. If you’re curious about my process, here are my posts tagged with capstone. Keep in mind, I’m learning as I go so I’m more than all ears for more efficient solutions. Please let me know in the comments!

Image paths – Check.

The realization my master Excel workbook (merged from multiple workbooks with multiple sheets) had a sort order different from the workbook with my image paths? Roadblock.

What I thought would be a simple copy entire column A from workbook 1 (all image file paths) to column B in workbook 2 (empty column where paths should go) soon became a hot mess.

Because the order of the image IDs in my master was not consistent with the order of the image paths I had extracted I realized I had to manually match, then copy-paste so there was a match between the image file paths and the rows with the image IDs in my master. What makes this dataset so dirty is that the file naming conventions were changed periodically so there might be a year or a few where the names of images were consistent then suddenly they were modified.

After several hours, it hit me that I should try this using a function either in R or Excel. I consulted The Google for ways to match the ID with a file path across/between multiple workbooks, tables, or sheets. I wasn’t sure what I was looking for but finally came upon two options that seemed worth the effort to learn.

=VLOOKUP or =Index + Match?

Testing a small sample set with some of the IDs and file paths was the only way to go and just. try. something. I think R would have been the best route IF the dataframes already existed in R but I wasn’t ready to commit because my thought was I still had loads of data to merge.

So I decided to go with Excel functions and decided to try Index with Match or VLookup. This website, ExcelJet, came to my rescue. Finally.

After several tries with Match and Index, I kept getting #NAME! or #N/A. The #NAME! error is easy to fix since that’s an alert that your syntax needs help but the #N/A was surprising since I knew there were matches. I believe the issue was that match was looking for an exact string? Possibly. 🤷🏻‍♀️

Not having any luck with Index + Match, I decided to try VLOOKUP. Someone told me that if I spend more than 20 minutes on a problem and don’t get anywhere to a) take a break and come back to it b) ask for help or c) move on to something else. Twenty minutes seems too short so I give myself an hour (ok, much more sometimes because I can be stubborn) and switch gears.

After several attempts with the same results as Index + Match it occurred to me that once again an exact match was needed. Enter the * wildcard. It seems obvious to me now but if you are like me and don’t use wildcards in your everyday activities, it is hard to remember it exists. Am I right?

Enter ExcelJet’s Partial Match with VLOOKUP article.

VLOOKUP Across Multiple Sheets

Using a sample from my image file paths and the image IDs from the master workbook, I ran the following function (below) and boom — happy dance.

=VLOOKUP("*" & A33681 & "*",Sheet2!A47:A62035,1,FALSE)

Now, I took that function to my master and moved slowly because the wildcards might be a bit too loose and insert an incorrect path. Thankfully those instances were few and by moving slowly down my sheet instead of filling the entire column with the function I saved myself a future headache.

I wish I could say this was the last step in cleaning my data but alas, there is so much more to do. In fact, I’m missing quite a bit. Mild panic.