Exercise 6 | Theory of Data Graphics I

Max Pellert

IS 616: Large Scale Data Analysis and Visualization

Practical considerations

We heard that we may not need graphics at all in some circumstances

A table or just presenting the data in text can make most sense for some situations

But often we can convey much more much better with data graphics

Consider the time series about the outgoing mail of the U.S. House of Representatives that peaks every two years, just before the election day

More on useful tools…

https://software-carpentry.org/

Lesson Repository Site
The Unix Shell swcarpentry/shell-novice rendered
Version Control with Git swcarpentry/git-novice rendered
Version Control with Mercurial swcarpentry/hg-novice rendered
Using Databases and SQL swcarpentry/sql-novice-survey rendered
Programming with Python swcarpentry/python-novice-inflammation rendered
Programming with R swcarpentry/r-novice-inflammation rendered
R for Reproducible Scientific Analysis swcarpentry/r-novice-gapminder/ rendered
Programming with MATLAB swcarpentry/matlab-novice-inflammation rendered
Automation and Make swcarpentry/make-novice rendered
Instructor Training carpentries/instructor-training rendered

Software capentry

“A Software Carpentry workshop is a hands-on training that covers the core skills needed to be productive in a small research team.

Short tutorials alternate with practical exercises, and all instruction is done via live coding.”

Regularly, local workshops in many areas of the world

All lessons are also available on GitHub

https://github.com/swcarpentry/swcarpentry

More on less useful tools…

Hand-in Exercise

🙌

Hand-in Exercise

The first of two that have to be completed to be able to take part in the exam

Due until October, 23rd 23:59 (AoE)

To be handed in on ILIAS (upload form provided there)

Everybody works on it on their own and uploads it individually (again, necessary for exam!)

I do check for substantial overlaps between your handed-in materials (code, visualization and text) and those of other students

Hand-in Exercise Format

Your submission should, in that order, consist of three parts

  1. Your visualization (as a vector graphic!), that can also be made up of multiple sub-plots together with annotations for example

  2. The documented code that produces your visualization (each line commented), R or Python

  3. Half a page (A4) of explanation and reasoning for design choices that you took, what questions you wanted to answer and also explain how you structured the data for your chosen visualization and if you faced challenges and how you overcame them in case

Hand-in Exercise Format

Submit your solution as 1 (!) PDF with a filename that includes your name and additionally includes a personal identifier of you (name and student number) on every A4 page in the PDF document

If you have seperate PDFs, you can for example combine them with the following command line tool

pdftk file1.pdf file2.pdf file3.pdf cat output first_submission_name.pdf

Or use any other tool of your choice (also consider creating your document directly in Rmarkdown or IPython notebooks)

Hand-in Exercise Data

This data comes from Hollywood Age Gap via Data Is Plural:

An informational site showing the age gap between movie love interests.

The data follows certain rules:

  • The two (or more) actors play actual love interests (not just friends, coworkers, or some other non-romantic type of relationship)

  • The youngest of the two actors is at least 17 years old

  • Not animated characters

Hand-in Exercise Data

“Note: The age gaps dataset includes”gender” columns, which always contain the values “man” or “woman”. These values appear to indicate how the characters in each film identify. Some of these values do not match how the actor identifies. We apologize if any characters are misgendered in the data!!”

age_gaps.csv

variable class description
movie_name character Name of the film
release_year integer Release year
director character Director of the film
age_difference integer Age difference between the characters in whole years
couple_number integer An identifier for the couple in case multiple couples are listed for this film
actor_1_name character The name of the older actor in this couple
actor_2_name character The name of the younger actor in this couple

age_gaps.csv

variable class description
character_1_gender character The gender of the older character, as identified by the person who submitted the data for this couple
character_2_gender character The gender of the younger character, as identified by the person who submitted the data for this couple
actor_1_birthdate date The birthdate of the older member of the couple
actor_2_birthdate date The birthdate of the younger member of the couple
actor_1_age integer The age of the older actor when the film was released
actor_2_age integer The age of the younger actor when the film was released

age_gaps.csv

https://raw.githubusercontent.com/rfordatascience/tidytuesday/ master/data/2023/2023-02-14/age_gaps.csv

age_gaps.csv in R

# Get the Data

# Read in with tidytuesdayR package 
# Install from CRAN via: install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest

# Either ISO-8601 date or year/week works!

tuesdata <- tidytuesdayR::tt_load('2023-02-14')
tuesdata <- tidytuesdayR::tt_load(2023, week = 7)

age_gaps <- tuesdata$age_gaps

# Or read in the data manually

age_gaps <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-14/age_gaps.csv')

Bechdel Test

Bechdel Test

“We previously provided a dataset about the Bechdel Test. It might be interesting to see whether there is any correlation between these datasets! The Bechdel Test dataset also included additional information about the films that were used in that dataset.”

raw_bechdel.csv

variable class description
year integer Year of release
id integer ID of film
imdb_id character IMDB ID
title character Title of film
rating integer Rating (0-3), 0 = unscored, 1. It has to have at least two [named] women in it, 2. Who talk to each other, 3. About something besides a man

movies.csv

variable class description
year double Year
imdb character IMDB
title character Title of movie
test character Bechdel Test outcome
clean_test character Bechdel Test cleaned
binary character Binary pass/fail of bechdel
budget double Budget as of release year
domgross character Domestic gross in release year
intgross character International gross in release year
code character Code

movies.csv

variable class description
budget_2013 double Budget normalized to 2013
domgross_2013 character Domestic gross normalized to 2013
intgross_2013 character International gross normalized to 2013
period_code double Period code
decade_code double Decade Code
imdb_id character IMDB ID
plot character Plot of movie
rated character Rating of movie
response character Response?
language character Language of film
country character Country produced in
writer character Writer of film

movies.csv

variable class description
metascore double Metascore rating (0-100)
imdb_rating double IMDB Rating 0-10
director character Director of movie
released character Released date
actors character Actors
genre character Genre
awards character Awards
runtime character Runtime
type character Type of film
poster character Poster image
imdb_votes character IMDB Votes
error character Error?

raw_bechdel.csv & movies.csv

https://raw.githubusercontent.com/rfordatascience/tidytuesday/ master/data/2021/2021-03-09/raw_bechdel.csv

https://raw.githubusercontent.com/rfordatascience/tidytuesday/ master/data/2021/2021-03-09/movies.csv

raw_bechdel.csv & movies.csv in R

tuesdata <- tidytuesdayR::tt_load('2021-03-09')
tuesdata <- tidytuesdayR::tt_load(2021, week = 11)

bechdel <- tuesdata$bechdel

raw_bechdel <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-09/raw_bechdel.csv')
movies <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-09/movies.csv')

🤲

Acknowledgements

https://www.youtube.com/watch?v=AdSZJzb-aX8#!

https://www.youtube.com/watch?v=Meq3CyuKOjM

https://yy.github.io/dviz-course/