Lecture 3 | Basics of Data Analysis I

Max Pellert

IS 616: Large Scale Data Analysis and Visualization

Aim

These course units are intended as a supplement to your actual work with data

It wants to teach you some tricks that are often not taught

🔨🧰🪛

Some Caveats

Don’t expect a full-fledged course that answers it all for you

That also doesn’t fit the subject matter

Data science is more like dentistry than particle physics

But, the aim is to bring everybody to the same level to be able to actually do visualizations (while at the same time also providing content that very likely also the more advanced student also haven’t heard yet)

It should convey some of the (softer) skills that you actually need often

“It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003).”

Wickham, 2014

Tidy Data

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10). https://doi.org/10.18637/jss.v059.i10

What makes a data set tidy?

“each variable is a column”

“each observation is a row”

“each type of observational unit is a table” (also called data frame or data table)

“data tidying: structuring datasets to facilitate analysis”

It provides a “philosophy of data”

What makes a data set untidy?

Generally, data sets can be constructed in all bizarre ways imaginable

Wide vs. long formats

Create and use tidy data also in the interest of reproducibility and open science (think of git too!)

https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

library(data.table)
DT = as.data.table(iris)

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# FROM[WHERE, SELECT, GROUP BY]
# DT  [i,     j,      by]

DT[Petal.Width > 1.0, mean(Petal.Length), by = Species]

##       Species       V1
## 1: versicolor 4.362791
## 2:  virginica 5.552000

#      Species       V1
#1: versicolor 4.362791
#2:  virginica 5.552000

https://pandas.pydata.org/

https://pandas.pydata.org/pandas-docs/ stable/getting_started/intro_tutorials/ 03_subset_data.html#min-tut-03-subset

import pandas as pd

titanic = pd.read_csv("data/titanic.csv")

titanic.head()

##    PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
## 0            1         0       3  ...   7.2500   NaN         S
## 1            2         1       1  ...  71.2833   C85         C
## 2            3         1       3  ...   7.9250   NaN         S
## 3            4         1       1  ...  53.1000  C123         S
## 4            5         0       3  ...   8.0500   NaN         S
## 
## [5 rows x 12 columns]

ages = titanic["Age"]
ages.head()

## 0    22.0
## 1    38.0
## 2    26.0
## 3    35.0
## 4    35.0
## Name: Age, dtype: float64

above_35 = titanic[titanic["Age"] > 35]
above_35.head()

##     PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
## 1             2         1       1  ...  71.2833   C85         C
## 6             7         0       1  ...  51.8625   E46         S
## 11           12         1       1  ...  26.5500  C103         S
## 13           14         0       3  ...  31.2750   NaN         S
## 15           16         1       2  ...  16.0000   NaN         S
## 
## [5 rows x 12 columns]

titanic["Age"] > 35

## 0      False
## 1       True
## 2      False
## 3      False
## 4      False
##        ...  
## 886    False
## 887    False
## 888    False
## 889    False
## 890    False
## Name: Age, Length: 891, dtype: bool

“Two sides to data analysis”

Specialized programming languages like R (or the right packages in Python) are often well suited for your tasks

As we already learned: the bottleneck is usually RAM (because whole objects are kept in memory)

Small command line tools, on the other hand, work differently, usually line by line

This is often due to those tools being ancient and from times of severe hardware limitations

–> very efficient ways to do specific, simple operations

GNU toolchain

Can come in extremely handy

Caveat: Best to use them exactly for the task that they were designed for, even small deviations for other tasks can cause a lot of headache

Because these programs are often missing very basic concepts that are very common today

Usually, those tools work on lines of “humanly readable files” that you could open with any text editor (for example lines of text)

A line has a start and an end (usually the newline character)

The small programs that we will discuss now have been pioneers by tackling specific tasks that come up often

That’s why their functionalities have been modeled by practically all later developments (sometimes even with the same name)

It gives you an idea how to think “algorithmically” about a task, which often helps massively finding a solution

Also helps to ask the question right:

AWK

https://stackoverflow.com/questions/11532157/remove-duplicate-lines-without-sorting

grep

grep 'Smith' data/titanic.csv

## 175,0,1,"Smith, Mr. James Clinch",male,56,0,0,17764,30.6958,A7,C
## 261,0,3,"Smith, Mr. Thomas",male,,0,0,384461,7.75,,Q
## 285,0,1,"Smith, Mr. Richard William",male,,0,0,113056,26,A19,S
## 347,1,2,"Smith, Miss. Marion Elsie",female,40,0,0,31418,13,,S

wc

“word count”, but also counts lines with the right option:

wc -l data/titanic.csv

## 892 data/titanic.csv

Extremely handy for quick sanity checks, e.g. was all of the data transferred?

paste

cat data/file1.txt

## Suse
## Fedora
## CentOS
## OEL
## Ubuntu

cat data/file2.txt

## Linux
## Unix
## Solaris
## HPUX
## AIX

paste data/file1.txt data/file2.txt

## Suse Linux
## Fedora   Unix
## CentOS   Solaris
## OEL  HPUX
## Ubuntu   AIX

paste -d"," data/file1.txt data/file2.txt

## Suse,Linux
## Fedora,Unix
## CentOS,Solaris
## OEL,HPUX
## Ubuntu,AIX

Learn how to use the terminal!

Looping over files

Allows you to directly script in any directory of your file system

Is often much faster (and sometimes also safer) than to use a Python or R script for that

But still, many unintented things can happen, so be careful!

Basic wildcard matching is usually also possible and can come in very handy, for example to select all files with a specific naming scheme (e.g. date) or file ending

ls

## 03_basics_of_data_analysis_I_lecture.html
## 03_basics_of_data_analysis_I_lecture.Rmd
## awk_dedup_cropped.png
## bash_cropped.png
## data
## data_manipulation_cropped.png
## data.table_cropped.png
## dplyr_cropped.png
## features_data.table.png
## features_pandas.png
## grep_cropped.png
## job_control_cropped.png
## logo-stackoverflow.png
## missing_semester_cropped.png
## missing_semester_why_cropped.png
## molten_data_cropped.png
## pandas_cropped.png
## pandas.png
## paste_cropped.png
## pipe_abstract.png
## pipe_example_cropped.png
## posit_cropped.png
## tidydata_cropped.png
## tidyverse.png
## titanic_data_docu_cropped.png
## untidy_data_cropped.png
## usage_data.table.png
## wc_cropped.png
## why_data.table.png
## wickham_bio_cropped.png

for i in *.png; do echo $i; done

## awk_dedup_cropped.png
## bash_cropped.png
## data_manipulation_cropped.png
## data.table_cropped.png
## dplyr_cropped.png
## features_data.table.png
## features_pandas.png
## grep_cropped.png
## job_control_cropped.png
## logo-stackoverflow.png
## missing_semester_cropped.png
## missing_semester_why_cropped.png
## molten_data_cropped.png
## pandas_cropped.png
## pandas.png
## paste_cropped.png
## pipe_abstract.png
## pipe_example_cropped.png
## posit_cropped.png
## tidydata_cropped.png
## tidyverse.png
## titanic_data_docu_cropped.png
## untidy_data_cropped.png
## usage_data.table.png
## wc_cropped.png
## why_data.table.png
## wickham_bio_cropped.png

Chaining (or piping)

Allows you to chain simple tools together

Those tools often only have very limited applications (but usually work on them very efficiently)

Chaining them is extremely powerful as you can build up very complex pipelines from those simple tools

Pipe characters: | (or %>% or %|% or many others)

ls | grep png | head -10

## awk_dedup_cropped.png
## bash_cropped.png
## data_manipulation_cropped.png
## data.table_cropped.png
## dplyr_cropped.png
## features_data.table.png
## features_pandas.png
## grep_cropped.png
## job_control_cropped.png
## logo-stackoverflow.png

ls | grep png | grep features

## features_data.table.png
## features_pandas.png

Take a look at ./missing-semester

https://missing.csail.mit.edu/

You learn about small tools and tricks that can be enormous time savers

Especially important, learning about command line interfaces and job control: