These course units are intended as a supplement to your actual work with data
It wants to teach you some tricks that are often not taught
Don’t expect a full-fledged course that answers it all for you
That also doesn’t fit the subject matter
Data science is more like dentistry than particle physics
But, the aim is to bring everybody to the same level to be able to actually do visualizations (while at the same time also providing content that very likely also the more advanced student also haven’t heard yet)
It should convey some of the (softer) skills that you actually need often
“It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003).”
Wickham, 2014
Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10). https://doi.org/10.18637/jss.v059.i10
“each variable is a column”
“each observation is a row”
“each type of observational unit is a table” (also called data frame or data table)
“data tidying: structuring datasets to facilitate analysis”
It provides a “philosophy of data”
Generally, data sets can be constructed in all bizarre ways imaginable
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## PassengerId Survived Pclass ... Fare Cabin Embarked
## 0 1 0 3 ... 7.2500 NaN S
## 1 2 1 1 ... 71.2833 C85 C
## 2 3 1 3 ... 7.9250 NaN S
## 3 4 1 1 ... 53.1000 C123 S
## 4 5 0 3 ... 8.0500 NaN S
##
## [5 rows x 12 columns]
## 0 22.0
## 1 38.0
## 2 26.0
## 3 35.0
## 4 35.0
## Name: Age, dtype: float64
## 0 False
## 1 True
## 2 False
## 3 False
## 4 False
## ...
## 886 False
## 887 False
## 888 False
## 889 False
## 890 False
## Name: Age, Length: 891, dtype: bool
Specialized programming languages like R (or the right packages in Python) are often well suited for your tasks
As we already learned: the bottleneck is usually RAM (because whole objects are kept in memory)
Small command line tools, on the other hand, work differently, usually line by line
This is often due to those tools being ancient and from times of severe hardware limitations
–> very efficient ways to do specific, simple operations
Can come in extremely handy
Caveat: Best to use them exactly for the task that they were designed for, even small deviations for other tasks can cause a lot of headache
Because these programs are often missing very basic concepts that are very common today
Usually, those tools work on lines of “humanly readable files” that you could open with any text editor (for example lines of text)
A line has a start and an end (usually the newline character)
The small programs that we will discuss now have been pioneers by tackling specific tasks that come up often
That’s why their functionalities have been modeled by practically all later developments (sometimes even with the same name)
It gives you an idea how to think “algorithmically” about a task, which often helps massively finding a solution
Also helps to ask the question right:
https://stackoverflow.com/questions/11532157/remove-duplicate-lines-without-sorting
## 175,0,1,"Smith, Mr. James Clinch",male,56,0,0,17764,30.6958,A7,C
## 261,0,3,"Smith, Mr. Thomas",male,,0,0,384461,7.75,,Q
## 285,0,1,"Smith, Mr. Richard William",male,,0,0,113056,26,A19,S
## 347,1,2,"Smith, Miss. Marion Elsie",female,40,0,0,31418,13,,S
“word count”, but also counts lines with the right option:
## 892 data/titanic.csv
Extremely handy for quick sanity checks, e.g. was all of the data transferred?
## Suse Linux
## Fedora Unix
## CentOS Solaris
## OEL HPUX
## Ubuntu AIX
Allows you to directly script in any directory of your file system
Is often much faster (and sometimes also safer) than to use a Python or R script for that
But still, many unintented things can happen, so be careful!
Basic wildcard matching is usually also possible and can come in very handy, for example to select all files with a specific naming scheme (e.g. date) or file ending
## 03_basics_of_data_analysis_I_lecture.html
## 03_basics_of_data_analysis_I_lecture.Rmd
## awk_dedup_cropped.png
## bash_cropped.png
## data
## data_manipulation_cropped.png
## data.table_cropped.png
## dplyr_cropped.png
## features_data.table.png
## features_pandas.png
## grep_cropped.png
## job_control_cropped.png
## logo-stackoverflow.png
## missing_semester_cropped.png
## missing_semester_why_cropped.png
## molten_data_cropped.png
## pandas_cropped.png
## pandas.png
## paste_cropped.png
## pipe_abstract.png
## pipe_example_cropped.png
## posit_cropped.png
## tidydata_cropped.png
## tidyverse.png
## titanic_data_docu_cropped.png
## untidy_data_cropped.png
## usage_data.table.png
## wc_cropped.png
## why_data.table.png
## wickham_bio_cropped.png
## awk_dedup_cropped.png
## bash_cropped.png
## data_manipulation_cropped.png
## data.table_cropped.png
## dplyr_cropped.png
## features_data.table.png
## features_pandas.png
## grep_cropped.png
## job_control_cropped.png
## logo-stackoverflow.png
## missing_semester_cropped.png
## missing_semester_why_cropped.png
## molten_data_cropped.png
## pandas_cropped.png
## pandas.png
## paste_cropped.png
## pipe_abstract.png
## pipe_example_cropped.png
## posit_cropped.png
## tidydata_cropped.png
## tidyverse.png
## titanic_data_docu_cropped.png
## untidy_data_cropped.png
## usage_data.table.png
## wc_cropped.png
## why_data.table.png
## wickham_bio_cropped.png
Allows you to chain simple tools together
Those tools often only have very limited applications (but usually work on them very efficiently)
Chaining them is extremely powerful as you can build up very complex pipelines from those simple tools
Pipe characters: | (or %>% or %|% or many others)
## awk_dedup_cropped.png
## bash_cropped.png
## data_manipulation_cropped.png
## data.table_cropped.png
## dplyr_cropped.png
## features_data.table.png
## features_pandas.png
## grep_cropped.png
## job_control_cropped.png
## logo-stackoverflow.png
## features_data.table.png
## features_pandas.png
https://missing.csail.mit.edu/
You learn about small tools and tricks that can be enormous time savers