Exercise 1 | Introduction & Logistics

Max Pellert

IS 616: Large Scale Data Analysis and Visualization

Let’s get your basic setups running!

🛠

R

Install R

https://cran.r-project.org/

Use RStudio

https://posit.co/products/open-source/rstudio/

Install R packges

install.packages()

R packages

data.table

ggplot2

tidyverse

quanteda

...

Usually, functions in R are well-documented, just run any function name prefixed with ? to get help if you are stuck.

Visualization and R

Data Analysis in R

https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

Python

Install Spyder

https://www.spyder-ide.org/

Use pip as package manager

https://pip.pypa.io/en/stable/installation/

after installing pip, install packages with `pip install`

Python packages

wordshiftgraphs

matplotlib

seaborne

altair

nltk

spacy

pytorch

transformers

...

Python Package Index (PyPI) (https://pypi.org/) usually also provides links to package documentations

Visualization and Python

Visualization and Python

Data Analysis in Python

https://pandas.pydata.org/

R vs. Python?

In this course, you are free to use either

The popularity of one over the other is currently largely determined by disciplinary tastes and traditions (econ more towards R, computer science more towards Python) and this course has a interdisciplinary audience

They are both non-commercial and have dedicated communities

R may still have an edge in concise statistical computing and also visualization, but Python caught up a lot

Python is a general purpose language and the de-facto standard in deep learning

To do

Catch up on using your favorite visualization package

Take special care to check out all ways to customize your plots, e.g.

  • How to change the theme of a plot

  • How to set custom axis limits

  • How to set custom axis ticks and labels

You will need those skills later in the course

To do

Also start refreshing your data wrangeling skills

How to load data in

How to handle most common preprocessing steps

It is obvious that you need those skills to able to do data visualization

git

https://rogerdudler.github.io/git-guide/

If you work alone on your repository, the following commands are usually all you need

git clone https://github.com/USERNAME/REPONAME.git

git add .

git commit -m "add first files"

git push

Versioning tools are an excellent way to do backups of your code and to share it with other people systematically

If you work together with others, who also push to the same repo, you will need commands like

git pull

too, to make your local repo up-to-date before pushing to the remote one.

LaTeX

To write your document in LaTeX, create a free account on overleaf.com

You can start writing from a template, for example the one provided by Overleaf for submissions to Nature Scientific Reports

Overleaf also offers a good introduction to LaTeX, if you have not used it before

Regular Expressions

If you ever used a line like find *.txt, you performed pattern matching

Much more complex patterns are possible with regex

For a quick introduction: https://www.codemag.com/article/0305041/Getting-Started-With-Regular-Expressions

Also used by many other useful basic tools like awk, grep, sub

Those are standalone command line tools but they are such classics that their functionality is also mimicked in other programming languages (for example in R there is grepl, gsub, …)

Regular Expressions

You will often stumble over regex in many contexts (for example in this course)

There is an enourmous amount of small variations (“flavours”) among them, for a comparison see for example: https://gist.github.com/CMCDragonkai/ 6c933f4a7d713ef712145c5eb94a1816

This (and other factors) can often make writing regex a frustrating experience

Tools like ChatGPT can help by fixing dysfunctional regex or by explaining them to you!

Learning by doing

https://norvig.com/21-days.html

Questions?