Exercise 3 | Basics of Data Analysis I

Max Pellert

IS 616: Large Scale Data Analysis and Visualization

Data

How to store data?

We will talk about two broad kinds of data formats

Human readable formats

Classical: Comma-seperated values (CSV) or Tabulator-seperated (TSV)

Also: JSON, YAML and many others

“Binary” formats

Those would show up as garbage in your text editor

Maybe not so accessible, but other advantages

Human readable formats

Easy to inspect

Pretty straightforward to use

What about data types? Long float/numeric or character string? Character string or date object?

Performance? I/O

Corruption? What if the field separator is contained within the field? Problems like this lead to many quoting and escaping rules and other differences between users and software packages

Convenience?

Binary formats

Can store the data type of the column: no more userids read in as numeric and converted to scientific notation

Can optimize for read speed

Can optimize for disk space (compression)

Depending on the format, can offer to do column subsets for reading in

Advanced: sometimes you can use database query language on some formats with special packages

How to load data?

Easy answer: Depends on the storage format

For humanly readable data a lot of different functions (for example from pandas, data.table, dplyr, …): read.csv, read_csv, fread, *.from_csv, …

The binary formats usually offer a library that can be loaded to provide import and export functions

How to decide between formats?

If there are no big constraints on disk space, I/O performance and similar things: (almost) everybody can work with TSVs

If you need to read in repeatedly large files, you can make your life much easier when you choose a format that optimizes I/O (like feather from Appache Arrow for example)

If disk space is an issue, use a format that supports good compression like parquet

A note on CSVs

What if the field seperator (a comma for example) appears within a field? (in a text for example)

One solution: We quote the field a,“<TEXTWITHCOMMA>”,b

What about an actual quotation mark appearing in <TEXTWITHCOMMAANDQUOTATIONMARK>?

We escape (\“), double (”“) or change to single quotation marks (’)

–> Quoting and escape rules with no real standard

TSV are a bit “safer” (because tabulators are more rare within fields), but still better to pay attention to the possibility

How to manipulate the data?

Usually the bottleneck is RAM

For R as well as Python all objects are handled in memory

Some tricks can help like the one we will discuss shortly

Most important: avoid unnecessary copies!

Many functions that are not well-implemented copy a lot

One tip from practice

Make use of column subsetting, i.e. specify the columns you need in the loading function (many formats support this)

Especially handy when you work with text: huge data sets tend to overwhelm the memory you have on typical machines quite easily

One workaround can be to work with a (row) index instead of the full text and do all kind of preprocessing with the (lightweight) metadata

Then write out the index of those rows that you actually keep and use a lightweight UNIX tool like AWK to select line by line only those from the original text file

A not(e) on loops in R

R works best on vectors (“vectorized” functions are usually way faster)

Avoid loops wherever possible

Functions like the apply family (lapply, sapply, …) sometimes also loop, but avoid some performance bottlenecks

If you have to loop, prespecify at least the size of the output object, for example a list, first

Use the remaining time to

Load the data set that you selected in the last exercise

Save it in different binary formats:

Apache Arrow, Parquet

Depending on your choice of programming language: pickle or saveRDS

Check and load each of the saved files again and take note of any differences (in terms of speed, functionality, disk space)