We will talk about two broad kinds of data formats
Human readable formats
Classical: Comma-seperated values (CSV) or Tabulator-seperated (TSV)
Also: JSON, YAML and many others
“Binary” formats
Those would show up as garbage in your text editor
Maybe not so accessible, but other advantages
Easy to inspect
Pretty straightforward to use
What about data types? Long float/numeric or character string? Character string or date object?
Performance? I/O
Corruption? What if the field separator is contained within the field? Problems like this lead to many quoting and escaping rules and other differences between users and software packages
Convenience?
Can store the data type of the column: no more userids read in as numeric and converted to scientific notation
Can optimize for read speed
Can optimize for disk space (compression)
Depending on the format, can offer to do column subsets for reading in
Advanced: sometimes you can use database query language on some formats with special packages
Easy answer: Depends on the storage format
For humanly readable data a lot of different functions (for example from pandas, data.table, dplyr, …): read.csv, read_csv, fread, *.from_csv, …
The binary formats usually offer a library that can be loaded to provide import and export functions
If there are no big constraints on disk space, I/O performance and similar things: (almost) everybody can work with TSVs
If you need to read in repeatedly large files, you can make your life much easier when you choose a format that optimizes I/O (like feather from Appache Arrow for example)
If disk space is an issue, use a format that supports good compression like parquet
What if the field seperator (a comma for example) appears within a field? (in a text for example)
One solution: We quote the field a,“<TEXTWITHCOMMA>”,b
What about an actual quotation mark appearing in <TEXTWITHCOMMAANDQUOTATIONMARK>?
We escape (\“), double (”“) or change to single quotation marks (’)
–> Quoting and escape rules with no real standard
TSV are a bit “safer” (because tabulators are more rare within fields), but still better to pay attention to the possibility
Usually the bottleneck is RAM
For R as well as Python all objects are handled in memory
Some tricks can help like the one we will discuss shortly
Most important: avoid unnecessary copies!
Many functions that are not well-implemented copy a lot
Make use of column subsetting, i.e. specify the columns you need in the loading function (many formats support this)
Especially handy when you work with text: huge data sets tend to overwhelm the memory you have on typical machines quite easily
One workaround can be to work with a (row) index instead of the full text and do all kind of preprocessing with the (lightweight) metadata
Then write out the index of those rows that you actually keep and use a lightweight UNIX tool like AWK to select line by line only those from the original text file
R works best on vectors (“vectorized” functions are usually way faster)
Avoid loops wherever possible
Functions like the apply family (lapply, sapply, …) sometimes also loop, but avoid some performance bottlenecks
If you have to loop, prespecify at least the size of the output object, for example a list, first
Load the data set that you selected in the last exercise
Save it in different binary formats:
Apache Arrow, Parquet
Depending on your choice of programming language: pickle or saveRDS
Check and load each of the saved files again and take note of any differences (in terms of speed, functionality, disk space)