Exercise 4 | Basics of Data Analysis II

Max Pellert

IS 616: Large Scale Data Analysis and Visualization

Spreadsheets?

Excel?

Sources

https://www.theverge.com/2020/10/5/21502141/uk-missing-coronavirus-cases-excel-spreadsheet-error

https://www.bbc.com/news/technology-54423988

https://warwick.ac.uk/fac/soc/economics/research/ centres/cage/publications/workingpapers/ 2020/does_contact_tracing_work_quasi_ experimental_evidence_from_an_excel_error_in_england/

and many more…

Sources

https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates

Ziemann, M., Eren, Y., & El-Osta, A. (2016). Gene name errors are widespread in the scientific literature. Genome Biology, 17(1), 177. https://doi.org/10.1186/s13059-016-1044-7

https://www.nature.com/articles/d41586-021-02211-4

https://theconversation.com/the-reinhart-rogoff-error-or-how-not-to-excel-at-economics-13646

Draw your own conclusions

In the end, you are responsible for the results that you communicate at the end of a research project, in a publication, in a report written as employee of a company, …

Spreadsheets can be a useful tool for small tasks such as for example collecting expenses or annotating small data

But even for that you may find better alternatives

Be aware of the signalling power of Excel

Some example code on different ways to store and load data

Human readable

library(data.table)
dt <- fread("filename.csv")

# typical parameters: sep="auto", quote="\"",
# nrows=Inf, header="auto", ...
import pandas as pd
df = pd.read_csv("filename.csv")

# typical parameters: delimiter=None, header='infer',
# names=_NoDefault.no_default, index_col=None, usecols=None, ...

Binary formats

library("arrow")
write_parquet(df, "df.parquet")
df <- read_parquet("df.parquet")

https://www.r-bloggers.com/2021/09/understanding-the-parquet-file-format/

import numpy as np
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'one': [-1, np.nan, 2.5],
                   'two': ['foo', 'bar', 'baz'],
                   'three': [True, False, True]},
                   index=list('abc'))

table = pa.Table.from_pandas(df)
import pyarrow.parquet as pq

pq.write_table(table, 'example.parquet')

table2 = pq.read_table('example.parquet')

table2.to_pandas()

Replicating Anscombes Quartett

library(datasets)
datasets::anscombe
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Replicating Anscombes Quartett

x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

datasets = {
    'I': (x, y1),
    'II': (x, y2),
    'III': (x, y3),
    'IV': (x4, y4)
}