Lecture 2 | Motivation

Max Pellert

IS 616: Large Scale Data Analysis and Visualization

Data Visualization?

Why do you need that?

Visual communication is important in all areas: industry, science, …

But also for yourself, to learn and to remember

In the context of science especially: to get new insights

Especially important for very large data sets

But keep in mind: pure eyeballing can also mislead (connected to the problem of induction, “reasoning after the facts”)

A tale of epidemics…

The Cholera Epidemic in London, 1854

Cholera broke out in the Broad Street area of central London on the evening of August 31, 1854

Causes and possible interventions were unclear

“Miasma theory” (“pollution”) held that “bad air” or “night air” was responsible

Miasma could eminate from rotting organic matter, for example from burying grounds of plague victims from two centuries earlier

The Cholera Epidemic in London, 1854

John Snow who investigated earlier epidemics had a different theory of the causes

He wasn’t successful in verifying his suspicions directly

So he tried an indirect strategy to find the causes: Data Visualization

He obtained a list of 83 deaths from cholera (including the addresses of the victims)

And plotted them on the map of the part of London that was affected by Cholera

Intervention

John Snow had the handle of the pump removed

Causes

The epidemic soon ended

Revolutionized our understanding of transmission processes: germ theory of disease

In 1886: discovery of the bacterium vibrio cholerae

What actually made the water impure and dangerous?

Industrial Revolution: rapid urbanization but no infrastructure

“Nightmen”

https://www.gutenberg.org/files/60440/60440-h/60440-h.htm

“Leaching cesspools” (Illustration)

What makes this investigation so strong?

Providing context, with the right graphic display

From a one-dimensional temporal ordering into a two-dimensional spatial comparison

Quantitative comparisons: Why did no workers at the brewery so close to the pump die?

They are allowed to drink a daily quantity of beer. The owner of the brewery believes “they do not drink water at all”

What makes this investigation so strong?

Considering alternative explanations and contrary cases

Seemingly unconnected cases of cholera in other areas reveal connections: a cabinet-maker works near the pump, a girl goes to school close-by

Assessment of possible errors in the numbers reported in graphics

“An area of the map may be free of cases merely because it is not populated” –> whole area very densely populated

A Note

Evidence of the effect of the intervention actually not that clear cut:

You could also aggregate the data differently, to artificially boost the story:

(Tufte calls this “chart-junk” as we will see later in the course)

Enriching visual displays

From a visualization point of view, John Snow actually used a very simple mechanism

Marking deaths on a map

Going beyond that, graphics can really excel at condensing and bringing much disparate information together to make it comparable

What about today’s “large-scale” data?

Often even more powerful in uncovering hidden phenomena!

The Follower Factory

A very good example of data journalism

Put the spotlight on identity theft and fake accounts in social media

The Follower Factory

https://www.economist.com/leaders/2010/02/25/the-data-deluge

The Follower Factory

Paradoxically, visualization techniques can sometimes profit from “too much data”

Allows you to see through the deluge sometimes

But you have to be careful, visualizations of big data can mislead as well as visualization of small data

We will see examples of that in the course

https://www.reddit.com/r/datascience/comments/ 16dk5b6/r_vs_python_detailed_examples_from_proficient/

Acknowledgements

https://www.gutenberg.org/files/60440/60440-h/60440-h.htm#i_image23

https://www.reddit.com/user/Useful-Possibility80/