Lecture 5 | Types of Data Visualization

Max Pellert

IS 616: Large Scale Data Analysis and Visualization

The beginnings

“The earliest seeds of visualization arose in geometric diagrams, in tables of the positions of stars and other celestial bodies, and in the making of maps to aid in navigation and exploration.”

“The idea of coordinates was used by ancient Egyptian surveyors in laying out towns, earthly and heavenly positions were located by something akin to latitude and longitude by at least 200 B.C.,

The beginnings

and the map projection of a spherical earth into latitude and longitude by Claudius Ptolemy [c. 85–c. 165] in Alexandria would serve as reference standards until the 14th century.”

Friendly, M. (2008). A Brief History of Data Visualization. In C. Chen, W. Härdle, & A. Unwin, Handbook of Data Visualization (pp. 15–56). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-33037-0_2

Heer, J., Bostock, M., & Ogievetsky, V. (2010). A Tour through the Visualization Zoo: A survey of powerful visualization techniques, from the obvious to the obscure. Queue, 8(5), 20–30. https://doi.org/10.1145/1794514.1805128

“Creating a visualization requires a number of nuanced judgments.”

“One must determine which questions to ask, identify the appropriate data, and select effective visual encodings to map data values to graphical features such as position, size, shape, and color.”

“The challenge is that for any given data set the number of visual encodings—and thus the space of possible visualization designs—is extremely large.”

1D data

If you have for example 10 data points, what would be the most direct way to visualize this?*

1D Scatterplot or “strip chart”

Problems?

1D “Jittered” Scatterplot

Using transparency (“alpha”)

Using empty symbols such as rings

https://r-charts.com/distribution/beeswarm/

https://en.wikipedia.org/wiki/Rug_plot

With a lot of data, we may need to aggregate or summarize not to be overwhelmed by the mass of single data points

Histograms

show the prevalence of values grouped into bins

Histograms can mislead

https://twitter.com/NicholasStrayer/ status/1026893778404225024

http://nickstrayer.me/histogram_bins/

https://en.wikipedia.org/wiki/Histogram

Boxplots

This refers to the box-and-whisker plot, which conveyes statistical features such as the mean, median, quartile boundaries or extreme outliers.

Wickham, H., & Stryjewski, L. (2012). 40 years of boxplots. had.co.nz.

Histogram vs. Boxplot

What are their strengths and weaknesses?

As a summarization method, a boxplot may be useful if you want to compare multiple (well-behaving) distributions. Boxplots will immediately and precisely show the median, the quartiles, and the rough range of the distribution.

On the other hand, a boxplot may hide details in the distribution, particularly when the distribution is far from a normal distribution.

A histogram is sensitive to parameter choice as we have seen

Bar charts

Similar to histograms but the height of the bars must not necessarily be a count (or frequency) and the data can have “natural” categories not artificial bins

Rather, any (numeric) variable can be displayed

https://en.wikipedia.org/wiki/Bar_chart

Pie charts

Similar to bar charts but the area is circle segments not bars

Have a very bad name, better not to use them not to trigger people (Few, S. (2007). Save the Pies for Dessert.)

Also good other reasons not to use them:

2D Scatter plots

So far, we were more or less only concerned with the x-axis

For example, the x-axis was set by the histogram bins or more general by groups or categories in the bar chart

If we relax this to plot arbitrary (numeric) variables on the x and y-axis, we get 2D scatter plots

2D Scatter plots

“Waiting time between eruptions and the duration of the eruption for the Old Faithful Geyser in Yellowstone National Park, Wyoming, USA.

This chart suggests there are generally two types of eruptions: short-wait-short-duration, and long-wait-long-duration.”

https://en.wikipedia.org/wiki/Scatter_plot

2D Scatter plots

Very common, often the first thing you plot usually by using points

That makes a lot of sense, as it is often an “honest” strategy that reveals a lot

It can prevent you from overlooking things, which may be embarassing later (see Anscombe’s quartett)

Same problems with a lot of data points as the 1D scatter plot, similar strategies to tackle that for example with alpha or rings

The “visualization zoo”

Time series

“Values changing over time”

Like a scatter plot, but the x-axis is a time dimension now

Often instead (or in addition) to points, lines are plotted

Raw values are often less important than relative changes

Mutltiple lines can often only meaningfully compared when they are normalized in some way

Multiple stocks may have totally different baseline prices for example

Index chart

Stacked graphs

Stacked Graph of Unemployed U.S. Workers by Industry, 2000-2010

By stacking area charts on top of each other, we arrive at a visual summation of time-series values

Also called “stream graph”

Some limitations:

A stacked graph does not support negative numbers and is meaningless for data that should not be summed (temperatures, for example)

Small multiples instead

Horizon graph

We start with standard area chart, with positive values colored blue and negative values colored red

“The horizon graph is a technique for increasing the data density of a time-series view while preserving resolution.”

We divide the graph into horizontal bands and layer them to create a nested form.

The result is a chart that preserves data resolution but uses only a quarter of the space.

Horizon graph

Statistical Distributions

Often, we want to do exploratory data analysis:

To gain insight into how data is distributed to inform data transformation and modeling decisions

We already covered the histogram and the boxplot, but there are many more techniques

Stem-and-leaf plots

Stem-and-Leaf Plot of Mechanical Turk Participation Rates

It typically bins numbers according to the ﬁrst signiﬁcant digit and then stacks the values within each bin by the second signiﬁcant digit.

This minimalistic representation uses the data itself to paint a frequency distribution,

replacing the “information-empty” bars of a traditional histogram bar chart and allowing one to assess both the overall distribution and the contents of each bin.

Q-Q plots

The Q-Q plot compares two probability distributions by graphing their quantiles

If the two are similar, the plotted values will lie roughly along the central diagonal

SPLOM (scatter plot matrix)

Scatter Plot Matrix of Automobile Data

Small multiples of scatter plots showing a set of pairwise relations among variables

A SPLOM enables visual inspection of correlations between any pair of variables.

Parallel coordinates

Parallel coordinates take a different approach to visualizing multivariate data in a more compact way

Instead of graphing every pair of variables in two dimensions, we repeatedly plot the data on parallel axes and then connect the corresponding points with lines

Each line represents a single row in the database

Line crossings between dimensions often indicate inverse correlation

Reordering dimensions can aid pattern ﬁnding

*Do you need a graphic at all?

Combined visualization types

There can be interesting combinations of those types of graphs that we covered

Some of those advanced techniques will be covered in later course units

For example: geo-spatial placement of stacked time series

https://erdavis.com/2022/02/09/how-i-made-the-viral-map/

As so often, there are also examples that don’t serve so well as role models

https://www.ft.com/content/3888bdba-d0d6-49a1-9e78-4d07ce458f42

https://theconversation.com/three-charts-that-show-where-the-coronavirus-death-rate-is-heading-137103

Acknowledgements

https://yy.github.io/dviz-course/