“The earliest seeds of visualization arose in geometric diagrams, in tables of the positions of stars and other celestial bodies, and in the making of maps to aid in navigation and exploration.”
“The idea of coordinates was used by ancient Egyptian surveyors in laying out towns, earthly and heavenly positions were located by something akin to latitude and longitude by at least 200 B.C.,
and the map projection of a spherical earth into latitude and longitude by Claudius Ptolemy [c. 85–c. 165] in Alexandria would serve as reference standards until the 14th century.”
Friendly, M. (2008). A Brief History of Data Visualization. In C. Chen, W. Härdle, & A. Unwin, Handbook of Data Visualization (pp. 15–56). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-33037-0_2
Heer, J., Bostock, M., & Ogievetsky, V. (2010). A Tour through the Visualization Zoo: A survey of powerful visualization techniques, from the obvious to the obscure. Queue, 8(5), 20–30. https://doi.org/10.1145/1794514.1805128
“Creating a visualization requires a number of nuanced judgments.”
“One must determine which questions to ask, identify the appropriate data, and select effective visual encodings to map data values to graphical features such as position, size, shape, and color.”
“The challenge is that for any given data set the number of visual encodings—and thus the space of possible visualization designs—is extremely large.”
If you have for example 10 data points, what would be the most direct way to visualize this?*
Problems?
1D “Jittered” Scatterplot
Using transparency (“alpha”)
Using empty symbols such as rings
show the prevalence of values grouped into bins
https://twitter.com/NicholasStrayer/ status/1026893778404225024
This refers to the box-and-whisker plot, which conveyes statistical features such as the mean, median, quartile boundaries or extreme outliers.
Wickham, H., & Stryjewski, L. (2012). 40 years of boxplots. had.co.nz.
What are their strengths and weaknesses?
As a summarization method, a boxplot may be useful if you want to compare multiple (well-behaving) distributions. Boxplots will immediately and precisely show the median, the quartiles, and the rough range of the distribution.
On the other hand, a boxplot may hide details in the distribution, particularly when the distribution is far from a normal distribution.
A histogram is sensitive to parameter choice as we have seen
Similar to histograms but the height of the bars must not necessarily be a count (or frequency) and the data can have “natural” categories not artificial bins
Rather, any (numeric) variable can be displayed
Similar to bar charts but the area is circle segments not bars
Have a very bad name, better not to use them not to trigger people (Few, S. (2007). Save the Pies for Dessert.)
Also good other reasons not to use them:
So far, we were more or less only concerned with the x-axis
For example, the x-axis was set by the histogram bins or more general by groups or categories in the bar chart
If we relax this to plot arbitrary (numeric) variables on the x and y-axis, we get 2D scatter plots
“Waiting time between eruptions and the duration of the eruption for the Old Faithful Geyser in Yellowstone National Park, Wyoming, USA.
This chart suggests there are generally two types of eruptions: short-wait-short-duration, and long-wait-long-duration.”
Very common, often the first thing you plot usually by using points
That makes a lot of sense, as it is often an “honest” strategy that reveals a lot
It can prevent you from overlooking things, which may be embarassing later (see Anscombe’s quartett)
Same problems with a lot of data points as the 1D scatter plot, similar strategies to tackle that for example with alpha or rings
“Values changing over time”
Like a scatter plot, but the x-axis is a time dimension now
Often instead (or in addition) to points, lines are plotted
Raw values are often less important than relative changes
Mutltiple lines can often only meaningfully compared when they are normalized in some way
Multiple stocks may have totally different baseline prices for example
Stacked Graph of Unemployed U.S. Workers by Industry, 2000-2010
By stacking area charts on top of each other, we arrive at a visual summation of time-series values
Also called “stream graph”
Some limitations:
A stacked graph does not support negative numbers and is meaningless for data that should not be summed (temperatures, for example)
We start with standard area chart, with positive values colored blue and negative values colored red
“The horizon graph is a technique for increasing the data density of a time-series view while preserving resolution.”
We divide the graph into horizontal bands and layer them to create a nested form.
The result is a chart that preserves data resolution but uses only a quarter of the space.
Often, we want to do exploratory data analysis:
To gain insight into how data is distributed to inform data transformation and modeling decisions
We already covered the histogram and the boxplot, but there are many more techniques
Stem-and-Leaf Plot of Mechanical Turk Participation Rates
It typically bins numbers according to the first significant digit and then stacks the values within each bin by the second significant digit.
This minimalistic representation uses the data itself to paint a frequency distribution,
replacing the “information-empty” bars of a traditional histogram bar chart and allowing one to assess both the overall distribution and the contents of each bin.
The Q-Q plot compares two probability distributions by graphing their quantiles
If the two are similar, the plotted values will lie roughly along the central diagonal
Scatter Plot Matrix of Automobile Data
Small multiples of scatter plots showing a set of pairwise relations among variables
A SPLOM enables visual inspection of correlations between any pair of variables.
Parallel coordinates take a different approach to visualizing multivariate data in a more compact way
Instead of graphing every pair of variables in two dimensions, we repeatedly plot the data on parallel axes and then connect the corresponding points with lines
Each line represents a single row in the database
Line crossings between dimensions often indicate inverse correlation
Reordering dimensions can aid pattern finding
There can be interesting combinations of those types of graphs that we covered
Some of those advanced techniques will be covered in later course units
For example: geo-spatial placement of stacked time series
https://www.ft.com/content/3888bdba-d0d6-49a1-9e78-4d07ce458f42