Words are spelled out, mysterious and elaborate encoding avoided
Words run from left to right, the usual direction for reading occidental languages
Little messages help explain data
Elaborately encoded shadings, crosshatching, and colors are avoided; instead, labels are placed on the graphic itself; no legend is required
Graphic attracts viewer, provokes curiosity
Colors, if used, are chosen so that the color-deficient and color-blind (5 to 10% of viewers) can make sense of the graphic (blue can be distinguished from other colors by most color-deficient people)
Type is clear, precise, modest; lettering may be done by hand
Type is upper-and-lower case, with serifs
Abbreviations abound, requiring the viewer to sort through text to decode abbreviations
Words run vertically, particularly along the Y-axis; words run in several different directions
Graphic is cryptic, requires repeated references to scattered text
Obscure codings require going back and forth between legend and graphic
Graphic is repellent, filled with chartjunk
Design insensitive to color-deficient viewers; red and green used for essential contrasts
Type is clotted, overbearing
Type is all capitals, sans serif
When we are working with data graphics, we are usually quite free to choose the dimensions of our graphic (height and width)
In the case of vector graphics, we can choose arbitrary (also very large) dimensions without loss of quality
For bitmap graphics, there are limits after which quality noticeable decreases and we don’t want that
An important question concerns the ratio of width to height, the aspect ratio
Which aspect ratio to choose for one specific data graphic?
Heavier lines should be a data measure
As an example consider a time series plot:
anscombe <- datasets::anscombe
# earlier function
# create_plot <- function(dataset_x,dataset_y,size_points=4,size_text=21){
# ggplot(anscombe,
# aes({{ dataset_x }},{{ dataset_y }})) +
# geom_point(
# size = size_points) +
# geom_smooth(method="lm", se=F, fullrange = TRUE,
# color="darkgrey") +
# scale_x_continuous(
# breaks = seq(0,20,2)) +
# scale_y_continuous(
# breaks = seq(0,14,2)) +
# expand_limits(x = c(0,20), y = c(0,14)) +
# labs(x = deparse(substitute(dataset_x)),
# y = deparse(substitute(dataset_y))) +
# theme_bw() +
# theme(text=element_text(size=size_text))
# }
p06 <- p05 + labs(x = "", y = "") + theme(text=element_text(size=16),
axis.text.x = element_text(hjust = 0.5),
axis.line = element_line(colour = 'black', linewidth = 0.6),
axis.ticks = element_line(colour = "black", linewidth = 0.5),
axis.ticks.length=unit(.15, "cm"),
plot.margin=unit(c(.2,.5,.2,.2),"cm"))
p06
p1 <- ggplot(anscombe,
aes(x1,y1)) +
geom_point(
size = 2.5) +
# geom_smooth(method="lm", se=F, fullrange = TRUE,
# color="darkgrey") +
annotate("text", x = 2, y = 13, size=10, family="Times", label = "I") +
scale_x_continuous(
labels = c("",10,"",20), breaks=c(5,10,15,20), expand = c(0, 0)) +
scale_y_continuous(
breaks = c(5,10)) +
expand_limits(x = c(0,20), y = c(0,14)) +
labs(x = "",
y = "") +
theme_classic() +
# theme_bw() +
theme(text=element_text(size=16),
axis.line = element_line(colour = 'black', linewidth = 0.6),
axis.ticks = element_line(colour = "black", linewidth = 0.5),
axis.ticks.length=unit(.15, "cm"),
plot.margin=unit(c(.2,.5,.2,.2),"cm"))
p2 <- ggplot(anscombe,
aes(x2,y2)) +
geom_point(
size = 2.5) +
annotate("text", x = 2, y = 13, size=10, family="Times", label = "II") +
scale_x_continuous(
labels = c("","10","","20"), breaks=c(5,10,15,20), expand = c(0, 0)) +
scale_y_continuous(
breaks = c(5,10)) +
expand_limits(x = c(0,20), y = c(0,14)) +
labs(x = "",
y = "") +
theme_classic() +
theme(text=element_text(size=16),
axis.line = element_line(colour = 'black', linewidth = 0.6),
axis.ticks = element_line(colour = "black", linewidth = 0.5),
axis.ticks.length=unit(.15, "cm"),
axis.text.y = element_text(colour = "white"),
axis.text.x = element_text(colour = "white"),
plot.margin=unit(c(.2,.5,.2,.2),"cm"))
p3 <- ggplot(anscombe,
aes(x3,y3)) +
geom_point(
size = 2.5) +
....
…to detect outliers and label outliers
Could you provide an alternative solution not using patchwork or a similar package?
Think about rearranging the data (and of using facets
in
ggplot2 for example)
For another, Python solution: Vega-Altaire
## Series X Y
## 0 I 10 8.04
## 1 I 8 6.95
## 2 I 13 7.58
## 3 I 9 8.81
## 4 I 11 8.33
## 5 I 14 9.96
## 6 I 6 7.24
## 7 I 4 4.26
## 8 I 12 10.84
## 9 I 7 4.81
## 10 I 5 5.68
## 11 II 10 9.14
## 12 II 8 8.14
## 13 II 13 8.74
## 14 II 9 8.77
## 15 II 11 9.26
## 16 II 14 8.10
## 17 II 6 6.13
## 18 II 4 3.10
## 19 II 12 9.13
## 20 II 7 7.26
## 21 II 5 4.74
## 22 III 10 7.46
## 23 III 8 6.77
## 24 III 13 12.74
## 25 III 9 7.11
## 26 III 11 7.81
## 27 III 14 8.84
## 28 III 6 6.08
## 29 III 4 5.39
## 30 III 12 8.15
## 31 III 7 6.42
## 32 III 5 5.73
## 33 IV 8 6.58
## 34 IV 8 5.76
## 35 IV 8 7.71
## 36 IV 8 8.84
## 37 IV 8 8.47
## 38 IV 8 7.04
## 39 IV 8 5.25
## 40 IV 19 12.50
## 41 IV 8 5.56
## 42 IV 8 7.91
## 43 IV 8 6.89
Visualize the Berkeley data as an informative graphic (and a table and possible a combination of both), investigating admission rates by gender (following the design principles discussed in the course so far)
What do you find? Think about possible reasons for your findings
Now, create small multiples split up by the study program applicants applied to (Variable “Major”) and take a look at admission rates by gender again for each of the majors
What do you find now? What could be the reasons for your earlier findings?
What other informative aspects of the data are there to be uncovered? Visualize them with techniques that you deem adequate