Module 3 : Data visualization with ggplot2
Tuesday, July 15, 2025
Overview
Package ggplot2
Initialization with ggplot()
function
Collection of color palettes with package ggsci
Interactive graphs with package plotly
Little help with package esquisse
Conclusion
R Graph Gallery a website of the main figures existing in ggplot, there are tutorials to help you.
ggplot2
ggplot2
ggplot2
is a R package created by Hadley Wickham and Winston Chang for producing data visualization.
You provide the data, tell ggplot2
how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
This package is now integrated into R through the tidyverse
framework.
ggplot()
functionggplot()
functionFor structure, we go over the 7 component parts that come together as a set of instructions on how to draw a chart.
Out of these components, ggplot2
needs at least the following three to produce a chart: data, a mapping, and a layer.
The scales, facets, coordinates, and themes have reasonable defaults.
ggplot()
functionTo build a graph with ggplot2
, you need to define several elements:
the data: ggplot2
lets you work with vectors, dataframes;
the mapping: the aesthetic (or aes
) defines the mapping, i.e. the correspondence between visual elements and variables. It is in the aesthetic that we declare what we want to represent, which depends on the variables (which variable on the x axis, on the y axis, which variable to define a color scale…);
the layer (“geometric form”): this defines the graphical representation in which the above parameters are represented. In ggplot2
, these functions are of the form geom_xxx
;
# ?grstat_example # help - see the documentation of the function
db = grstat_example(N=200, seed = 42) # generate a list of simulated datasets
# str(db) # displays an overview of each element of the list
# names(db) # displays the names of objects in the list db
df_enrol = db$enrolres
df_recist = db[["recist"]]
Overview of the first observations with the function head()
# display the 5 first rows
head(x = df_enrol, n = 5)
#> # A tibble: 5 × 5
#> subjid arm arm3 date_inclusion crfname
#> <int> <fct> <fct> <date> <chr>
#> 1 1 Control Treatment B 2026-03-26 enrolres
#> 2 2 Control Control 2023-06-29 enrolres
#> 3 3 Treatment Treatment A 2027-11-17 enrolres
#> 4 4 Control Control 2024-02-04 enrolres
#> 5 5 Treatment Treatment A 2027-06-15 enrolres
Frequency table with the function table()
(prop.table()
to compute percentage).
Note
You can add as many additional functions as you want.
Each function is linked with a +
like a pipe %>%
with tidyverse
package.
Tip
Return to the line at each instruction, after +
. The code will be easier to read.
Do not forget to indent CTRL + I
.
ggplot2
uses data to construct a plot. The best format is a rectangular dataframe structure where rows are observations and columns are variables.
As the first step in many plots, you would pass the data to the ggplot()
function, which stores the data to be used later by other parts of the plotting system.
The mapping of a plot is a set of instructions on how parts of the data are mapped onto aesthetic attributes of geometric objects.
A mapping can be made by using the aes()
function.
Note
No need to specify the table name in the form df_recist$subjid or df_recist[,“rctlsum_b”] as ggplot
automatically searches for the variable in the data table specified with the data parameter.
In addition, the aes()
function admits other arguments that can be used to modify the appearance of the graph.
Note
It is possible to specify parameters that will be valid for the entire graph. These parameters are the same as those proposed in the aes
, but must be passed outside the aesthetic.
# preparation of the data
df_recist %>%
filter(rcvisit == 1) %>%
left_join(df_enrol %>%
select(subjid, arm3),
by = "subjid") %>%
# figure
ggplot(mapping = aes(x = subjid,
y = rctlsum,
fill = arm3, # fill colour for points
colour = arm3, # outline color for points
alpha = rcresp, # transparency
size = rctlsum_b/100) # points size
) +
geom_point()
The heart of any graphic is the layers. Every layer consists of three important parts:
The geometry that determines how data are displayed, such as points, lines, or rectangles.
The statistical transformation that may compute new variables from the data and affect what of the data is displayed.
The position adjustment that primarily determines where a piece of data is being displayed.
A layer can be constructed using the geom_xxx()
functions. These functions often determine the geometry, while the other two can still be specified later with others functions.
Warning
The layer chosen and the display depend on the data you put on the mapping. Indeed, some form of figure do not need exhaustive parameters whereas others do, hence, if you choose the wrong layer, R will return you an error or a warning message.
Reminder
Each new graphic element is added in the form of a layer separated by +
.
labs()
function labels all possible elements of the aesthetic, as well as the title, subtitle and caption.
Note
There are several other ways of specifying these elements using specific layers.
Scales are responsible for updating the limits of a plot, setting the breaks, formatting the labels, and possibly applying a transformation.
To use scales, use one of the scale functions that are patterned as scale_{aesthetic}_{type}()
functions, where {aesthetic}
is one of the pairings made in the mapping part of a plot.
library(RColorBrewer)
# preparation of the data
df_recist %>%
filter(rcvisit == 0) %>%
left_join(df_enrol %>%
select(subjid, arm3),
by = "subjid") %>%
# figure histogram
ggplot(mapping = aes(x = rctlsum_b,
fill = arm3,
colour = arm3)) +
geom_histogram(binwidth = 20, boundary = 0) +
scale_x_continuous(name = "Baseline size tumor",
breaks = seq(from = 0, to = 180, by = 20)) +
scale_fill_brewer(name = "Treatment arm") +
scale_colour_brewer(name = "Treatment arm",
type = "div",
palette = "PRGn") +
scale_y_continuous(limits = c(0,100))
Facets can be used to separate small multiples, or different subsets of the data. It is a powerful tool to quickly split up the data into smaller panels, based on one or more variables, to display patterns or trends (or the lack thereof) within the subsets.
The facets have their own mapping that can be given as a formula with the function facet_grid()
.
df_recist %>%
filter(rcvisit == 1) %>%
left_join(df_enrol %>%
select(subjid, arm3),
by = "subjid") %>%
ggplot(mapping = aes(x = rcresp,
y = rctlsum,
fill = rcresp,
colour = rcresp)) +
geom_boxplot() +
xlab(label = "Overall response at visit 1") +
ylab(label = "Tumor size (mm)") +
guides(colour = "none", fill = "none") + # remove legend
scale_x_discrete(labels = c(`Complete response`="Complete \n Response", # return to line
`Partial response` = "PR",
`Stable disease` = "SD",
`Progressive disease` = "PR",
`Not evaluable` = "NA")) +
scale_fill_brewer(type = "div",
palette = "RdYlBu", # display.brewer.pal(4,"RdYlBu")
direction = -1) + # reverse the order
scale_colour_brewer(type = "div",
palette = "RdYlBu",
direction = -1) +
scale_y_continuous(limits = c(0,180))+
facet_grid( ~ arm3)
ggplot2
themes modify the appearance of yours graphics. Appearance refers to everything that does not relate to the data, such as fonts, grids and background.
There are predefined themes in ggplot2
that you can already use.
The theme system controls almost any visuals of the plot that are not controlled by the data and is therefore important for the look and feel of the plot.
You can use many of the built-in theme_*()
functions and/or detail specific aspects with the theme()
function. The element_*()
functions control the graphical attributes of theme components.
df_recist %>%
filter(rcvisit == 1) %>%
left_join(df_enrol %>%
select(subjid, arm3),
by = "subjid") %>%
mutate(count = n(), .by = c("arm3","rcresp")) %>%
# figure boxplot
ggplot(mapping = aes(x = rcresp,
y = rctlsum,
fill = rcresp)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
axis.text.y = element_text(size = 10, colour = "red"),
axis.title.x = element_blank())+ # remove the space dedicated to the title
xlab(label = "Overall response at visit 1") + # did not appear because the space was removed
ylab(label = "Tumor size (mm)") +
scale_fill_brewer(type = "div",
palette = "RdYlBu", # display.brewer.pal(4,"RdYlBu")
direction = -1) + # reverse the order
scale_y_continuous(limits = c(0,180))+
guides(fill = guide_legend(theme = theme(
legend.text.position = "top",
legend.position = "top",
legend.text = element_text(hjust = 1))))
As mentioned at the start, you can layer all of the pieces to build a customized plot of your data.
df_recist %>%
filter(rcvisit == 1) %>%
left_join(df_enrol %>%
select(subjid, arm3),
by = "subjid") %>%
mutate(count = n(), .by = c("arm3","rcresp")) %>%
# figure boxplot
ggplot(mapping = aes(x = rcresp,
y = rctlsum,
fill = rcresp)) +
geom_boxplot() +
geom_jitter(width = 0.2, height = 0.2) +
geom_text(aes(label = count, x= rcresp,y =150) ) +
xlab(label = "Overall response at visit 1") +
ylab(label = "Tumor size (mm)") +
guides(colour = "none", fill = "none") + # remove legend
scale_x_discrete(labels = c(`Complete response`="Complete \n Response", # \n back to the line
`Partial response` = "PR",
`Stable disease` = "SD",
`Progressive disease` = "PR",
`Not evaluable` = "NA")) +
scale_fill_brewer(type = "div",
palette = "RdYlBu", # display.brewer.pal(4,"RdYlBu")
direction = -1) + # reverse the order
scale_y_continuous(limits = c(0,180))+
facet_grid( ~ arm3)
ggsci
ggsci
External packages, such as ggthemes
or hrbrthemes
, can be used to amplify the collection of themes.
The package ggsci
contains the usual themes to use for scientific journals, such as Lancet journal, journal of Clinical Oncology, NEJM and BMJ.
df_recist %>%
filter(rcvisit >= 0 & rcvisit <= 1) %>%
filter(!is.na(rcnew) & !is.na(rcresp)) %>%
ggplot() +
aes(
x = rctlsum_b,
y = rctlsum,
fill = rcresp,
colour = rcresp,
size = rcnew
) +
geom_point() +
xlab("Baseline tumor size")+
ylab("Tumor size at visit 1")+
labs(fill = "Global response",
colour = "Global response",
size = "New lesion")+
scale_fill_lancet() +
scale_color_lancet() # scale_color_jco The Lancet Journal
plotly
plotly
The package plotly
is a supplementary element that can be added in ggplot
, to obtain an interactive graph.
plotly
esquisse
esquisse
The package esquisse
explore and visualize your data interactively. It helps you to obtain the main body of the ggplot, by indicating variables you want to plot. Then, you can adapt it to your taste.
You are now able to:
import databases into R environment and manipulate them (training course n°2 led by Charlotte);
create basic and complex graphs with reasoning and structure;
customized aesthetics parameters.
The next steps will be covered with Dan.
Training course n°4
presentation of the R package EDCimport
, a toolbox for importing and checking TrialMaster data;
description data tables with the package crosstable
.
Training course n°5
main hypothesis tests;
usual statistical models (linear and binary regressions, survival models…);
presentation of the R package grstat
for Adverse Events tables.
The source code for this presentation is available on GitHub.