ONCOSTAT & BBE Team
Training course R

Module 3 : Data visualization with ggplot2

Nusaïbah IBRAHIMI

Tuesday, July 15, 2025

Table of contents

  • Overview

  • Package ggplot2

  • Initialization with ggplot() function

  • Collection of color palettes with package ggsci

  • Interactive graphs with package plotly

  • Little help with package esquisse

  • Conclusion

Overview

R Graph Gallery a website of the main figures existing in ggplot, there are tutorials to help you.

Package ggplot2

Package ggplot2

ggplot2 is a R package created by Hadley Wickham and Winston Chang for producing data visualization.

You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

This package is now integrated into R through the tidyverse framework.

# The easiest way to get ggplot2 is to install the whole tidyverse:
# install.packages("tidyverse")
library(tidyverse)


# Alternatively, install just ggplot2:
# install.packages("ggplot2")
library(ggplot2)

Cheatsheet

Initialization with ggplot() function

Initialization with ggplot() function

For structure, we go over the 7 component parts that come together as a set of instructions on how to draw a chart.

Out of these components, ggplot2 needs at least the following three to produce a chart: data, a mapping, and a layer.

The scales, facets, coordinates, and themes have reasonable defaults.

Initialization with ggplot() function

To build a graph with ggplot2, you need to define several elements:

  1. the data: ggplot2 lets you work with vectors, dataframes;

  2. the mapping: the aesthetic (or aes) defines the mapping, i.e. the correspondence between visual elements and variables. It is in the aesthetic that we declare what we want to represent, which depends on the variables (which variable on the x axis, on the y axis, which variable to define a color scale…);

  3. the layer (“geometric form”): this defines the graphical representation in which the above parameters are represented. In ggplot2, these functions are of the form geom_xxx;

Case study

# install.packages(pkgs = "devtools")
library(package = devtools)
install_github(repo = "Oncostat/grstat@v0.1.0.9010")
library(grstat)
# ?grstat_example # help - see the documentation of the function

db = grstat_example(N=200, seed = 42) # generate a list of simulated datasets 

# str(db)   # displays an overview of each element of the list
# names(db)  # displays the names of objects in the list db

df_enrol = db$enrolres
df_recist = db[["recist"]]

Overview of the first observations with the function head()

# display the 5 first rows
head(x = df_enrol, n = 5) 
#> # A tibble: 5 × 5
#>   subjid arm       arm3        date_inclusion crfname 
#>    <int> <fct>     <fct>       <date>         <chr>   
#> 1      1 Control   Treatment B 2026-03-26     enrolres
#> 2      2 Control   Control     2023-06-29     enrolres
#> 3      3 Treatment Treatment A 2027-11-17     enrolres
#> 4      4 Control   Control     2024-02-04     enrolres
#> 5      5 Treatment Treatment A 2027-06-15     enrolres

Frequency table with the function table() (prop.table() to compute percentage).

# count of patient in each treatment arm
table(df_enrol$arm3)
#> 
#>     Control Treatment A Treatment B 
#>          67          67          66

Syntax

# barplot
ggplot(data = df_enrol,
       mapping = aes(x = arm3,
                     fill = arm3)) + # Change fill color by groups
  
  geom_bar(width = 0.7) +
  ggtitle("Distribution of patients by treatment arm") +
  xlab("") +
  ylab("Number of patients") +
  labs(fill = "Group") # title subtitle x & y labels parameters also

Note

You can add as many additional functions as you want.

Each function is linked with a + like a pipe %>% with tidyverse package.

Tip

Return to the line at each instruction, after + . The code will be easier to read.

Do not forget to indent CTRL + I.

Data

ggplot2 uses data to construct a plot. The best format is a rectangular dataframe structure where rows are observations and columns are variables.

As the first step in many plots, you would pass the data to the ggplot() function, which stores the data to be used later by other parts of the plotting system.

ggplot(data = df_recist %>% filter(rcvisit == 0))
df_recist %>% 
  filter(rcvisit == 0) %>% 
  ggplot() # it will understand that the data is 
           # df_recist filtered by baseline visit

Mapping

The mapping of a plot is a set of instructions on how parts of the data are mapped onto aesthetic attributes of geometric objects.

A mapping can be made by using the aes() function.

# preparation of the data
df_recist %>% 
  filter(rcvisit == 0) %>% 
  left_join(df_enrol %>% 
              select(subjid, arm3),
            by = "subjid") %>% 
# figure
  ggplot(mapping = aes(x = subjid,  
                       y = rctlsum_b))

Note

No need to specify the table name in the form df_recist$subjid or df_recist[,“rctlsum_b”] as ggplot automatically searches for the variable in the data table specified with the data parameter.

Mapping

In addition, the aes() function admits other arguments that can be used to modify the appearance of the graph.

Note

It is possible to specify parameters that will be valid for the entire graph. These parameters are the same as those proposed in the aes, but must be passed outside the aesthetic.

# preparation of the data
df_recist %>% 
  filter(rcvisit == 1) %>% 
  left_join(df_enrol %>% 
              select(subjid, arm3),
            by = "subjid") %>% 
# figure
  ggplot(mapping = aes(x = subjid,  
                       y = rctlsum,
                       fill = arm3, # fill colour for points
                       colour = arm3, # outline color for points
                       alpha = rcresp, # transparency
                       size = rctlsum_b/100) # points size
         ) +
geom_point()

Layers

The heart of any graphic is the layers. Every layer consists of three important parts:

  1. The geometry that determines how data are displayed, such as points, lines, or rectangles.

  2. The statistical transformation that may compute new variables from the data and affect what of the data is displayed.

  3. The position adjustment that primarily determines where a piece of data is being displayed.

A layer can be constructed using the geom_xxx() functions. These functions often determine the geometry, while the other two can still be specified later with others functions.

# preparation of the data
df_recist %>% 
  filter(rcvisit == 0) %>% 
  left_join(df_enrol %>% 
              select(subjid, arm3),
            by = "subjid") %>% 
# figure
  ggplot(mapping = aes(x = subjid,  
                       y = rctlsum_b,
                       colour = arm3))+
  geom_point()

Layers

Warning

The layer chosen and the display depend on the data you put on the mapping. Indeed, some form of figure do not need exhaustive parameters whereas others do, hence, if you choose the wrong layer, R will return you an error or a warning message.

# preparation of the data
fig_point = df_recist %>% 
  filter(rcvisit == 0) %>% 
  left_join(df_enrol %>% 
              select(subjid, arm3),
            by = "subjid") %>% 
  # figure scatter plot
  ggplot(mapping = aes(x = subjid,  
                       y = rctlsum_b,
                       colour = arm3))+
  geom_point(shape = 6) # triangle form 
fig_point

# preparation of the data
df_recist %>% 
  filter(rcvisit == 0) %>% 
  left_join(df_enrol %>% 
              select(subjid, arm3),
            by = "subjid") %>% 
# figure histogram
  ggplot(mapping = aes(x = rctlsum_b,
                       fill = arm3,
                       colour = arm3))+
  
  geom_histogram(binwidth = 20, boundary = 0)

Layers

Reminder

Each new graphic element is added in the form of a layer separated by +.

labs() function labels all possible elements of the aesthetic, as well as the title, subtitle and caption.

fig_point + 
  labs(title = "TITLE",
       subtitle = "subtitle",
       colour = "Legend")

Layers

Note

There are several other ways of specifying these elements using specific layers.

fig_point + 
  ggtitle(label = "Scatter plot of tumor size at baseline (N = 200)",
       subtitle = NULL) +
  xlab("Patient") +
  ylab("Tumor size at baseline (mm)")

Scales

Scales are responsible for updating the limits of a plot, setting the breaks, formatting the labels, and possibly applying a transformation.

To use scales, use one of the scale functions that are patterned as scale_{aesthetic}_{type}() functions, where {aesthetic} is one of the pairings made in the mapping part of a plot.

library(RColorBrewer)
# preparation of the data
df_recist %>% 
  filter(rcvisit == 0) %>% 
  left_join(df_enrol %>% 
              select(subjid, arm3),
            by = "subjid") %>% 
# figure histogram
  ggplot(mapping = aes(x = rctlsum_b,
                       fill = arm3,
                       colour = arm3)) +
  geom_histogram(binwidth = 20, boundary = 0) + 
  scale_x_continuous(name = "Baseline size tumor",
                     breaks = seq(from = 0, to = 180, by = 20)) +
  scale_fill_brewer(name = "Treatment arm") + 
  scale_colour_brewer(name = "Treatment arm",
                      type = "div", 
                      palette = "PRGn") +
  scale_y_continuous(limits = c(0,100))

# display.brewer.pal(3,"PRGn")  # displays the 3 first color of palette name "PRGn"

Facets

Facets can be used to separate small multiples, or different subsets of the data. It is a powerful tool to quickly split up the data into smaller panels, based on one or more variables, to display patterns or trends (or the lack thereof) within the subsets.

The facets have their own mapping that can be given as a formula with the function facet_grid().

df_recist %>% 
  filter(rcvisit == 1) %>% 
  left_join(df_enrol %>% 
              select(subjid, arm3),
            by = "subjid") %>% 
  ggplot(mapping = aes(x = rcresp,
                       y = rctlsum,
                       fill = rcresp,
                       colour = rcresp)) +
  geom_boxplot() + 
  xlab(label = "Overall response at visit 1") +
  ylab(label = "Tumor size (mm)") +
  guides(colour = "none", fill = "none") + # remove legend
  scale_x_discrete(labels = c(`Complete response`="Complete \n Response", # return to line
                              `Partial response` = "PR",
                              `Stable disease` = "SD",
                              `Progressive disease` = "PR",
                              `Not evaluable` = "NA")) +
  scale_fill_brewer(type = "div", 
                      palette = "RdYlBu",  # display.brewer.pal(4,"RdYlBu") 
                      direction = -1) +    # reverse the order
  scale_colour_brewer(type = "div", 
                      palette = "RdYlBu",  
                      direction = -1) + 
  scale_y_continuous(limits = c(0,180))+
  facet_grid( ~ arm3)

Theme

ggplot2 themes modify the appearance of yours graphics. Appearance refers to everything that does not relate to the data, such as fonts, grids and background.

There are predefined themes in ggplot2 that you can already use.

Theme

fig_point + 
  theme_void()   # no relevant

Theme

The theme system controls almost any visuals of the plot that are not controlled by the data and is therefore important for the look and feel of the plot.

You can use many of the built-in theme_*() functions and/or detail specific aspects with the theme() function. The element_*() functions control the graphical attributes of theme components.

df_recist %>% 
  filter(rcvisit == 1) %>% 
  left_join(df_enrol %>% 
              select(subjid, arm3),
            by = "subjid") %>% 
  mutate(count = n(), .by = c("arm3","rcresp")) %>%
  # figure boxplot
  ggplot(mapping = aes(x = rcresp,
                       y = rctlsum,
                       fill = rcresp)) +
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
        axis.text.y = element_text(size = 10, colour = "red"),
        axis.title.x = element_blank())+ # remove the space dedicated to the title
  xlab(label = "Overall response at visit 1") + # did not appear because the space was removed
  ylab(label = "Tumor size (mm)") +
  scale_fill_brewer(type = "div", 
                    palette = "RdYlBu",  # display.brewer.pal(4,"RdYlBu") 
                    direction = -1) +    # reverse the order 
  scale_y_continuous(limits = c(0,180))+
  guides(fill = guide_legend(theme = theme(
  legend.text.position = "top",
  legend.position = "top",
  legend.text = element_text(hjust = 1))))

Combining

As mentioned at the start, you can layer all of the pieces to build a customized plot of your data.

df_recist %>% 
  filter(rcvisit == 1) %>% 
  left_join(df_enrol %>% 
              select(subjid, arm3),
            by = "subjid") %>% 
  mutate(count = n(), .by = c("arm3","rcresp")) %>%
  # figure boxplot
  ggplot(mapping = aes(x = rcresp,
                       y = rctlsum,
                       fill = rcresp)) +
  geom_boxplot() + 
  geom_jitter(width = 0.2, height = 0.2) +
  geom_text(aes(label = count, x= rcresp,y =150) ) + 
  xlab(label = "Overall response at visit 1") +
  ylab(label = "Tumor size (mm)") +
  guides(colour = "none", fill = "none") + # remove legend
  scale_x_discrete(labels = c(`Complete response`="Complete \n Response", # \n back to the line 
                              `Partial response` = "PR",
                              `Stable disease` = "SD",
                              `Progressive disease` = "PR",
                              `Not evaluable` = "NA")) +
  scale_fill_brewer(type = "div", 
                    palette = "RdYlBu",  # display.brewer.pal(4,"RdYlBu") 
                    direction = -1) +    # reverse the order
  scale_y_continuous(limits = c(0,180))+
  facet_grid( ~ arm3)

Collection of color palettes with package ggsci

Package ggsci

External packages, such as ggthemes or hrbrthemes, can be used to amplify the collection of themes.

The package ggsci contains the usual themes to use for scientific journals, such as Lancet journal, journal of Clinical Oncology, NEJM and BMJ.

# install.packages("ggsci")
library(ggsci)
df_recist %>%
 filter(rcvisit >= 0 & rcvisit <= 1) %>%
  filter(!is.na(rcnew) & !is.na(rcresp)) %>% 
 ggplot() +
  aes(
    x = rctlsum_b,
    y = rctlsum,
    fill = rcresp,
    colour = rcresp,
    size = rcnew
  ) +
  geom_point() +
  xlab("Baseline tumor size")+
  ylab("Tumor size at visit 1")+
  labs(fill = "Global response",
       colour = "Global response", 
       size = "New lesion")+
  scale_fill_lancet() +
  scale_color_lancet() # scale_color_jco The Lancet Journal

Interactive graphs with package plotly

Package plotly

The package plotly is a supplementary element that can be added in ggplot, to obtain an interactive graph.

# install.packages("plotly") 
library(plotly)
ggplotly()   # by default it will display the last plot (hence the one of the previous slide)

Package plotly

ggplotly(p = fig_point) # as argument it needs a ggplot object
fig_point %>%    # another way to code with the pipe
  ggplotly() 

Little help with package esquisse

Package esquisse

The package esquisse explore and visualize your data interactively. It helps you to obtain the main body of the ggplot, by indicating variables you want to plot. Then, you can adapt it to your taste.

# install.packages("esquisse")
# install.packages("plotly")
library(esquisse)
library(plotly)
# esquisser()
# esquisser(df_recist)

Conclusion

Conclusion

You are now able to:

  • import databases into R environment and manipulate them (training course n°2 led by Charlotte);

  • create basic and complex graphs with reasoning and structure;

  • customized aesthetics parameters.

Next steps

The next steps will be covered with Dan.

Training course n°4

  • presentation of the R package EDCimport, a toolbox for importing and checking TrialMaster data;

  • description data tables with the package crosstable.

Training course n°5

  • main hypothesis tests;

  • usual statistical models (linear and binary regressions, survival models…);

  • presentation of the R package grstat for Adverse Events tables.

Thank you for your attention