R for Data Science

Session 4

Data Visualization with ggplot2

Master the art of creating publication-quality graphics using ggplot2. Learn how to build complex visualizations layer by layer, from basic scatter plots to advanced faceted displays. Discover how to tell compelling stories with data through effective visualization.

Data Visualization Overview

Data visualization helps in understanding patterns, trends, and relationships in data. It is a crucial element in scientific research, enabling researchers to interpret and communicate their results effectively. Visualization transforms complex datasets into intuitive visual representations that reveal insights at a glance.

Types of Data Visualization

Data visualizations can be categorized by the number of variables they display. Univariate visualizations show a single variable (histograms, box plots). Bivariate visualizations display relationships between two variables (scatter plots, line plots, bar charts). Multivariate visualizations handle three or more variables (heatmaps, pair plots, violin plots). Specialized visualizations like pie charts, bubble charts, and word clouds serve specific purposes. Time series visualizations track data changes over time (time series line plots, autocorrelation plots).

Best Practices in Data Visualization

Effective visualizations follow key principles: Choose the right chart type for your data (bar charts for categories, line charts for trends). Follow design principles emphasizing simplicity, consistency, and accessibility. Use storytelling to highlight key insights and structure visuals logically. Avoid common pitfalls like misleading scales, cluttered visuals, and unnecessary 3D effects. Always consider your audience and the message you want to communicate.

Introduction to ggplot2

ggplot2 is a powerful plotting system for R based on the Grammar of Graphics, developed by Hadley Wickham. It allows building complex, publication-quality graphics in layers. ggplot2 is the gold standard for data visualization in R, offering consistent syntax, seamless integration with tidyverse packages, support for faceting and grouping, and access to over 100 extensions. It breaks down visualization into layers added using the '+' operator, making it easier to customize and understand how to build effective charts.

Install and Setup ggplot2

# Install and load ggplot2 package
install.packages("ggplot2")
library(ggplot2)

# Set working directory
setwd("C:/Users/Admin/Documents/DAM/CDAM/2025/R_TRAINING")

# Import dataset
gss <- read.csv("GSSsubset.csv")

Essential Layers in ggplot2

Every ggplot2 visualization consists of essential layers: Data is the foundation where you define the dataset. Aesthetics map variables to visual aspects like color, size, and position using aes(). Geometries specify the plot type (bar, line, scatter) using geom_*() functions. Facets create subplots for different data subsets. Statistics add transformations like mean lines or trend lines. Coordinates control the plot's coordinate system. Theme adjusts appearance including grid lines, fonts, and background.

Scatter Plots

Scatter plots visualize relationships between two continuous variables. Each point represents an observation, with position determined by the x and y values. Scatter plots are excellent for identifying correlations, clusters, and outliers in your data.

Creating Scatter Plots

# Basic scatter plot
gss |>
  ggplot(aes(x = age, y = income)) +
  geom_point()

# Scatter plot with color
gss |>
  ggplot(aes(x = age, y = income, colour = marital)) +
  geom_point()

# Scatter plot with size and color
gss |>
  ggplot(aes(x = age, y = income, colour = marital)) +
  geom_point(size = 4)

# Scatter plot with labels and theme
gss |>
  ggplot(aes(x = age, y = income)) +
  geom_point(color = "blue") +
  labs(title = "Income vs Age", x = "Age", y = "Income") +
  theme_minimal()

Histograms

Histograms display the distribution of a single continuous variable by dividing the data into bins and showing the frequency of observations in each bin. They help identify the shape of the distribution, central tendency, and spread of your data.

Creating Histograms

# Basic histogram
gss |>
  ggplot(aes(x = income)) +
  geom_histogram()

# Histogram with custom color and bins
gss |>
  ggplot(aes(x = income)) +
  geom_histogram(fill = "red", bins = 5, color = "black")

# Histogram with labels and theme
gss |>
  ggplot(aes(x = income)) +
  geom_histogram(fill = "red", bins = 5, color = "black") +
  labs(x = "Income in Kshs",
       y = "No. of Respondents",
       title = "Histogram showing Income Distribution",
       caption = "Source: CDAM Experts, 2026") +
  theme_classic()

Bar Charts

Bar charts compare categorical data by displaying the count or summary statistic for each category. They are ideal for comparing values across groups and are one of the most commonly used visualization types.

Creating Bar Charts

# Basic bar chart
gss |>
  ggplot(aes(x = age)) +
  geom_bar()

# Bar chart with custom color
gss |>
  ggplot(aes(x = age)) +
  geom_bar(fill = "blue")

# Bar chart with labels and theme
gss |>
  ggplot(aes(x = age)) +
  geom_bar(fill = "blue") +
  labs(x = "Age in Years",
       y = "No. of Respondents",
       title = "Bar Chart showing Age Distribution",
       caption = "Source: CDAM Experts, 2026") +
  theme_classic()

Box Plots

Box plots display the distribution of a continuous variable across categories. They show the median, quartiles, and outliers, making them excellent for detecting outliers and comparing distributions across groups.

Creating Box Plots

# Basic box plot
gss |>
  ggplot(aes(x = degree, y = income)) +
  geom_boxplot()

# Box plot with fill color
gss |>
  ggplot(aes(x = degree, y = income, fill = sex)) +
  geom_boxplot()

# Box plot with jitter points
gss |>
  ggplot(aes(x = degree, y = income, fill = sex)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.2)

# Box plot with labels and theme
gss |>
  ggplot(aes(x = degree, y = income, fill = sex)) +
  geom_boxplot() +
  labs(x = "Education Qualification",
       y = "Income Levels",
       title = "Box Plot of Income Distribution by Degree",
       caption = "Source: CDAM Experts, 2026") +
  theme_classic() +
  theme(legend.position = "top")

Violin Plots

Violin plots are similar to box plots but also show the kernel probability density of the data at different values. They provide a more detailed view of the distribution shape and are particularly useful for comparing distributions across categories.

Creating Violin Plots

# Basic violin plot
gss |>
  ggplot(aes(x = age, y = income)) +
  geom_violin(fill = "tomato") +
  geom_jitter(alpha = 0.2) +
  theme_classic()

Faceting: Creating Small Multiples

Facets are a powerful feature that allows you to create multiple plots based on a single variable. This small multiple approach makes it easy to compare distributions or relationships across groups. Facets use the facet_wrap() function with the ~ operator.

Creating Faceted Plots

# Faceted scatter plot
gss |>
  ggplot(aes(x = age, y = income)) +
  geom_point(color = "blue") +
  facet_wrap(~ sex) +
  labs(title = "Income vs Age", x = "Age", y = "Income") +
  theme_classic()

# Faceted histogram
gss |>
  ggplot(aes(x = income)) +
  geom_histogram(fill = "red", bins = 5, color = "black") +
  facet_wrap(~ sex) +
  labs(x = "Income in Kshs",
       y = "No. of Respondents",
       title = "Histogram showing Income Distribution",
       caption = "Source: CDAM Experts, 2026") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Themes and Customization

The theme() function allows you to customize the appearance of your plots. ggplot2 includes many preset themes (theme_classic, theme_minimal, theme_bw) and allows further customization using the generic theme() function. Themes control grid lines, font styles, background, legend position, and other visual elements. It's typically best to set the theme at the end of your plot specification.

Exporting Plots

You can export plots to various formats using the ggsave() function. Save plots to an object first, then use ggsave() to export. You can specify width, height, and resolution (dpi). The file type is determined by the file extension. Use informative names for output files to make them easily identifiable.

Exporting Plots with ggsave()

# Create and export a plot
plot_1 <- gss |>
  ggplot(aes(x = income)) +
  geom_histogram(fill = "blue", bins = 5, color = "black") +
  facet_wrap(~ sex) +
  labs(x = "Income in Kshs",
       y = "No. of Respondents",
       title = "Histogram showing Income Distribution",
       caption = "Source: CDAM Experts, 2026") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Export the plot
ggsave("Plot_1.png", plot = plot_1, width = 10, height = 6, dpi = 300)

Interactive Visualizations

Use the plotly package to create interactive visualizations. The ggplotly() function converts ggplot2 plots to interactive versions, allowing users to hover over data points, zoom, pan, and toggle series on and off. Interactive visualizations are particularly useful for exploratory data analysis and presentations.

Creating Interactive Plots

# Create interactive visualization with plotly
library(plotly)

# Create a ggplot
p <- gss |>
  ggplot(aes(x = degree, y = income, fill = sex)) +
  geom_boxplot() +
  labs(x = "Education Qualification",
       y = "Income Levels",
       title = "Box Plot of Income Distribution by Degree",
       caption = "Source: CDAM Experts, 2026") +
  theme_classic() +
  theme(legend.position = "top")

# Convert to interactive plot
ggplotly(p)

Homework

Homework

Create visualizations for a dataset of your choice using ggplot2. Perform the following tasks: (1) Create a scatter plot to visualize the relationship between two continuous variables. (2) Create a bar plot to display the count or summary of a categorical variable. (3) Create a histogram to show the distribution of a single variable. (4) Create a box plot to summarize the distribution of a continuous variable across categories. (5) Customize your plots with titles, labels, themes, and aesthetic modifications. Bonus: Create faceted plots and export them as high-resolution PNG files.