Session 3

Data Manipulation with dplyr

Master the art of data transformation using the powerful dplyr package. Learn how to filter rows, select columns, create new variables, sort data, and chain operations together to efficiently prepare your data for analysis.

Data Manipulation Overview

Data manipulation in R refers to the process of cleaning, transforming, organizing, and preparing data so it can be analyzed or visualized effectively. In simple terms, it means taking raw data and modifying it to get useful information. Key operations include cleaning data by removing errors and duplicates, sorting and filtering to select what you need, transforming data format, combining data from different sources, and summarizing data to calculate totals and averages.

Key Ideas in Data Manipulation

Data manipulation encompasses several critical operations: Cleaning data removes errors, duplicates, or missing values. Sorting and filtering arranges data or selects only what you need. Transforming data changes its format (e.g., numbers to percentages, text to categories). Combining data merges data from different sources. Summarizing data calculates totals, averages, and other statistics. These operations are widely used in data science, statistics, computer science, and business analytics.

The dplyr Package

dplyr is a powerful R package designed for data manipulation and summarization. It provides a consistent set of verbs (functions) that help you solve the most common data manipulation challenges. The package helps you select variables, filter rows, create new variables, arrange data, and summarize data easily. dplyr uses a pipe operator (|>) to chain operations together, making your code more readable and efficient.

Install and Load dplyr

# Install and load dplyr package
install.packages("dplyr")
library(dplyr)

Setting Up the Environment

Before performing data manipulation, ensure that you have the required libraries installed and your working directory is set correctly. This ensures that R can find your data files and that all necessary functions are available.

Setup and Data Import

# Load the library
library(dplyr)

# Set working directory
setwd("C:/Users/Admin/Documents/DAM/CDAM/2025/R_TRAINING")

# Import dataset
gss <- read.csv("GSSsubset.csv")

Using filter() - Select Rows

The filter() function selects rows that meet specific conditions. This is useful when you want to subset your data based on certain criteria. You can use comparison operators (>, <, ==, !=) and logical operators (& for AND, | for OR) to create complex filtering conditions.

Filter Rows with Conditions

# Example: Filter rows where age > 30
gss_filtered <- gss |>
  dplyr::filter(age > 30)

# Filter with multiple conditions
gss_filtered <- gss |>
  dplyr::filter(age > 30 & income > 50000)

Using select() - Choose Columns

The select() function allows you to choose specific columns from your dataset. This is useful when you want to focus on particular variables or reduce the size of your dataset by removing unnecessary columns. You can select columns by name or use helper functions to select ranges.

Select Specific Columns

# Example: Select only 'sex' and 'income' columns
gss_selected <- gss |>
  dplyr::select(sex, income)

# Select multiple columns
gss_selected <- gss |>
  dplyr::select(degree, income, sex, age)

Using mutate() - Create New Variables

The mutate() function creates new variables or modifies existing ones. This is powerful for deriving new features from your data, such as creating categorical variables from continuous data, calculating ratios, or applying mathematical transformations.

Create and Modify Variables

# Example: Add a new column 'salary_category' based on income
gss_mutated <- gss |>
  dplyr::mutate(salary_category = ifelse(income > 50000, "High", "Low"))

# Create multiple new variables
gss_mutated <- gss |>
  dplyr::mutate(
    salary_category = ifelse(income > 50000, "High", "Low"),
    age_group = ifelse(age > 40, "Senior", "Junior")
  )

Using arrange() - Sort Data

The arrange() function sorts rows based on one or more variables. By default, it sorts in ascending order. Use the desc() function to sort in descending order. Sorting is useful for identifying patterns, finding extreme values, or preparing data for visualization.

Sort Data with arrange()

# Example: Arrange data by descending income
gss_arranged <- gss |> 
  dplyr::arrange(desc(income))

# Sort by multiple columns
gss_arranged <- gss |>
  dplyr::arrange(sex, desc(income))

Using summarize() - Compute Summary Statistics

The summarize() function computes summary statistics for your data. When combined with group_by(), it allows you to calculate statistics for different groups within your dataset. This is powerful for comparing groups and understanding patterns in your data.

Summarize Data by Groups

# Example: Calculate mean income grouped by gender
gss |> 
  dplyr::group_by(sex) |>
  dplyr::summarize(mean_income = mean(income, na.rm = TRUE))

# Multiple summary statistics
gss |>
  dplyr::group_by(sex) |>
  dplyr::summarize(
    mean_income = mean(income, na.rm = TRUE),
    median_income = median(income, na.rm = TRUE),
    count = n()
  )

Chaining Operations with |>

The pipe operator (|>) allows you to chain multiple dplyr operations together into a single pipeline. This makes your code more readable and efficient by avoiding the creation of intermediate variables. Each operation passes its result to the next operation in the chain.

Chain Multiple Operations

# Example: Chain multiple operations
gss_processed <- gss |>
  dplyr::filter(age > 30) |>                # Step 1: Filter rows where age > 30
  dplyr::select(degree, income, sex) |>    # Step 2: Select specific columns
  dplyr::mutate(income_category = ifelse(income > 50000, "High", "Low")) |>  # Step 3: Add a new column
  dplyr::arrange(desc(income))               # Step 4: Arrange by descending income

# View the result
head(gss_processed)

Reorganizing the Dataset

Beyond dplyr, R provides base functions for subsetting and reorganizing data. These include bracket notation for selecting specific rows and columns, the $ operator for accessing columns by name, and logical indexing for conditional subsetting. These techniques are fundamental to data manipulation in R.

Reorganize Data with Base R

# Subsetting data with bracket notation
gss[1,1]                 # first row first column
gss[,1]                  # first column
gss[1,]                  # first row
gss[,1:2]                # first 2 columns

# Subsetting by specific criteria
gss$income               # access income column
gss[gss$income > 100000,]  # rows where income > 100000
gss[gss$degree == "GRADUATE",]  # rows where degree is GRADUATE
gss[gss$marital == "DIVORCED",] # rows where marital status is DIVORCED

Summary Report

Generate comprehensive summary statistics for your dataset using the summary() function. This provides a quick overview of all variables in your dataset, including mean, median, quartiles, and other descriptive statistics. You can also use aggregate() for more customized summaries.

Generate Summary Reports

# Overall summary
summary(gss)

# Aggregate functions for custom summaries
aggregate(income ~ sex, data = gss, mean)  # mean income for each gender
aggregate(income ~ sex, data = gss, max)   # maximum income for each gender
aggregate(income ~ sex + degree, data = gss, mean)  # mean income by gender and education
aggregate(income ~ marital + sex + age, data = gss, mean)  # complex aggregation

Homework

Practice dplyr functions using a car dataset. Perform the following tasks: (1) Filter rows based on specific conditions (e.g., Wheelbase > 110). (2) Select specific columns from the dataset. (3) Create or modify variables using mutate(). (4) Sort rows using arrange(). (5) Compute summary statistics using summarize() grouped by different variables. Combine all these operations into a single pipeline using the pipe operator (|>).