R for Data Science

Session 5

Hypothesis Testing in R

Master the fundamentals of statistical hypothesis testing. Learn to formulate hypotheses, select appropriate tests, and interpret results with confidence. This session covers t-tests, chi-square tests, correlation tests, and assumption checking techniques.

Learning Objectives

By the end of this session, you will be able to: (1) Understand the logic and framework of hypothesis testing. (2) Formulate null and alternative hypotheses correctly. (3) Select appropriate statistical tests for different scenarios. (4) Perform hypothesis tests in R. (5) Interpret results in a statistically sound manner.

Conceptual Foundation

Hypothesis testing is a statistical inference method used to make decisions about a population parameter based on sample data. It answers the fundamental question: Is the observed effect real, or due to random chance? This framework allows researchers to test claims about populations using sample data.

Key Terminology

Understanding key terms is essential for hypothesis testing. The Null Hypothesis (H₀) assumes no effect or no difference. The Alternative Hypothesis (H₁) assumes there is an effect or difference. The Significance Level (α) is the probability of rejecting H₀ when it is true (commonly 0.05). The p-value is the probability of observing results at least as extreme as the sample. The Test Statistic is the value calculated from sample data. Type I Error occurs when rejecting a true H₀. Type II Error occurs when failing to reject a false H₀.

General Steps in Hypothesis Testing

The hypothesis testing process follows a systematic approach: (1) Define hypotheses (H₀ and H₁). (2) Choose significance level (α, typically 0.05). (3) Select appropriate test based on data type and research question. (4) Compute test statistic from sample data. (5) Calculate p-value. (6) Make decision: If p ≤ α, reject H₀; If p > α, fail to reject H₀.

Loading and Inspecting Data

Before performing hypothesis tests, you must load and inspect your data. The General Social Survey (GSS) dataset contains categorical variables (gender, marital status, education level, political views) and numerical variables (age, income, years of education). Always check variable types, identify missing values, and understand your data structure before analysis.

Load and Inspect Data

# Load required libraries
library(tidyverse)

# Import dataset
gss <- read.csv("GSSsubset.csv")

# Inspect structure
str(gss)
glimpse(gss)

# Check missing values
colSums(is.na(gss))

Selecting the Right Test

Different research questions require different statistical tests. For comparing a sample mean to a constant, use a one-sample t-test. For comparing means between two independent groups, use a two-sample t-test. For paired observations (before/after), use a paired t-test. For associations between categorical variables, use a chi-square test. For relationships between two numerical variables, use a correlation test.

One-Sample t-Test

The one-sample t-test compares a sample mean with a known population value. This test answers questions like: Is the average age different from 40? It tests whether the observed sample mean is significantly different from a hypothesized population mean. The test produces a t-statistic and p-value that determine whether to reject the null hypothesis.

One-Sample t-Test Examples

# Example: One-sample t-test
sample_data <- c(22, 24, 26, 28, 30, 32, 34, 36)
known_mean <- 30

# Perform one-sample t-test
result <- t.test(sample_data, mu = known_mean)

# Print results
print(result)

# Test if average age is different from 40
t.test(gss$age, mu = 40)

Two-Sample t-Test

The two-sample t-test compares means between two independent groups. This test answers questions like: Does income differ by gender? It determines whether the difference between two group means is statistically significant. The output includes means for each group, confidence intervals, the t-statistic, degrees of freedom, and the p-value.

Two-Sample t-Test Examples

# Example: Two-sample t-test with vectors
group1 <- c(22, 24, 26, 28, 30)
group2 <- c(32, 34, 36, 38, 40)

# Perform two-sample t-test
result <- t.test(group1, group2)

# Print results
print(result)

# Test if income differs by gender using formula notation
t.test(income ~ sex, data = gss)

# Interpretation example:
# A two-sample t-test was conducted to compare income between gender.
# The results showed a statistically significant difference 
# (t = -8.9504, df = 825.42, p-value < 2.2e-16).
# Therefore, we reject the null hypothesis and conclude that income differs by gender.

Paired t-Test

The paired t-test compares means of two related samples, such as before and after measurements on the same subjects. This test is appropriate when observations are dependent. It measures whether there is a significant difference in the paired observations.

Paired t-Test Examples

# Example: Paired t-test
before <- c(120, 130, 125, 140)
after <- c(115, 128, 120, 135)

# Perform paired t-test
result <- t.test(before, after, paired = TRUE)

# Print results
print(result)

# Another example with GSS data
before_scores <- c(21, 24, 26, 28, 30)
after_scores <- c(24, 26, 27, 30, 32)

t.test(before_scores, after_scores, paired = TRUE)

Chi-Square Test for Independence

The chi-square test determines whether there is a significant association between two categorical variables. It tests whether the variables are independent or dependent. This test is appropriate for categorical data organized in contingency tables.

Chi-Square Test Examples

# Example: Chi-square test with matrix
data <- matrix(c(10, 20, 30, 40), nrow = 2)

# Perform chi-square test
result <- chisq.test(data)

# Print results
print(result)

# Test association between education level and marital status
table_data <- table(gss$degree, gss$marital)

chi_result <- chisq.test(table_data)

# Interpretation:
# If p-value < 0.05, variables are dependent (associated)
# If p-value >= 0.05, variables are independent

Correlation Test

The correlation test measures the strength and significance of the linear relationship between two numerical variables. It answers questions like: Is age correlated with income? The test produces a correlation coefficient (r) ranging from -1 to 1 and a p-value indicating statistical significance.

Correlation Test Examples

# Example: Correlation test
# Test if age is correlated with income
cor.test(gss$age, gss$income)

# Interpretation:
# Correlation coefficient (r): strength and direction of relationship
# p-value: statistical significance
# If p < 0.05, the correlation is statistically significant

Assumption Checking

Statistical tests have underlying assumptions that should be verified before analysis. For t-tests, normality of the data is important. The Shapiro-Wilk test can formally test normality, or you can use Q-Q plots visually. Equal variance between groups can be tested with the F-test. If assumptions are violated, non-parametric alternatives like the Wilcoxon test can be used.

Checking Statistical Assumptions

# Check normality using Shapiro-Wilk test
shapiro.test(gss$age)

# Visual check for normality
hist(gss$age)
qqnorm(gss$age)
qqline(gss$age)

# Test for equal variance (F-test)
var.test(income ~ sex, data = gss)

# Non-parametric alternative (Wilcoxon test)
wilcox.test(income ~ sex, data = gss)

Practical Exercise

Exercise

Apply your hypothesis testing skills to real research questions. Test whether the mean age differs from 35. Determine if income differs by education level. Check for association between gender and marital status. Compute the correlation between age and education years. For each analysis, clearly state your hypotheses, perform the appropriate test, and interpret the results.

Homework Assignment

Homework

Part A - Hypothesis Development: Formulate 3 research questions from the GSS dataset. Define H₀ and H₁ clearly for each question. Part B - Implementation in R: Perform one t-test, one chi-square test, and one correlation test. Part C - Interpretation: For each test, report the test statistic, p-value, decision (reject or fail to reject H₀), and provide a real-world interpretation of your findings.