Session 6

Correlation and Regression Analysis

Master the techniques for analyzing relationships between variables. Learn to measure correlation strength, build predictive models with regression, and interpret results for real-world insights. This session covers correlation analysis, simple and multiple linear regression, and model diagnostics.

Introduction to Correlation Analysis

Correlation measures the strength and direction of a linear relationship between two variables. Correlation coefficients range from -1 to +1. A coefficient of +1 indicates a perfect positive relationship where both variables increase together. A coefficient of -1 indicates a perfect negative relationship where one increases while the other decreases. A coefficient of 0 indicates no linear relationship. Common correlation measures include Pearson's correlation coefficient (r) for continuous, normally distributed data, Spearman's rank correlation for ranked/ordinal data, and Kendall's tau for ordinal data with ties.

Pearson Correlation

Pearson's correlation coefficient measures the linear relationship between two continuous variables. It is the most commonly used correlation measure when data is normally distributed. The coefficient ranges from -1 to +1, with values closer to ±1 indicating stronger relationships. Pearson correlation is sensitive to outliers and assumes a linear relationship.

Calculating Pearson Correlation

# Example: Pearson correlation
data <- data.frame(
  x = c(1, 2, 3, 4, 5),
  y = c(2, 4, 5, 7, 6))

# View data
head(data)

# Calculate Pearson correlation
pearson_corr <- cor(data$x, data$y, method = "pearson")
print(paste("Pearson Correlation:", pearson_corr))

# Test significance of Pearson correlation
pearson_test <- cor.test(data$x, data$y, method = "pearson")
print(pearson_test)

Spearman Rank Correlation

Spearman's rank correlation measures the monotonic relationship between two variables. It is based on ranked data rather than raw values, making it more robust to outliers and non-linear relationships. Spearman correlation is appropriate for ordinal data or when the relationship is not strictly linear.

Calculating Spearman Correlation

# Example: Spearman correlation
data <- data.frame(
  x = c(1, 2, 3, 4, 5),
  y = c(2, 4, 5, 7, 6))

# Calculate Spearman correlation
spearman_corr <- cor(data$x, data$y, method = "spearman")
print(paste("Spearman Correlation:", spearman_corr))

# Test significance of Spearman correlation
spearman_test <- cor.test(data$x, data$y, method = "spearman")
print(spearman_test)

Introduction to Regression Analysis

Regression models the relationship between a dependent variable (response) and one or more independent variables (predictors). The purpose is to explain and predict how changes in X affect Y. Types of regression include Simple Linear Regression (one predictor, equation: Y = a + bX), Multiple Linear Regression (several predictors), Logistic Regression (for categorical outcomes), and Polynomial Regression (for nonlinear relationships). Regression coefficients are interpreted as the average change in Y for each unit change in X.

Simple Linear Regression

Simple linear regression models the relationship between one independent variable (X) and one dependent variable (Y). The regression equation is Y = a + bX, where a is the intercept and b is the slope. The intercept represents the expected value of Y when X is zero. The slope represents the average change in Y for each unit increase in X. The lm() function in R fits linear regression models.

Fitting Simple Linear Regression

# Example: Simple linear regression
data <- data.frame(
  x = c(1, 2, 3, 4, 5),
  y = c(2, 4, 5, 7, 6))

# Fit the simple linear regression model
model <- lm(y ~ x, data = data)

# Print summary of the model
summary(model)

# Extract model parameters
model$coefficients
summary(model)$r.squared

# Interpretation example:
# If the equation is: y = 1.5 + 1.1*x
# Intercept (1.5): Expected y when x = 0
# Slope (1.1): Average increase in y for each unit increase in x

Multiple Linear Regression

Multiple linear regression extends simple linear regression to include multiple independent variables. The model equation is Y = a + b₁X₁ + b₂X₂ + ... + bₙXₙ. Each coefficient represents the average change in Y for each unit change in that variable, holding other variables constant. Multiple regression allows you to model more complex relationships and control for confounding variables.

Fitting Multiple Linear Regression

# Example: Multiple linear regression
data <- data.frame(
  y = c(2, 4, 5, 4, 6),
  x1 = c(1, 2, 3, 4, 5),
  x2 = c(3, 5, 7, 6, 8))

# Fit the multiple linear regression model
model <- lm(y ~ x1 + x2, data = data)

# Print summary of the model
summary(model)

# Extract coefficients
summary(model)$coefficients

# View R-squared value
summary(model)$r.squared

Interpreting Regression Results

Regression output includes several important components: Coefficients show the estimated parameters (intercept and slopes). Standard errors indicate the precision of coefficient estimates. t-values and p-values test whether coefficients are significantly different from zero. R-squared measures the proportion of variance in Y explained by the model (ranges from 0 to 1). Adjusted R-squared accounts for the number of predictors. F-statistic tests overall model significance.

Understanding Regression Output

# Example: Interpreting regression results
# Assume we have fitted a model

# View coefficients with standard errors, t-values, and p-values
summary(model)$coefficients

# View R-squared value (proportion of variance explained)
summary(model)$r.squared

# View adjusted R-squared
summary(model)$adj.r.squared

# View F-statistic and overall p-value
summary(model)$fstatistic

# Interpretation guidelines:
# - Coefficient p-value < 0.05: Variable is statistically significant
# - R-squared: Higher values indicate better model fit
# - F-statistic p-value < 0.05: Overall model is significant

Model Diagnostics

Model diagnostics help assess whether regression assumptions are met. Residual plots show the distribution of prediction errors. The Q-Q plot checks normality of residuals. The scale-location plot checks homogeneity of variance. The residuals vs. fitted plot identifies patterns or non-linearity. Variance Inflation Factor (VIF) detects multicollinearity among predictors. Always perform diagnostics to ensure your model is valid.

Performing Model Diagnostics

# Example: Model diagnostics
# Assume we have fitted a model

# Generate diagnostic plots
par(mfrow = c(2, 2))  # Set up a 2x2 plot layout
plot(model)           # Generate four diagnostic plots

# Check for multicollinearity using VIF
library(car)
vif(model)

# Interpretation guidelines:
# - Residuals should be randomly scattered around zero (no patterns)
# - Q-Q plot: Points should follow the diagonal line (normality)
# - Scale-location plot: Points should be randomly scattered (homogeneity)
# - VIF < 5: Generally acceptable (no multicollinearity)
# - VIF > 10: Indicates multicollinearity problem

Model Comparison and Selection

When building regression models, you often need to compare different models to select the best one. Criteria for comparison include R-squared (higher is better), Adjusted R-squared (penalizes complexity), AIC (Akaike Information Criterion), and BIC (Bayesian Information Criterion). The ANOVA function can compare nested models. Cross-validation assesses model performance on new data.

Regression Assumptions

Linear regression relies on several key assumptions: (1) Linearity: The relationship between X and Y is linear. (2) Independence: Observations are independent. (3) Normality: Residuals are normally distributed. (4) Homoscedasticity: Residuals have constant variance. (5) No Multicollinearity: Predictors are not highly correlated. Violations of these assumptions can lead to invalid conclusions. Always check assumptions using diagnostic plots and tests.

Practical Applications

Regression analysis is widely used in real-world applications. Predict house prices based on features like size and location. Forecast sales based on advertising spending. Analyze the effect of education on income. Understand factors affecting customer satisfaction. Model disease risk based on patient characteristics. Each application requires careful model building, assumption checking, and result interpretation.

Homework Assignment

Homework

Using the GSS dataset or a dataset of your choice: (1) Calculate correlation coefficients between at least three pairs of variables using both Pearson and Spearman methods. (2) Build a simple linear regression model and interpret the coefficients, R-squared, and p-values. (3) Build a multiple linear regression model with at least three predictors. (4) Perform model diagnostics and check regression assumptions. (5) Compare the simple and multiple regression models and discuss which is better and why. (6) Provide real-world interpretations of your findings.