
Research Methods Primer
Professor
Department of Political Science
011C Calvin Hall
meflynn@ksu.edu
2024-09-17
What do we want to get out of today’s class?
How you’ll want to read it:
How you’ll want to read it:
How you should read it (as a rough guide):
We use lots of different methods in the social sciences, but let’s look at a few key approaches/tools:
First, what are our goals?
Prediction
Overall model fit
In-sample fit?
Out-of-sample fit?
Description
Overall model fit
Individual coefficients
Associations between groups
Causal Inference
What is a variable?
A variable is a characteristic, trait, attribute, or condition that can be measured or observed
A “definable quantity that can take on two or more values” (Kellstedt and Whitten 2018, 22).
Examples:
What is a coefficient?
\[\begin{equation} Y_i = \alpha + {\color{blue} \beta_1} X_i + {\color{blue}\beta_2} Z_i + \epsilon_i \end{equation}\]
The correlation between \(X\) and \(Y\), conditional on other predictors and model
Typically denotes the magnitude of the change in \(Y\) that corresponds to a one-unit change in \(X\)
In some cases this is direct and easy, less so in other cases
Different models and measurement schemes mean we need to be cautious in interpreting coefficients

Bias: Systematic error of some sort that can affect our inferences and conclusions (Howard 2017, 151)
What it’s not:
What is confounding?
Confounding is when the predictor (i.e. X) and the outcome (i.e. Y) share a common ancestor (i.e. Z).
If we don’t take into account the relationship between Z and the other variables, our estimates of X are biased.
The most common way to do this is to include Z into our model
\[\begin{align} Y & = \alpha + \beta_1 X_i + \epsilon_i \\ vs \\ Y & = \alpha + \beta_1 X_i + \beta_2 Z_i + \epsilon_i \end{align}\]

Let’s imagine we’re interested in the relationship between gas pedal pressure and speed, so we collect data on
Call:
lm(formula = speed ~ pedal_pressure, data = cardata)
Coefficients:
(Intercept) pedal_pressure
70.0441843 -0.0004776

Confounding can mask causal relationships
We need to understand the data generating process to help us make sense of results that may or may not themselves make sense
Adjusting for confounders in our model can reduce bias and help us better estimate causal relationships
Call:
lm(formula = speed ~ pedal_pressure + incline, data = cardata)
Coefficients:
(Intercept) pedal_pressure incline
45.110 1.244 -1.492

What is it and why does it matter?

The underlying population matters!
The rate of an event may vary across groups
Knowing something about the reference population is key
Another way to think of this is the symmetry/asymmetry of conditional probabilities
\[Pr(H \mid V) \neq Pr(V \mid H)\]

What is it and why does it matter?
What is it and why does it matter?
Simpson’s Paradox is a special case of confounding
Simple bivariate relationship between X and Y may appear to show one thing
But when we adjust for other relevant variables (i.e. Z) the apparent relationship between X and Y reverses
Common in lots of fields of study
Example:
Imagine we have data on the exercise activity and cholesterol of a few hundred adults
We plot the individuals’ cholesterol levels against their level of exercise (we’ll assume we directly observe this and it’s not self-reported)
In the plot we see that exercise appears to correlate positively with cholesterol levels. A simple regression model supports this eyeball estimate.
Call:
lm(formula = y ~ x, data = data2)
Coefficients:
(Intercept) x
182.372 2.078

Example:
But what if we’re omitting some key variables?
It’s possible for results to reverse if we’re omitting a variable that could also affect cholesterol levels
Another way to say this is that the effect of exercise might be biased because we’re leaving other important things out of the model.
One good variable would be patient age, so let’s see what that looks like
Call:
lm(formula = y ~ x + age, data = data2)
Coefficients:
(Intercept) x ageAge 2 ageAge 3 ageAge 4 ageAge 5
193.1930 -0.6774 4.2166 8.1891 11.8306 15.1400

What is collider bias?
Collider bias can be introduced by adjusting for a common effect.
It can also be a form of selection bias.
Similar to Simpson’s Paradox it can make effects appear to be the opposite of what they are, but also depends on the population of interest.
In this example Z is a collider because both X and Y cause Z.

Example: College Admissions
Let’s assume we’re interested in looking at students who have been admitted to college to see what the relationship is between their high school extracurricular activities and GPA
We collect data on college freshman and compare their extracurricular activities in high school with their GPA.
Surprisingly, we find a negative relationship.
That’s sad.

Example: College Admissions
But wait!
If we’re looking at students who have already been admitted to college we’re implicitly conditioning on a collider variable (i.e. acceptance/admission status)
If we step back and look at the entire population of high school students we find something different
Students who have better grades and/or better extracurricular activities are more likely to be admitted to college in the first place.

What is theory?
Let’s start with what it’s not
What is theory?
More accurately, theory is the story we tell about which variables are important and how those variables are related.
It both describes and explains how variables affect one another.
It also allows us to make predictions about what we should expect to see in various scenarios
Theory: A tentative conjecture about the causes of some phenomenon of interest (Kellstedt and Whitten 2019, 22)
Theory: A logically consistent set of statements that explains a phenomenon of interest (Frieden, Lake, and Schultz 2022, xxix)