library(tidyverse)
library(knitr)
# pseudo-data for first-gen students
<- read_rds("data/model_data.rds") model_data
Regression for Humans!
Data Leadership
This ThIRsdays project contributes to the larger idea of “data leadership.” In IR work we create a lot of data products, including regulatory data uploads, dashboards for administrators, operational reports on surveys and outcomes, and bespoke analyses that address some area of interest. In my experience, this work doesn’t often affect institutional strategy. It’s more common for a university to outsource strategic analysis, which is expensive and often substandard.
I think there are many opportunities for good data analysis to help guide decisions toward better outcomes, and that most institutions could better integrate IR functions into such deliberations. However, there are significant challenges to be overcome, and this is what I’m calling data leadership. This is a work in progress, which I hope will be developed by the IR community as a whole, since I certainly don’t have all the answers.
Potential barriers that exist now to the effective use of analysis include these.
The IR office doesn’t have the capacity to spend time on deep analysis, because it’s consumed with standard reports.
The data and/or analysis isn’t obviously usable to decision-makers. This problem is partly technical and partly psychological.
The IR office is part of a social hierarchy that’s lower on the org chart than the cabinet. This means that data products won’t be understood on their own terms, but in this social milieu. For example, any ideas from IR are subject to a complicated distribution of blame if things go wrong.
The main ThIRsdays project and associated resources and certificate classes are aimed at the first two of these issues, helping you switch from using Excel and SPSS to automated scripts with R/Tidyverse. If IR offices can work more efficiently, freeing time to be spent on deeper problems, the content of reports can become more useful, and you’ll be more comfortable with the statistical methods entailed. That’s the technical part. But there’s also a psychological issue in how we ourselves understand such an analysis and how we communicate it to the cabinet in a way that seems useful. That’s what today’s topic is about–this intersection of data sophistication with the need to be understandable.
In the bigger picture, my hope is that we can provide a natural pathway from IR offices to the cabinet; that more IR directors become vice presidents and can lead strategy laterally instead of upwards through the org chart.
Problem 1: Oversimplification
One of the problems we should try to solve is oversimplification. This occurs when decision-makers rely on a model of reality that isn’t sophisticated enough to capture the important features of the problem. We have too much data and not enough time, and it’s easy for an organization to become comfortable with data presentations that are the equivalent of junk food. My evidence for this is the reflexive request for a “dashboard” to inform decision-makers at every level. It seems that sophistication in a dashboard means more complexity: more data sources, more filters and ways to show aggregate statistics. There’s nothing wrong with a dashboard as long as we don’t mistake it for analysis, but that’s what tends to happen.
For example, if a dashboard shows that first generation students have lower GPAs than others, the conclusion is naturally that the lower grades are because of first-gen status. That is, instead of being read as merely descriptive, oversimplified data summaries can lead the unwary to conclusions about causality that aren’t justified. A dashboard that showed the frequencies of eating ice cream and sunburn would show a strong relationship. Does eating ice cream cause sunburn? No: both are caused by sunshine and its effects on behavior. Similarly, those first-gen students are receiving lower grades in part because of academic and financial factors that are more direct causes than their parents’ education.
More complex understandings of interaction allow us to contextualize issues like GPA gaps within a general model of how grades are determined. So part of the problem we face is to wean leaders off of junk food data. We also have to become proficient at the more sophisticated analysis that can help them make better decisions.
Problem 2: Magic
When humans are faced with complex decisions, we have a tendency to look for a magic solution, by which I mean a method we don’t really understand. If we oversell the statistical techniques by impressing decision-makers with our Bayesian posterior distributions, or otherwise use technical jargon that makes the results seem more certain than they are, we’re using magic. Consultants are very good at this, and they have the added advantage of an external source to blame if things go wrong.
Phrases like “data-driven decision-making” are a subtle form of magic, because they oversell the data, implying that decisions will be easy-peasy once we have the right analysis in hand. I don’t think that’s true except in rare cases. Another bit of magic is a data warehouse advertised as “once source of Truth.” The philosophers have worked on the problem for over two thousand years and still don’t know what Truth is.
Proposed Solution
As I’ll explore below, the idea is to avoid data junk food, but also avoid magical solutions. This means being sophisticated, but also transparent about limitations. Ideally this presents decision-makers with the right level of analysis, so they can use their best judgment on it. There’s a plot twist to this that involves Daniel Kahneman, which I’ll come to later.
The Flaw of Averages
The main way we oversimplify a data summary is by using averages or proportions (a proportion is just the average of a binary variable). This presentation isn’t to discourage you from using averages, but to open up a new way to think about them that can unlock a trick for communicating to the leadership. There’s actually a book called The Flaw of Averages, which gives examples of an analysis going terribly wrong through a chain of reasoning that uses averaging throughout, instead of considering the whole distribution of each variable. This problem stems from the limited extrapolations we can draw from averages. For example, the average of \({1,2,3, 4, 5}\) is 3, and if we square that we get 9. But if we first square each of the numbers and then take the average, we get 11 instead. The point is that simple numerical transformations on a column of data won’t be reflected in an average or proportion in an intuitive way1. This includes multiplying variables.
It’s natural to think that if we have two variables, \(X\) and \(Y\), then the average of their product is the product of their averages. For example, we might predict next year’s revenue by multiplying predicted student enrollment by the average tuition revenue. But if enrollment and revenue are correlated at the student level, this calculation will be biased. For example, if students who pay more are more likely to drop out, then the average revenue will be lower than the product of the averages. To account for these complexities, we need a more sophisticated understanding of how enrollment and revenue interact than simple averages.
Beyond the mathematical issues with using averages, the degree of data compression in calculating an average is often too limiting for decision-making. That’s because effective decisions rely on answers to “why” questions, and averages don’t answer those. For example, if we take the average GPA of first-year students, we can say that the average is 3.0, but we can’t say why it’s 3.0. We can only say that some students earned a 4.0, some earned a 2.0, and so on. Beyond that is a mystery.
Lurking behind the average is some interesting story, or lots of them. For example, maybe if we looked more closely we’d find that students admitted with weaker credentials tend to pick easier courses and easier majors. Maybe we’d see changes in grading patterns over time that can’t be explained by student characteristics. Maybe the course sequence from Calculus 1 to Calculus 4 shows grade patterns that don’t make sense, like students earning an A in the first course who fail the second. We might wonder about things like the relationship between sense of belonging on campus (via surveys) and grades, or work-study and grades, or financial pressures and grades. These are all examples of useful questions we could ask, but they can’t be answered with simple averages.
The important step is to think in terms of models that specify a relationship between the original data and our summary of it. If we take an average of first year grades, the number we get answers the following model question. “Suppose all students earned the grades the first year. What (common) grade did they earn?” The model is to assume that the data distribution takes on a single value. If we ask for the average age of a first-year student at a residential college, that model isn’t terrible, since most students will be 18 years old. But in most cases, this model is oversimplified. The go-to upgrade is a regression analysis.
Regression
It’s easy to think of linear (or other) regression as a special technique with a lot of fiddly tests we have to apply. This unfortunate idea is often learned in graduate school, and it’s what we see in journals: lots of ritual that amounts to another kind of magic. I’d like to encourage you to think of regression as the main tool for analysis for most problems. Use it reflexively. Do regression all day long, until you’re so comfortable with the process of preparing data, trying out models, and interpreting results that it’s second nature. This is the skill you’ll need to graduate from junk food models. The bonus is that it’s quite easy to do in R; most of the work is prepping the data.
An average is just a very simple regression model. If we take the numbers from 1 to 5 and average them, we get 3. We’re asking the model question “if all the numbers are the same, what’s that unique value?” Expressed as a regression model in math terms, this is
\[ y_i = \beta + \varepsilon_i \]
where \(y_i\) is the data we’re modeling (a column in a data sheet, usually), so that \(y_1 = 1, y_2 = 2, ...\) . The \(\beta\) is a constant, that is a fixed number (the average) that is to be estimated and represents our model of the data. The last term, \(\varepsilon_i\) is to hedge our bets, in case the model doesn’t exactly match the data. If the numbers aren’t all the same, there will be a mismatch between the model and the data, so the \(\varepsilon_1\) is the error that captures the difference between the first data point and the predicted value. We can think of the \(\varepsilon_i\) as the a term for “other causes we can’t explain.”
I’ll use some data on first generation students as an example. In the data set of 2000 students, the first year GPA for each is averaged to create a total average of 3.22. The model question is “if all students had the same GPA, what would it be?” The difference between the average and the real value for a given student is “due to other causes than the GPA being constant.”
graph TD B[Students are identical] --> A[Student GPA] C[Other explanations] --> A
This kind of model is okay for quickly conveying information about groups, but not good enough for decision-making. Knowing that the average annual temperature in your town is not sufficient information to plan a day at the park.
The model in Figure 1 can be written and computed as a regression model, and we can estimate it in R with the lm()
function. The code below shows how to do this for the student data.
lm(Y1GPA ~ 1, data = model_data) |> summary()
Call:
lm(formula = Y1GPA ~ 1, data = model_data)
Residuals:
Min 1Q Median 3Q Max
-3.2159 -0.3159 0.0841 0.4841 0.7841
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.21590 0.01328 242.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5939 on 1999 degrees of freedom
The model specified in the code above is “Y1GPA ~ 1”, where the 1 means “some constant.” This explicitly states the model, that we want to assume all the values are the same.
You may be familiar with the formula for standard error of the mean. It’s calculated from the standard deviation of the data, \(SD_y\), and the number of observations, \(N_y\) with
\[ SE = \frac{SD_y}{\sqrt{N_y}}. \]
The standard error of the mean is a measure of how much we expect the average to vary from sample to sample, and is used to construct confidence intervals around the average, e.g. plus or minus two standard errors. In this example, we can calculate this in R with
|>
model_data summarise(N = n(),
mean = mean(Y1GPA),
SE = sd(Y1GPA)/sqrt(N)) |>
kable(digits = 3)
N | mean | SE |
---|---|---|
2000 | 3.216 | 0.013 |
Notice that we get the same values we did with the regression model that assumes \(Y1GPA\) is a constant. The point is that taking an average is the same as creating the simplest-possible regression model, where we imagine that all the data is constant.
When we calculate an average or proportion, we’re doing regression.
There’s a standard way to visualize a regression model, by comparing predicted values to actuals with a scatterplot. The code blow below shows how to create such a plot for a data frame.
<- lm(Y1GPA ~ 1, data = model_data)
const_model $y_hat <- const_model$fitted.values # just the constant = 3
model_data
# plot the predicted vs actual
|>
model_data ggplot(aes(x = y_hat, y = Y1GPA)) +
geom_point() +
# we usually want to add thse two reference lines:
# geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
# geom_smooth(se = FALSE) +
theme_bw() +
xlab("Predicted Y1GPA")
The pattern in Figure 2 looks weird because predicting the values of the data set with a single constant value is a poor approximation. The dots are vertical because all the predictions are the same, while the actual GPAs range from 0 to 4. The points also fall on top of each other, so we can’t tell they are denser at the top. These problems can all be remedied, but it’s not worth the trouble with this model.
The point to remember at this point is that an average is a poor representation of reality. We should be mindful of this when using and presenting data summaries. We don’t usually want to base decisions on simple averages.
Disaggregation
In dashboards and outcomes data summaries like graduation rates, we often see averages that are disaggregated by a categorical variable like race or gender. This is fine for getting a sense of the situation, but as with a simple average, it shouldn’t be thought of as analysis; don’t make decisions based on such summaries without trying to figure out what’s going on. If the model assumption for an average (all values are the same) is bad, disaggregation can make it worse in two ways. First we’re making the data sets smaller when we split into groups, which leads to worse estimates of statistics we’re interested in. Second, the model assumes that the members of each categorical group are identical (stereotyping). If a decision-maker looks at the difference in group averages and thinks of it as causal, this attributes all the differences to the categories. For example, if first year GPA is lower for first-generation students, we could be tempted to believe that we should create a special program to boost the grades of those students. That’s attributing the category as the cause. What’s more likely to be going on is that all students who attended high schools in less wealthy areas are less prepared for college, and first-gen students happen to overlap with that condition. We’ll dig into those details below.
Data summaries like averages only describe a situation. An analysis attempts to understand why it exists. The practical difference is in the complexity of the model we use.
In our data set, we’ll disaggregate first year college grades by first-generation (FirstGen) status.
# take the average for each category
|>
model_data group_by(FirstGen) |>
summarise(N = n(),
gpa = mean(Y1GPA),
SE = sd(Y1GPA)/sqrt(N)) |>
kable(digits = 3)
FirstGen | N | gpa | SE |
---|---|---|---|
No | 1792 | 3.234 | 0.014 |
Yes | 208 | 3.058 | 0.045 |
The figures in Table 3 show a difference in first year grades between the two categories, their status as a first-generation college student. To create a “confidence interval” around the mean we typically add and subtract two standard errors.
We can get similar results with a regression model that only uses the category as a predictor. This is better than using a single constant to predict all the output values; now we have two constants, but it’s not that much better.
<- lm(Y1GPA ~ 0 + FirstGen, data = model_data)
cat_model |>
cat_model summary()
Call:
lm(formula = Y1GPA ~ 0 + FirstGen, data = model_data)
Residuals:
Min 1Q Median 3Q Max
-3.2343 -0.3343 0.0657 0.4657 0.9423
Coefficients:
Estimate Std. Error t value Pr(>|t|)
FirstGenNo 3.23426 0.01398 231.42 <2e-16 ***
FirstGenYes 3.05769 0.04102 74.54 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5916 on 1998 degrees of freedom
Multiple R-squared: 0.9673, Adjusted R-squared: 0.9673
F-statistic: 2.956e+04 on 2 and 1998 DF, p-value: < 2.2e-16
The regression model in the code block specifies that the model constant is zero, which forces the output to include both values A and B as coefficients. It also makes the R-squared value untrustworthy. The actual \(R^2\) is very close to zero in this case.
Compare the values in Table 4 to those in Table 3. The coefficients table shows the estimates for the two categories, which are the same as the averages in Table 3. The standard errors are nearly the same, with the difference that the regression model uses all the data in calculating standard errors, whereas the averages in Table 3 are calculated separately for each category. The regression model is a little more useful than the single-constant model (averaging all the values), but not by much. If we plotted the predicted values from the regression model, we would now see two vertical lines, one for each category. This isn’t very useful, and we don’t want to show that to humans. We can make the point better with a density plot, which is like a smoothed histogram.
|>
model_data mutate(y_hat = cat_model$fitted.values) |>
ggplot(aes(x = Y1GPA, group = FirstGen, fill = FirstGen)) +
geom_density(alpha = .4) +
geom_vline(aes(xintercept = y_hat, color = FirstGen), linetype = "dashed") +
theme_bw() +
xlim(1, 4)
We should be careful about density plots like Figure 3, because they smooth out data, and the shape will depend on how much smoothing is done. The plot shows the distribution of the two categories of student overlaps significantly, but that non-FirstGen students have a lot more of the higher GPAs, above 3.5, say. When we disaggregate groups and create averages, we get the values indicated by the two dashed lines. If we characterize all FirstGen students as having a GPA at the blue dashed line (the disaggregated average), all the non-FirstGen students as having the GPA at the red line, we’re missing the main story. The overlap in distributions suggests that the real causes for GPA differences also overlap within these groups, and an analysis would try to uncover what those are. One way to explain this is to point out that if all we have is student GPA, it’ll be almost impossible to reverse engineer the student’s category.
Sometimes we really can treat differences in A and B as caused by the groups. For example, if we did an A-B test of a new program to help students, where a randomly-chosen group receives the “treatment.” This is difficult to do with real students, but in principle we can set up experiments where the categories are meaningful on their own. Usually, however, there are causes in the background that are more fundamental to the things we care about, like grades and retention rates.
Correlation
One problem with presenting averages of categories of students is that it emphasizes the difference between groups, which can distract from more productive models. For example, the connection between a student’s parental education and that student’s grades in college is weak. It’s more meaningful to compare high school grades to college grades. We start by looking at a scatterplot.
|>
model_data ggplot(aes(x = HSGPA, y = Y1GPA)) +
geom_jitter(alpha = .1) +
geom_smooth(se = FALSE, method = "lm") +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(x = "HSGPA", y = "Y1GPA") +
theme_minimal()
Figure 4 has one dot per student, showing the grade average in high school along the x-axis, and the first year college-grades along the y-axis. The red line shows where high school GPA and first year college GPA would be equal. The blue line is drawn to provide the best estimate of the average Y1GPA for any given value of HSGPA. For example, for HSGPA = 3.0, the average Y1GPA is about 2.75. This is the y-value where the blue line crosses the vertical line at HSGPA = 3.0. The blue line is the best fit for predicting averages, and is called the linear regression line. If we were to draw a vertical line at HSGPA = 3.0, the red line shows that the average Y1GPA would also be 3.0 (equality), but the blue line shows that the average college grade is only about 2.7. This is probably because college is harder than high school.
What we call a “linear regression” that models Y from X is a line that estimates the average value of Y for any given value of X. This another reason to think “regression = average.”
The (Pearson) correlation coefficient is a measure of how well the regression line can fit the averages of Y1GPA, on a scale of 0 to 1 (it can be negative if the line slopes the other way, but we’re just interested in the magnitude here). In this case, the correlation is 0.55. This isn’t perfect as a measure of model accuracy, but it has the advantage that everyone has heard of correlation, so we can develop that understanding without adding new technical terms. To compute the specifics of the blue regression line, we use the lm()
function in R, which stands for “linear model.” The code below shows how to do this.
<- lm(Y1GPA ~ HSGPA, data = model_data)
gpa_model summary(gpa_model)
Call:
lm(formula = Y1GPA ~ HSGPA, data = model_data)
Residuals:
Min 1Q Median 3Q Max
-3.09087 -0.24517 0.08529 0.34821 1.31775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.20873 0.10272 2.032 0.0423 *
HSGPA 0.84769 0.02879 29.448 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4961 on 1998 degrees of freedom
Multiple R-squared: 0.3027, Adjusted R-squared: 0.3023
F-statistic: 867.2 on 1 and 1998 DF, p-value: < 2.2e-16
The output in Table 5 shows the coefficients for the regression model. There’s a lot of useful information here, but I’ll pull out two items of most interest. First is that the coefficient of HSGPA is .85, which means that incremental high school grades get “discounted” by 15% when predicting college grades. That is, an increase in HSGPA of 1 translates to an increase in Y1GPA of .85. The second point of interest is the \(R^2\) value of .30, which is often explained as how much variance is “explained by the model.” But this is a foreign language to humans, since the units of variance are squared. Who goes around thinking about \(GPA^2\)? Fortunately, the R is just the correlation coefficient, so we can take the square root and talk about correlation instead.
The \(R^2\) for a regression model is the square of a correlation coefficient, and it’s often better to communicate \(R\). It can be explained as the correlation between predicted and actual values, on a scale of 0 to 1, where 0 is no predictive ability at all, and 1 is perfect prediction.
Scatterplots showing predicted versus actual values are an essential part of our analytical toolkit. You may find that your audience also responds to this level of detail.
Psychology of Models
Google created a position called “Chief Decision Officer,” to help the company learn how to make better decisions. One strategy coming out of that work is to map out data’s relationship to decisions before seeing the real data, for example, “if we find that X performance measure is below Y, we’ll close the Scranton Office.” This forces us to think about the decision-making process in advance to avoid post-hoc rationalizations. That’s probably more than an IR office can bite off, but there are other ways we can facilitate good decision-making.
In his blog on the data viz industry, Benn Stancil connected data analysis to the decision-making pathways described in Daniel Kahneman’s book Thinking, Fast and Slow. The book describes two systems of thinking, which he calls System 1 and System 2. System 1 is fast, intuitive, and emotional, while System 2 is slow, deliberate, and logical. Benn’s observation is that data analysts are usually providing System 2-type thinking, but the way decisions are usually made are via System 1. This isn’t a criticism of the decision-makers, because they are in their positions for the very reason that the choices they make are hard ones, depending largely on experience and intuition. Stancil suggests that as analysts we should be trying to understand the System 1 model that decisions are made with, and nudge it toward whatever facts our System 2 analysis has uncovered. This is a powerful idea that comports with my experience in higher education.
One way to put this idea into practice is to strategically move conversations away from averages and proportions to regression models that have a better chance of representing the interactions that matter. But we can’t talk about standard errors, residuals, and \(R^2\) values. To communicate with decision makers, we want to avoid technical terms, even if they have some familiarity with statistics. There are two good reasons for this. One is that it may lend the results more certainty than actually exists, because we’re “doing science” (magic). There are many problems with this kind of overpromising, and one is that makes it harder to be flexible. When a model is not good enough, we want to find out as soon as possible, which means admitting we were wrong the first time. The second reason is that the technical terms can shut down the natural conversation we want to have to understand the System 1 in place. The goal is to nudge them out of the comfort zone of averages and proportions, and toward a new comfort with the idea of models of interaction while avoiding technical jargon.
One way to strike a balance between natural language and regression models is to visualize the model with a simple drawing of a relevant chain of logic or cause and effect. This puts us all on the same footing, since anyone can draw bubbles and arrows. We can start with a conversation that tries to uncover how decisions are currently made. For example, in a different ThIRsdays, I showed how most thinking about tuition discount rates is oversimplified.
Example: Tuition discounting
Here’s a typical model of tuition discounting I see in the press.
graph TD A[Higher tuition discount] --> B[Less tuition revenue] A --> C[More aid to students]
This model can lead to fixed aid budgets, that can reduce net tuition revenue. The real situation is more complicated, and a better model would look like this.
graph TD A[Price] --> C[Net revenue per student] B[Demand] --> C A[Price] --> D[Total enrollment] B[Demand] --> D D --> E[Total revenue] C --> E C --> F[Discount rate]
The important switch in the two models is that discount rate is seen as causal in the first one, and a side effect in the second. The second model leads to more sophisticated thinking about how budgets, aid leveraging, and enrollment strategy interact. If you’re paying a consultant to model financial aid awards, I hope they’re not using the first version.
Example: Aid and Retention
At one of my jobs, I analyzed retention rates and found that financial aid was a big predictor. In particular, students who received no institutional aid awards were much more likely to drop out. A simple model could be drawn like this.
graph TD A[No institutional aid] --> B[Attrition] C[Other risk factors] --> B
Model of retention.
Seems straightforward, right? It turns out that when a student dropped out, the financial aid office would delete the record of their awards, so it only appeared to have been a zero in retrospect. The actual model was.
graph TD C[Risk factors] --> B[Attrition] B --> A[Award records deleted]
Model of retention revised.
This is an example of reverse causality (zero awards as an effect of attrition, not the other way around), which is a problem to watch out for. Drawing pictures can help explain what’s going on.
Example: First generation students
Going back to an earlier example, suppose we create a dashboard that disaggregates first-year GPA by first-generation status, and we find that first-gen students have lower GPAs than others. The simple version of this relationship is
graph TD A[First-gen status] --> B[GPA] C[Unknown factors] --> B
Here we’re admitting that the GPA is influenced by other factors, but we no have opinions about what those might be. Because we have so much data on GPA, it would be better to start with the more general question of what predicts grades, and then see if first-gen status is a significant predictor. For example, we might assume that high school grade performance would affect college grades, as we explored earlier.
graph TD A[First-gen status] --> B[College GPA] C[Unknown factors] --> B D[High school GPA] --> B
If we have data to map to these variables, we can turn it directly into a regression model. However, a conversation with the decision-makers might lead us to think about other factors that could affect college GPA, especially ones related to first-generation status. You might come up with a diagram like the following.
graph TD A[K-12 school quality] --> F[Academic preparation] B[Parent occupation] --> A C[Parent earnings] --> A B --> C D(Parent education) --> A D --> B G[Social integration] --> E[GPA] F --> E C --> G
The model in Figure 7 is more complex, and it may not be “true” in the sense that we can verify all the parts with data. But the technique of thinking through possible causes of GPA is useful, because it (1) puts on paper an informal model of System 1 we can discuss with decision-makers, and (2) points toward regression models we can estimate and assess fit. We might have data proxies on hand already, or this exercise might point to opportunities to collect new types of data.
In this case, the diagram focuses attention on the fact that first generation status is only about parental education, which fact is denoted with rounded corners in the figure. Occupation and earnings are related to education, but not directly indicated by FirstGen status. The point of departure between high school and college is the quality of the education in K-12, and how each student navigates those years in preparation for further studies.
In converting an on-paper model like this into the world of IR, we are limited by the data we have, and not everything we dream up is easily measurable. Here’s a map that tries to capture the model in Figure 7 using real data elements.
Parental education -> first generation status (FirstGen = Yes/No)
Parental occupation -> too messy to be mapped in a simple model (omitted)
Parental earnings -> Pell status (Pell = 0/1)
K-12 school quality -> Number of AP courses taken in high school (AP_N = 0, 1, 2, …)
K-12 school quality -> did the student submit ACT or SAT (TestOptional = 0/1)
Additionally we’ll include a proxy for a student’s academic ability in high schools as HSGPA (0 to 4). The outcome variable we’ll predict is first year college GPA (Y1GPA, 0 to 4). I limited it to the first year so we don’t have to account for first year attrition. Additionally, I include the difficulty of courses taken by a student, which adjusts the GPA up or down accordingly. See my paper “Grades and Learning” in J. Assessment and Institutional Effectiveness for how to do that.
The blue lines in Figure 8 are allowed to bend here. They are estimating the average Y1GPA for each predicted value using local smoothing. This is a type of regression that’s not required to be a straight line, and it’s easy to generate with ggplot
. The red line is where predicted and actual values are the same when we force the model to be linear. Ideally the blue line coincides with the red one. In this case, it’s a pretty good match. As an analytical tool, we can use this kind of plot to adjust the model toward a better match. Our audience doesn’t need to know about that technicality.
The two plots are disaggregating the data by FirstGen status, but there’s only one model that fits both types. The correlation between predicted and actual values is \(R=\) 0.7, which is an improvement over the earlier model that only used HSGPA.
We are now in a position to evaluate the importance of FirstGen status in the model by looking at the coefficients.
|> broom::tidy() |>
gpa_model kable(digits = 3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -3.258 | 0.286 | -11.389 | 0.000 |
poly(HSGPA, 2)1 | 8.519 | 0.487 | 17.502 | 0.000 |
poly(HSGPA, 2)2 | 1.153 | 0.429 | 2.690 | 0.007 |
Difficulty | 1.971 | 0.087 | 22.632 | 0.000 |
FirstGenYes | -0.162 | 0.033 | -4.944 | 0.000 |
PellYes | -0.044 | 0.026 | -1.692 | 0.091 |
TestOptionalYes | -0.177 | 0.021 | -8.419 | 0.000 |
AP_N | 0.013 | 0.003 | 4.254 | 0.000 |
The model includes both HSGPA and its square, because the relationship to Y1GPA is curved. The FirstGen coefficient can’t be interpreted as a causal effect, only as a contribution to the prediction as part of this ensemble of variables. As such, it’s showing that first-gen students have, on average a slightly lower GPA than other students, once we take into account the other information. But the other information includes Pell, TestOptional, which are all related to first-gen status. While Pell status relates to FirstGen status, the two aren’t identical. Similarly, TestOptional is a proxy for the student’s academic preparation, but it doesn’t mean that all students who are first-gen are unprepared. The model here seeks a general way to predict Y1GPA, and the FirstGen status is just one of many variables that contribute to that prediction. Taking this approach allows us to avoid the oversimplification of thinking that first-gen status is the cause of lower GPAs.
What we gain from this analysis is a broad understanding of how first year college grades are related to other factors, including first-generation status. Since college GPAs are raging from about 2.5 to 4, most of that variation is not explained by FirstGen, which has a coefficient of only .16. Most of the variation corresponds to high school grades. We might have a conversation with the decision-maker that makes these points:
We can predict first year grades well enough to be useful for interventions.
The main link is to high school grades, but other factors include FirstGen, Pell, and TestOptional.
FirstGen students tend to take less difficult classes (I didn’t show this part of the analysis).
There are probably barriers to FirstGen success that can’t be explained by the other factors we’re considering, so this is worth following up on. A lot has been written about social integration and so on that might be useful in mitigating problems.
An advantage of zooming out like this from the original question is that we can begin to link together models from other studies. The course difficulty angle is one of those aspects. We could connect this study to engagement indicators like participating in study away. It might prompt us to inventory indicators of social integration and improve that source of information.
Summary
The connection between data analysis and decision-making can have gaps that lead to unrealized potential in data use and suboptimal decisions. This paper is an attempt to explore that issue and suggest ideas that may or may not work for you. It’s intended to be a starting point, not a recipe.
The first challenge for many IR offices will be to switch from manual work (pointing and clicking) to writing scripts that describe the work. I prefer R for several reasons, but there are other options like Python. Once we switch to automation, it unlocks iterative improvement and the time to do more analysis, like the sample one shown here.
I’m advertising linear regression (and its many cousins) as a general tool for relating data fields of interest. It’s a kind of averaging (which is familiar) that can accommodate complex interactions. The results won’t usually tell us about causality, but association is usually good enough to help with decision-making. It’s not difficult to create linear models in R, but the details of that are beyond the scope of this piece.
We need complex models for most of the important decisions typical to higher ed. The models are always incomplete, probably wrong, but (as George Box put it) sometimes useful. In combination with professional training and good judgment, the odds of making the right decision are increased.
I’d love to know your stories. What data influenced what big decisions, and how did it turn out? I hope we can have a follow-up ThIRsdays to compare notes.
Footnotes
There’s an important mathematical relationship about this called Jensen’s Inequality.↩︎