Regression Analysis
16 minute read
Regressions are one of the most powerful data analysis tools. One of the great advantages of Stata is the ease and flexibility with which you can estimate regressions. This section provides an introduction to regression analysis using Stata.
5.1 Correlation Tables and T-tests
An initial step in assessing the relationship between variables is to create a correlation table using the correlate
command followed by the variables you want to correlate.
Practical Exercise: Correlation table of variables.
Using the life expectancy data we see that GNP per capita and life expectancy have a positive correlation coefficient of around 0.7.
. sysuse lifeexp, clear
(Life expectancy, 1998)
. corr lexp gnppc
(obs=63)
| lexp gnppc
──────────────────────────
lexp | 1.0000
gnppc | 0.7182 1.0000
In policy research, you often only collect data of a small random sample of the whole population. The population will have one true mean value, but each random sample will have slightly different sets of values with slightly different means. If you take enough samples from a population, the means will be arranged into a distribution around the true population mean. This is called the sampling distribution.
We can ask how likely it is that the population mean takes on a certain value given our sample mean using the ttest
command.
In the example below, we test the hypothesis that the population mean life expectancy is equal to 70 using our life expectancy sub-sample. The main numbers of interest in this t-test output table are the p-values (last row). The one-tailed p-values evaluate the null against the alternatives that the mean is less than 70 (left test) and greater than 70 (right test). If p-value is less than the pre-specified confidence level (usually .05 or .01) we conclude that mean is statistically significantly greater or less than the null hypothetical value. In this case, we conclude that the mean is significantly greater than 70.
The two-tailed t-test in the bottom centre evaluates the null against an alternative that the mean is not equal to 50. Since the p-value is less than 0.05, we conclude that the mean life expectancy is statistically significantly different from 70.
Practical Exercise: Run a mean comparison t-test.
Suppose we want to know whether the population mean of life expectancy is significantly different from 70. We can test the null hypothesis that the mean is equal to 70 using the following command:
. ttest lexp == 70
One-sample t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. err. Std. dev. [95% conf. interval]
---------+--------------------------------------------------------------------
lexp | 68 72.27941 .5718159 4.715315 71.13806 73.42076
------------------------------------------------------------------------------
mean = mean(lexp) t = 3.9863
H0: mean = 70 Degrees of freedom = 67
Ha: mean < 70 Ha: mean != 70 Ha: mean > 70
Pr(T < t) = 0.9999 Pr(|T| > |t|) = 0.0002 Pr(T > t) = 0.0001
5.2 Linear Regression
We can use a linear regression to further investigate the relationship between life expectancy and income. The command used for linear regressions is regress
. The command takes the form regress y x1 x2 x3 … xn
, where y
is our dependent variable and x1 … xn
are our covariates (all of the independent variables which are thought to influence y
). Stata automatically includes a constant term, the variable _cons
, so you do not need to create a constant term yourself.
Practical Exercise: Run a linear regression.
. sysuse lifeexp, clear
(Life expectancy, 1998)
. gen log_gnppc = log(gnppc)
(5 missing values generated)
. regress lexp log_gnppc popgrowth safewater
Source | SS df MS Number of obs = 37
-------------+---------------------------------- F(3, 33) = 33.30
Model | 722.137692 3 240.712564 Prob > F = 0.0000
Residual | 238.56501 33 7.22924274 R-squared = 0.7517
-------------+---------------------------------- Adj R-squared = 0.7291
Total | 960.702703 36 26.6861862 Root MSE = 2.6887
------------------------------------------------------------------------------
lexp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
log_gnppc | 1.493245 .551269 2.71 0.011 .3716798 2.61481
popgrowth | -.5310035 .5548132 -0.96 0.345 -1.659779 .5977725
safewater | .1385733 .0394751 3.51 0.001 .0582607 .2188859
_cons | 49.6063 3.547374 13.98 0.000 42.38911 56.82348
------------------------------------------------------------------------------
5.2.1 Interpreting Regression Output Tables
The output will be presented as in the regression tables shown above. Interpreting these tables may seem confusing at first but will soon become second nature.
The most important of these numbers are the coefficient estimates (2nd column). The coefficients tell us the effect of each x variable listed in the 1st column on life expectancy. When neither of the variables are in log form the interpretation is straightforward: the coefficient tells us how life expectancy changes if the x variable changes by one unit. Since GNP per capita is in logs, the interpretation is that a 1% increase in GNP per capita is associated with an increase of 0.0277 years in life expectancy (noting that if GNP increases by one percent its log increases by 0.01). Conversely, if the y variable is in log form and x is not, then the x coefficient tells you the percentage change in y when x goes up by one unit. If both variables are in logs, then the x coefficient tells you the percentage change in y when x goes up by 1% (i.e., the elasticity of y with respect to x).
The standard error of the coefficient (3rd column) measures how precisely the model estimates the coefficient’s unknown value. When we think that the errors do not have uniform variance (i.e., heteroskedastic errors) we must use robust standard errors by adding the robust option to our regression, e.g., regress lexp log_gnppc, robust
.
The t-statistic (4th column) is equal to the coefficient estimate over its standard error. As we know the distribution of the t statistic, we are able to assess the probability that the population coefficient is equal to zero. This probability is the p-value displayed in the 5th column. In the last regression, we see that this probability is greater than 0.05 for the coefficient on population growth. So, we fail to reject the null hypothesis that the coefficient on population growth is equal to 0. More succinctly, the effect of population growth on life expectancy is statistically insignificant at the 5% level.
5.2.2 Dummy Variables
Suppose we also want to include the region variable in our regression. It would be incorrect to run the model regress lexp log_gnppc popgrowth safewater region
as region
is not a continuous variable. Instead, we want to include a separate regression intercept for each region. The easiest way to do this in Stata is include i.
before the categorical variable you wish to include.
In the practical example below which adds region dummies to our life expectancy regression, notice that there are only two dummies even though there are three regions. Each dummy coefficient tells us the effect of each region on life expectancy relative to the base region (in our case Europe and Central Asia). So, North and South America have lower life expectancy than Europe and Central Asia, although the difference is not statistically significant.
We can choose which region to use as a base. To specify North America (coded as 2) as the base, use the following command: regress lexp log_gnppc popgrowth safewater b2.region
.
Practical Exercise: Run a linear regression with dummy variables.
. regress lexp log_gnppc popgrowth safewater i.region
Source | SS df MS Number of obs = 37
-------------+---------------------------------- F(5, 31) = 18.95
Model | 723.836248 5 144.76725 Prob > F = 0.0000
Residual | 236.866455 31 7.64085337 R-squared = 0.7534
-------------+---------------------------------- Adj R-squared = 0.7137
Total | 960.702703 36 26.6861862 Root MSE = 2.7642
--------------------------------------------------------------------------------
lexp | Coefficient Std. err. t P>|t| [95% conf. interval]
---------------+----------------------------------------------------------------
log_gnppc | 1.586536 .6088676 2.61 0.014 .3447419 2.828329
popgrowth | -.2782813 .7849757 -0.35 0.725 -1.87925 1.322687
safewater | .1335213 .0423067 3.16 0.004 .0472362 .2198064
|
region |
North America | -.612925 1.400706 -0.44 0.665 -3.469683 2.243833
South America | -.6228757 1.497229 -0.42 0.680 -3.676495 2.430743
|
_cons | 49.23628 3.759722 13.10 0.000 41.56827 56.90428
--------------------------------------------------------------------------------
5.2.3 Interaction Effects
We can have interaction effects between two continuous variables, a continuous and a categorical variable, or two categorical variables.
Starting with two continuous variables, suppose we think the effect of safe drinking water on life expectancy is not linear. We can examine this by including the square of the safewater
variable in the regression. One way to do this would be to generate a new variable (see section 3.6) and include it in the regression. Alternatively, we can include the interaction term using #
. Remember to add a c.
before safewater
to tell Stata that it is dealing with a continuous variable.
Practical Exercise: Interact continuous variables.
. regress lexp log_gnppc safewater c.safewater#c.safewater
Source | SS df MS Number of obs = 37
-------------+---------------------------------- F(3, 33) = 37.04
Model | 740.746457 3 246.915486 Prob > F = 0.0000
Residual | 219.956246 33 6.66534078 R-squared = 0.7710
-------------+---------------------------------- Adj R-squared = 0.7502
Total | 960.702703 36 26.6861862 Root MSE = 2.5817
----------------------------------------------------------------------------------------
lexp | Coefficient Std. err. t P>|t| [95% conf. interval]
------------------------+---------------------------------------------------------------
log_gnppc | 2.212724 .5762273 3.84 0.001 1.04038 3.385067
safewater | .4402642 .1591055 2.77 0.009 .1165615 .7639668
|
c.safewater#c.safewater | -.002323 .001194 -1.95 0.060 -.0047522 .0001062
|
_cons | 34.25524 7.264115 4.72 0.000 19.47629 49.0342
----------------------------------------------------------------------------------------
The coefficient estimate on the square of safewater
is negative. This means that the effect is decreasing in safe water access (for countries with higher access to safe water, the marginal gains from improving safe water access are smaller).
We might also think that the effect of safe drinking water on life expectancy varies by region. To test this, interact the safewater
variable with our regional dummies using the #
. Specify whether variables are continuous (c.
) or categorical (i.
).
Practical Exercise: Interact continuous variables with categorical variables.
. regress lexp log_gnppc popgrowth safewater i.region#c.safewater
Source | SS df MS Number of obs = 37
-------------+---------------------------------- F(5, 31) = 18.94
Model | 723.770406 5 144.754081 Prob > F = 0.0000
Residual | 236.932296 31 7.6429773 R-squared = 0.7534
-------------+---------------------------------- Adj R-squared = 0.7136
Total | 960.702703 36 26.6861862 Root MSE = 2.7646
------------------------------------------------------------------------------------
lexp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------------+----------------------------------------------------------------
log_gnppc | 1.523157 .580035 2.63 0.013 .3401674 2.706146
popgrowth | -.5400662 .7261172 -0.74 0.463 -2.020992 .9408597
safewater | .1362441 .0409431 3.33 0.002 .0527401 .2197482
|
region#c.safewater |
North America | .004114 .0172491 0.24 0.813 -.0310659 .0392938
South America | -.003567 .0180154 -0.20 0.844 -.0403096 .0331756
|
_cons | 49.53485 3.779901 13.10 0.000 41.82569 57.24401
------------------------------------------------------------------------------------
The two variables tell us that the effect safe water access on life expectancy does not vary significantly by region (since the p-values are greater than 0.05).
Interacting two categorical variables is done in the same way (i.catvar#i.catvar
). If you want to include the dummy variables separately as well as the interaction, a shortcut in Stata is to use the double ##
. This includes the full expansion of the variables being interacted in the regression.
5.3 Hypothesis Testing
We have already established the meaning of the t statistics and p-values in the regression output tables, these refer to a test of the null hypothesis that the coefficient on that variable in the population is equal to zero. However, we can also test any linear hypothesis using the test
command. For example, if we wanted to test whether the log GNP per capita effect is significantly different to 1, we would run the command below.
Practical Exercise: Hypothesis test.
. quiet regress lexp log_gnppc popgrowth safewater
. test _b[log_gnppc] == 1
( 1) log_gnppc = 1
F( 1, 33) = 0.80
Prob > F = 0.3774
Here I use quiet
to suppress the regression output, which is useful when we are only interested in the hypothesis test results. As the p-value is greater than 0.05, we fail to reject the null hypothesis that the coefficient on income is equal to 1. We can also hypothesis test linear combinations of coefficients: test _b[log_gnppc] - _b[safewater] == 0
. To test non-linear combinations, use the testnl command. For joint hypothesise, include test in separate brackets: test (_b[log_gnppc] == 0) (_b[safewater] == 0)
.
5.4 Binary Regressions
We have explored how to deal with categorical variables on the right-hand side of our regressions. Now let us see how to deal with categorical outcome variables using a probit (or logit) regression in Stata. To illustrate this, let us generate a categorical variable equal to 1 if a country’s life expectancy is greater than 70 and zero otherwise.
gen high_lexp = lexp > 70
The reason we can’t run a simple linear regression in that the outcome variable can only take on the values of 0 or 1, but a linear regression would generate a continuous prediction which could be lower than 0 or greater than 1. To examine this graphically, use the predict
command which stores the predicted values of a regression (see help predict
for more detail), and plot these predicted values against our binary outcome.
. quiet reg high_lexp log_gnppc
. predict high_lexp_prediction
(option xb assumed; fitted values)
(5 missing values generated)
The issue is clear from the graph below: for countries with a log GNP per capita greater than 10, the predicted value is greater than 1.
. twoway (scatter high_lexp log_gnppc, mcolor(navy)) ///
(scatter high_lexp_prediction log_gnppc, mcolor(dkgreen)) ///
, xtitle("Log GNP per capita") ///
ytitle("High Life Expectancy (binary variable)") ///
legend(order(1 "Actual values" 2 "Fitted values")) ///
scheme(s1color)
Two of the most popular alternatives when dealing with binary outcome variables are the probit and logit estimators. The two estimators involve slightly different distributional assumptions but should produce roughly the same results. To run these regressions, use the probit
(or logit
) command.
It is important to recognise that the coefficient estimates from these regressions are not the same as the output generated by the simple linear regressions above. They are not marginal effects, i.e., they do not tell us the average effect of a unit change in the X variable on Y (dy/dx). However, we can calculate marginal effects from the probit(/logit) regression output using the margins
command. The syntax is margins, dydx(xvar)
. As the marginal effects differ depending on the value of the x variables, there are a number of different ways of calculating these. By default, Stata calculates these marginal effects at the mean of the independent variables, however it is also possible to evaluate them at other values by specifying the at(var = value)
option.
Practical Exercise: Run a probit regression and calculate the marginal effects.
Use the probit model for the effect of log GNP per capita on the likelihood of being a “high life expectancy” country. We can summarise the predicted values to verify that they lie between 0 and 1.
. probit high_lexp log_gnppc
Iteration 0: log likelihood = -41.345943
Iteration 1: log likelihood = -28.858118
Iteration 2: log likelihood = -28.512014
Iteration 3: log likelihood = -28.511079
Iteration 4: log likelihood = -28.511078
Probit regression Number of obs = 63
LR chi2(1) = 25.67
Prob > chi2 = 0.0000
Log likelihood = -28.511078 Pseudo R2 = 0.3104
------------------------------------------------------------------------------
high_lexp | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
log_gnppc | .7923054 .1956773 4.05 0.000 .4087849 1.175826
_cons | -5.951921 1.519856 -3.92 0.000 -8.930783 -2.973058
------------------------------------------------------------------------------
. predict high_lexp_probit
(option pr assumed; Pr(high_lexp))
(5 missing values generated)
. sum high_lexp_probit
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
high_lexp_~t | 63 .6432734 .2870433 .1026455 .9927263
. margins, dydx(log_gnppc)
Average marginal effects Number of obs = 63
Model VCE: OIM
Expression: Pr(high_lexp), predict()
dy/dx wrt: log_gnppc
------------------------------------------------------------------------------
| Delta-method
| dy/dx std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
log_gnppc | .1994692 .0241786 8.25 0.000 .1520799 .2468585
------------------------------------------------------------------------------
5.5 Instrumental Variable Regression
An ordinary least squares (OLS) regression can lead to biased estimates if there is correlation between the independent (X) variable and the error term in the model. This can arise under a number of scenarios, including omitted variables, simultaneity and measurement error.
One common example is the effect of schooling on wages. Higher education usually correlates with higher wages. However, individuals with higher schooling may also have more motivation, which also correlates with higher wages. If we were to regress wages on education without controlling for the effect of unobservable motivation, we could overestimate the positive effect of schooling, because we would be also capturing the hidden motivation effect.
This problem of omitted variable bias can be fixed by using an instrumental variable (IV) which is correlated with the X variable (in our case education) but not any of the other predictors of Y (like motivation or ability which may affect wages). This isolates the exogenous part of X, which we can use to examine the true causal effect of X on Y. In the case of the schooling example, we can use the distance to the nearest college, since individuals who live near a college are more likely to have higher years of education but are not necessarily more motivated/able than students who live far from a college.
To estimate at instrumental variable regression in Stata, we can use the ivregress
command, which uses the syntax: ivregress yvar xvar (endogenous variable(s) = IV(s))
. To illustrate this, load US data on individual-level education (educ
), distance to college (nearc4
), labour market experience (exper
), and wages (lwage
).
Practical Exercise: Estimate an instrumental variable regression.
. use http://fmwww.bc.edu/ec-p/data/wooldridge/card, clear
. ivreg lwage (educ = nearc4) exper
Instrumental variables 2SLS regression
Source | SS df MS Number of obs = 3,010
-------------+---------------------------------- F(2, 3007) = 29.01
Model | -245.292502 2 -122.646251 Prob > F = 0.0000
Residual | 837.934114 3,007 .278661162 R-squared = .
-------------+---------------------------------- Adj R-squared = .
Total | 592.641611 3,009 .196956335 Root MSE = .52788
------------------------------------------------------------------------------
lwage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
educ | .2620434 .0344996 7.60 0.000 .1943982 .3296886
exper | .1119277 .0147441 7.59 0.000 .0830182 .1408373
_cons | 1.794982 .5869679 3.06 0.002 .6440832 2.945882
------------------------------------------------------------------------------
Instrumented: educ
Instruments: exper nearc4
5.6 Exporting Regression Tables
You can save estimates from regressions using the estimates store
command. To then present multiple estimates in the same table in Stata, use esttab
. There are several options to add information to the esttab
table of estimate, see help esttab
. Usually, we want to at least include the adjusted R-squared value by specifying the ar2
option. You can suppress the constant term by specifying the nocons
option.
esttab
output can be displayed on your results screen but can also be exported in various formats, including HTML, Word documents, Excel documents, LaTeX, or PDF. To export an esttab table to a word document use: esttab estimates using "tablename.doc", replace
. We can also export estimates to Latex or pdf by specifying a “.tex” or “.pdf” files, e.g., esttab estimates using "tablename.tex", replace
.
Practical Exercise: Exporting a regression output table.
. quiet reg lexp log_gnppc popgrowth safewater
. est store regression1
. quiet reg lexp log_gnppc popgrowth safewater i.region c.safewater#c.safewater
. est store regression2
. esttab regression1 regression2, ar2 nocons
--------------------------------------------
(1) (2)
lexp lexp
--------------------------------------------
log_gnppc 1.493* 2.418**
(2.71) (3.56)
popgrowth -0.531 -0.0124
(-0.96) (-0.02)
safewater 0.139** 0.500**
(3.51) (3.01)
2.region -1.134
(-0.85)
3.region -1.756
(-1.18)
c.safewate~r -0.00290*
(-2.27)
--------------------------------------------
N 37 37
adj. R-sq 0.729 0.748
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
. esttab regression1 regression2 using "regression_table_example.doc", ar2
(output written to regression_table_example.doc)
By Teresa Hall, Thomas Monk & Jeremiah Dittmar.
© Copyright 2023, London School of Economics.