Regression

For this lab, let’s use county-level 2020 presidential election results.

. * Set your working directory to the location of pres_county2020.dta
. use pres_county2020.dta, clear

Simple and Multiple Regression

. * SYNTAX: reg y x
. reg trump20pct trumppct

Source |       SS           df       MS      Number of obs   =     3,112
-------------+----------------------------------   F(1, 3110)      =  73277.15
Model |   775987.71         1   775987.71   Prob > F        =    0.0000
Residual |  32934.1652     3,110  10.5897637   R-squared       =    0.9593
-------------+----------------------------------   Adj R-squared   =    0.9593
Total |  808921.875     3,111  260.019889   Root MSE        =    3.2542

------------------------------------------------------------------------------
trump20pct |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
trumppct |   .9776738   .0036117   270.70   0.000     .9705923    .9847554
_cons |  -.0372979   .2477361    -0.15   0.880    -.5230407    .4484449
------------------------------------------------------------------------------

. 
. * let's visualize:
. scatter trump20pct trumppct [pw=total20votes], ///
>     msize(small) mcolor(maroon%5) mlwidth(none) ///
>     graphregion(color(white)) ///
>     || lfit trump20pct trumppct, lcolor(maroon) ///
>      legend(off) xtitle("Trump % of Two-Party Vote, 2016") ///
>     ytitle("Trump % of Two-Party Vote, 2020")

Now let’s conduct a multiple regression, adding county race and education variables:

. * SYNTAX: reg y x1 x2...
. reg trump20pct trumppct whitenoncollege whitecollege black latinx

Source |       SS           df       MS      Number of obs   =     3,111
-------------+----------------------------------   F(5, 3105)      =  20398.24
Model |  781995.993         5  156399.199   Prob > F        =    0.0000
Residual |  23806.9313     3,105  7.66728866   R-squared       =    0.9705
-------------+----------------------------------   Adj R-squared   =    0.9704
Total |  805802.924     3,110  259.100619   Root MSE        =     2.769

---------------------------------------------------------------------------------
trump20pct |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
trumppct |   .9396296    .004703   199.79   0.000     .9304082     .948851
whitenoncollege |   7.180532   .8133951     8.83   0.000     5.585685    8.775379
whitecollege |  -13.35032   .9881584   -13.51   0.000    -15.28783   -11.41281
black |   4.517834   .8052197     5.61   0.000     2.939017    6.096651
latinx |   10.53558   .8616352    12.23   0.000     8.846147    12.22501
_cons |  -.6150971   .7359075    -0.84   0.403    -2.058012    .8278176
---------------------------------------------------------------------------------

Predictions and Residuals

Stata has built-in postestimation commands which help you extract additional information from your model results. predict is a very useful postestimation command, available for a wide range of model types, including OLS. We can use predict to calculated $$ as well as residuals ($y_i - $). predict will call up your most recent model results to use for analysis, or you could store your regression results and then call up whichever model results you want.

. quietly reg trump20pct trumppct

. * save regression results and name them "m1"
. estimates store m1

. 
. quietly reg  trump20pct trumppct whitenoncollege whitecollege black latinx

. * save regression results and name them "m2"
. estimates store m2

. 
. * now restore m1 results
. estimates restore m1
(results m1 are active now)

. 
. *predict y_hat:
. * SYNTAX: predict newvar, options
. predict y_hat_m1
(option xb assumed; fitted values)
(107 missing values generated)

. 
. *we could calculate residuals directly as well:
. predict residual_m1, residuals
(149 missing values generated)

. 
. estimates restore m2
(results m2 are active now)

. predict y_hat_m2
(option xb assumed; fitted values)
(147 missing values generated)

. predict residual_m2, residuals
(150 missing values generated)

How well do our models match the actual Trump % in 2020? We could plot the predict values against our original DV:

. scatter trump20pct y_hat_m1, msize(vsmall) mcolor(maroon%15)

Pretty well!

. scatter trump20pct y_hat_m2, msize(vsmall) mcolor(maroon%15)

Even better, we can plot the residuals directly:

. scatter residual_m1 trump20pct , msize(vsmall) mcolor(maroon%15)

And again:

. scatter residual_m2 trump20pct, msize(vsmall) mcolor(maroon%15)

Regression with nominal data

To incorporate independent variables measured at the categorical/nominal level, you need to create a series of dummy (dichotomous) variables. You could do so by hand using the recode command, or you could use Stata’s factor variable notation: add an i. in front of the variable name. An additional strategy is to use tab with the gen() option. I use strategies two and three in the example below.

. * add region dummies to the regression model 
. 
. *first create the dummies:
. tab region, gen(regdum)

region |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |        217        7.01        7.01
2 |      1,056       34.10       41.10
3 |      1,366       44.11       85.21
4 |        458       14.79      100.00
------------+-----------------------------------
Total |      3,097      100.00

. describe regdum*

storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
regdum1         byte    %8.0g                 region== 1.0000
regdum2         byte    %8.0g                 region== 2.0000
regdum3         byte    %8.0g                 region== 3.0000
regdum4         byte    %8.0g                 region== 4.0000

. 
. *now run the regression, being sure to leave out one dummy (in this case, the South)
. reg trump20pct trumppct whitenoncollege whitecollege black latinx regdum1 regdum2 regdum4

Source |       SS           df       MS      Number of obs   =     3,055
-------------+----------------------------------   F(8, 3046)      =  13200.57
Model |  772218.983         8  96527.3729   Prob > F        =    0.0000
Residual |  22273.4542     3,046  7.31236185   R-squared       =    0.9720
-------------+----------------------------------   Adj R-squared   =    0.9719
Total |  794492.437     3,054  260.148146   Root MSE        =    2.7041

---------------------------------------------------------------------------------
trump20pct |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
trumppct |   .9302781     .00513   181.34   0.000     .9202195    .9403367
whitenoncollege |   6.080273    .820377     7.41   0.000     4.471724    7.688821
whitecollege |  -14.28692   .9763185   -14.63   0.000    -16.20123   -12.37261
black |   .9215254   .8427074     1.09   0.274    -.7308073    2.573858
latinx |   9.670075   .8456495    11.44   0.000     8.011973    11.32818
regdum1 |  -.1633985   .2292055    -0.71   0.476    -.6128117    .2860147
regdum2 |  -.6834723   .1389635    -4.92   0.000     -.955944   -.4110006
regdum4 |  -2.318332   .1741424   -13.31   0.000    -2.659781   -1.976884
_cons |   1.814021   .7460239     2.43   0.015     .3512594    3.276782
---------------------------------------------------------------------------------

. estimates store m3

. 
. *we could do a similar regression with i.region, but we won't be able to define
. * the reference category - Stata will the first category as the reference.
. reg trump20pct trumppct whitenoncollege whitecollege black latinx i.region

Source |       SS           df       MS      Number of obs   =     3,055
-------------+----------------------------------   F(8, 3046)      =  13200.57
Model |  772218.983         8  96527.3729   Prob > F        =    0.0000
Residual |  22273.4542     3,046  7.31236185   R-squared       =    0.9720
-------------+----------------------------------   Adj R-squared   =    0.9719
Total |  794492.437     3,054  260.148146   Root MSE        =    2.7041

---------------------------------------------------------------------------------
trump20pct |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
trumppct |   .9302781     .00513   181.34   0.000     .9202195    .9403367
whitenoncollege |   6.080273    .820377     7.41   0.000     4.471724    7.688821
whitecollege |  -14.28692   .9763185   -14.63   0.000    -16.20123   -12.37261
black |   .9215254   .8427074     1.09   0.274    -.7308073    2.573858
latinx |   9.670075   .8456495    11.44   0.000     8.011973    11.32818
|
region |
2  |  -.5200738   .2103211    -2.47   0.013    -.9324594   -.1076882
3  |   .1633985   .2292055     0.71   0.476    -.2860147    .6128117
4  |  -2.154934    .237672    -9.07   0.000    -2.620947    -1.68892
|
_cons |   1.650622   .7534941     2.19   0.029     .1732137    3.128031
---------------------------------------------------------------------------------

Interaction terms and marginal effects

The easiest way to incorporate interaction terms into your regression analysis is by using Stata’s built-in interaction notation. ## signifies an interaction between two independent variables and will incorporate the interaction term into the model. For example: reg y c.x##c.z, assuming you want to treat x and z as ordinal or interval data, will include the constituent terms of x and of z along with the interaction xz. If x is nominal, you’d use: i.x instead of c.x.

The margins command lets you calculate the predicted effect of your constituent variables AND their interaction on the dependent variable. We call these conditional marginal effects the effect of one variable, conditional on the other variable’s values (because of the interaction term, the predicted effect of x varies by z).

. reg trump20pct trumppct whitenoncollege black latinx regdum1 i.regdum2##c.whitecollege regdum4

Source |       SS           df       MS      Number of obs   =     3,055
-------------+----------------------------------   F(9, 3045)      =  11772.68
Model |  772297.501         9  85810.8334   Prob > F        =    0.0000
Residual |  22194.9365     3,045   7.2889775   R-squared       =    0.9721
-------------+----------------------------------   Adj R-squared   =    0.9720
Total |  794492.437     3,054  260.148146   Root MSE        =    2.6998

----------------------------------------------------------------------------------------
trump20pct |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------------+----------------------------------------------------------------
trumppct |   .9291103   .0051341   180.97   0.000     .9190436     .939177
whitenoncollege |   6.034341   .8191837     7.37   0.000     4.428132     7.64055
black |   .5535949   .8487943     0.65   0.514    -1.110673    2.217863
latinx |   9.384419   .8487705    11.06   0.000     7.720198    11.04864
regdum1 |  -.0818514   .2301836    -0.36   0.722    -.5331824    .3694796
1.regdum2 |  -1.652678   .3262698    -5.07   0.000    -2.292409   -1.012947
whitecollege |  -15.77708   1.075309   -14.67   0.000    -17.88549   -13.66867
|
regdum2#c.whitecollege |
1  |   4.885712   1.488598     3.28   0.001     1.966954     7.80447
|
regdum4 |  -2.275541   .1743519   -13.05   0.000    -2.617401   -1.933682
_cons |    2.23724   .7559097     2.96   0.003     .7550952    3.719385
----------------------------------------------------------------------------------------

. estimates store m4

. * the dydx(regdum2) option requests Stata calculate the marginal effect of being in the Midwest
. * over the values specified in the at() command, in this case, when whitecollege moves from 0 
. * to .6 in units of .1.
. margins, dydx(regdum2) at(whitecollege=(0(.1).6))

Average marginal effects                        Number of obs     =      3,055
Model VCE    : OLS

Expression   : Linear prediction, predict()
dy/dx w.r.t. : 1.regdum2

1._at        : whitecollege    =           0

2._at        : whitecollege    =          .1

3._at        : whitecollege    =          .2

4._at        : whitecollege    =          .3

5._at        : whitecollege    =          .4

6._at        : whitecollege    =          .5

7._at        : whitecollege    =          .6

------------------------------------------------------------------------------
|            Delta-method
|      dy/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
0.regdum2    |  (base outcome)
-------------+----------------------------------------------------------------
1.regdum2    |
_at |
1  |  -1.652678   .3262698    -5.07   0.000    -2.292409   -1.012947
2  |  -1.164107   .2017281    -5.77   0.000    -1.559644   -.7685696
3  |  -.6755355   .1387622    -4.87   0.000    -.9476125   -.4034584
4  |  -.1869643    .205266    -0.91   0.362    -.5894382    .2155097
5  |   .3016069   .3306536     0.91   0.362    -.3467199    .9499337
6  |    .790178   .4699446     1.68   0.093    -.1312627    1.711619
7  |   1.278749   .6137447     2.08   0.037     .0753535    2.482145
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

. 
. * marginsplot will graph the results from the margins command. Check out help marginsplot to 
. *  see some of the options you can use to change the look of the graph.
. marginsplot

Variables that uniquely identify margins: whitecollege

Presentation of Results

My favorite ways of presenting regression results require adding user-generated Stata commands. You can add such commands from Boston College’s Statistical Software Components (SSC) archive through ssc install. We’ll use two packages: estout and coefplot.

esttab

The esttab command, from the estout package, is, in my opinion, the best way to generate regression tables in Stata. The command lets you make tables of multiple regression models, gives you flexibility over which parameter estimates to include, and formatting control over the look of the table.

. * install the estout package from the SSC
. ssc install estout
checking estout consistency and verifying not already installed...
all files already exist and are up to date.

. 
. * SYNTAX: esttab model_names, options
. *   the m1 through m4 item below are our stored model names from previous regressions
. *   b se options tell Stata to include the coefficients (b) and standard errors (se)
. *   star() option lists a symbol and then the significance level 
. *   the wide option places standard errors to the right of coefficients instead of 
. *      underneath the coefficients - good if you have multiple models to include
. *      in the table.
. *   the ar2 option adds the Adj. R^2 score at the bottom of the table. 
. 
. esttab m1 m2 m3 m4, b se star(+ .1 * .05 ** .01) wide ar2 

----------------------------------------------------------------------------------------------------------------------------
(1)                         (2)                         (3)                         (4)               
trump20pct                  trump20pct                  trump20pct                  trump20pct               
----------------------------------------------------------------------------------------------------------------------------
trumppct            0.978**    (0.00361)        0.940**    (0.00470)        0.930**    (0.00513)        0.929**    (0.00513)
whitenonco~e                                    7.181**      (0.813)        6.080**      (0.820)        6.034**      (0.819)
whitecollege                                   -13.35**      (0.988)       -14.29**      (0.976)       -15.78**      (1.075)
black                                           4.518**      (0.805)        0.922        (0.843)        0.554        (0.849)
latinx                                          10.54**      (0.862)        9.670**      (0.846)        9.384**      (0.849)
regdum1                                                                    -0.163        (0.229)      -0.0819        (0.230)
regdum2                                                                    -0.683**      (0.139)                            
regdum4                                                                    -2.318**      (0.174)       -2.276**      (0.174)
0.regdum2                                                                                                   0            (.)
1.regdum2                                                                                              -1.653**      (0.326)
0.regdum2#~e                                                                                                0            (.)
1.regdum2#~e                                                                                            4.886**      (1.489)
_cons             -0.0373        (0.248)       -0.615        (0.736)        1.814*       (0.746)        2.237**      (0.756)
----------------------------------------------------------------------------------------------------------------------------
N                    3112                        3111                        3055                        3055               
adj. R-sq           0.959                       0.970                       0.972                       0.972               
----------------------------------------------------------------------------------------------------------------------------
Standard errors in parentheses
+ p<.1, * p<.05, ** p<.01

. 
. * You can save the table directly to your computer!
. quietly esttab m1 m2 m3 m4 using regmodels.csv, b se star(+ .1 * .05 ** .01) wide r2 csv replace

coefplot

The coefplot command offers an easy way to graph coefficient estimates and confidence intervals around those estimates. coefplot works best when your variables are the same.

. * install the coefplot package
. 
. ssc install coefplot
checking coefplot consistency and verifying not already installed...
all files already exist and are up to date.

. 
. * load the m3 model results
. estimates restore m3 
(results m3 are active now)

. 
. coefplot

Let’s switch to vertical orientation:

. * now let's make the dots vertical:
. coefplot, vertical

And clean up the graph to make it easier to read:

. 
. * let's rescale our proportion variables and change the labels, marker size, and graph color:
. coefplot, vertical msize(small) ciopts(lwidth(vthin)) rescale(whitenoncollege = .01 whitecollege = .01 black = .01 latinx = .01) ///
>     coeflabels(trumppct = `""Trump""2016 %""' whitenoncollege = `""% White""No BA""' whitecollege = `""% White""BA""' ///
>     black = "% Black" latinx = "% Latinx" regdum1 = "Northeast" regdum2 = "Midwest" regdum4 = "West" ///
>     _cons = "Constant" , labsize(small)) graphregion(color(white))