POL 200 Lab 4: Comparing Using Crosstabs and Mean Comparisons
In previous labs, we have covered describing our data and simple data cleaning and variable creation in Stata. Now let’s move on to comparing values of two variables together. We will focus on two simple methods, cross-tabulations and mean comparisons. Both methods are used when you have independent variables with categorical or ordinal values. Crosstabs are used when the dependent variable is also categorical or ordinal, while mean comparisons can be used when the dependent variable is continuous/interval or is a dummy/dichotomous variable.
For this lab, we will use the July 2020 AP-NORC Poll, available from the Roper Center. See the instructions for downloading and accessing the data from the previous lab.
First hypothesis: respondents exposed to the coronavirus are more likely to support closing bars and restaurants than are those who have not been exposed.
Second hypothesis: respondents worried about the coronavirus infection are more likely to say the country is headed in the wrong direction.
Third hypothesis: respondents experiencing economic hardship are more likely to say the country is headed in the wrong direction.
. * Change the file path below to the appropriate working directory for your machine
.
. cd h:\POL200\labs
h:\POL200\labs
. use 31117583.dta, clear
. * Recode the variables we'll use in the analysis, making sure to code
. * missing data as periods (.)
. * We can also specify value labels directly in the recode command if
. * we are creating a new variable using the "gen" option
.
. codebook CUR1
--------------------------------------------------------------------------------------------------------------------------------------------------------------
CUR1 CUR1: Generally speaking, would you say things in this country are heading in th
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Type: Numeric (byte)
Label: CUR1
Range: [1,99] Units: 1
Unique values: 3 Missing .: 0/1,057
Tabulation: Freq. Numeric Label
197 1 (1) Right direction
851 2 (2) Wrong direction
9 99 (99) DON'T KNOW/SKIPPED ON
WEB/REFUSED (VOL)
. recode CUR1 (1=1 "Right direction")(2=0 "Wrong direction") ///
> (99=.), gen(rightdir)
(860 differences between CUR1 and rightdir)
. codebook politics B2AB
--------------------------------------------------------------------------------------------------------------------------------------------------------------
politics POLITICS: Do you consider yourself a Democrat, a Republican, an independent or n
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Type: Numeric (byte)
Label: POLITICS
Range: [1,99] Units: 1
Unique values: 5 Missing .: 0/1,057
Tabulation: Freq. Numeric Label
347 1 (1) Democrat
324 2 (2) Republican
258 3 (3) Independent
119 4 (4) None of these
9 99 (99) DON'T KNOW/SKIPPED ON
WEB/REFUSED (VOL)
--------------------------------------------------------------------------------------------------------------------------------------------------------------
B2AB B2AB: And how would you describe the financial situation in your own household t
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Type: Numeric (byte)
Label: B2AB
Range: [1,7] Units: 1
Unique values: 7 Missing .: 0/1,057
Tabulation: Freq. Numeric Label
168 1 (1) Very good
359 2 (2) Somewhat good
188 3 (3) Lean toward good
1 4 (4) Neither good nor poor
133 5 (5) Lean toward poor
146 6 (6) Somewhat poor
62 7 (7) Very poor
.
. codebook VIRUS2A
--------------------------------------------------------------------------------------------------------------------------------------------------------------
VIRUS2A VIRUS2A: [The coronavirus] How worried are you about you or someone in your fami
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Type: Numeric (byte)
Label: VIRUS2A
Range: [1,99] Units: 1
Unique values: 6 Missing .: 0/1,057
Tabulation: Freq. Numeric Label
266 1 (1) Extremely worried
239 2 (2) Very worried
329 3 (3) Somewhat worried
138 4 (4) Not too worried
82 5 (5) Not at all worried
3 99 (99) DON'T KNOW/SKIPPED ON
WEB/REFUSED (VOL)
. recode VIRUS2A (1=5 "Extremely Worried")(2=4)(3=3 "Somewhat worried") ///
> (4=2)(5=1 "Not at all worried")(99=.), gen(worried)
(728 differences between VIRUS2A and worried)
.
. codebook VIRUS7A
--------------------------------------------------------------------------------------------------------------------------------------------------------------
VIRUS7A VIRUS7A: [Requiring bars and restaurants to close] In response to the coronaviru
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Type: Numeric (byte)
Label: VIRUS7A
Range: [1,99] Units: 1
Unique values: 6 Missing .: 0/1,057
Tabulation: Freq. Numeric Label
290 1 (1) Strongly favor
272 2 (2) Somewhat favor
170 3 (3) Neither favor nor oppose
193 4 (4) Somewhat oppose
127 5 (5) Strongly oppose
5 99 (99) DON'T KNOW/SKIPPED ON
WEB/REFUSED (VOL)
. recode VIRUS7A (1=5 "Strongly favor")(2=4)(3=3 "Neither favor nor oppose") ///
> (4=2)(5=1 "Strongly Oppose")(99=.), gen(closebars)
(887 differences between VIRUS7A and closebars)
.
. recode VIRUS14 (1=1 "Yes")(2=0 "No")(99=.), gen(gotcorona)
(777 differences between VIRUS14 and gotcorona)
.
. * To make our tables easier to read later, let's change a few variable lables:
.
. label var worried "How worried are you about Covid-19"
. label var closebars "Requiring bars/restaurants to close"
. label var gotcoron "Resp or close friend has had Covid-19"
. label var rightdir "Is country going in right direction"
Crosstabs
There are multiple commands that can generate a crosstab. A crosstab is a two-way frequency table. It shows how your observations are jointly distributed across both variables. We can use such a table to evaluate the relationship between X and Y by seeing how the values of your Y variable become more (or less) likely as you change categories of the X variable. One quick command is tab. Be sure to specify the col option to calculate column percentages. Crosstabs are interpreted by reading the percentages across columns within a row.
. * SYNTAX: tab dv iv, col
. tab closebars gotcorona, col
+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+
Requiring | Resp or close friend
bars/restaurants to | has had Covid-19
close | No Yes | Total
----------------------+----------------------+----------
Strongly Oppose | 100 27 | 127
| 13.09 9.64 | 12.16
----------------------+----------------------+----------
2 | 142 50 | 192
| 18.59 17.86 | 18.39
----------------------+----------------------+----------
Neither favor nor opp | 130 39 | 169
| 17.02 13.93 | 16.19
----------------------+----------------------+----------
4 | 194 76 | 270
| 25.39 27.14 | 25.86
----------------------+----------------------+----------
Strongly favor | 198 88 | 286
| 25.92 31.43 | 27.39
----------------------+----------------------+----------
Total | 764 280 | 1,044
| 100.00 100.00 | 100.00
This is a nice little table, and the syntax was easy to call up. But, it has a huge downside! There is no way to automatically export this table to a program that will let you share your findings with others. So instead, let’s turn to Stata’s table command, which we’ve used before.
. * CROSSTAB SYNTAX: table dv iv, stat(percent, across(dv)) stat(freq)
. table closebars gotcorona, stat(percent, across(closebars)) stat(freq)
--------------------------------------------------------------------------------
| Resp or close friend has had Covid-19
| No Yes Total
------------------------------------+-------------------------------------------
Requiring bars/restaurants to close |
Strongly Oppose |
Percent | 13.09 9.64 12.16
Frequency | 100 27 127
2 |
Percent | 18.59 17.86 18.39
Frequency | 142 50 192
Neither favor nor oppose |
Percent | 17.02 13.93 16.19
Frequency | 130 39 169
4 |
Percent | 25.39 27.14 25.86
Frequency | 194 76 270
Strongly favor |
Percent | 25.92 31.43 27.39
Frequency | 198 88 286
Total |
Percent | 100.00 100.00 100.00
Frequency | 764 280 1,044
--------------------------------------------------------------------------------
As always with the table command, we can combine it with collect for exporting.
. collect table closebars gotcorona, stat(percent, across(closebars)) stat(freq)
--------------------------------------------------------------------------------
| Resp or close friend has had Covid-19
| No Yes Total
------------------------------------+-------------------------------------------
Requiring bars/restaurants to close |
Strongly Oppose |
Percent | 13.09 9.64 12.16
Frequency | 100 27 127
2 |
Percent | 18.59 17.86 18.39
Frequency | 142 50 192
Neither favor nor oppose |
Percent | 17.02 13.93 16.19
Frequency | 130 39 169
4 |
Percent | 25.39 27.14 25.86
Frequency | 194 76 270
Strongly favor |
Percent | 25.92 31.43 27.39
Frequency | 198 88 286
Total |
Percent | 100.00 100.00 100.00
Frequency | 764 280 1,044
--------------------------------------------------------------------------------
. collect export crosstab1.xlsx, replace
(collection Table exported to file crosstab1.xlsx)
Mean Comparisons
Mean comparison tests follow a similar logic. What happens to the mean of the dependent variable when we change categories of the independent variable? Does the average value of the DV change in the hypothesized way? We can conduct a mean comparison test also using the tab command, this time with the sum option. The IV should be categorical or ordinal, and the DV should be continuous or a dummy variable.
. tab rightdir
Is country |
going in right |
direction | Freq. Percent Cum.
----------------+-----------------------------------
Wrong direction | 851 81.20 81.20
Right direction | 197 18.80 100.00
----------------+-----------------------------------
Total | 1,048 100.00
.
. * SYNTAX: tab iv, sum(dv)
. tab worried, sum(rightdir)
How worried |
are you | Summary of Is country going in
about | right direction
Covid-19 | Mean Std. dev. Freq.
------------+------------------------------------
Not at al | .2375 .428236 80
2 | .30434783 .46180692 138
Somewhat | .22769231 .4199896 325
4 | .13445378 .34185816 238
Extremely | .10984848 .31329473 264
------------+------------------------------------
Total | .18755981 .39054716 1,045
Again, perfectly nice table except that we can’t use it easily in a report or presentation. Let’s use table instead.
. * MEAN COMPARISON SYNTAX: table iv, stat(mean dv)
. table worried, stat(mean rightdir)
----------------------------------------------
| Mean
-----------------------------------+----------
How worried are you about Covid-19 |
Not at all worried | .2375
2 | .3043478
Somewhat worried | .2276923
4 | .1344538
Extremely Worried | .1098485
Total | .1875598
----------------------------------------------
.
. * You could specify a couple more statistics if you wanted, and then export
. * the table using collect
. collect table worried, stat(mean rightdir) stat(sd rightdir) stat(count rightdir)
--------------------------------------------------------------------------------------------------
| Mean Standard deviation Number of non-missing values
-----------------------------------+--------------------------------------------------------------
How worried are you about Covid-19 |
Not at all worried | .2375 .428236 80
2 | .3043478 .4618069 138
Somewhat worried | .2276923 .4199896 325
4 | .1344538 .3418582 238
Extremely Worried | .1098485 .3132947 264
Total | .1875598 .3905472 1,045
--------------------------------------------------------------------------------------------------
. collect export meancomp1.xlsx, replace
(collection Table exported to file meancomp1.xlsx)
Now let’s create another mean comparison test and append it to our existing using collect export
.
. * We can tell Stata to modify an existing excel file
. * and to write the table starting at a specific cell
. * using the modify and cell() options
.
. collect table B2AB, stat(mean rightdir) ///
> stat(sd rightdir) ///
> stat(count rightdir)
------------------------------------------------------------------------------------------------------------------------------------------------
| Mean Standard deviation Number of non-missing values
---------------------------------------------------------------------------------+--------------------------------------------------------------
B2AB: And how would you describe the financial situation in your own household t |
(1) Very good | .327381 .4706604 168
(2) Somewhat good | .220339 .4150619 354
(3) Lean toward good | .1621622 .3695998 185
(4) Neither good nor poor | 1 . 1
(5) Lean toward poor | .0977444 .2980914 133
(6) Somewhat poor | .0758621 .2656951 145
(7) Very poor | .1451613 .355139 62
Total | .1879771 .3908804 1,048
------------------------------------------------------------------------------------------------------------------------------------------------
. collect export meancomp1.xlsx, modify cell(A13)
(collection Table exported to file meancomp1.xlsx)
Accounting for Confounding Variable Z
There are several ways to “control” for a confounding variable. In a crosstab or mean comparison, we could hold the categories of the Z variable constant and look at the relationship between X and Y inside each category of Z. Let’s do this for both the crosstab test (controlling for gender) and the mean comparison (controlling for political party).
. * Perhaps the simplest way to control for Z is to run the
. * crosstab command multiple times, each time selecting
. * different categories of Z:
.
. * Let's look at the values of Z
. codebook gender
--------------------------------------------------------------------------------------------------------------------------------------------------------------
gender GENDER: Gender
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Type: Numeric (byte)
Label: GENDER
Range: [1,2] Units: 1
Unique values: 2 Missing .: 0/1,057
Tabulation: Freq. Numeric Label
423 1 (1) Male
634 2 (2) Female
. * Now let's re-run our cross tab, once for women and once for men
. * using "if" to select each categories
. table closebars gotcorona if gender==1, ///
> stat(percent, across(closebars)) stat(freq)
--------------------------------------------------------------------------------
| Resp or close friend has had Covid-19
| No Yes Total
------------------------------------+-------------------------------------------
Requiring bars/restaurants to close |
Strongly Oppose |
Percent | 15.43 11.46 14.52
Frequency | 50 11 61
2 |
Percent | 18.83 17.71 18.57
Frequency | 61 17 78
Neither favor nor oppose |
Percent | 14.20 21.88 15.95
Frequency | 46 21 67
4 |
Percent | 26.85 19.79 25.24
Frequency | 87 19 106
Strongly favor |
Percent | 24.69 29.17 25.71
Frequency | 80 28 108
Total |
Percent | 100.00 100.00 100.00
Frequency | 324 96 420
--------------------------------------------------------------------------------
. table closebars gotcorona if gender==2, ///
> stat(percent, across(closebars)) stat(freq)
--------------------------------------------------------------------------------
| Resp or close friend has had Covid-19
| No Yes Total
------------------------------------+-------------------------------------------
Requiring bars/restaurants to close |
Strongly Oppose |
Percent | 11.36 8.70 10.58
Frequency | 50 16 66
2 |
Percent | 18.41 17.93 18.27
Frequency | 81 33 114
Neither favor nor oppose |
Percent | 19.09 9.78 16.35
Frequency | 84 18 102
4 |
Percent | 24.32 30.98 26.28
Frequency | 107 57 164
Strongly favor |
Percent | 26.82 32.61 28.53
Frequency | 118 60 178
Total |
Percent | 100.00 100.00 100.00
Frequency | 440 184 624
--------------------------------------------------------------------------------
We can also control for Z using a single, exportable table command. Notice our addition of the variable gender in the below command. Now, the columns of gotcorona will be nested inside of categories of gender.
.
. * SYNTAX: table dv (z iv), stat(percent, across(dv)) stat(freq)
. * To supress the Total rows and columns, add the notables option
. table closebars (gender gotcorona), stat(percent, across(closebars)) stat(freq)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| GENDER: Gender
| (1) Male (2) Female Total
| Resp or close friend has had Covid-19 Resp or close friend has had Covid-19 Resp or close friend has had Covid-19
| No Yes Total No Yes Total No Yes Total
------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------
Requiring bars/restaurants to close |
Strongly Oppose |
Percent | 15.43 11.46 14.52 11.36 8.70 10.58 13.09 9.64 12.16
Frequency | 50 11 61 50 16 66 100 27 127
2 |
Percent | 18.83 17.71 18.57 18.41 17.93 18.27 18.59 17.86 18.39
Frequency | 61 17 78 81 33 114 142 50 192
Neither favor nor oppose |
Percent | 14.20 21.88 15.95 19.09 9.78 16.35 17.02 13.93 16.19
Frequency | 46 21 67 84 18 102 130 39 169
4 |
Percent | 26.85 19.79 25.24 24.32 30.98 26.28 25.39 27.14 25.86
Frequency | 87 19 106 107 57 164 194 76 270
Strongly favor |
Percent | 24.69 29.17 25.71 26.82 32.61 28.53 25.92 31.43 27.39
Frequency | 80 28 108 118 60 178 198 88 286
Total |
Percent | 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Frequency | 324 96 420 440 184 624 764 280 1,044
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
. * SYNTAX: table iv z, stat(mean dv)
. table worried politics, stat(mean rightdir)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
| POLITICS: Do you consider yourself a Democrat, a Republican, an independent or n
| (1) Democrat (2) Republican (3) Independent (4) None of these (99) DON'T KNOW/SKIPPED ON WEB/REFUSED (VOL) Total
-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------
How worried are you about Covid-19 |
Not at all worried | 0 .2619048 .125 .3846154 1 .2375
2 | .1818182 .3648649 .2941176 .1666667 0 .3043478
Somewhat worried | .1348315 .3583333 .0857143 .2727273 .5 .2276923
4 | .0384615 .3111111 .1363636 .2 .3333333 .1344538
Extremely Worried | .0526316 .2439024 .1538462 .0833333 0 .1098485
Total | .0724638 .326087 .1474104 .2184874 .375 .1875598
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
This table is messy. Let’s restrict the analysis to just Democrats and Republicans.
. * the "<=" below means "less than or equal to"
. * alternatively, we could have done: politics == 1 | politics ==2
. * where the "|" means "or"
.
. table worried politics if politics <= 2, stat(mean rightdir)
--------------------------------------------------------------------------------------------------------------------------
| POLITICS: Do you consider yourself a Democrat, a Republican, an independent or n
| (1) Democrat (2) Republican Total
-----------------------------------+--------------------------------------------------------------------------------------
How worried are you about Covid-19 |
Not at all worried | 0 .2619048 .22
2 | .1818182 .3648649 .3411765
Somewhat worried | .1348315 .3583333 .2631579
4 | .0384615 .3111111 .1208054
Extremely Worried | .0526316 .2439024 .0977011
Total | .0724638 .326087 .1949025
--------------------------------------------------------------------------------------------------------------------------
Ok, one more mean comparison test. Let’s repeat our table command, this time using B2AB (financial situation of household) as the IV. What do we learn about the relationship between Covid-19 and right/wrong direction and personal financial situation and right/wrong direction between the two major parties in 2020?
. table B2AB politics if politics <= 2, stat(mean rightdir)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| POLITICS: Do you consider yourself a Democrat, a Republican, an independent or n
| (1) Democrat (2) Republican Total
---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------
B2AB: And how would you describe the financial situation in your own household t |
(1) Very good | .106383 .4683544 .3333333
(2) Somewhat good | .1010101 .3357664 .2372881
(3) Lean toward good | .0508475 .2413793 .1452991
(5) Lean toward poor | .0434783 .1578947 .0769231
(6) Somewhat poor | .0322581 .2 .0804598
(7) Very poor | .09375 .2 .1081081
Total | .0724638 .3281734 .1961078
------------------------------------------------------------------------------------------------------------------------------------------------------------------------