POL 200 Lab 2: Describing Your Data
Descriptive statistics
Let’s begin by opening the Nominate data for the 116th Congress (Lewis et al. 2020). The goal of today’s lab is to learn out to describe our data through calculations of central tendancy and dispersion, and how to create simple tables to present such statistics for your readers. Along the way, we will learn how to run a command while suppressing output, how to use the display command to call up stored information and print it with text, briefly experience using local macros in Stata, and how to write tables to an Excel file using collect. Ready? Ok - let’s get started! Open the dataset and use the describe command to print your variables.
NOTE: If you do not already have the Nominate data downloaded, converted to Stata format and saved to your computer, please see this before proceeding.
. * Be sure to set your working directory - where you have saved
. * your files and to where you want to write new files. Example:
. * cd "c:\myfolder\"
.
. use HS116_members.dta, clear
What variables are included in this dataset?
. describe
Contains data from HS116_members.dta
Observations: 544
Variables: 22 25 Aug 2020 15:22
--------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
--------------------------------------------------------------------------------------------------
congress int %8.0g
chamber str6 %9s
icpsr long %12.0g
state_icpsr byte %8.0g
district_code byte %8.0g
state_abbrev str2 %9s
party_code int %8.0g
occupancy byte %8.0g
last_means byte %8.0g
bioname str42 %42s
bioguide_id str7 %9s
born int %8.0g
died byte %8.0g
nominate_dim1 float %9.0g
nominate_dim2 float %9.0g
nominate_log_~d float %9.0g
nominate_geo_~y float %9.0g
nominate_num~es int %8.0g
nominate_num~rs int %8.0g
conditional byte %8.0g
nokken_poole_~1 float %9.0g
nokken_poole_~2 float %9.0g
--------------------------------------------------------------------------------------------------
Sorted by:
There are various commands useful to describe your variables, including the tab, summarize, and table commands.
tabulate
The tabulate command presents one-way and two-way frequency tables, along with a range of other functions. We can use tabulate to view variable values, the percent of cases with a given value, and cumulative precentages.
. tab state_abbrev
state_abbre |
v | Freq. Percent Cum.
------------+-----------------------------------
AK | 3 0.55 0.55
AL | 9 1.65 2.21
AR | 6 1.10 3.31
AZ | 11 2.02 5.33
CA | 56 10.29 15.63
CO | 9 1.65 17.28
CT | 7 1.29 18.57
DE | 3 0.55 19.12
FL | 29 5.33 24.45
GA | 17 3.13 27.57
HI | 4 0.74 28.31
IA | 6 1.10 29.41
ID | 4 0.74 30.15
IL | 20 3.68 33.82
IN | 11 2.02 35.85
KS | 6 1.10 36.95
KY | 8 1.47 38.42
LA | 8 1.47 39.89
MA | 11 2.02 41.91
MD | 11 2.02 43.93
ME | 4 0.74 44.67
MI | 17 3.13 47.79
MN | 10 1.84 49.63
MO | 10 1.84 51.47
MS | 6 1.10 52.57
MT | 3 0.55 53.12
NC | 16 2.94 56.07
ND | 3 0.55 56.62
NE | 5 0.92 57.54
NH | 4 0.74 58.27
NJ | 15 2.76 61.03
NM | 5 0.92 61.95
NV | 6 1.10 63.05
NY | 30 5.51 68.57
OH | 18 3.31 71.88
OK | 7 1.29 73.16
OR | 7 1.29 74.45
PA | 21 3.86 78.31
RI | 4 0.74 79.04
SC | 9 1.65 80.70
SD | 3 0.55 81.25
TN | 11 2.02 83.27
TX | 38 6.99 90.26
UT | 6 1.10 91.36
VA | 13 2.39 93.75
VT | 3 0.55 94.30
WA | 12 2.21 96.51
WI | 11 2.02 98.53
WV | 5 0.92 99.45
WY | 3 0.55 100.00
------------+-----------------------------------
Total | 544 100.00
.
. * or we could use "if" to select just Democrats or just Republicans
. * the "quietly" prefix will surpress the output while still
. * running the command
.
. quietly tab state_abbrev if party_code == 100
. quietly tab state_abbrev if party_code == 200
summarize
summarize a pre-set group of statistics on variables you list, like the mean, the standard deviation, and the number of observations. You can use summarize with one or more variables listed. When including the detail option, Stata provides additional statistics, including the variance, the median, in lowest and highest observations. Using the detail option also saves those additional statistics in Stata’s memory as scalars that we can call up when needed using the display command. display can also be used to print information and make conduct calculations based on what is stored or new information we pass to Stata.
. sum nominate_dim1
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
nominate_d~1 | 544 .0506949 .4545393 -.77 .916
. display r(mean)
.05069485
. quietly sum nominate_dim1, detail
. display "The mean Nominate score in the 116th Congress is " r(mean) - r(p50) " points greater th
> an the median."
The mean Nominate score in the 116th Congress is .20269486 points greater than the median.
Let’s do one more example using summarize and display. What proportion of Republicans in Congress could we consider as “moderate”, defined as more liberal than two standard deviations below the GOP mean? To make our lives a bit easier, let’s save the scalars from the summarize command as local macros that we can call up later, even if run summarize again.
. quietly sum nominate_dim1 if party_code == 200, detail
. display "Nominate mean = " r(mean)
Nominate mean = .50268077
. local mn_gop = r(mean)
. local sd_gop = r(sd)
.
. * We can access the local macro by enclosing the name in single quotes
. display "Average Republican Nominate Score: " `mn_gop' "; " ///
> "Republican standard deviation: " `sd_gop'
Average Republican Nominate Score: .50268077; Republican standard deviation: .1463517
.
. * Let's use the local macro "threshold" to hold the value of two
. * standard deviations below the mean
. local threshold = `mn_gop' - 2*`sd_gop'
.
. quietly sum nominate_dim1 if party_code==200
. quietly local obs_gop = r(N)
.
. sum nominate_dim1 if party_code==200 & nominate_dim1<=`threshold'
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
nominate_d~1 | 5 .1646 .0327536 .112 .202
.
. * notice that I have shortened the display command to just di.
. * Since there is no other command starting with di, this will work.
. di "Only " r(N) " Republicans have Nominate scores lower (more liberal)" ///
> "than two standard deviations below the mean GOP DW_Nominate score, " ///
> r(N) / `obs_gop' " of the congressional GOP."
Only 5 Republicans have Nominate scores lower (more liberal)than two standard deviations below the
> mean GOP DW_Nominate score, .01923077 of the congressional GOP.
table
Another option for creating tables of descriptive statistics is table, which offers the user the ability to choose which statistics she would like to include in the table, each calculated on whatever variables we choose. Let’s make a table of descriptive statistics with Lewis et al.’s first and second-dimension Nominate scores as well as the Nokken and Poole scores.
. * this command will tell Stata to calculate the range of statitics listed
. * in the statistics() option
. * by adding the "save" option, we can allow the table results to be accessed later
.
. table (var), ///
> stat(count nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
> stat(mean nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
> stat(median nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
> stat(sd nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
> stat(min nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
> stat(max nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2)
--------------------------------------------------------------------------------------------------------------------------
| Number of non-missing values Mean Median Standard deviation Minimum value Maximum value
------------------+-------------------------------------------------------------------------------------------------------
nominate_dim1 | 544 .0506949 -.152 .4545393 -.77 .916
nominate_dim2 | 544 .0205331 .0165 .2997723 -.975 .879
nokken_poole_dim1 | 543 .0450331 -.157 .4637939 -.808 1
nokken_poole_dim2 | 543 .0405285 .053 .3298611 -.975 .879
--------------------------------------------------------------------------------------------------------------------------
Now this is useful, but it could be even more useful if we could export this table to Word or Excel or some other word-processing or typesetting program for use in academic writing. We can use Stata’s new collect prefix and collect export command to first save the table results, and then to export them.
. collect table (var), ///
> stat(count nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
> stat(mean nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
> stat(median nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
> stat(sd nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
> stat(min nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
> stat(max nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2)
--------------------------------------------------------------------------------------------------------------------------
| Number of non-missing values Mean Median Standard deviation Minimum value Maximum value
------------------+-------------------------------------------------------------------------------------------------------
nominate_dim1 | 544 .0506949 -.152 .4545393 -.77 .916
nominate_dim2 | 544 .0205331 .0165 .2997723 -.975 .879
nokken_poole_dim1 | 543 .0450331 -.157 .4637939 -.808 1
nokken_poole_dim2 | 543 .0405285 .053 .3298611 -.975 .879
--------------------------------------------------------------------------------------------------------------------------
.
. * the syntax below will export the collected table to the Microsoft Excel file,
. * lab2_descriptive_statistics.xslx. Many other file formats are possible.
. * Check out the help file for collect export to find out more.
.
. * the "replace" option will allow you to overwrite an existing table with the same name.
. collect export lab2_descriptive_statistics.xlsx, replace
(collection Table exported to file lab2_descriptive_statistics.xlsx)