POL 200 Lab 2: Describing Your Data

Descriptive statistics

Let’s begin by opening the Nominate data for the 116th Congress (Lewis et al. 2020). The goal of today’s lab is to learn out to describe our data through calculations of central tendancy and dispersion, and how to create simple tables to present such statistics for your readers. Along the way, we will learn how to run a command while suppressing output, how to use the display command to call up stored information and print it with text, briefly experience using local macros in Stata, and how to write tables to an Excel file using collect. Ready? Ok - let’s get started! Open the dataset and use the describe command to print your variables.

NOTE: If you do not already have the Nominate data downloaded, converted to Stata format and saved to your computer, please see this before proceeding.

. * Be sure to set your working directory - where you have saved 
. *   your files and to where you want to write new files. Example:
. *   cd "c:\myfolder\" 
. 
. use HS116_members.dta, clear

What variables are included in this dataset?

. describe

Contains data from HS116_members.dta
Observations:           544                  
Variables:            22                  25 Aug 2020 15:22
--------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
name         type    format    label      Variable label
--------------------------------------------------------------------------------------------------
congress        int     %8.0g                 
chamber         str6    %9s                   
icpsr           long    %12.0g                
state_icpsr     byte    %8.0g                 
district_code   byte    %8.0g                 
state_abbrev    str2    %9s                   
party_code      int     %8.0g                 
occupancy       byte    %8.0g                 
last_means      byte    %8.0g                 
bioname         str42   %42s                  
bioguide_id     str7    %9s                   
born            int     %8.0g                 
died            byte    %8.0g                 
nominate_dim1   float   %9.0g                 
nominate_dim2   float   %9.0g                 
nominate_log_~d float   %9.0g                 
nominate_geo_~y float   %9.0g                 
nominate_num~es int     %8.0g                 
nominate_num~rs int     %8.0g                 
conditional     byte    %8.0g                 
nokken_poole_~1 float   %9.0g                 
nokken_poole_~2 float   %9.0g                 
--------------------------------------------------------------------------------------------------
Sorted by: 

There are various commands useful to describe your variables, including the tab, summarize, and table commands.

tabulate

The tabulate command presents one-way and two-way frequency tables, along with a range of other functions. We can use tabulate to view variable values, the percent of cases with a given value, and cumulative precentages.

. tab state_abbrev 

state_abbre |
v |      Freq.     Percent        Cum.
------------+-----------------------------------
AK |          3        0.55        0.55
AL |          9        1.65        2.21
AR |          6        1.10        3.31
AZ |         11        2.02        5.33
CA |         56       10.29       15.63
CO |          9        1.65       17.28
CT |          7        1.29       18.57
DE |          3        0.55       19.12
FL |         29        5.33       24.45
GA |         17        3.13       27.57
HI |          4        0.74       28.31
IA |          6        1.10       29.41
ID |          4        0.74       30.15
IL |         20        3.68       33.82
IN |         11        2.02       35.85
KS |          6        1.10       36.95
KY |          8        1.47       38.42
LA |          8        1.47       39.89
MA |         11        2.02       41.91
MD |         11        2.02       43.93
ME |          4        0.74       44.67
MI |         17        3.13       47.79
MN |         10        1.84       49.63
MO |         10        1.84       51.47
MS |          6        1.10       52.57
MT |          3        0.55       53.12
NC |         16        2.94       56.07
ND |          3        0.55       56.62
NE |          5        0.92       57.54
NH |          4        0.74       58.27
NJ |         15        2.76       61.03
NM |          5        0.92       61.95
NV |          6        1.10       63.05
NY |         30        5.51       68.57
OH |         18        3.31       71.88
OK |          7        1.29       73.16
OR |          7        1.29       74.45
PA |         21        3.86       78.31
RI |          4        0.74       79.04
SC |          9        1.65       80.70
SD |          3        0.55       81.25
TN |         11        2.02       83.27
TX |         38        6.99       90.26
UT |          6        1.10       91.36
VA |         13        2.39       93.75
VT |          3        0.55       94.30
WA |         12        2.21       96.51
WI |         11        2.02       98.53
WV |          5        0.92       99.45
WY |          3        0.55      100.00
------------+-----------------------------------
Total |        544      100.00

. 
. *  or we could use "if" to select just Democrats or just Republicans
. *  the "quietly" prefix will surpress the output while still
. *  running the command
. 
. quietly tab state_abbrev if party_code == 100

. quietly tab state_abbrev if party_code == 200

summarize

summarize a pre-set group of statistics on variables you list, like the mean, the standard deviation, and the number of observations. You can use summarize with one or more variables listed. When including the detail option, Stata provides additional statistics, including the variance, the median, in lowest and highest observations. Using the detail option also saves those additional statistics in Stata’s memory as scalars that we can call up when needed using the display command. display can also be used to print information and make conduct calculations based on what is stored or new information we pass to Stata.

. sum nominate_dim1

Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
nominate_d~1 |        544    .0506949    .4545393       -.77       .916

. display r(mean)
.05069485

. quietly sum nominate_dim1, detail

. display "The mean Nominate score in the 116th Congress is " r(mean) - r(p50) " points greater th
> an the median."
The mean Nominate score in the 116th Congress is .20269486 points greater than the median.

Let’s do one more example using summarize and display. What proportion of Republicans in Congress could we consider as moderate”, defined as more liberal than two standard deviations below the GOP mean? To make our lives a bit easier, let’s save the scalars from the summarize command as local macros that we can call up later, even if run summarize again.

. quietly sum nominate_dim1 if party_code == 200, detail

. display "Nominate mean = " r(mean)
Nominate mean = .50268077

. local mn_gop = r(mean)

. local sd_gop = r(sd)

. 
. * We can access the local macro by enclosing the name in single quotes
. display "Average Republican Nominate Score: " `mn_gop' "; " /// 
>     "Republican standard deviation: " `sd_gop'
Average Republican Nominate Score: .50268077; Republican standard deviation: .1463517

. 
. * Let's use the local macro "threshold" to hold the value of two 
. *   standard deviations below the mean
. local threshold = `mn_gop' - 2*`sd_gop'

. 
. quietly sum nominate_dim1 if party_code==200

. quietly local obs_gop = r(N)

.  
. sum nominate_dim1 if party_code==200 & nominate_dim1<=`threshold'

Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
nominate_d~1 |          5       .1646    .0327536       .112       .202

. 
. * notice that I have shortened the display command to just di. 
. *   Since there is no other command starting with di, this will work. 
. di "Only " r(N) " Republicans have Nominate scores lower (more liberal)"  ///
>     "than two standard deviations below the mean GOP DW_Nominate score, " ///
>     r(N) / `obs_gop' " of the congressional GOP."
Only 5 Republicans have Nominate scores lower (more liberal)than two standard deviations below the
>  mean GOP DW_Nominate score, .01923077 of the congressional GOP.

table

Another option for creating tables of descriptive statistics is table, which offers the user the ability to choose which statistics she would like to include in the table, each calculated on whatever variables we choose. Let’s make a table of descriptive statistics with Lewis et al.’s first and second-dimension Nominate scores as well as the Nokken and Poole scores.

. * this command will tell Stata to calculate the range of statitics listed
. *   in the statistics() option
. * by adding the "save" option, we can allow the table results to be accessed later
. 
. table (var), ///
>         stat(count nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
>         stat(mean nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
>         stat(median nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
>         stat(sd nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
>         stat(min nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
>         stat(max nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2)  

--------------------------------------------------------------------------------------------------------------------------
|  Number of non-missing values       Mean   Median   Standard deviation   Minimum value   Maximum value
------------------+-------------------------------------------------------------------------------------------------------
nominate_dim1     |                           544   .0506949    -.152             .4545393            -.77            .916
nominate_dim2     |                           544   .0205331    .0165             .2997723           -.975            .879
nokken_poole_dim1 |                           543   .0450331    -.157             .4637939           -.808               1
nokken_poole_dim2 |                           543   .0405285     .053             .3298611           -.975            .879
--------------------------------------------------------------------------------------------------------------------------

Now this is useful, but it could be even more useful if we could export this table to Word or Excel or some other word-processing or typesetting program for use in academic writing. We can use Stata’s new collect prefix and collect export command to first save the table results, and then to export them.

. collect table (var), ///
>         stat(count nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
>         stat(mean nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
>         stat(median nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
>         stat(sd nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
>         stat(min nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2) ///
>         stat(max nominate_dim1 nominate_dim2 nokken_poole_dim1 nokken_poole_dim2)  

--------------------------------------------------------------------------------------------------------------------------
|  Number of non-missing values       Mean   Median   Standard deviation   Minimum value   Maximum value
------------------+-------------------------------------------------------------------------------------------------------
nominate_dim1     |                           544   .0506949    -.152             .4545393            -.77            .916
nominate_dim2     |                           544   .0205331    .0165             .2997723           -.975            .879
nokken_poole_dim1 |                           543   .0450331    -.157             .4637939           -.808               1
nokken_poole_dim2 |                           543   .0405285     .053             .3298611           -.975            .879
--------------------------------------------------------------------------------------------------------------------------

. 
. * the syntax below will export the collected table to the Microsoft Excel file, 
. *   lab2_descriptive_statistics.xslx. Many other file formats are possible. 
. *   Check out the help file for collect export to find out more. 
. 
. * the "replace" option will allow you to overwrite an existing table with the same name.
. collect export lab2_descriptive_statistics.xlsx, replace
(collection Table exported to file lab2_descriptive_statistics.xlsx)