POL 200 Lab 3: Cleaning and Coding Your Data

In this lab, we will be learning about basic data manipulation and cleaning tools in Stata. Before we can begin analyzing the relationship between our independent and dependent variables, we need to make sure variable coding matches the concepts we are trying to measure and is appropriate for the tests we are conducting.

We will use public opinion data available from the Roper Center, a comprehensive repository of survey data. TCNJ has an institutional membership to Roper, so you can access and download the data. To access:

Access Roper using your TCNJ institutional credentials here.
Search for study #31117583, the July 2020 AP-NORC Poll and select the “Studies/Datasets” tab.
Click on the spreadsheet with the Stata icon to download the dataset:
Be sure to place 31117583.dta in a folder you can easily access.
Set your working directory to that folder using the cd command.

Now, let’s get started!

. * Change the file path below to the appropriate working directory for your machine
. 
. *cd "h:\POL200\labs"
. use 31117583.dta, clear

Creating and deleting variables

generate

The generate command allows you to create new variables by setting the new variable to a specified value or values. We can refer to existing variables or call up built-in Stata functions to do so.

.  * SYNTAX: gen some_new_name = some_value
. 
.  * first, let's create a numeric variable called "type"
.  *   this code will give every observation the value of 1
. generate type = 1

. codebook type

--------------------------------------------------------------------------------------------------------------------------------
type                                                                                                                 (unlabeled)
--------------------------------------------------------------------------------------------------------------------------------

                  Type: Numeric (float)

                 Range: [1,1]                         Units: 1
         Unique values: 1                         Missing .: 0/1,057

            Tabulation: Freq.  Value
                        1,057  1

.  * second, let's create a string variable named "category"
.  *   by enclosing the value in quotes
. generate category = "Group A"

. codebook category

--------------------------------------------------------------------------------------------------------------------------------
category                                                                                                             (unlabeled)
--------------------------------------------------------------------------------------------------------------------------------

                  Type: String (str7)

         Unique values: 1                         Missing "": 0/1,057

            Tabulation: Freq.  Value
                        1,057  "Group A"

               Warning: Variable has embedded blanks.

.  * third, let's create a new variable using Stata's random number function
.  *   this command will set the new variable to a random number between
.  *   0 and 1
. 
.  * note: you can shorten generate to gen
. gen randnumb = runiform()

. sum randnumb

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    randnumb |      1,057    .5114684    .2900753   .0020405   .9994879

.  * fourth, let's create a new variable that calls data from an existing variable
.  *   in the first gen command, our randnumb variable is logged 
.  *   in the second, our new var is randnumb divided by 1000
. 
. gen lnrandnumb = ln(randnumb)

. gen randnumb2 = randnumb/1000

. 
. table (var), ///
>     stat(count  randnumb randnumb2 lnrandnumb) ///
>         stat(mean randnumb randnumb2 lnrandnumb) ///
>         stat(range randnumb randnumb2 lnrandnumb) 

-----------------------------------------------------------------
           |  Number of non-missing values        Mean      Range
-----------+-----------------------------------------------------
randnumb   |                         1,057    .5114684   .9974474
randnumb2  |                         1,057    .0005115   .0009974
lnrandnumb |                         1,057   -.9450384   6.194039
-----------------------------------------------------------------

drop

Use drop to delete variables. But, always be careful when doing so: if you delete a variable and then save your dataset, those changes will be lost forever. Again, always generate and delete variables using your script, not the command line. You can drop one or more variables at a time.

. * Let's drop our new, unnecessary variables:
. drop category type

Changing variable values

recode

The easiest way to change an existing variable’s values is recode. The syntax asks the user to assign a new value to one or more old values. Always use the gen(newvar) option to create a new variable instead of replacing the existing one. That way you can always recover (and understand) the coding change.

.  * SYNTAX: recode oldvar (oldvalue = newvalue) (oldvalue=newvalue), gen(newvar)
.  
.  * Example: create an indicator variable for Latinx respondents by setting
.  *    all those identifying as Latinx as 1 and all other respondents as 0
.  
. codebook raceth

--------------------------------------------------------------------------------------------------------------------------------
raceth                                                                                                    RACETH: Race/ethnicity
--------------------------------------------------------------------------------------------------------------------------------

                  Type: Numeric (byte)
                 Label: RACETH

                 Range: [1,4]                         Units: 1
         Unique values: 4                         Missing .: 0/1,057

            Tabulation: Freq.   Numeric  Label
                          788         1  (1) White, non-Hispanic
                           50         2  (2) African American,
                                         non-Hispanic
                          135         3  (3) Hispanic
                           84         4  (4) Other

. recode raceth (1 2 4= 0)(3=1), gen(latinx)
(1057 differences between raceth and latinx)

. 
.  * alternatively, we could run:
.  * recode raceth (1=0)(2=0)(3=1)(4=0), gen(latinx)

.  * We can also recode a range of values using / :
. recode randnumb (0/.25=0)(.25/.5=1)(.5/.75=2)(.75/1=3), gen(rand3cat)
(1057 differences between randnumb and rand3cat)

. tabulate rand3cat

  RECODE of |
   randnumb |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        267       25.26       25.26
          1 |        236       22.33       47.59
          2 |        284       26.87       74.46
          3 |        270       25.54      100.00
------------+-----------------------------------
      Total |      1,057      100.00

replace

The replace command can reassign values of an existing variable. Let’s create an indicator variable denoting Black respondents using a combination of gen and replace using if to select specific observations.

. * first, create the new variable, setting all values to missing data (.)
. gen black = .
(1,057 missing values generated)

. 
. * now set our Black respondents to equal 1 on the new variable. We
. *   can select Black respondents from the existing raceth variable
. replace black = 1 if raceth==2
(50 real changes made)

. * now set the other respondents to 0. The "|" means "or"
. replace black = 0 if raceth ==1 | raceth==3 | raceth==4
(1,007 real changes made)

. * check the new variable:
. codebook black

--------------------------------------------------------------------------------------------------------------------------------
black                                                                                                                (unlabeled)
--------------------------------------------------------------------------------------------------------------------------------

                  Type: Numeric (float)

                 Range: [0,1]                         Units: 1
         Unique values: 2                         Missing .: 0/1,057

            Tabulation: Freq.  Value
                        1,007  0
                           50  1

Labeling

label variable

Variable labels are useful for data transparency and to guide the future you as you analyze the data. Labels will show up in the variable window in the Stata GUI as well as printed in the results window through the codebook and describe commands.

. * SYNTAX: label variable varname "label"
. label var latinx "Latinx respondent indicator variable (Latinx = 1)"

. label var black "Black respondent indicator variable (Black = 1)"

. describe latinx black

Variable      Storage   Display    Value
    name         type    format    label      Variable label
--------------------------------------------------------------------------------------------------------------------------------
latinx          byte    %9.0g                 Latinx respondent indicator variable (Latinx = 1)
black           float   %9.0g                 Black respondent indicator variable (Black = 1)

label define and label values

Labeling values of a variable is a two-step process. The label define command creates a label that is saved with the dataset and then can be applied to any variable. label values attaches a label to a variable.

. * SYNTAX: label define labelname value "label" value "label"
. * let's create a labels called blacklbl and latinxlbl:
. label define blacklbl 1 "Black Respondent" 0 "Non-Black Respondent"

. label define latinxlbl 1 "Latinx Respondent" 0 "Non-Latinx Respondent"

. * SYNTAX: label values varname labelname
. * Now attach the labels to the variables:
. label values black blacklbl

. label values latinx latinxlbl 

. 
. * let's check out if the labeling worked. the value label names should 
. *   show up in the value label column from the describe command below
. describe latinx black

Variable      Storage   Display    Value
    name         type    format    label      Variable label
--------------------------------------------------------------------------------------------------------------------------------
latinx          byte    %21.0g     latinxlbl
                                              Latinx respondent indicator variable (Latinx = 1)
black           float   %20.0g     blacklbl   Black respondent indicator variable (Black = 1)