POL 200 Lab 3: Cleaning and Coding Your Data
In this lab, we will be learning about basic data manipulation and cleaning tools in Stata. Before we can begin analyzing the relationship between our independent and dependent variables, we need to make sure variable coding matches the concepts we are trying to measure and is appropriate for the tests we are conducting.
We will use public opinion data available from the Roper Center, a comprehensive repository of survey data. TCNJ has an institutional membership to Roper, so you can access and download the data. To access:
- Access Roper using your TCNJ institutional credentials here.
- Search for study #31117583, the July 2020 AP-NORC Poll and select the “Studies/Datasets” tab.
- Click on the spreadsheet with the Stata icon to download the dataset:
- Be sure to place 31117583.dta in a folder you can easily access.
- Set your working directory to that folder using the cd command.
Now, let’s get started!
. * Change the file path below to the appropriate working directory for your machine
.
. *cd "h:\POL200\labs"
. use 31117583.dta, clear
Creating and deleting variables
generate
The generate command allows you to create new variables by setting the new variable to a specified value or values. We can refer to existing variables or call up built-in Stata functions to do so.
. * SYNTAX: gen some_new_name = some_value
.
. * first, let's create a numeric variable called "type"
. * this code will give every observation the value of 1
. generate type = 1
. codebook type
--------------------------------------------------------------------------------------------------------------------------------
type (unlabeled)
--------------------------------------------------------------------------------------------------------------------------------
Type: Numeric (float)
Range: [1,1] Units: 1
Unique values: 1 Missing .: 0/1,057
Tabulation: Freq. Value
1,057 1
. * second, let's create a string variable named "category"
. * by enclosing the value in quotes
. generate category = "Group A"
. codebook category
--------------------------------------------------------------------------------------------------------------------------------
category (unlabeled)
--------------------------------------------------------------------------------------------------------------------------------
Type: String (str7)
Unique values: 1 Missing "": 0/1,057
Tabulation: Freq. Value
1,057 "Group A"
Warning: Variable has embedded blanks.
. * third, let's create a new variable using Stata's random number function
. * this command will set the new variable to a random number between
. * 0 and 1
.
. * note: you can shorten generate to gen
. gen randnumb = runiform()
. sum randnumb
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
randnumb | 1,057 .5114684 .2900753 .0020405 .9994879
. * fourth, let's create a new variable that calls data from an existing variable
. * in the first gen command, our randnumb variable is logged
. * in the second, our new var is randnumb divided by 1000
.
. gen lnrandnumb = ln(randnumb)
. gen randnumb2 = randnumb/1000
.
. table (var), ///
> stat(count randnumb randnumb2 lnrandnumb) ///
> stat(mean randnumb randnumb2 lnrandnumb) ///
> stat(range randnumb randnumb2 lnrandnumb)
-----------------------------------------------------------------
| Number of non-missing values Mean Range
-----------+-----------------------------------------------------
randnumb | 1,057 .5114684 .9974474
randnumb2 | 1,057 .0005115 .0009974
lnrandnumb | 1,057 -.9450384 6.194039
-----------------------------------------------------------------
drop
Use drop to delete variables. But, always be careful when doing so: if you delete a variable and then save your dataset, those changes will be lost forever. Again, always generate and delete variables using your script, not the command line. You can drop one or more variables at a time.
. * Let's drop our new, unnecessary variables:
. drop category type
Changing variable values
recode
The easiest way to change an existing variable’s values is recode. The syntax asks the user to assign a new value to one or more old values. Always use the gen(newvar) option to create a new variable instead of replacing the existing one. That way you can always recover (and understand) the coding change.
. * SYNTAX: recode oldvar (oldvalue = newvalue) (oldvalue=newvalue), gen(newvar)
.
. * Example: create an indicator variable for Latinx respondents by setting
. * all those identifying as Latinx as 1 and all other respondents as 0
.
. codebook raceth
--------------------------------------------------------------------------------------------------------------------------------
raceth RACETH: Race/ethnicity
--------------------------------------------------------------------------------------------------------------------------------
Type: Numeric (byte)
Label: RACETH
Range: [1,4] Units: 1
Unique values: 4 Missing .: 0/1,057
Tabulation: Freq. Numeric Label
788 1 (1) White, non-Hispanic
50 2 (2) African American,
non-Hispanic
135 3 (3) Hispanic
84 4 (4) Other
. recode raceth (1 2 4= 0)(3=1), gen(latinx)
(1057 differences between raceth and latinx)
.
. * alternatively, we could run:
. * recode raceth (1=0)(2=0)(3=1)(4=0), gen(latinx)
. * We can also recode a range of values using / :
. recode randnumb (0/.25=0)(.25/.5=1)(.5/.75=2)(.75/1=3), gen(rand3cat)
(1057 differences between randnumb and rand3cat)
. tabulate rand3cat
RECODE of |
randnumb | Freq. Percent Cum.
------------+-----------------------------------
0 | 267 25.26 25.26
1 | 236 22.33 47.59
2 | 284 26.87 74.46
3 | 270 25.54 100.00
------------+-----------------------------------
Total | 1,057 100.00
replace
The replace command can reassign values of an existing variable. Let’s create an indicator variable denoting Black respondents using a combination of gen and replace using if to select specific observations.
. * first, create the new variable, setting all values to missing data (.)
. gen black = .
(1,057 missing values generated)
.
. * now set our Black respondents to equal 1 on the new variable. We
. * can select Black respondents from the existing raceth variable
. replace black = 1 if raceth==2
(50 real changes made)
. * now set the other respondents to 0. The "|" means "or"
. replace black = 0 if raceth ==1 | raceth==3 | raceth==4
(1,007 real changes made)
. * check the new variable:
. codebook black
--------------------------------------------------------------------------------------------------------------------------------
black (unlabeled)
--------------------------------------------------------------------------------------------------------------------------------
Type: Numeric (float)
Range: [0,1] Units: 1
Unique values: 2 Missing .: 0/1,057
Tabulation: Freq. Value
1,007 0
50 1
Labeling
label variable
Variable labels are useful for data transparency and to guide the future you as you analyze the data. Labels will show up in the variable window in the Stata GUI as well as printed in the results window through the codebook and describe commands.
. * SYNTAX: label variable varname "label"
. label var latinx "Latinx respondent indicator variable (Latinx = 1)"
. label var black "Black respondent indicator variable (Black = 1)"
. describe latinx black
Variable Storage Display Value
name type format label Variable label
--------------------------------------------------------------------------------------------------------------------------------
latinx byte %9.0g Latinx respondent indicator variable (Latinx = 1)
black float %9.0g Black respondent indicator variable (Black = 1)
label define and label values
Labeling values of a variable is a two-step process. The label define command creates a label that is saved with the dataset and then can be applied to any variable. label values attaches a label to a variable.
. * SYNTAX: label define labelname value "label" value "label"
. * let's create a labels called blacklbl and latinxlbl:
. label define blacklbl 1 "Black Respondent" 0 "Non-Black Respondent"
. label define latinxlbl 1 "Latinx Respondent" 0 "Non-Latinx Respondent"
. * SYNTAX: label values varname labelname
. * Now attach the labels to the variables:
. label values black blacklbl
. label values latinx latinxlbl
.
. * let's check out if the labeling worked. the value label names should
. * show up in the value label column from the describe command below
. describe latinx black
Variable Storage Display Value
name type format label Variable label
--------------------------------------------------------------------------------------------------------------------------------
latinx byte %21.0g latinxlbl
Latinx respondent indicator variable (Latinx = 1)
black float %20.0g blacklbl Black respondent indicator variable (Black = 1)