Data Management II
In this lab, we will learn how to import, merge, and append datasets in Stata, using datasets from the U.S. Census Bureau. Datasets are from https://data.census.gov/. Download tables “S0802” and “S1903” (American Community Survey, 2018, 5-year estimates, All counties within the U.S.). Before processing by Stata, please save the .csv files as “S0802.csv” and “S1903.csv”, respectively. Please manually delete the second header row in each table before saving. The final dataset is gastax_API.dta, created from the American Petroleum Institute’s annual fuel tax report. Please download from Canvas and place in your working directory.
Importing data
There is more than one way to import data in Stata. A suite of commands follow the same general syntax. We will be using import delimited, but there is also import sas, import spss, and import excel commands for importing datasets created using those proprietary programs. import delimited is appropriate for .csv files or other “delimited” datasets which have a specific character reserved for separating columns. You will occasionally find .txt files with other delimiters, like ;, spaces, or |. But commas (csv) are most common.
import delimited
. * Syntax: import delimited [vars] using filename.csv, varnames(1) clear
. * the [vars] can be replaced with a list of variables
. * you want to include. Or leave out if you want all vars.
. * the varnames(1) option tells Stata that the first row contains
. * variable names, not data.
.
. * Set your working directory using the cd command
.
. import delimited using S0802.csv, varnames(1) clear
(encoding automatically selected: ISO-8859-1)
(810 vars, 3,220 obs)
We will need a number of variables, including the geographic identifiers starting with “geo.” The rest can be dropped. It is much better to drop data using your do file than by deleting the variables directly from the .csv file. That way you can always change the code to add in new variables when you rerun the file as needed.
. * the keep command will keep the variables you list and drop the rest
. keep geo_id *c01_090e *c02_001e *c01_001e
.
. * now let's rename these variables
. rename s0802_c01_001e totalworkers
. rename s0802_c02_001e drovealone
. rename s0802_c01_090e mediantravel
.
. * save the dataset for use later
. save S0802.dta, replace
file S0802.dta saved
Let’s repeat these steps with the next dataset, this time keeping median family income, variable s1903_c03_015e.
. import delimited using S1903.csv, varnames(1) clear
(encoding automatically selected: ISO-8859-1)
(242 vars, 3,220 obs)
.
. keep geo_id *c03_015e
.
. * now let's rename the variables
. rename s1903_c03_015e medianfamilyinc
Merging
You can merge datasets together using the merge command. Another option is joinby, but in my experience merge is a safer option for newer users. The key to successful merges is to understand your unit of analysis in the datasets you’re combining. Stata expects there to be one variable with identical names in both datasets, which shows how observations in one dataset are related to observations in the other. In the examples below, we will use the geo_id variable to first combine the ACS county-level datasets, and then to merge in state-level gas tax data into the county dataset.
merge
The merge command can handle a number of different types of merges, but before running, you must feel understand your data. The dataset you have opened in Stata is considered the master dataset. The file saved on your computer is the using dataset. The other distinction is between one-to-one, one-to-many, many-to-one, or (less likely) many-to-many merges. The master dataset is listed first, with the using dataset refered to second. So a one-to-many merge takes one observation in the master dataset and joins it to (potentially) multiple observations in the using dataset.
. * Syntax: merge 1:1 varname using using_dataset.dta, gen(newvar)
. * 1:1 can be replaced with 1:m, m:1, or m:m, depending on
. * the type of merge you want to run.
.
. * varname should be replaced with the identifier that
. * shows how obs in one dataset match those in the other.
. * this variable must appear in both datasets.
.
. * The filename in the command should refer to the
. * using dataset, not the master.
.
. * The gen() option will allow Stata to save a new variable
. * with a numeric code telling you for which observations
. * the merge worked. This variable cannot exist in your dataset
. * before running the command.
.
. merge 1:1 geo_id using S0802.dta, gen(mg_county)
Result Number of obs
-----------------------------------------
Not matched 0
Matched 3,220 (mg_county==3)
-----------------------------------------
Let’s take a look at our data now:
. list in 1/10, compress clean
geo_id med~c tot~s dro~e med~l mg_county
1. 0500000US01001 73530 24428 21136 25.8 Matched (3)
2. 0500000US01003 71951 91420 77051 27.4 Matched (3)
3. 0500000US01005 44339 8538 7118 N Matched (3)
4. 0500000US01007 54840 7946 6747 N Matched (3)
5. 0500000US01009 59882 21148 18225 N Matched (3)
6. 0500000US01011 42550 3940 3115 28.6 Matched (3)
7. 0500000US01013 46312 7769 6513 24 Matched (3)
8. 0500000US01015 56402 47147 39985 N Matched (3)
9. 0500000US01017 51386 14396 12273 23.6 Matched (3)
10. 0500000US01019 52375 9671 8038 N Matched (3)
Uh oh! The dataset has been imported as string varibles, not numeric variables. Strings are fine for the geo_id, which is supposed to contain non-numeric characters, but not for our other variables. We can see that they are strings using describe:
. describe
Contains data
Observations: 3,220
Variables: 6
--------------------------------------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
--------------------------------------------------------------------------------------------------------------------------------
geo_id str14 %14s GEO_ID
medianfamilyinc str6 %9s S1903_C03_015E
totalworkers str7 %9s S0802_C01_001E
drovealone str7 %9s S0802_C02_001E
mediantravel str4 %9s S0802_C01_090E
mg_county byte %23.0g _merge Matching result from merge
--------------------------------------------------------------------------------------------------------------------------------
Sorted by: geo_id
Note: Dataset has changed since last saved.
Before we convert the strings to numeric variables, let’s learn about loops.
Loops
loops are ways of automating the running of commands over variables, observations, values, datasets, files, etc. The most common loops in Stata are foreach and forvalues loops. The former can loop over numbers or text; the latter is expecting a range of integers (like: forvalues x=1/10 { ). You can check out the help file by typing help forvalues in the command line.
foreach
The most flexible foreach syntax is: foreach local_macro_name in some list {. On the next line, type your command, with whatever object is being looped replaced with the local macro you specified above, enclosed in single quotes. You can loop over as many commands as you want. When you’re done, end the loop with }.
Let’s loop over our variable names, running the destring command to convert the string variables to numeric variables.
. * the destring command will convert to numeric, ignoring the characters
. * listed in the ignore() options
.
. foreach vars in medianfamilyinc totalworkers drovealone mediantravel {
2. destring `vars', ignore("-" "*" "null" "N") replace
3. }
medianfamilyinc: characters n u l removed; replaced as long
(1 missing value generated)
totalworkers: characters n u l removed; replaced as long
(1 missing value generated)
drovealone: characters n u l removed; replaced as long
(1 missing value generated)
mediantravel: characters n u l N removed; replaced as double
(1737 missing values generated)
.
. describe
Contains data
Observations: 3,220
Variables: 6
--------------------------------------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
--------------------------------------------------------------------------------------------------------------------------------
geo_id str14 %14s GEO_ID
medianfamilyinc long %10.0g S1903_C03_015E
totalworkers long %10.0g S0802_C01_001E
drovealone long %10.0g S0802_C02_001E
mediantravel double %10.0g S0802_C01_090E
mg_county byte %23.0g _merge Matching result from merge
--------------------------------------------------------------------------------------------------------------------------------
Sorted by: geo_id
Note: Dataset has changed since last saved.
String Functions
To conduct our last merge, we need to unpack some information stored in the geo_id string variable. gastax_API.dta has several state identifiers, including the state FIPS code ( fips variable) that we could use for the merge. Let’s take a look at our options from our merged county dataset:
. list geo_id in 1/5, compress clean
geo_id
1. 0500000US01001
2. 0500000US01003
3. 0500000US01005
4. 0500000US01007
5. 0500000US01009
The two digits after “US” contain the state FIPS code, with the final three digits holding the county FIPS code. Together, the last five characters of geo_id unique identify U.S. counties. Lots of datasets will use FIPS codes to refer to various geographic units in the U.S.
The substr() function is very useful to grab parts of a string variable and save it as a new variable:
. * the -5, 2 in the following command tells Stata to start in the 5th character
. * from the end and then go two characters to the right. This info
. * gets saved in a new variable called "fips"
. gen fips = substr(geo_id,-5, 2)
.
. * now let's convert to numeric:
. destring fips, replace
fips: all characters numeric; replaced as byte
Great! Now we can merge with the state-level gas tax dataset using merge.
. * This time we use m:1, because we have many counties in the master dataset
. * which will match with a single observation in the using dataset (state).
.
. merge m:1 fips using gastax_API.dta, gen(mg_gas)
Result Number of obs
-----------------------------------------
Not matched 79
from master 79 (mg_gas==1)
from using 0 (mg_gas==2)
Matched 3,141 (mg_gas==3)
-----------------------------------------
Stata’s output tells us that each of the 50 states include in the gastax_API.dta dataset were successfully merged, but that 79 observations in the county dataset did not have a match in the gas tax dataset. My guess is those observations are from DC and Puerto Rico. Let’s check:
. tab fips if mg_gas==1
fips | Freq. Percent Cum.
------------+-----------------------------------
11 | 1 1.27 1.27
72 | 78 98.73 100.00
------------+-----------------------------------
Total | 79 100.00
Yup - 11 is the code for DC and 72 is the code for PR. We are safe to move forward with our analysis. Let’s save our data for easy access in the future:
. save transportation.dta, replace
file transportation.dta saved