Data Management II

In this lab, we will learn how to import, merge, and append datasets in Stata, using datasets from the U.S. Census Bureau. Datasets are from https://data.census.gov/. Download tables “S0802” and “S1903” (American Community Survey, 2018, 5-year estimates, All counties within the U.S.). Before processing by Stata, please save the .csv files as “S0802.csv” and “S1903.csv”, respectively. Please manually delete the second header row in each table before saving. The final dataset is gastax_API.dta, created from the American Petroleum Institute’s annual fuel tax report. Please download from Canvas and place in your working directory.

Importing data

There is more than one way to import data in Stata. A suite of commands follow the same general syntax. We will be using import delimited, but there is also import sas, import spss, and import excel commands for importing datasets created using those proprietary programs. import delimited is appropriate for .csv files or other “delimited” datasets which have a specific character reserved for separating columns. You will occasionally find .txt files with other delimiters, like ;, spaces, or |. But commas (csv) are most common.

import delimited

. * Syntax: import delimited [vars] using filename.csv, varnames(1) clear  
. *     the [vars] can be replaced with a list of variables
. *     you want to include. Or leave out if you want all vars.
. *     the varnames(1) option tells Stata that the first row contains
. *     variable names, not data.
. 
. * Set your working directory using the cd command 
. 
. import delimited using S0802.csv, varnames(1) clear
(encoding automatically selected: ISO-8859-1)
(810 vars, 3,220 obs)

We will need a number of variables, including the geographic identifiers starting with “geo.” The rest can be dropped. It is much better to drop data using your do file than by deleting the variables directly from the .csv file. That way you can always change the code to add in new variables when you rerun the file as needed.

. * the keep command will keep the variables you list and drop the rest
. keep geo_id *c01_090e *c02_001e *c01_001e

. 
. * now let's rename these variables
. rename s0802_c01_001e totalworkers

. rename s0802_c02_001e drovealone

. rename s0802_c01_090e mediantravel

. 
. * save the dataset for use later
. save S0802.dta, replace
file S0802.dta saved

Let’s repeat these steps with the next dataset, this time keeping median family income, variable s1903_c03_015e.

. import delimited using S1903.csv, varnames(1) clear
(encoding automatically selected: ISO-8859-1)
(242 vars, 3,220 obs)

. 
. keep geo_id *c03_015e

. 
. * now let's rename the variables
. rename s1903_c03_015e medianfamilyinc

Merging

You can merge datasets together using the merge command. Another option is joinby, but in my experience merge is a safer option for newer users. The key to successful merges is to understand your unit of analysis in the datasets you’re combining. Stata expects there to be one variable with identical names in both datasets, which shows how observations in one dataset are related to observations in the other. In the examples below, we will use the geo_id variable to first combine the ACS county-level datasets, and then to merge in state-level gas tax data into the county dataset.

merge

The merge command can handle a number of different types of merges, but before running, you must feel understand your data. The dataset you have opened in Stata is considered the master dataset. The file saved on your computer is the using dataset. The other distinction is between one-to-one, one-to-many, many-to-one, or (less likely) many-to-many merges. The master dataset is listed first, with the using dataset refered to second. So a one-to-many merge takes one observation in the master dataset and joins it to (potentially) multiple observations in the using dataset.

. * Syntax: merge 1:1 varname using using_dataset.dta, gen(newvar)
. *    1:1 can be replaced with 1:m, m:1, or m:m, depending on 
. *    the type of merge you want to run. 
. 
. *    varname should be replaced with the identifier that 
. *    shows how obs in one dataset match those in the other.
. *    this variable must appear in both datasets. 
. 
. *    The filename in the command should refer to the 
. *    using dataset, not the master. 
. 
. *    The gen() option will allow Stata to save a new variable
. *    with a numeric code telling you for which observations
. *    the merge worked. This variable cannot exist in your dataset 
. *    before running the command. 
. 
. merge 1:1 geo_id using S0802.dta, gen(mg_county)

Result                      Number of obs
-----------------------------------------
Not matched                             0
Matched                             3,220  (mg_county==3)
-----------------------------------------

Let’s take a look at our data now:

. list in 1/10, compress clean 

geo_id   med~c   tot~s   dro~e   med~l     mg_county  
1.   0500000US01001   73530   24428   21136    25.8   Matched (3)  
2.   0500000US01003   71951   91420   77051    27.4   Matched (3)  
3.   0500000US01005   44339    8538    7118       N   Matched (3)  
4.   0500000US01007   54840    7946    6747       N   Matched (3)  
5.   0500000US01009   59882   21148   18225       N   Matched (3)  
6.   0500000US01011   42550    3940    3115    28.6   Matched (3)  
7.   0500000US01013   46312    7769    6513      24   Matched (3)  
8.   0500000US01015   56402   47147   39985       N   Matched (3)  
9.   0500000US01017   51386   14396   12273    23.6   Matched (3)  
10.   0500000US01019   52375    9671    8038       N   Matched (3)

Uh oh! The dataset has been imported as string varibles, not numeric variables. Strings are fine for the geo_id, which is supposed to contain non-numeric characters, but not for our other variables. We can see that they are strings using describe:

. describe

Contains data
Observations:         3,220                  
Variables:             6                  
--------------------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
name         type    format    label      Variable label
--------------------------------------------------------------------------------------------------------------------------------
geo_id          str14   %14s                  GEO_ID
medianfamilyinc str6    %9s                   S1903_C03_015E
totalworkers    str7    %9s                   S0802_C01_001E
drovealone      str7    %9s                   S0802_C02_001E
mediantravel    str4    %9s                   S0802_C01_090E
mg_county       byte    %23.0g     _merge     Matching result from merge
--------------------------------------------------------------------------------------------------------------------------------
Sorted by: geo_id
Note: Dataset has changed since last saved.

Before we convert the strings to numeric variables, let’s learn about loops.

Loops

loops are ways of automating the running of commands over variables, observations, values, datasets, files, etc. The most common loops in Stata are foreach and forvalues loops. The former can loop over numbers or text; the latter is expecting a range of integers (like: forvalues x=1/10 { ). You can check out the help file by typing help forvalues in the command line.

foreach

The most flexible foreach syntax is: foreach local_macro_name in some list {. On the next line, type your command, with whatever object is being looped replaced with the local macro you specified above, enclosed in single quotes. You can loop over as many commands as you want. When you’re done, end the loop with }.

Let’s loop over our variable names, running the destring command to convert the string variables to numeric variables.

. * the destring command will convert to numeric, ignoring the characters 
. *     listed in the ignore() options
. 
. foreach vars in medianfamilyinc totalworkers drovealone mediantravel {
2.     destring `vars', ignore("-" "*" "null" "N") replace 
3. }
medianfamilyinc: characters n u l removed; replaced as long
(1 missing value generated)
totalworkers: characters n u l removed; replaced as long
(1 missing value generated)
drovealone: characters n u l removed; replaced as long
(1 missing value generated)
mediantravel: characters n u l N removed; replaced as double
(1737 missing values generated)

. 
. describe

Contains data
Observations:         3,220                  
Variables:             6                  
--------------------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
name         type    format    label      Variable label
--------------------------------------------------------------------------------------------------------------------------------
geo_id          str14   %14s                  GEO_ID
medianfamilyinc long    %10.0g                S1903_C03_015E
totalworkers    long    %10.0g                S0802_C01_001E
drovealone      long    %10.0g                S0802_C02_001E
mediantravel    double  %10.0g                S0802_C01_090E
mg_county       byte    %23.0g     _merge     Matching result from merge
--------------------------------------------------------------------------------------------------------------------------------
Sorted by: geo_id
Note: Dataset has changed since last saved.

String Functions

To conduct our last merge, we need to unpack some information stored in the geo_id string variable. gastax_API.dta has several state identifiers, including the state FIPS code ( fips variable) that we could use for the merge. Let’s take a look at our options from our merged county dataset:

. list geo_id in 1/5, compress clean

geo_id  
1.   0500000US01001  
2.   0500000US01003  
3.   0500000US01005  
4.   0500000US01007  
5.   0500000US01009

The two digits after “US” contain the state FIPS code, with the final three digits holding the county FIPS code. Together, the last five characters of geo_id unique identify U.S. counties. Lots of datasets will use FIPS codes to refer to various geographic units in the U.S.

The substr() function is very useful to grab parts of a string variable and save it as a new variable:

. * the -5, 2 in the following command tells Stata to start in the 5th character 
. *    from the end and then go two characters to the right. This info
. *    gets saved in a new variable called "fips"
. gen fips = substr(geo_id,-5, 2)

. 
. * now let's convert to numeric: 
. destring fips, replace 
fips: all characters numeric; replaced as byte

Great! Now we can merge with the state-level gas tax dataset using merge.

. * This time we use m:1, because we have many counties in the master dataset
. *    which will match with a single observation in the using dataset (state).
. 
. merge m:1 fips using gastax_API.dta, gen(mg_gas)

Result                      Number of obs
-----------------------------------------
Not matched                            79
from master                        79  (mg_gas==1)
from using                          0  (mg_gas==2)

Matched                             3,141  (mg_gas==3)
-----------------------------------------

Stata’s output tells us that each of the 50 states include in the gastax_API.dta dataset were successfully merged, but that 79 observations in the county dataset did not have a match in the gas tax dataset. My guess is those observations are from DC and Puerto Rico. Let’s check:

. tab fips if mg_gas==1

fips |      Freq.     Percent        Cum.
------------+-----------------------------------
11 |          1        1.27        1.27
72 |         78       98.73      100.00
------------+-----------------------------------
Total |         79      100.00

Yup - 11 is the code for DC and 72 is the code for PR. We are safe to move forward with our analysis. Let’s save our data for easy access in the future:

. save transportation.dta, replace
file transportation.dta saved