Data Manipulation

22 minute read

This section introduces basic commands for manipulating data in Stata. We will work with one dataset at a time. In Section 6 we will cover how to work with multiple datasets using “frames”.

3.1 Importing Data

Stata holds data as .dta files. Importing .dta files is simple: the command is use filename.dta if the file is in your working directory. You will have to include the folder path wrapped in quotation marks if you have not set the working directory use “pathtofolder\filename.dta”. Alternatively, go through “File -> Open”.

Stata can also read data files saved in other formats, including .xls and .csv files. The easiest way to do this is going through “File -> Import” then select the type of file you want to import.

Before loading a new dataset, you must clear the data Stata is currently holding in its memory using the clear command (or add the , clear option after the use command).

3.2 Viewing Data in Stata

Stata comes with a few sample data files. One of these has information on life expectancy, gross national product (GNP) per capita, population growth, and the percentage of the population with access to safe water for 68 countries in 1998.

Practical Exercise: Load life expectancy data.

To load the life expectancy data file, type sysuse lifeexp.dta.

. sysuse lifeexp, clear 
(Life expectancy, 1998)

Stata holds the data in memory like an excel file. But unlike excel, the main Stata interface does not show us our rows and columns of data. To see the actual data, select “Data - > Data Editor -> Data Browse” from the menus or click the data browse icon in the toolbar. Note there is also a “Data Edit” option which allows you to modify datapoints. We never use this option in data analysis, as we want to ensure that all changes made to our data are reproducible.

When you browse the life expectancy data file, you will see 6 columns and 68 rows. The columns in our data are called variables – values that measure the same underlying attribute (like life expectancy) across observed units. The rows are observations – an observation contains all values measured on the same unit (in our case countries) across attributes.

Another way to view data in Stata is using the list command. This will print the entire dataset. You can also list only specific variables (e.g., list country) or list specific observations (e.g., for the first 5 observations list in 1/5). We can also list observations based on their value of a variable. For example, to list all observations with a GNP per capita of less than 500 use the command list if gnppc < 500. This example illustrates a powerful feature of Stata: the action of any command can be restricted to a subset of the data using the conditional if.

Practical Exercise: Describe life expectancy data.

To see how many variables and observations are in a data file we do not need to look directly at the data. We can type describe in the command window.

. describe 
 
Contains data from \\rlab-app\StataSE$\ado\base/l/lifeexp.dta 
Observations:            68                  Life expectancy, 1998 
    Variables:             6                  26 Mar 2020 09:40 
                                              (_dta has notes) 
────────────────────────────────────────────────────────────────────────────────────────
Variable      Storage   Display    Value 
    name         type    format    label      Variable label 
────────────────────────────────────────────────────────────────────────────────────────
region          byte    %16.0g     region     Region 
country         str28   %28s                  Country 
popgrowth       float   %9.0g               * Avg. annual % growth 
lexp            byte    %9.0g               * Life expectancy at birth 
gnppc           float   %9.0g               * GNP per capita 
safewater       byte    %9.0g               * Safe water 
                                            * indicated variables have notes 
────────────────────────────────────────────────────────────────────────────────────────

This produces a description of the dataset in memory, listing the variable names and their labels. The dataset has notes that you can see by typing notes. Four of the variables have annotations that you can see by typing notes varname. You can also see some of this information at a glance in the Variables window.

Variable names can have up to 32 characters, are case sensitive, and cannot include spaces or start with a number. To change the name of a variable, use the rename command followed by the old variable name and new variable name. label variable allows you to change or include a brief description of the variable.

Practical Exercise: Renaming variables and changing the variable label.

Change the name and label of the safewater variable, adding more information which is found in the notes.

. rename safewater access_to_safe_water
. label var access_to_safe_water "Access to safe water, % of population"

3.3 Types of Variables

There are essentially two types of variables in Stata, words (referred to as strings) and numbers. Each is handled slightly differently when you are manipulating your data. Within numerical variables, there are two further types: continuous data, such as income or height, and categorical data, such as level of education or gender. In the second case a value will take on a particular meaning. For example, 1=male, 2=female, and 3 = non-binary.

In the life expectancy data, variables popgrowth, lexp, gnppc, and safewater are clearly numerical variables (these appear in black in the data browser). However, region and country are different. country is a string variable, with the entries appearing as words in the dataset (strings appear in red).

On the other hand, for region the entry for observation 1 appears as “Europe & C. Asia” in the data browser, but “1” when you click on it. This means the data is in numerical form (in this case 1-3) but has been labelled so that each number refers to a different region.

You can create or edit variable labels in Stata using label define. Stata has a two-step approach to defining labels. First you define a named label set which associates integer codes with labels of up to 80 characters, using the label define command. Then you associate the set of labels with a variable, using the label values command.

Practical Exercise: Changing value labels.

Replace the current labels in our data with shorthand labels “EU & C.A”, “NA”, “SA”.

. label list region
region:
           1 Europe & C. Asia
           2 North America
           3 South America

. label define region 1 "EU & C.A" 2 "NA" 3 "SA", replace

. label values region region

. label list region
region:
           1 EU & C.A
           2 NA
           3 SA

If a variable has been read as a string but really contains only numerical values, you will want to use the command destring, specifying either the , replace option to replace the original string with a numeric variable, or , gen(varname) to keep both variables and save the numeric variable with varname.

If instead you want to convert string data into a numeric variable, use the command encode, specifying the name of the new variable you want to generate , gen(varname). The new numeric variable will be automatically labelled using the original string names.

To convert a labelled numeric variable to a string, we similarly use the decode command. Note that this will only work with numeric variables which are labelled. Otherwise, we can use tostring to create strings of numbers from unlabelled numeric variables.

Practical Exercise: Converting between string and numeric variables.

Let us convert the string variable country to a labelled numeric variable and back to a string again.

. encode country, gen(country_code)

. decode country_code, gen(country_string)

If you have done this correctly, country_string and country should be identical (check in data browser).

3.4 Summarising Data

One of the first things you will want to do with your data is to summarise its main features. The summarize command shows us the mean, standard deviation, minimum, maximum, and number of observations for each variable in our data. Alternatively, specifying a variable after the command (e.g. summarize gnppc) Stata will summarize each variable on its own. If you would like more precise information (e.g., percentiles) then you can add the detail option to the end of that command, i.e., summarize gnppc, detail.

Practical Exercise: Summary statistics of a variable.

Run simple descriptive statistics for GNP per capita.

. summarize gnppc 
 
    Variable │        Obs        Mean    Std. dev.       Min        Max 
─────────────┼───────────────────────────────────────────────────────── 
       gnppc │         63    8674.857    10634.68        370      39980

GNP per capita ranges from $370 to $39,980 with an average of $8,675. We also see that Stata reports only 63 observations on GNP per capita, so we must have some missing values. We discuss how to deal with missing values in section 3.6.6. below.

Sometimes you may wish to break down summaries by a particular variable. The bysort command is very useful here. It is best explained with an example. Suppose you want to see how GNP per capita varies by region. The command here would be bysort region: summarize gnppc. This means, in English, “summarize GNP per capita for every distinct value of region.”

3.5 Tables

Frequency tables are useful for summarising categorical data. They can be created in Stata using the table command. The simplest frequency table is a one-way tabulation of a variable (i.e., tells you how many times each answer is given in response to a particular variable).

Practical Exercise: Frequency table of a categorical variable.

Create a frequency table of regions in the life expectancy data, showing how many countries there are in each region and what percentage of all countries each region makes up.

. table region, stat(frequency) stat(percent)
───────────────────┬─────────────────────
                   │  Frequency   Percent
───────────────────┼─────────────────────
Region             |                     
  Europe & C. Asia |         44     64.71
  North America    |         14     20.59
  South America    |         10     14.71
  Total            |         68    100.00
───────────────────┼─────────────────────

This is only suitable for variables with relatively few entries, such as categorical data. As with all commands, this can also be accessed through the menus via “Statistics -> Summaries -> Frequency tables -> One-way tables”.

We can also obtain two-way frequency tables which give a breakdown of a variable by another. For example, table region lexp, stat(frequency) stat(percent) will show how many countries with each life expectancy there are in each region. The same results can be achieved using the tabulate command. However, the table command is a more flexible and powerful tool. For example, it allows us to create three-way frequency tables (table var1 var2 var3, stat(frequency) stat(percent)).

The table command also allows us to explore summary statistics of continuous variables, like averages. To do this, simply add the continuous variable we want to summarise and summary statistic we want to report to the stat() option. The summary statistics we can report using table are in the command Helpfile. Common statistics include mean, variance, sd, count, median, min, max, range.

To specify the number of decimal places to show in our tables use the nformat() option. See help format for all of the formatting sub-options. Using %12.2fc specifies two decimal places and adds a comma to make large numbers more readable.

Practical Exercise: Frequency table with mean of other variables.

Add the mean life expectancy and GNP per capita for each region to our frequency table.

. table region, stat(freq) stat(mean gnppc) stat(mean lexp) nformat(%12.2fc)
───────────────────────────────────────────────────────────────────────────
                   |  Frequency                      Mean                  
                   |              GNP per capita   Life expectancy at birth
───────────────────────────────────────────────────────────────────────────
Region             |                                                       
  Europe & C. Asia |      44.00        10,738.05                      73.07
  North America    |      14.00         5,817.17                      71.21
  South America    |      10.00         3,645.00                      70.30
  Total            |      68.00         8,674.86                      72.28
───────────────────────────────────────────────────────────────────────────

We can also use the table command in a similar way to summarize by specifying all the summary statistics of a single variable.

Practical Exercise: Summary statistics using table.

For GNP per capita, create a table with the average, standard deviation, number of observations, and min/max values. This gives us exactly the same results as using the summarize command we learned above.

. table  , statistic(count gnppc) statistic(mean gnppc) statistic(sd gnppc) statistic(max gnppc) statistic(min gnppc) nformat(%12.2fc)

────────────────────────────────────────
Number of nonmissing values |      63.00
Mean                        |   8,674.86
Standard deviation          |  10,634.68
Maximum value               |  39,980.00
Minimum value               |     370.00
────────────────────────────────────────

3.6 Generating New Variables

We have briefly seen how new variables are generated when we switch between strings and labelled numeric data using encode/decode. This is just one example of the many instances in which we might want to create new variables from our existing data. The most important Stata commands for creating new variables are generate, replace, and recode.

3.6.1 Generate, Replace

The generate command creates a new variable using an expression of other variables or constants. For example, gen gnppc_sq = gnppc^2. If we have made a mistake in generating a variable, we can drop it using drop varname.

Arithmetic operators allowed are:

Code	Meaning
+	Add
-	Subtract
*	Multiply
/	Divide
^	Raise to the power of
+	String concatenation

generate is often used in conjunction with a logical expression using if. For example, if we wanted a variable equal to 1 for all countries with a high life expectancy (say above 75), we would type gen high = 1 if lexp >75. To make this variable equal to zero for all countries with life expectancy not above 75 (i.e., to create a dummy variable), we use the replace function: replace high = 0 if lexp <=75. More succinctly, gen high = lexp > 75 gives us exactly the same dummy variable.

Logical operators allowed are:

Code	Meaning
==	Equal
!=	Not equal
>	Greater than
<	Less than
>=	Greater than or equal
<=	Less than or equal
l	Or
&	And
!	Not

We can also generate variables using Stata’s inbuild functions. It is very common for economists to take the natural logarithm of large variables, such as GNP per capita. This is especially useful when a variable displays a non-linear relationship with another variable (more on this in section 5). To take the natural logarithm of GNP per capita, type gen log_gnp = log(gnppc). Notice that this will create missing values for any zero value, since the natural logarithm of zero is undefined.

Here are a few frequently used functions (type help mathfun for the full list):

Code	Meaning
abs(x)	the absolute value of x
exp(x)	the exponential function of x
ln(x) or log(x)	the natural logarithm of x if x>0
log10(x)	the log base 10 of x (for x>0)
max(x1,x2,…,xn)	the maximum of x1, x2, …, xn, ignoring missing values
min(x1,x2,…,xn)	the minimum of x1, x2, …, xn, ignoring missing values
round(x)	x rounded to the nearest whole number
sqrt(x)	the square root of x if x >= 0

3.6.2 Extended Generate

The egen (“extended gen”) command works just like gen but with extra options, which you can explore in the egen Helpfile. For example, for a variable of mean life expectancy egen avg_lexp = mean(lexp). With egen we can also break commands down into percentiles very easily. For example, to create a variable equal to the 99th percentile of GNP per capita, enter egen high_gnp = pctile(gnp), p(99). Changing the 99 to 50 in that command would produce a variable equal to the median price.

egen is often combined with the bysort command to obtain a breakdown of a particular statistic by another variable. For example, we could obtain a variable containing the minimum life expectancy by region by typing: bysort region: egen min_lexp = min(lexp). This essentially groups our data into regions and then applies the egen command separately to each region.

3.6.3 Summarize, Generate

Results of the summarize command are stored by Stata in r(). These can also be used to generate new variables.

Practical Exercise: Demeaning a variable

In quantitative analysis, demeaning or centering is when we calculate the deviation of a variable from its mean value. The summarize command in conjunction with generate can be used to achieve this. For example, if we wanted a new variable with the deviation of life expectancy from its mean value, we would run the following:

. sum lexp

    Variable |        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
        lexp |         68    72.27941    4.715315         54         79

. gen dm_lexp = lexp - r(mean)

3.6.4 Recode

The recode command is used to group a numeric variable into categories.

Suppose for example, we wanted to group countries into three life expectancy age categories: “high” for above 75, “medium” for between 65 and 75 inclusive, and “low” for below 65. This could be done by:

recode lexp (min/64 = 1) (65/75 = 2) (76/max = 3), gen(lexp_category)

A range is specified using a slash and includes the two boundaries. So 65/75 is 65 to 75 inclusive. You can use min to refer to the smallest value and max to refer to the largest value of the variable you are recoding.

Then apply labels using commands described above:

label define lexp_category 1 "low" 2 "medium" 3 "high", replace
label values lexp_category lexp_category

Or you can specify value labels in each recoding rule:

recode lexp (min/64 = 1 low) (65/75 = 2 medium) (76/max = 3 high), gen(lexp_cat)

3.6.5 Sorting and Counting

You can sort data by any chosen variable value using the gsort command. For example, to sort our countries by life expectancy in ascending order (lowest to highest), type gsort lexp. For descending order add a minus symbol before the variable you want to sort, gsort -lexp. You can also sort by multiple variables. To sort countries by life expectancy and then GNP per capita in ascending order, gsort lexp gnppc.

The subscripts _n and _N can be used to create variables that count and rank your data. _N is used to count the total number of observations. For example, gen total_country = _N gives us a new variable which is the total number of observations in the dataset. _n tells us the ranking of each observation in the dataset. If we wanted to create a variable which ranks GNP per capita from highest to lowest we would first gsort -gnppc then gen gnp_rank = _n. This is useful in quantitative analysis, where we are sometimes more interested in the rank than value of a variable, especially if the variable changes non-linearly across observations or over time.

The _N and _n subscripts are useful when used with the bysort command. For example, to generate a variable with the total number of countries in each region, bysort region: gen reg_tot = _N. Or for the regional ranking of GNP per capita, from highest to lowest, first gsort region -gnppc then bysort region: gen reg_rank = _n.

3.6.6 Missing Values

Missing values are common, particularly in survey and census data where individuals might not respond to all questions.

Often missing values will just be blank entries, represented in Stata using a dot (.). For most commands, like the regression commands we will explore in section 5 below, observations which have missing values (for any of the variables which are involved in running that command) are excluded.

However, Stata actually equates missing values with infinity. So, if you try something like gen gnppc_lb = 500 if gnppc>500 then the missing values will be (unintentionally) included. This would have to be reversed by replace gnppc_lb = . if missing(gnppc).

In many established surveys missing values will be coded as numbers and appropriately labelled: -9 , 99, 9999 are often used. Reading the documentation for datasets you are working with will tell you how to identify missing values. To recode these values as missing in Stata, recode var (99=.) or replace var=. if var==99.

More conceptually, the challenge posed by missing values are a very important topic in policy analysis. A key question to consider is whether values are missing in a systematic way. For example, if in our life expectancy data only countries with good access to safe water report the % of the population with access, it can lead us to believe that this value is higher than it actually is. Our conclusions about the relationship between clean water and life expectancy would reflect a selected (or non-random) subset of the underlying evidence.

Practical Exercise: Generating new variables using generate.

Create 3 new variables from the life expectancy data:

A dummy variable called rich equal to 1 if GNP per capita for a country is greater than the sample mean GNP per capita, and 0 otherwise.
A continuous variable called wellbeing equal to the natural logarithm of GNP per capita added a quarter of life expectancy.
A dummy variable called missing_any equal to one if any variable is missing for the observation.

. sysuse lifeexp, clear 
(Life expectancy, 1998) 

. sum gnppc

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       gnppc |         63    8674.857    10634.68        370      39980

. gen rich = gnppc > r(mean)

. replace rich = . if missing(gnppc)
(5 real changes made, 5 to missing)

. gen wellbeing = ln(gnppc) + lexp/4
(5 missing values generated)

. gen missing_any = 0

. replace missing_any = 1 if missing(region) | missing(country) | missing(popgrowth) | missing(lexp) | missing(gnppc) | missing(safewater) | missing(rich)
(31 real changes made)

3.7 Reshaping Datasets

Data can either be in “wide” form or “long” form and we can switch between these. Reshaping is most often needed when we work with panel datasets that contain multiple units (e.g., countries) being observed over time. “Long” data will have a time variable (e.g., year) and units (e.g., countries) being repeated for each time period. “Wide” data will have each unit observed only once, and different variables for each time period.

This is difficult to explain without an example. Load the system blood pressure dataset in its long format: sysuse bplong, clear. Looking in our data browser we will see that we have 120 patients, identified by their patient id, with their blood pressure bp, observed in two time periods (when) “before” and “after”, coded as 1 and 2 respectively. Loading the wide version of the same data, sysuse bpwide, clear, you will see that we have the same 120 patients, this time with a separate variable for their blood pressure in the “before” and “after” period.

To go between “long” and “wide” data, we use the reshape command. The syntax is a little tricky, but the most important part is to identify our outcome variables (in our case blood pressure bp), our time variable (in our case when), and our unit identifier (in our case patient id). The command then takes the form reshape “destination shape (wide or long)” “outcome variables”, i(“identifier”) j(“time variable”).

Practical Exercise: Reshaping data.

Reshape the blood pressure data from long to wide and back again.

. sysuse bplong, clear
(Fictional blood-pressure data)

. reshape wide bp, i(patient) j(when)
(j = 1 2)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations              240   ->   120         
Number of variables                   5   ->   5           
j variable (2 values)              when   ->   (dropped)
xij variables:
                                     bp   ->   bp1 bp2
-----------------------------------------------------------------------------

. reshape long bp, i(patient) j(when)
(j = 1 2)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations              120   ->   240         
Number of variables                   5   ->   5           
j variable (2 values)                     ->   when
xij variables:
                                bp1 bp2   ->   bp
-----------------------------------------------------------------------------
. label define when 1 "before" 2 "after", replace

. label values when when

3.8 Joining Datasets

There are two different situations when you will need to join datasets, each requires a different command. Broadly these involve adding more observations or adding more variables.

Adding new observations is easier. This is done using the append command. Load a dataset form the Stata website with information on the population of three counties in California (Los Angeles, Orange, and Ventura). We can append another dataset with the population of three counties in Illinois (Cook, DeKalb, and Will) using the append command and specifying the option generate() to create a variable identifying from which dataset each observation originally came. We can then label this generated variable to identify which state each county is in.

Notice that the variable names need to be the same for Stata to append the datasets successfully.

Practical Exercise: Appending two datasets.

Append data on the population of counties in California and Illinois.

. use https://www.stata-press.com/data/r18/capop, clear

. list
     +-----------------------+
     |      county       pop |
     |-----------------------|
  1. | Los Angeles   9878554 |
  2. |      Orange   2997033 |
  3. |     Ventura    798364 |
     +-----------------------+

.  append using https://www.stata-press.com/data/r18/ilpop, generate(state)
. label define state 0 "CA" 1 "IL"

. label values state state

. list
     +-------------------------------+
     |      county       pop   state |
     |-------------------------------|
  1. | Los Angeles   9878554      CA |
  2. |      Orange   2997033      CA |
  3. |     Ventura    798364      CA |
  4. |        Cook   5285107      IL |
  5. |      DeKalb    103729      IL |
  6. |        Will    673586      IL |
     +-------------------------------+

Adding new variables is slightly more difficult. This is done using the merge command. We can’t just add the information at random, we need to make sure that each observation is matched in each dataset. So, you need a variable in each dataset that uniquely identifies each observation (e.g., country, county, individual, firm). For example, consider two datasets from the Stata website on the weight and length / price and mileage of old car makes. To merge this data, we first need to identify the observation variable, which in this case is the car make. We then tell Stata that one unique car make in the first dataset corresponds to one unique car make in the second by using the command merge 1:1 make. Many-to-one (m:1) and one-to-many (1:m) merges are also possible, see merge help for more details.

When a merge is performed in Stata, Stata generates a variable called _merge which tells us whether a match was found for each observation. If the observation is present in both the original and merged data, _merge == 3. If the observation is only in the original data, _merge == 1. If its only in the data being merged in, _merge == 2. If we wanted to keep only those car makes for which all information is present, we would use the command keep if _merge == 3.

Practical Exercise: Merging two datasets.

Merge data on the weight and length of car makes to information on price and mileage.

. use https://www.stata-press.com/data/r18/autosize, clear
(1978 automobile data)

. list
     +------------------------------------+
     | make               weight   length |
     |------------------------------------|
  1. | Toyota Celica       2,410      174 |
  2. | BMW 320i            2,650      177 |
  3. | Cad. Seville        4,290      204 |
  4. | Pont. Grand Prix    3,210      201 |
  5. | Datsun 210          2,020      165 |
  6. | Plym. Arrow         3,260      170 |
     +------------------------------------+

. use https://www.stata-press.com/data/r18/autoexpense, clear
(1978 automobile data)

. list
     +---------------------------------+
     | make                price   mpg |
     |---------------------------------|
  1. | Toyota Celica       5,899    18 |
  2. | BMW 320i            9,735    25 |
  3. | Cad. Seville       15,906    21 |
  4. | Pont. Grand Prix    5,222    19 |
  5. | Datsun 210          4,589    35 |
     +---------------------------------+

. merge 1:1 make using https://www.stata-press.com/data/r18/autosize 
    Result                      Number of obs
    -----------------------------------------
    Not matched                             1
        from master                         0  (_merge==1)
        from using                          1  (_merge==2)

    Matched                                 5  (_merge==3)
    -----------------------------------------

. list
     +--------------------------------------------------------------------+
     | make                price   mpg   weight   length           _merge |
     |--------------------------------------------------------------------|
  1. | BMW 320i            9,735    25    2,650      177      Matched (3) |
  2. | Cad. Seville       15,906    21    4,290      204      Matched (3) |
  3. | Datsun 210          4,589    35    2,020      165      Matched (3) |
  4. | Pont. Grand Prix    5,222    19    3,210      201      Matched (3) |
  5. | Toyota Celica       5,899    18    2,410      174      Matched (3) |
  6. | Plym. Arrow             .     .    3,260      170   Using only (2) |
     +--------------------------------------------------------------------+