Introduction to Stata

The Stata Interface

Welcome to OpenIntro Labs using Stata! Stata is a program for statistical computing that will allow you to interact with data and apply the statistical methods you learn in the OpenIntro Statistics course. This first lab will provide an introduction to Stata, including the interface, getting data into Stata, and some basic commands.

You can apply statistical methods in Stata using either the drop-down series of menus or using some programming. These labs will focus on teaching programming skills using Stata.

This lab assumes that you have already installed Stata. Stata is available for purchase at http://www.stata.com. You may also wish to see whether your university has Stata licenses available for your use.

First, launch Stata as you would with any program on your computer. The first screen you see should look something like this:

Stata interface

You will notice several areas of the screen. The bottom rectangle is the Command window, where you can enter Stata code interactively (or “on the fly’’) that you wish to run. The results will display in the Results window above the command line. When you open Stata, the Results window displays information about the version of Stata that you are running.

The right panel summarizes what information, for example data, is currently available to work with in Stata. In this screen above, we do not have any data or information loaded into Stata, so that area is blank.

Working directory

Each time you open Stata, it is a good idea to know where your working directory is set to. The working directory is where Stata looks for Stata-related files, such as datasets or code. We recommend creating a folder for your labs on your computer and setting your working directory there. You should save all datasets for these labs in this folder.

You can view the current working directory by looking at the bottom of the screen:

Stata interface

You can also check the current working directory by typing pwd into the Command window and pressing enter.

Stata interface

If your working directory is not set to where you would like Stata to look, you can change this to another folder on your computer. You can set the working directory through the dropdown menu:

Stata interface

You can also set the working directory using the command “cd” followed by the path on your computer.

Stata interface

Do-files

In Stata, you can enter code interactively (“on the fly”) in the Command window. We have shown this above with the pwd and cd commands. You can also save your code in a separate file called a do-file. It is always better to save code in a separate do-file because:

You can save your code cleanly.
You can annotate your code with comments explaining what you were trying to do.
You can easily share your work with others.

In these labs, you should always use a do-file to record any code that you create. To create a new do-file, select the button at the top corresponding to Do-file Editor:

Stata interface

Or choose file -> new -> do file editor from the drop down menu.

Do-file

Using either approach, a new window will open that is your do-file. Save this file to wherever you would like to store your code for this lab.

When you type commands into your do-file, they will not automatically run when you press enter. Highlight the code you wish you run in your do-file, and click the “do” button on the top right to send your code to Stata.

Do-file

Once you highlight and click the “do” button, the Stata results window will show the command and any output. You can also use a keyboard shortcut to run code in your do-file:

Windows: Type ctrl-D (control button-letter D)
Mac: Type command-shift-D

Dr. Arbuthnot’s Baptism Records

Loading data into Stata

In this lab, we will use a datafile called arbuthnot.dta. Datafiles with the extension “.dta” are Stata datafiles, and they are the easiest to read into Stata. All of the following assume that Stata is available on your computer. There are three options for reading .dta datafiles into your computer:

Double click the dataset in your file folder on your computer.
Choose file -> open and select the file folder on your computer.
Using Stata code in your do-file.

We will focus on the last option because it is “reproducible” in the sense that others can see from your do-file what dataset you have loaded. First, open up a new do-file using the instructions above and save it to a folder called “Lab-dofiles” within your OpenIntroLabs folder. From now on, we assume that you type the given code into your do-file and save it there.

Make sure your working directory is set to your OpenIntroLabs folder, and that the dataset “arbuthnot.dta” is within this folder. Then enter the following into your do-file and run the code:

use "arbuthnot.dta"

You’ll notice that Stata does not provide any feedback, other than repeating the code you wrote in the Results window. If you do no receive an error in red font, the data likely has been loaded into Stata.

Once the dataset is loaded, there will be additional information in the Variables window in the top right corner. Specifically, we see three variables: year, boys, girls. If you click on each variable in the top right, information about the variable is presented in the Properties window in the bottom right. Additionally, if you scroll down in the Properties window, you can see information about the Arbuthnot baptism data, including the number of variables (3) and the number of observations (82).

Background on Arbuthnot baptism data

The Arbuthnot data set refers to Dr. John Arbuthnot, an 18^th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710.

Interacting with data

To view the dataset in Stata, you can click the “Data Browser” button at the top of Stata.

Data browser

Here, you can see the three variables again: year, boys, and girls. Each row represents a different year, starting with 1629 and ending with 1710, for a total of 82 observations. The columns “boys” and “girls” indicate the number of boys and girls baptized in each year, respectively.

Some Exploration

Going back to your Stata do-file, the command codebook provides a summary for each variable in your dataset. This is not recommended for variables with many (hundreds of variables), but will be sufficient here.

codebook

If we would like to only look at information about one variable, we could follow codebook with the variable of interest, e.g. girls.

codebook girls

 . codebook girls

---------------------------------------------------------------------------------------------------------------------------
girls                                                                                                           (unlabeled)
---------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (long)

                 range:  [2722,7779]                  units:  1
         unique values:  80                       missing .:  0/82

                  mean:   5534.65
              std. dev:   1592.14

           percentiles:        10%       25%       50%       75%       90%
                              3181      4457      5718      7158      7483

What command would you use to extract just information about boys baptized? Try it!

Data visualization

In Stata, the scatter command gives a simple plot of the number of girls baptized per year.

scatter girls year

Scatterplot of girls over time

You will notice that the first variable, girls, appears first in the command, and also appears on the y-axis. The second variable, year, appears second in the command, and appears on the x-axis. We can modify the plot using options in Stata. Options in Stata are frequently specified after the main command and are preceded by a comma. For example, if we wanted to change the color of the plotted points to red, we would type:

scatter girls year, mcolor("red")

To read about all the options available for scatterplots, you can access the help files by typing into Stata help scatter.

Is there an apparent trend in the number of girls baptized over the years? How would you describe it? (To ensure that your lab report is comprehensive, be sure to include the code needed to make the plot as well as your written interpretation.)

Stata as a calculator

We can also use Stata to compute mathematical expressions, just like a calculator. We only need to precede our mathematical expression with the command display to make Stata do computations. For example, if we want to see the total number of baptisms in 1629, we could type:

display 5218 + 4683

Adding a new variable to the data frame

Alternatively, we could create a new variable corresponding to the sum of boys and girls using the generate command:

generate total = boys + girls

After running the generate command, open up the Data Browser and verify the new variable, total, is added to the dataset. We can make a plot of the total number of baptisms per year with the command:

scatter total year

Similarly to how we computed the total number of births, we can compute the ratio of the number of boys to the number of girls baptized in 1629 with

display 5218 / 4683

or we can add a new variable that contains all the ratios by using the command generate:

generate ratio = boys / girls

We can also compute the proportion of newborns that are boys in 1629:

display 5218 / (5218 + 4683)

or this may also be computed for all years simultaneously and appended to the dataset:

generate boyratio = boys / total

Note that we are using the new total variable we created earlier in our calculations.

Now, generate a plot of the proportion of boys born over time. What do you see?

Finally, in addition to simple mathematical operators like subtraction and division, you can ask Stata to make comparisons like greater than, >, less than, <, and equality, ==. For example, we can ask if boys outnumber girls in each year with the expression

generate moreboys = boys > girls

This command add a new variable to the arbuthnot dataframe containing the values of either 1 if that year had more boys than girls, or 0 if that year did not (the answer may surprise you).

More Practice

In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot’s baptism data. Your assignment involves repeating these steps, but for present day birth records in the United States. First, save the “present.dta” dataset to your OpenIntroLabs folder. Then you will need to clear the current dataset from Stata using the command clear. At this stage, you should have saved all your code in a do-file, and you do not need to save your datafile. Load the present day data with the following command.

clear
use "present.dta"

The data are stored in a data frame called present.

What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?
How do these counts compare to Arbuthnot’s? Are they of a similar magnitude?
Make a plot that displays the proportion of boys born over time. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response. Hint: You should be able to reuse your code from Exercise 3 above, just replace the dataframe name.
In what year did we see the most total number of births in the U.S.? Hint: First calculate the totals and save it as a new variable. Then, sort your dataset in descending order based on the total column. You can do this by running the command sort total.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for Stata by Jenna R. Krall and John Muschelli and adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.

Resources for learning Stata

This lab is meant to provide you with a brief introduction to Stata. You may find it helpful to use Google to help you find code. In addition, there is a useful list of resources at http://www.stata.com/links/resources-for-learning-stata/.