Stata is a commonly used tool for empirical research. Stata comes with an extensive library of statistical methods, and there are additional user written methods that extend the functionality of Stata even further.
Stata stores data in memory as a single matrix. If you are familiar with Microsoft Excel Workbooks, Stata stores a single Worksheet in memory where each column has a name and each row is numbered from 1 to the total number of rows in the dataset.
This tutorial aims to introduce you to the key features of Stata and its documentation so you can start your own empirical work.
The display
command is useful for showing values at the command
line.
. display 1 + 2
3
Use the Page Up
key to recall the previous command evaluated. This
is particularly useful if you need to fix a typo.
Commands can be abbreviated, di
is equivalent to display
. I
prefer to use the whole command name because it makes code explicit.
Use the help
command if you know the name of the function and want
more details. Use the findit
command if you want to find a
function. I end up using Google more than findit
, but this may be a
mistake.
Unfortunately the help command opens a new window each time you use
it, use the nonew
option to prevent this behavior,
help help, nonew
.
There are many different ways to read data into Stata. To get a good
overview of how to import data into Stata type help import
in
Stata’s Command window. The functions I use most are import excel
and insheet
. import excel
is great if you are working with an
Excel workbook, while insheet
is great if you have a comma-separated
values (csv) file.
Stata datasets are generally stored in files with a .dta
extension.
To read a Stata dataset use the use
command. For the purpose of
this tutorial we will use a dataset shipped with Stata about
automobiles. Type in sysuse auto
to load the dataset into memory.
. sysuse auto, clear
(1978 Automobile Data)
The describe
command gives useful information about the variables in
the dataset and the number of rows in the dataset.
. describe
Contains data from /Applications/Stata/ado/base/a/auto.dta
obs: 74 1978 Automobile Data
vars: 12 13 Apr 2011 17:45
size: 3,182 (_dta has notes)
--------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair Record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn Circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear Ratio
foreign byte %8.0g origin Car type
--------------------------------------------------------------------------------------------------------
Sorted by: foreign
The summarize
command gives some useful summary statistics for each
variable.
. summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
make | 0
price | 74 6165.257 2949.496 3291 15906
mpg | 74 21.2973 5.785503 12 41
rep78 | 69 3.405797 .9899323 1 5
headroom | 74 2.993243 .8459948 1.5 5
-------------+--------------------------------------------------------
trunk | 74 13.75676 4.277404 5 23
weight | 74 3019.459 777.1936 1760 4840
length | 74 187.9324 22.26634 142 233
turn | 74 39.64865 4.399354 31 51
displacement | 74 197.2973 91.83722 79 425
-------------+--------------------------------------------------------
gear_ratio | 74 3.014865 .4562871 2.19 3.89
foreign | 74 .2972973 .4601885 0 1
You’ll notice that 11 of 12 variables in the auto dataset are numeric
and the make
variable is a string. To see what the make variable
looks like, we can list the first few observations.
. list make if _n <= 5
+---------------+
| make |
|---------------|
1. | AMC Concord |
2. | AMC Pacer |
3. | AMC Spirit |
4. | Buick Century |
5. | Buick Electra |
+---------------+
To see if make
uniquely identifies each row in the dataset we can
use the isid
function.
. isid make
When isid
says nothing the variable list does uniquely identify each
row. Are cars uniquely identified by their weight and length?
. duplicates report make
Duplicates in terms of make
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 74 0
--------------------------------------
. duplicates report weight length
Duplicates in terms of weight length
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 70 0
2 | 4 2
--------------------------------------
Imagine we are interested in looking at how foreign and domestic cars
differ. As a first step, it would be good to examine some summary
statistics for foreign and domestic cars, the tabstat
command makes
this fairly easy.
. tabstat price mpg weight length, by(foreign) stat(mean sd)
Summary statistics: mean, sd
by categories of: foreign (Car type)
foreign | price mpg weight length
---------+----------------------------------------
Domestic | 6072.423 19.82692 3317.115 196.1346
| 3097.104 4.743297 695.3637 20.04605
---------+----------------------------------------
Foreign | 6384.682 24.77273 2315.909 168.5455
| 2621.915 6.611187 433.0035 13.68255
---------+----------------------------------------
Total | 6165.257 21.2973 3019.459 187.9324
| 2949.496 5.785503 777.1936 22.26634
--------------------------------------------------
You may have noticed from the output of the summarize
command that
rep78
has 5 missing values. We can look at those observations using
the list command:
. list if missing(rep78)
+---------------------------------------------------------------------------------------------+
3. | make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displa~t |
| AMC Spirit | 3,799 | 22 | . | 3.0 | 12 | 2,640 | 168 | 35 | 121 |
|---------------------------------------------------------------------------------------------|
| gear_r~o | foreign |
| 3.08 | Domestic |
+---------------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------------+
7. | make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displa~t |
| Buick Opel | 4,453 | 26 | . | 3.0 | 10 | 2,230 | 170 | 34 | 304 |
|---------------------------------------------------------------------------------------------|
| gear_r~o | foreign |
| 2.87 | Domestic |
+---------------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------------+
45. | make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displa~t |
| Plym. Sapporo | 6,486 | 26 | . | 1.5 | 8 | 2,520 | 182 | 38 | 119 |
|---------------------------------------------------------------------------------------------|
| gear_r~o | foreign |
| 3.54 | Domestic |
+---------------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------------+
51. | make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displa~t |
| Pont. Phoenix | 4,424 | 19 | . | 3.5 | 13 | 3,420 | 203 | 43 | 231 |
|---------------------------------------------------------------------------------------------|
| gear_r~o | foreign |
| 3.08 | Domestic |
+---------------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------------+
64. | make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displa~t |
| Peugeot 604 | 12,990 | 14 | . | 3.5 | 14 | 3,420 | 192 | 38 | 163 |
|---------------------------------------------------------------------------------------------|
| gear_r~o | foreign |
| 3.58 | Foreign |
+---------------------------------------------------------------------------------------------+
There are good graph galleries provided by StataCorp, UCLA, and Survey Design and Analysis Services. Below is a simple scatter plot of weight versus length:
. graph twoway scatter weight length
. graph export scatter.png, replace
(file scatter.png written in PNG format)
There are a number of ways to create new variables or modifying
existing variables. The most important command in this section is the
generate
command. Imagine we are curious about cars that are heavy
for their length we could create a new variable
. generate weight_per_length = weight / length
This creates a new column in the dataset, for each car we have calculated the ratio of that car’s weight to its length. Let’s take a look at the top five heaviest cars per length.
. gsort -weight_per_length
. list make weight_per_length if _n <= 5
+------------------------------+
| make weight~h |
|------------------------------|
1. | Cad. Seville 21.02941 |
2. | Linc. Continental 20.77253 |
3. | Linc. Mark V 20.52174 |
4. | Cad. Deville 19.59276 |
5. | Olds Toronado 19.56311 |
+------------------------------+
Another very useful command for generating new variables is the egen
command. This is particularly useful is you want to merge summary
statistics for groups of cars back into the larger dataset. For
instance, we might be curious to see how a car’s price compares to the
average price among foreign or domestic cars. We can find the average
price for foreign and domestic cars using tabstat, but how do we make
a column in the dataset with these values?
. tabstat price, by(foreign)
Summary for variables: price
by categories of: foreign (Car type)
foreign | mean
---------+----------
Domestic | 6072.423
Foreign | 6384.682
---------+----------
Total | 6165.257
--------------------
. egen ave_price = mean(price), by(foreign)
. list foreign ave_price
+---------------------+
| foreign ave_pr~e |
|---------------------|
1. | Domestic 6072.423 |
2. | Domestic 6072.423 |
3. | Domestic 6072.423 |
4. | Domestic 6072.423 |
5. | Domestic 6072.423 |
|---------------------|
6. | Domestic 6072.423 |
7. | Domestic 6072.423 |
8. | Domestic 6072.423 |
9. | Domestic 6072.423 |
10. | Domestic 6072.423 |
|---------------------|
11. | Domestic 6072.423 |
12. | Domestic 6072.423 |
13. | Domestic 6072.423 |
14. | Domestic 6072.423 |
15. | Foreign 6384.682 |
|---------------------|
16. | Domestic 6072.423 |
17. | Domestic 6072.423 |
18. | Domestic 6072.423 |
19. | Domestic 6072.423 |
20. | Domestic 6072.423 |
|---------------------|
21. | Domestic 6072.423 |
22. | Domestic 6072.423 |
23. | Domestic 6072.423 |
24. | Domestic 6072.423 |
25. | Domestic 6072.423 |
|---------------------|
26. | Domestic 6072.423 |
27. | Domestic 6072.423 |
28. | Domestic 6072.423 |
29. | Domestic 6072.423 |
30. | Domestic 6072.423 |
|---------------------|
31. | Domestic 6072.423 |
32. | Domestic 6072.423 |
33. | Domestic 6072.423 |
34. | Domestic 6072.423 |
35. | Foreign 6384.682 |
|---------------------|
36. | Domestic 6072.423 |
37. | Domestic 6072.423 |
38. | Domestic 6072.423 |
39. | Domestic 6072.423 |
40. | Domestic 6072.423 |
|---------------------|
41. | Domestic 6072.423 |
42. | Domestic 6072.423 |
43. | Domestic 6072.423 |
44. | Foreign 6384.682 |
45. | Domestic 6072.423 |
|---------------------|
46. | Domestic 6072.423 |
47. | Foreign 6384.682 |
48. | Foreign 6384.682 |
49. | Foreign 6384.682 |
50. | Domestic 6072.423 |
|---------------------|
51. | Domestic 6072.423 |
52. | Foreign 6384.682 |
53. | Foreign 6384.682 |
54. | Domestic 6072.423 |
55. | Foreign 6384.682 |
|---------------------|
56. | Domestic 6072.423 |
57. | Foreign 6384.682 |
58. | Foreign 6384.682 |
59. | Foreign 6384.682 |
60. | Domestic 6072.423 |
|---------------------|
61. | Foreign 6384.682 |
62. | Domestic 6072.423 |
63. | Domestic 6072.423 |
64. | Foreign 6384.682 |
65. | Foreign 6384.682 |
|---------------------|
66. | Foreign 6384.682 |
67. | Foreign 6384.682 |
68. | Foreign 6384.682 |
69. | Foreign 6384.682 |
70. | Domestic 6072.423 |
|---------------------|
71. | Foreign 6384.682 |
72. | Foreign 6384.682 |
73. | Foreign 6384.682 |
74. | Domestic 6072.423 |
+---------------------+
To further explore the relationship between weight and length we can run a regression.
. regress weight length
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 613.27
Model | 39461306.8 1 39461306.8 Prob > F = 0.0000
Residual | 4632871.55 72 64345.4382 R-squared = 0.8949
-------------+------------------------------ Adj R-squared = 0.8935
Total | 44094178.4 73 604029.841 Root MSE = 253.66
------------------------------------------------------------------------------
weight | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
length | 33.01988 1.333364 24.76 0.000 30.36187 35.67789
_cons | -3186.047 252.3113 -12.63 0.000 -3689.02 -2683.073
------------------------------------------------------------------------------
We see that on average, each additional inch is associated with 33 pounds. We can plot the predicted values from the regression on the scatter plot from above.
. graph twoway (scatter weight length) (lfit weight length)
. graph export scatter_lfit.png, replace
(file scatter_lfit.png written in PNG format)
Germán Rodríguez’s Stata Tutorial is an excellent introduction to Stata..
These notes on writing code by Matthew Gentzkow and Jesse Shapiro have excellent suggestions on how to program with Stata.