Stata Rules


Doing applied research requires much care, and it is very important that you follow a set of rules to organize your work. The rules proposed here allow to reach the following goals:

  • Replicability of results
  • Ease of collaborations with other researchers, even if located elsewhere
  • Modularity, to allow integration of data and results from different projects

The simple rules that follow are thought for Stata - but, as you'll see, we apply some of them also for other project documents. However, the general principles which inspire them are equally valid for any statistical package (such as, R, SPSS, etc.).

Program files, not mouse clicks


Forget Excel. Excel is evil. That's why we use Stata (or other like programs). These are "programs that can be programmed", that is, we can write files containing series of commands. Such files, when using Stata, have a ".do" extension. For this reason, we call them ".do files".

So, when we use Stata we do a minimal use of its drag-and-drop menus, and never to run commands. We run commands within .do files. These .do files load datasets, process them, often merge them together, and then carry out the desired analyses. Almost invariably, a project is composed of several .do files, each one dealing with a block of logically related tasks within the project. For example, a first .do file may load the raw data and process it, a second one may compute descriptive statistics of various types, and a third .do file may perform the estimation of some statistical model.

The project folder


The project folder is where we put all the material of a project - software, and documents. We will use Dropbox, so a project folder is simply a Dropbox folder shared with your collaborators.

Each project folder should have the following subdirectories:
  • /do_files (where we put .do files)
  • /in_data (where we put the raw data. These may be spreadsheets, datasets saved in the Stata ".dta" format, CSV files, etc.)
  • /out_data (where we put data processed by one or more .do files, ready to be loaded by another .do files. These are invariably .dta files)
  • /temp (where we put temporary datasets that are used within the same .do files. These are invariably .dta files)
  • /log_files (where we put log files, that is, files containing results)
  • /docs (where we place documents, for example, the paper that we are writing).

Moreover, within /do_files we create a subdirectory named /do_files_old, and within the /docs subdirecory, /docs_old.

The /docs folder may also have subfolders of various types, depending on the needs.


File naming rules


These rules apply to all types of files.

  • Use mnemonic file names
  • File names should not have spaces. Different strings are linked by underscores
  • Every file name begins with a short string identifying the project
  • Every file name ends with a date, of the type yyyy_mm_dd
  • In a collaborative project, before the date, a file name contains the initials of the last person who last modified the file.

Example of file names
  • GOVEU_data_process_LP_2015_04_28.do
(a .do file, part of the "GOVEU" project, presumably dedicated to data processing, last updated by LP, on the 28th of April of 2015)
  • GOVEU_determinants_happiness_LP_2015_04_28.odt
(an Open Office document, such as a paper researching the determinants of happiness, etc.)

Housekeeping


In a collaborative project, it is very important that whenever a team member desires to modify a file, she should change the name of the file to reflect the new date, and her initials.

As one or more people work on a project, they generate several versions of the same files. It is important that regularly older versions of .do files are put in the /do_files_old subdirectories (and older versions of documents, in /docs_old

Once it is known that those old versions are not needed, they also may be cancelled.

Documenting .do files


It is essential that .do files are well documented. Each .do file should start with a standard header. Remember: your code should be understood by your collaborators who may be far away, by the scientific community at large, and by yourself in ten years, should you need to replicate your own results, or use them within a new project.

The "readme" file


As you proceed, you should update a readme_[date].txt file, which starts with a standard header describing the project, and then lists the headers of the individual .do files in the same order in which the have to be run.

This file should allow any person familiar with Stata to replicate all results of the project, simply by running all the .do files in the same order in which they are listed in the "readme_[data].txt" file.

Obviously, path names will have to be modified to run the .do files of a project into a different computer. To minimize the number of modifications needed, it is important to always use relative path names, while declaring at the beginning of each .do file the relevant working directory. In this way, a single change will be needed to make a .do file run on a different computer.