Data Practices
Best Practices
Best Practices
Below, we present general data management and efficiency best practices, provide software specific data recommendations, and highlight strongly recommended reading.
General Data Management & Efficiency Best Practices
- For large projects, keep a README file in the top level directory with a project summary including who was involved, dates, and a listing of the directory structure and important files within that project folder. Avoid unnecessary creation of data sets - combine multiple data steps into a single step if possible.
- Keep files zipped or compressed if you aren't using them.
- Check for duplicate files when sharing a project folder with multiple users.
- Do not keep duplicate copies of raw data in different software formats.
- Avoid keeping unnecessary interim data sets.
- Store common sub-expressions in variables rather than re-computing them.
- Identify which portions of the program are using the most time. In Stata, "set rmsg on" causes the run time to be displayed after each command; in MATLAB, use the "tic" and "toc" functions to compute elapsed time.
Optimization & Maximum Likelihood (Any Language)
- Supply analytic derivatives and Hessian if possible.
- Supply thoughtful starting values.
- If calculations don't depend on the parameters being estimated, move them outside the likelihood or objective functions calculations so they are only done once, and save results in global variables.
We encourage HBS researchers with large datasets to contact Research Computing Services (research@hbs.edu) for a one-on-one customized review of your data including techniques and tricks that can be used to enhance and expedite your research.
Software Specific Data Recommendations
SAS
- For large data sets, use a Length statement to reduce the size of variables.
- If you have long character strings, consider leaving them out or using a FORMAT to convert between strings and shorter codes.
- For PROC GLM, if you have categorical variables with large numbers of levels, use ABSORB statement when appropriate.
- In PROC MIXED, speed can depend on how the model is specified. For example, using RANDOM INTERCEPT/SUBJECT=xxx can be faster than RANDOM xxx.
- For large multilevel models in PROC MIXED, consider using specialized software such as MLWin or HLM instead.
- Determine whether you need all of your variables in the working dataset. Space and computing time may be saved by retaining long character strings or extraneous variables in a separate dataset. Non-essential variables can be merged back into the main datasets when needed.
Stata
- Use built-in commands rather than commands implemented in ado-files if a built-in command is available with the appropriate functionality.
- On the research grid, use stata-large or stata-xl only when you need more memory than the standard stata wrapper will provide you.
- If you are getting unneeded output (e.g. with "by" group processing), use "quietly".
- Avoid macro variable loops if possible - substitute vector-oriented data set processing.
MATLAB
- Use sparse matrices where applicable.
- Use the profiler to identify sections of code that are using the most execution time and optimize those.
- Use vector and matrix operations rather than loops.
Strongly Recommended Reading
The following references are must-reads to get going on best practices for working with research data as a part of the data management lifecycle. If pressed for time, read only the top five:
- Support Your Data: A Research Data Management Guide for Researchers (Borghi et al.)
- Our path to better science in less time using open data science tools (Lowndes et al.)
- Good enough practices in scientific computing (Wilson et al.)
- Data organization in spreadsheets (Broman & Woo)
- Code and Data for the Social Sciences: A Practitioner’s Guide (Gentzkow & Shapiro)
- Data Management for Researchers (Kristen Briney)
- FAIR Guiding Principles for scientific data management and stewardship (Wilkinson et al.)
- A Quick Guide to Organizing Computational (Biology) Projects (Noble)
- Excuse me, do you have a minute to talk about version control? (Jennifer Bryan)
- Best Practices for Scientific Computing (Wilson et al.)