Skip to Main Content
HBS Home
  • About
  • Academic Programs
  • Alumni
  • Faculty & Research
  • Baker Library
  • Giving
  • Harvard Business Review
  • Initiatives
  • News
  • Recruit
  • Map / Directions
Research Computing Services
  • Online Requests
  • FAQ
  • Blog
  • Contact Us
  • About Us
  • Faculty Projects
  • Training
  • Compute Cluster & Data Storage
  • Data Practices
  • Help
  • …→
  • Harvard Business School→
  • Research Computing Services→
  • Data Practices
    • Data Practices
    • Best Practices
    • Data Tips & Toolkits
    • Database Best Practices
    • Transferring Data
    →
  • Data Tips & Toolkits
    • Data Tips & Toolkits
    • Causal Inference
    • Converting Audio to Text
    • Converting PDFs to TXT Files
    • Fuzzy Matching
    • Machine Learning
    • Missing Data
    • Natural Language Processing
    • Power Analysis
    • Visualizations
    • Web Scraping
    →

Data Practices

Data Practices

  • Best Practices
  • Data Tips & Toolkits
  • Database Best Practices
  • Transferring Data

Data Tips & Toolkits

Data Tips & Toolkits

  • Best Practices
  • Data Tips & Toolkits
    • Causal Inference
    • Converting Audio to Text
    • Converting PDFs to TXT Files
    • Fuzzy Matching
    • Machine Learning
    • Missing Data
    • Natural Language Processing
    • Power Analysis
    • Visualizations
    • Web Scraping
  • Database Best Practices
  • Transferring Data
6ms

Causal Inference

R Packages

causalTree, CausalImpact, Counterfactual, BayesTree, rrd, FindIt, causaldrf, uplift, Synth, matchIt, pcalg, wfe, matching, medflex, BCEE, dagitty, causaleffect, mediation, pampe, and beanz.
 

Converting Audio to Text 

  • YouTube: the popular platform provides automatic text captioning of the audio uploaded onto YouTube.
  • Watson speech-to-text API: a machine learning API that can transcribe audio files into text, among other capabilities.
  • VLC player: a free, open source media player that can be used on Linux in conjunction with other programs to transcribe audio.
  • Additional free speech to text apps include:
    • Google Gboard
    • Just Press Record
    • Speechnotes
    • Transcribe
    • Windows 10 Speech recognition
  • The Dragon series by Nuance: software providing audio to text services.

Converting PDFs to TXT Files 

  • Tabula: a tool for PDF table extraction with a nice Python wrapper
  • Camelot: a Python package that extracts tables from PDFs

Fuzzy Matching

Python

Python’s dedupe package supports fuzzy matching using machine learning: https://github.com/dedupeio/dedupe.

R

R’s fastLink package supports fuzzy matching by implementing a Fellegi-Sunter probabilistic record linkage model that allows for missing data and the inclusion of auxiliary information: https://cran.r-project.org/web/packages/fastLink/index.html.

Machine Learning

R

Caret package, short for “Classification and Regression Training”, offers everything you need to know to solve supervised machine learning problems: it provides a uniform interface to a ton of machine learning algorithms. If you’re a bit familiar with Python machine learning, you might see similarities with scikit-learn! 

Missing Data

A nice blog post that describes a lot of useful R packages: https://www.r-bloggers.com/2016/11/missing-values-data-science-and-r/ 

Natural Language Processing (NLP)

Python

  • NLTK is a robust NLP package in Python.
  • For an introduction and hands-on experience using the NLTK package, DataCamp provides a free module as part of their NLP fundamentals course.

R

  • Information on the NLP packages in R can be found here.
  • An introduction to text-mining can be found here.

Power Analysis

  • Optimal Design Software: freeware software to calculate power for group-level interventions 
  • PowerUp!: Excel tool to calculate and detect main, moderator, and mediator effects for experiments and quasi-experiments. 

R

  • WebPower: Collection of tools for conducting both basic and advanced statistical power analysis including correlation, proportion, t-test, one-way ANOVA, two-way ANOVA, linear regression, logistic regression, Poisson regression, mediation analysis, longitudinal data analysis, structural equation modeling and multilevel modeling.
  • Pwr: Power analysis functions along the lines of Cohen (1988). 

Visualizations

Word Cloud Resources

  • Tableau Public: popular software with free academic licenses.
  • Wordcloud Python Package
  • Wordcloud R Package

Web Scraping

Please remember that before embarking on a web scraping project, you must check the site’s robots.txt file and Terms of Service. For example, if you were interested in scraping data from Facebook, we’d recommend checking https://www.facebook.com/robots.txt (their robots.txt file) and https://www.facebook.com/legal/terms (their Terms of Service). Examination of the robots.txt file indicates that almost all parameters cannot be scraped. Their terms of service corroborates this by indicating that, “You may not access or collect data from our Products using automated means (without our prior permission) or attempt to access data you do not have permission to access.” In this case, Facebook provides an API (https://developers.facebook.com/docs/apis-and-sdks/) that may provide the data that you are interested in scraping.

  • DownThemAll: A FireFox add-on that allows you to download all the files or images in a webpage and more. 

Python

  • Requests
  • lxml (supports XPath expressions; more information) 
  • Beautiful Soup
  • Selenium and Headless Chrome (useful if you need to interact with a browser or the site needs to run JavaScript). 

R

rvest. An example using this package can be found on the RCS Statistics Blog.
ǁ
Campus Map
Research Computing Services (RCS) 
Harvard Business School
Baker Library, B90, 25 Harvard Way
Boston, MA 02163
Phone: 617.495.6100
Email: research@hbs.edu
→Map & Directions
→More Contact Information
→Terms Of Service
  • Make a Gift
  • Site Map
  • Jobs
  • Harvard University
  • Trademarks
  • Policies
  • Accessibility
  • Digital Accessibility
Copyright © President & Fellows of Harvard College