Skip to Main Content
HBS Home
  • About
  • Academic Programs
  • Alumni
  • Faculty & Research
  • Baker Library
  • Giving
  • Harvard Business Review
  • Initiatives
  • News
  • Recruit
  • Map / Directions
Research Computing Services
  • Online Requests
  • FAQ
  • Blog
  • Contact Us
  • About Us
  • Faculty Projects
  • Training
  • Compute Cluster & Data Storage
  • Data Practices
  • Help
  • …→
  • Harvard Business School→
  • Research Computing Services→
  • Data Practices
    • Data Practices
    • Data Tips & Toolkits
    →
  • Data Tips & Toolkits
    • Data Tips & Toolkits
    • Causal Inference
    • Converting Audio to Text
    • Converting PDFs to TXT Files
    • Fuzzy Matching
    • Machine Learning
    • Missing Data
    • Natural Language Processing
    • Power Analysis
    • Visualizations
    • Web Scraping
    →

Data Practices

Data Practices

  • Data Tips & Toolkits

Data Tips & Toolkits

Data Tips & Toolkits

  • Data Tips & Toolkits
    • Causal Inference
    • Converting Audio to Text
    • Converting PDFs to TXT Files
    • Fuzzy Matching
    • Machine Learning
    • Missing Data
    • Natural Language Processing
    • Power Analysis
    • Visualizations
    • Web Scraping
40ms

Causal Inference

R Packages

causalTree, CausalImpact, Counterfactual, BayesTree, rrd, FindIt, causaldrf, uplift, Synth, matchIt, pcalg, wfe, matching, medflex, BCEE, dagitty, causaleffect, mediation, pampe, and beanz.
 

Converting Audio to Text 

  • YouTube: the popular platform provides automatic text captioning of the audio uploaded onto YouTube.
  • Watson speech-to-text API: a machine learning API that can transcribe audio files into text, among other capabilities.
  • VLC player: a free, open source media player that can be used on Linux in conjunction with other programs to transcribe audio.
  • Additional free speech to text apps include:
    • Google Gboard
    • Just Press Record
    • Speechnotes
    • Transcribe
    • Windows 10 Speech recognition
  • The Dragon series by Nuance: software providing audio to text services.

Converting PDFs to TXT Files 

  • Tabula: a tool for PDF table extraction with a nice Python wrapper
  • Camelot: a Python package that extracts tables from PDFs

Fuzzy Matching

Python

Python’s dedupe package supports fuzzy matching using machine learning: https://github.com/dedupeio/dedupe.

R

R’s fastLink package supports fuzzy matching by implementing a Fellegi-Sunter probabilistic record linkage model that allows for missing data and the inclusion of auxiliary information: https://cran.r-project.org/web/packages/fastLink/index.html.

Machine Learning

R

Caret package, short for “Classification and Regression Training”, offers everything you need to know to solve supervised machine learning problems: it provides a uniform interface to a ton of machine learning algorithms. If you’re a bit familiar with Python machine learning, you might see similarities with scikit-learn! 

Missing Data

A nice blog post that describes a lot of useful R packages: https://www.r-bloggers.com/2016/11/missing-values-data-science-and-r/ 

Natural Language Processing (NLP)

Python

General Packages

  • NLTK is a robust NLP package in Python.
  • For an introduction and hands-on experience using the NLTK package, DataCamp provides a free module as part of their NLP fundamentals course.
  • HuggingFace is a hub for open-source pre-trained models and provides a user-friendly approach to implement them. They have numerous informative and effective NLP tutorials.

Topic Modeling Packages

  • Gensim is often used for Latent Dirichlet Allocation (LDA) topic modeling. A good introduction/tutorial can be found here.
  • To explore topic modeling using large language models (LLMs) such as BERT (a bi-directional transformer language model that is more contextually aware than something like LDA), try out the BERTopic package.

Sentiment Analysis Packages

Sentiment analysis is usually approached in two ways: a lexicon approach is based on predefined lists of words (sentiment lexicons) or a non-lexicon approach involving machine learning:

  • Vader, TextBlob, and AFINN are popular Python packages employing the lexicon approach.
  • Non-lexicon packages include LLM-based, pre-trained sentiment models such as bert-base-multilingual-uncased that leverage BERT, RoBerta, and DistillBERT, and are available on HuggingFace.

Packages Capturing How Text is Communicated

  • Spelling and grammar error detection
  • Detect language of text
  • Reading level. Two metrics are typically used for this: the Flesch-Kincaid Grade Level, which provides a readability score reflecting an estimate of the grade level required to understand the text and the Gunning Fog Index, which estimates the year of formal education needed to understand the text. In both cases, higher scores indicate more complex sentences.
  • Word counts

Text Similarity

  • tf-idf
  • Cosine similarity
  • Word2Vec
  • GloVe
  • N-gram overlap, which compares the overlap of n-grams (continuous sequences of words) between two texts. You can then apply the Jaccard similarity or cosine similarity scores on the n-gram sets to compute similarity.
  • Spacy similarity library that uses pre-trained word vectors from GloVe to calculate text similarity.

R

  • Information on the NLP packages in R can be found here.
  • An introduction to text-mining can be found here.

Power Analysis

  • ClinCalc: web-based platform for conducting power analysis 
  • PowerUp!: Excel tool to calculate and detect main, moderator, and mediator effects for experiments and quasi-experiments. 

R

  • WebPower: Collection of tools for conducting both basic and advanced statistical power analysis including correlation, proportion, t-test, one-way ANOVA, two-way ANOVA, linear regression, logistic regression, Poisson regression, mediation analysis, longitudinal data analysis, structural equation modeling and multilevel modeling.
  • Pwr: Power analysis functions along the lines of Cohen (1988). 

Visualizations

Word Cloud Resources

  • Tableau Public: popular software with free academic licenses.
  • Wordcloud Python Package
  • Wordcloud R Package

Web Scraping

Please remember that before embarking on a web scraping project, you should check the university guidance on web scraping and the site’s robots.txt file and Terms of Service.

For example, if you were interested in scraping data from Facebook, we’d recommend checking https://www.facebook.com/robots.txt (their robots.txt file) and https://www.facebook.com/legal/terms (their Terms of Service). Examination of the robots.txt file indicates that almost all parameters cannot be scraped. Their terms of service corroborates this by indicating that, “You may not access or collect data from our Products using automated means (without our prior permission) or attempt to access data you do not have permission to access.” In this case, Facebook provides an API (https://developers.facebook.com/docs/apis-and-sdks/) that may provide the data that you are interested in scraping.

  • DownThemAll: A FireFox add-on that allows you to download all the files or images in a webpage and more. 

Python

  • Requests
  • lxml (supports XPath expressions; more information) 
  • Beautiful Soup
  • Selenium and Headless Chrome (useful if you need to interact with a browser or the site needs to run JavaScript). 

R

rvest. An example using this package can be found on the RCS Statistics Blog.
ǁ
Campus Map
Research Computing Services (RCS) 
Harvard Business School
Baker Library, B90, 25 Harvard Way
Boston, MA 02163
Phone: 617.495.6100
Email: research@hbs.edu
→Map & Directions
→More Contact Information
→Terms Of Service
  • Make a Gift
  • Site Map
  • Jobs
  • Harvard University
  • Trademarks
  • Policies
  • Accessibility
  • Digital Accessibility
Copyright © President & Fellows of Harvard College.