Data Tips & Toolkits

Causal Inference

R Packages

causalTree, CausalImpact, Counterfactual, BayesTree, rrd, FindIt, causaldrf, uplift, Synth, matchIt, pcalg, wfe, matching, medﬂex, BCEE, dagitty, causaleﬀect, mediation, pampe, and beanz.

Converting Audio to Text

YouTube: the popular platform provides automatic text captioning of the audio uploaded onto YouTube.
Watson speech-to-text API: a machine learning API that can transcribe audio files into text, among other capabilities.
VLC player: a free, open source media player that can be used on Linux in conjunction with other programs to transcribe audio.
Additional free speech to text apps include:
- Google Gboard
- Just Press Record
- Speechnotes
- Transcribe
- Windows 10 Speech recognition
The Dragon series by Nuance: software providing audio to text services.

Converting PDFs to TXT Files

Tabula: a tool for PDF table extraction with a nice Python wrapper
Camelot: a Python package that extracts tables from PDFs

Fuzzy Matching

Python

Python’s dedupe package supports fuzzy matching using machine learning: https://github.com/dedupeio/dedupe.

R

R’s fastLink package supports fuzzy matching by implementing a Fellegi-Sunter probabilistic record linkage model that allows for missing data and the inclusion of auxiliary information: https://cran.r-project.org/web/packages/fastLink/index.html.

Machine Learning

R

Caret package, short for “Classification and Regression Training”, offers everything you need to know to solve supervised machine learning problems: it provides a uniform interface to a ton of machine learning algorithms. If you’re a bit familiar with Python machine learning, you might see similarities with scikit-learn!

Missing Data

A nice blog post that describes a lot of useful R packages: https://www.r-bloggers.com/2016/11/missing-values-data-science-and-r/

Natural Language Processing (NLP)

Python

General Packages

NLTK is a robust NLP package in Python.
For an introduction and hands-on experience using the NLTK package, DataCamp provides a free module as part of their NLP fundamentals course.
HuggingFace is a hub for open-source pre-trained models and provides a user-friendly approach to implement them. They have numerous informative and effective NLP tutorials.

Topic Modeling Packages

Gensim is often used for Latent Dirichlet Allocation (LDA) topic modeling. A good introduction/tutorial can be found here.
To explore topic modeling using large language models (LLMs) such as BERT (a bi-directional transformer language model that is more contextually aware than something like LDA), try out the BERTopic package.

Sentiment Analysis Packages

Sentiment analysis is usually approached in two ways: a lexicon approach is based on predefined lists of words (sentiment lexicons) or a non-lexicon approach involving machine learning:

Vader, TextBlob, and AFINN are popular Python packages employing the lexicon approach.
Non-lexicon packages include LLM-based, pre-trained sentiment models such as bert-base-multilingual-uncased that leverage BERT, RoBerta, and DistillBERT, and are available on HuggingFace.

Packages Capturing How Text is Communicated

Spelling and grammar error detection
Detect language of text
Reading level. Two metrics are typically used for this: the Flesch-Kincaid Grade Level, which provides a readability score reflecting an estimate of the grade level required to understand the text and the Gunning Fog Index, which estimates the year of formal education needed to understand the text. In both cases, higher scores indicate more complex sentences.
Word counts

Text Similarity

tf-idf
Cosine similarity
Word2Vec
GloVe
N-gram overlap, which compares the overlap of n-grams (continuous sequences of words) between two texts. You can then apply the Jaccard similarity or cosine similarity scores on the n-gram sets to compute similarity.
Spacy similarity library that uses pre-trained word vectors from GloVe to calculate text similarity.

R

Information on the NLP packages in R can be found here.
An introduction to text-mining can be found here.

Power Analysis

ClinCalc: web-based platform for conducting power analysis
PowerUp!: Excel tool to calculate and detect main, moderator, and mediator effects for experiments and quasi-experiments.

R

WebPower: Collection of tools for conducting both basic and advanced statistical power analysis including correlation, proportion, t-test, one-way ANOVA, two-way ANOVA, linear regression, logistic regression, Poisson regression, mediation analysis, longitudinal data analysis, structural equation modeling and multilevel modeling.
Pwr: Power analysis functions along the lines of Cohen (1988).

Visualizations

Word Cloud Resources

Tableau Public: popular software with free academic licenses.
Wordcloud Python Package
Wordcloud R Package

Web Scraping

Please remember that before embarking on a web scraping project, you should check the university guidance on web scraping and the site’s robots.txt file and Terms of Service.

For example, if you were interested in scraping data from Facebook, we’d recommend checking https://www.facebook.com/robots.txt (their robots.txt file) and https://www.facebook.com/legal/terms (their Terms of Service). Examination of the robots.txt file indicates that almost all parameters cannot be scraped. Their terms of service corroborates this by indicating that, “You may not access or collect data from our Products using automated means (without our prior permission) or attempt to access data you do not have permission to access.” In this case, Facebook provides an API (https://developers.facebook.com/docs/apis-and-sdks/) that may provide the data that you are interested in scraping.

DownThemAll: A FireFox add-on that allows you to download all the files or images in a webpage and more.

Python

Requests
lxml (supports XPath expressions; more information)
Beautiful Soup
Selenium and Headless Chrome (useful if you need to interact with a browser or the site needs to run JavaScript).

R

rvest. An example using this package can be found on the RCS Statistics Blog.

Data Practices

Data Practices

Data Tips & Toolkits

Causal Inference

R Packages

Converting Audio to Text

Converting PDFs to TXT Files

Fuzzy Matching

Python

R

Machine Learning

R

Missing Data

Natural Language Processing (NLP)

Python

R

Power Analysis

R

Visualizations

Word Cloud Resources

Web Scraping

Python

R