Data Practices
Data Tips & Toolkits
Data Tips & Toolkits
Causal Inference
R Packages
causalTree, CausalImpact, Counterfactual, BayesTree, rrd, FindIt, causaldrf, uplift,
Synth, matchIt, pcalg, wfe, matching, medflex, BCEE, dagitty, causaleffect, mediation, pampe,
and beanz.
Converting Audio to Text
- YouTube: the popular platform provides automatic text captioning of the audio uploaded onto YouTube.
- Watson speech-to-text API: a machine learning API that can transcribe audio files into text, among other capabilities.
- VLC player: a free, open source media player that can be used on Linux in conjunction with other programs to transcribe audio.
- Additional free speech to text apps include:
- Google Gboard
- Just Press Record
- Speechnotes
- Transcribe
- Windows 10 Speech recognition
- The Dragon series by Nuance: software providing audio to text services.
Converting PDFs to TXT Files
- Tabula: a tool for PDF table extraction with a nice Python wrapper
- Camelot: a Python package that extracts tables from PDFs
Fuzzy Matching
Python
Python’s dedupe package supports fuzzy matching using machine learning: https://github.com/dedupeio/dedupe.
R
R’s fastLink package supports fuzzy matching by implementing a Fellegi-Sunter probabilistic
record linkage model that allows for missing data and the inclusion of auxiliary information:
https://cran.r-project.org/web/packages/fastLink/index.html.
Machine Learning
R
Caret package, short for “Classification and Regression Training”, offers everything you
need to know to solve supervised machine learning problems: it provides a uniform
interface to a ton of machine learning algorithms. If you’re a bit familiar with Python
machine learning, you might see similarities with scikit-learn!
Missing Data
A nice blog post that describes a lot of useful R packages: https://www.r-bloggers.com/2016/11/missing-values-data-science-and-r/
R
- Information on the NLP packages in R can be found here.
- An introduction to text-mining can be found here.
Power Analysis
- Optimal Design Software: freeware software to calculate power for group-level interventions
- PowerUp!: Excel tool to calculate and detect main, moderator, and mediator effects for experiments and quasi-experiments.
R
- WebPower: Collection of tools for conducting both basic and advanced statistical power analysis including correlation, proportion, t-test, one-way ANOVA, two-way ANOVA, linear regression, logistic regression, Poisson regression, mediation analysis, longitudinal data analysis, structural equation modeling and multilevel modeling.
- Pwr: Power analysis functions along the lines of Cohen (1988).
Visualizations
Word Cloud Resources
- Tableau Public: popular software with free academic licenses.
- Wordcloud Python Package
- Wordcloud R Package
Web Scraping
Please remember that before embarking on a web scraping project, you must check the
site’s robots.txt file and Terms of Service. For example, if you were interested in
scraping data from Facebook, we’d recommend checking https://www.facebook.com/robots.txt (their robots.txt file) and https://www.facebook.com/legal/terms (their Terms of Service). Examination of the robots.txt file indicates that almost
all parameters cannot be scraped. Their terms of service corroborates this by indicating
that, “You may not access or collect data from our Products using automated means
(without our prior permission) or attempt to access data you do not have permission
to access.” In this case, Facebook provides an API (https://developers.facebook.com/docs/apis-and-sdks/) that may provide the data that you are interested in scraping.
- DownThemAll: A FireFox add-on that allows you to download all the files or images in a webpage and more.
Python
- Requests
- lxml (supports XPath expressions; more information)
- Beautiful Soup
- Selenium and Headless Chrome (useful if you need to interact with a browser or the site needs to run JavaScript).
R
rvest. An example using this package can be found on the RCS Statistics Blog.