Data Practices
Data Tips & Toolkits
Data Tips & Toolkits
Causal Inference
R Packages
Converting Audio to Text
- YouTube: the popular platform provides automatic text captioning of the audio uploaded onto YouTube.
- Watson speech-to-text API: a machine learning API that can transcribe audio files into text, among other capabilities.
- VLC player: a free, open source media player that can be used on Linux in conjunction with other programs to transcribe audio.
- Additional free speech to text apps include:
- Google Gboard
- Just Press Record
- Speechnotes
- Transcribe
- Windows 10 Speech recognition
- The Dragon series by Nuance: software providing audio to text services.
Converting PDFs to TXT Files
- Tabula: a tool for PDF table extraction with a nice Python wrapper
- Camelot: a Python package that extracts tables from PDFs
Fuzzy Matching
Python
R
Machine Learning
R
Missing Data
Natural Language Processing (NLP)
Python
General Packages
- NLTK is a robust NLP package in Python.
- For an introduction and hands-on experience using the NLTK package, DataCamp provides a free module as part of their NLP fundamentals course.
- HuggingFace is a hub for open-source pre-trained models and provides a user-friendly approach to implement them. They have numerous informative and effective NLP tutorials.
Topic Modeling Packages
- Gensim is often used for Latent Dirichlet Allocation (LDA) topic modeling. A good introduction/tutorial can be found here.
- To explore topic modeling using large language models (LLMs) such as BERT (a bi-directional transformer language model that is more contextually aware than something like LDA), try out the BERTopic package.
Sentiment Analysis Packages
Sentiment analysis is usually approached in two ways: a lexicon approach is based on predefined lists of words (sentiment lexicons) or a non-lexicon approach involving machine learning:
- Vader, TextBlob, and AFINN are popular Python packages employing the lexicon approach.
- Non-lexicon packages include LLM-based, pre-trained sentiment models such as bert-base-multilingual-uncased that leverage BERT, RoBerta, and DistillBERT, and are available on HuggingFace.
Packages Capturing How Text is Communicated
- Spelling and grammar error detection
- Detect language of text
- Reading level. Two metrics are typically used for this: the Flesch-Kincaid Grade Level, which provides a readability score reflecting an estimate of the grade level required to understand the text and the Gunning Fog Index, which estimates the year of formal education needed to understand the text. In both cases, higher scores indicate more complex sentences.
- Word counts
Text Similarity
- tf-idf
- Cosine similarity
- Word2Vec
- GloVe
- N-gram overlap, which compares the overlap of n-grams (continuous sequences of words) between two texts. You can then apply the Jaccard similarity or cosine similarity scores on the n-gram sets to compute similarity.
- Spacy similarity library that uses pre-trained word vectors from GloVe to calculate text similarity.
R
- Information on the NLP packages in R can be found here.
- An introduction to text-mining can be found here.
R
- WebPower: Collection of tools for conducting both basic and advanced statistical power analysis including correlation, proportion, t-test, one-way ANOVA, two-way ANOVA, linear regression, logistic regression, Poisson regression, mediation analysis, longitudinal data analysis, structural equation modeling and multilevel modeling.
- Pwr: Power analysis functions along the lines of Cohen (1988).
Visualizations
Word Cloud Resources
- Tableau Public: popular software with free academic licenses.
- Wordcloud Python Package
- Wordcloud R Package
Web Scraping
Please remember that before embarking on a web scraping project, you should check the university guidance on web scraping and the site’s robots.txt file and Terms of Service.
For example, if you were interested in scraping data from Facebook, we’d recommend checking https://www.facebook.com/robots.txt (their robots.txt file) and https://www.facebook.com/legal/terms (their Terms of Service). Examination of the robots.txt file indicates that almost all parameters cannot be scraped. Their terms of service corroborates this by indicating that, “You may not access or collect data from our Products using automated means (without our prior permission) or attempt to access data you do not have permission to access.” In this case, Facebook provides an API (https://developers.facebook.com/docs/apis-and-sdks/) that may provide the data that you are interested in scraping.
- DownThemAll: A FireFox add-on that allows you to download all the files or images in a webpage and more.
Python
- Requests
- lxml (supports XPath expressions; more information)
- Beautiful Soup
- Selenium and Headless Chrome (useful if you need to interact with a browser or the site needs to run JavaScript).