Skip to Main Content
HBS Home
  • About
  • Academic Programs
  • Alumni
  • Faculty & Research
  • Baker Library
  • Giving
  • Harvard Business Review
  • Initiatives
  • News
  • Recruit
  • Map / Directions
Faculty & Research
  • Faculty
  • Research
  • Featured Topics
  • Academic Units
  • …→
  • Harvard Business School→
  • Faculty & Research→
Publications
Publications
  • 2023
  • Other Article
  • Conference on Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track

The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications

By: Mirac Suzgun, Luke Melas-Kyriazi, Suproteem K. Sarkar, Scott Duke Kominers and Stuart Shieber
  • Format:Electronic
  • | Pages:38
ShareBar

Abstract

Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Though the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, machine learning offers a promising set of techniques for evaluating novelty, summarizing contributions, and embedding semantics. In this paper, we introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale, well-structured, and multi-purpose corpus of English-language patent applications filed to the United States Patent and Trademark Office (USPTO) between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. Unlike other NLP patent datasets, HUPD contains the inventor-submitted versions of patent applications, not the final versions of granted patents, allowing us to study patentability at the time of filing using NLP methods for the first time. It is also novel in its inclusion of rich structured data alongside the text of patent filings: By providing each application’s metadata along with all of its text fields, HUPD enables researchers to perform new sets of NLP tasks that leverage variation in structured covariates. As a case study on the types of research HUPD makes possible, we introduce a new task to the NLP community--patent acceptance prediction. We additionally show the structured metadata provided in HUPD allows us to conduct explicit studies of concept shifts for this task. We find that performance on patent acceptance prediction decays when models trained in one context are evaluated on different innovation categories and over time. Finally, we demonstrate how HUPD can be used for three additional tasks: Multi-class classification of patent subject areas, language modeling, and abstractive summarization. Put together, our publicly-available dataset aims to advance research extending language and classification models to diverse and dynamic real-world data distributions.

Keywords

USPTO; Natural Language Processing; Classification; Summarization; Patent Novelty; Patent Trolls; Patent Enforceability; Patents; Innovation and Invention; Intellectual Property; AI and Machine Learning; Analytics and Data Science

Citation

Suzgun, Mirac, Luke Melas-Kyriazi, Suproteem K. Sarkar, Scott Duke Kominers, and Stuart Shieber. "The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications." Conference on Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track 36 (2023).
  • Read Now

About The Author

Scott Duke Kominers

Entrepreneurial Management
→More Publications

More from the Authors

    • June 2025
    • Journal of Finance

    Collusion in Brokered Markets

    By: John William Hatfield, Scott Duke Kominers and Richard Lowery
    • March 14, 2025
    • Harvard Crimson

    Harvard Students Should Ignore Calls to Boycott Israel Trek

    By: Jesse M. Fried, Paul A. Gompers, Scott Kominers and Mark C. Poznansky
    • March 2025
    • Faculty Research

    O2X: Optimizing to the X

    By: Scott Duke Kominers, Thomas Jennings and Maisie Wiltshire-Gordon
More from the Authors
  • Collusion in Brokered Markets By: John William Hatfield, Scott Duke Kominers and Richard Lowery
  • Harvard Students Should Ignore Calls to Boycott Israel Trek By: Jesse M. Fried, Paul A. Gompers, Scott Kominers and Mark C. Poznansky
  • O2X: Optimizing to the X By: Scott Duke Kominers, Thomas Jennings and Maisie Wiltshire-Gordon
ǁ
Campus Map
Harvard Business School
Soldiers Field
Boston, MA 02163
→Map & Directions
→More Contact Information
  • Make a Gift
  • Site Map
  • Jobs
  • Harvard University
  • Trademarks
  • Policies
  • Accessibility
  • Digital Accessibility
Copyright © President & Fellows of Harvard College.