Skip to Main Content
HBS Home
  • About
  • Academic Programs
  • Alumni
  • Faculty & Research
  • Baker Library
  • Giving
  • Harvard Business Review
  • Initiatives
  • News
  • Recruit
  • Map / Directions
Faculty & Research
  • Faculty
  • Research
  • Featured Topics
  • Academic Units
  • …→
  • Harvard Business School→
  • Faculty & Research→
Publications
Publications
  • 2023
  • Article
  • Proceedings of the Conference on Empirical Methods in Natural Language Processing

MoPe: Model Perturbation-based Privacy Attacks on Language Models

By: Marvin Li, Jason Wang, Jeffrey Wang and Seth Neel
  • Format:Electronic
  • | Pages:14
ShareBar

Abstract

Recent work has shown that Large Language Models (LLMs) can unintentionally leak sensitive information present in their training data. In this paper, we present Model Perturbations (MoPe), a new method to identify with high confidence if a given text is in the training data of a pre-trained language model, given white-box access to the model's parameters. MoPe adds noise to the model in parameter space and measures the drop in log-likelihood at a given point x, a statistic we show approximates the trace of the Hessian matrix with respect to model parameters. Across language models ranging from 70M to 12B parameters, we show that MoPe is more effective than existing loss-based attacks and recently proposed perturbation-based methods. We also examine the role of training point order and model size in attack success, and empirically demonstrate that MoPe accurately approximate the trace of the Hessian in practice. Our results show that the loss of a point alone is insufficient to determine extractability -- there are training points we can recover using our method that have average loss. This casts some doubt on prior works that use the loss of a point as evidence of memorization or unlearning.

Keywords

Large Language Model; AI and Machine Learning; Cybersecurity

Citation

Li, Marvin, Jason Wang, Jeffrey Wang, and Seth Neel. "MoPe: Model Perturbation-based Privacy Attacks on Language Models." Proceedings of the Conference on Empirical Methods in Natural Language Processing (2023): 13647–13660.
  • Read Now

About The Author

Seth Neel

Technology and Operations Management
→More Publications

More from the Authors

    • 2023
    • Faculty Research

    Black-box Training Data Identification in GANs via Detector Networks

    By: Lukman Olagoke, Salil Vadhan and Seth Neel
    • 2023
    • Faculty Research

    In-Context Unlearning: Language Models as Few Shot Unlearners

    By: Martin Pawelczyk, Seth Neel and Himabindu Lakkaraju
    • 2023
    • Faculty Research

    PRIMO: Private Regression in Multiple Outcomes

    By: Seth Neel
More from the Authors
  • Black-box Training Data Identification in GANs via Detector Networks By: Lukman Olagoke, Salil Vadhan and Seth Neel
  • In-Context Unlearning: Language Models as Few Shot Unlearners By: Martin Pawelczyk, Seth Neel and Himabindu Lakkaraju
  • PRIMO: Private Regression in Multiple Outcomes By: Seth Neel
ǁ
Campus Map
Harvard Business School
Soldiers Field
Boston, MA 02163
→Map & Directions
→More Contact Information
  • Make a Gift
  • Site Map
  • Jobs
  • Harvard University
  • Trademarks
  • Policies
  • Accessibility
  • Digital Accessibility
Copyright © President & Fellows of Harvard College.