Skip to content

Menu

Network by SubjectChannelsBlogsHomeAboutContact
AI Legal Journal logo
Subscribe
Search
Close
PublishersBlogsNetwork by SubjectChannels
Subscribe

When Chats Become Evidence: Court Affirms Order Requiring OpenAI to Produce 20 Million De-Identified ChatGPT Logs

By Roma Patel on January 15, 2026
Email this postTweet this postLike this postShare this post on LinkedIn

On January 5, 2026, the federal U.S. District Court for the Southern District of New York upheld two discovery orders requiring OpenAI to produce a sample of 20 million de-identified user logs from ChatGPT as part of wide-ranging copyright litigation brought by news organizations and class plaintiffs. This decision offers important insights into how federal courts are currently approaching the intersection of discovery, user privacy, and the relevance of data from large language models.

Factual and Procedural Background

The plaintiffs sought discovery of logs reflecting users’ conversations with ChatGPT, including both prompts and model outputs. OpenAI, which retains tens of billions of such logs in the ordinary course of business, initially resisted a July 2025 motion by the plaintiffs to compel the production of a 120-million-log sample. OpenAI instead proposed a smaller sample of 20 million de-identified conversations and indicated it would remove personally identifiable and other private information from the sample using a custom de-identification tool. Plaintiffs agreed to this smaller log sample, but preserved their request for a larger production if warranted.

In October 2025, OpenAI changed its position, offering to run search terms across the 20 million log sample and produce only those conversations which implicated the plaintiffs’ works. OpenAI argued that this approach would better protect the privacy of ChatGPT users. The following day, plaintiffs responded with a renewed motion to compel production of the entire de-identified 20 million-log sample rather than just a filtered subset.

On November 7, 2025, Magistrate Judge Wang granted the plaintiffs’ motion and ordered production of the full, 20-million log de-identified sample. OpenAI’s motion for reconsideration was denied. The court concluded that the full log sample comprising logs both relevant and seemingly irrelevant to the plaintiffs’ claims was necessary for a complete analysis, noting that even logs not directly implicating plaintiffs’ content could be relevant to OpenAI’s asserted fair use defenses. Judge Wang also weighed privacy considerations but determined that these concerns were sufficiently addressed by three main safeguards: (i) reducing the volume from tens of billions of logs to 20 million; (ii) de-identification; and (iii) a standing protective order governing the use of discovery in the case.

District Court’s Analysis

OpenAI objected to Magistrate Judge Wang’s orders, arguing that they inadequately balanced privacy interests against the requested discovery and that the court should have adopted its proposed, less burdensome production method. District Court Judge Stein, reviewing the objections, affirmed both discovery orders.

Key findings from the January 6,2026, order include:

  • Balancing relevance and privacy: The District Court found that Magistrate Judge Wang adequately balanced user privacy with discovery needs. Particularly, the reduction in sample size, use of de-identification, and a protective order were sufficient to address privacy concerns in the context of this litigation;
  • No requirement for least burdensome discovery: The District Court rejected OpenAI’s argument that Magistrate Judge Wang was obligated to order the “least burdensome” means of production, such as filtered search-term results. Notably, the District Court emphasized that no applicable authority required such a standard in these circumstances;
  • Distinguishing Rajaratnam: OpenAI relied primarily on Securities and Exchange Comm’n. v. Rajaratnam, 622 F.3d 159 (2d. Cir. 2010), to argue that stronger privacy protections were necessary. The District Court distinguished Rajaratnam, noting that it involved surreptitiously recorded, potentially illegal wiretaps and far greater privacy interests. By contrast, ChatGPT users voluntarily provided their data to OpenAI as part of ordinary platform usage, and there is no question regarding the legality of OpenAI’s retention of logs here; and
  • Relevance beyond direct infringement: Echoing Magistrate Judge Wang’s findings, the District Court observed that even logs which do not reproduce plaintiffs’ works may help OpenAI assert defenses such as fair use and are thus “relevant for this case” under the governing discovery standard.

Implications for Organizations: Legal and Governance Considerations

The court’s order carries several practical takeaways for organizations:

  • Discovery of de-identified user data: Even where data is produced in de-identified form and subject to a protective order, courts may still require production at scale if the dataset is relevant and proportional. Privacy risk management for AI interactions should assume that de-identification is a risk-reduction step and not a guarantee. Further, protective order terms and access controls can become as important as the underlying redaction method;
  • Adequacy of safeguards: Here, the analysis credited a bundle of controls: reduced scope (tens of billions down to 20 million), OpenAI’s de-identification, and an existing protective order. Organizations should expect the “safeguards” inquiry to be fact-specific and cumulative. For example, if a vendor cannot describe its de-identification workflow or cannot operationalize access restrictions and auditing for large text datasets, a court might be less receptive to privacy objections than if those controls are mature and demonstrable;
  • Relevance is not limited to “copies”: The court accepted that logs “disconnected from” plaintiffs’ works could still be relevant, including to OpenAI’s fair use defense. This holding implies that, once a party’s defenses turn on how a system behaves across many interactions, the discoverability of “non-infringing” examples can increase. Organizations should anticipate that litigation positions taken on defenses (not only claims) can influence how much data becomes discoverable and what sampling methodologies a court will accept;
  • No absolute right to “minimally intrusive” discovery: The court held that OpenAI identified no caselaw requiring a court to order the least burdensome discovery possible or to specifically explain why it rejected a party’s alternative discovery proposal, and it upheld Magistrate Judge Wang’s decision to require production of the full de-identified sample. For organizations, a “we will run the search and give you results” approach can be characterized not only as burden-reducing, but also as control-shifting. When one party controls the tooling, index, and match logic, the other party may argue it cannot validate completeness or test alternative hypotheses. This order suggests courts may favor approaches that preserve the requesting party’s ability to analyze the dataset when proportionality and privacy safeguards are otherwise satisfied;
  • Vendor and platform data practices (contracting should account for discovery posture, not just steady-state privacy): The order reflects that OpenAI retained “tens of billions” of chat logs in the ordinary course of business, and that those logs became a central discovery target. Organizations should focus on vendor diligence. Contracts should address discovery realities, including what is retained, for how long, under what governance controls, and what mechanisms exist to support de-identification and controlled production if litigation compels it. A contract that is silent on these points can leave customers with limited leverage once a vendor’s logs are in scope;
  • Highlighting litigation risks in AI use: Magistrate Judge Wang treated users’ “sincere” privacy interests as one factor in proportionality but still found production appropriate given the safeguards. For organizations, this is a reminder that user expectations about privacy are relevant, but they do not necessarily prevent disclosure in civil discovery. That is especially important for employees using consumer-facing tools for work-related tasks, where the organization may not control retention settings and may not even know what was submitted; and
  • Importance of comprehensive AI governance: Because this dispute is about “conversations” defined as prompts and outputs, organizations should treat conversational AI data as a discoverable record category and build governance accordingly. This could mean mapping what AI interaction data exists, clarifying approved and prohibited data types, implementing technical controls to reduce sensitive inputs, litigation holds, and vendor coordination.

The January 5, 2026, order is a reminder that, in AI litigation, interaction logs can play a key role. Courts may be willing to compel production of very large datasets when relevance is framed broadly enough to include defenses like fair use, and when sampling, de-identification, and protective orders are presented as workable privacy safeguards. For organizations, the most durable lesson is not simply “be careful what you type into AI,” but that AI governance now includes discovery posture: what is retained, what can be produced, who controls the tooling, and what protections actually function at scale.

Photo of Roma Patel Roma Patel

Roma Patel focuses her practice on a broad range of data privacy and cybersecurity matters. She handles comprehensive responses to cybersecurity incidents, including business email compromises, network intrusions, inadvertent disclosures and ransomware attacks. In response to privacy and cybersecurity incidents, Roma guides clients…

Roma Patel focuses her practice on a broad range of data privacy and cybersecurity matters. She handles comprehensive responses to cybersecurity incidents, including business email compromises, network intrusions, inadvertent disclosures and ransomware attacks. In response to privacy and cybersecurity incidents, Roma guides clients through initial response, forensic investigation, and regulatory obligations in a manner that balances legal risks and business or organizational needs. Read her full rc.com bio here.

Show more Show less
  • Posted in:
    Intellectual Property
  • Blog:
    Data Privacy + Cybersecurity Insider
  • Organization:
    Robinson & Cole LLP
  • Article: View Original Source

LexBlog logo
Copyright © 2026, LexBlog. All Rights Reserved.
Legal content Portal by LexBlog LexBlog Logo