In federal court, lawyers for Microsoft and OpenAI defended the scraping of news stories to train LLMs and urged the dismissal of news outlets' copyright claims

I really have no idea how those arguments landed. Kate Knibbs / @knibbs : Daily News & NYT lawyer now speaking. Pushing back on the idea that the publishers should've known about the ingestion prior to statute of limitations cut-off Kate Knibbs / @knibbs : Different NYT lawyer now. Is saying that LLMs don't actually learn “facts” as Microsoft asserted. “It only absorbs the expression of the fact, it never learns the underlying fact. These are just statistical models” Kate Knibbs / @knibbs : Judge: “My confusion is that I read that the models are always learning” — (also, no shade to the court. this shit is hard to understand as a lay person!) Kate Knibbs / @knibbs : Listening into today's hearing in NYT v. Microsoft/OpenAI — the judge is currently going over how the training process works with the NYT lawyer. “The information is stored in packets?” X: Marty Swant / @martyswant : Judge: What's the injury? Lieberman: “You're leaving people open for massive copyright infringement without the ability to trace it...It's like it causes the alarm system in your house to go down.” Jason Kint / @jason_kint : Attended oral arguments in NYT+ v OpenAI/Microsoft this morning @ SDNY. It was good to finally hear some back and forth - majority of claims imho will clearly survive OpenAI's attempts to dismiss - but all of OpenAI's PR offensive and talking points were on full display. /1 Marty Swant / @martyswant : Judge just asked OpenAI's lawyer if GPT models have NYT content in its database. Lawyer said not doesn't, but then clarified it doesn't have a database but instead relies on weights. I wonder how NYT's lawyers will respond to that. Marty Swant / @martyswant : NYT's lawyer Ian Crosby said LLM outputs are just “the expression of the words and facts, not the underlying facts in the words and the models.” “They're not next-generation search engines, but answer engines,” Crosby said, adding that they're not substitutional. Marty Swant / @martyswant : Just joined the NYT v OpenAI oral arguments. Every time someone joins the audio feed, it announces the person w/ each person's voice & name...Awkward and distracting! Jason Kint / @jason_kint : A convo on “memorization” also interesting.. discovery here may be enlightening. I say this based on Facebook discovery unsealed last night showing how it balanced “memorization rate” with risk acknowledging the issue it creates. Don't miss this. /9 https://x.com/... Marty Swant / @martyswant : Lieberman: When copyright management information (CMI) is retracted, outputs will be “either the verbatim language or a summary without the attribution of the new york times or the daily news.” John Legere / @johnlegere : Publications like the New York Times are going after OpenAI about their use of articles to teach their AI. Will this mean more companies will go after not only OpenAI but other companies using AI too? https://www.npr.org/... Forums: r/technology : ‘The New York Times’ takes OpenAI to court. ChatGPT's future could be on the line See also Mediagazer

Courthouse News Service 2025-01-15 Josh Russell

Context & Ripple Effects

The hearing advances the copyright fight launched when the Times sued OpenAI and Microsoft over use of its articles for model training. OpenAI had already framed training as fair use and pointed to an opt-out in its initial response to the Times lawsuit.

The immediate dispute turns on a central distinction: publishers say outputs can reproduce protected expression and obscure copyright-management information, while the companies say model weights are not a retrievable archive. Earlier coverage noted uncertainty over how fair-use precedent would apply, making this dismissal bid consequential for both sides' legal theories.

First-order effects

Microsoft and OpenAI are seeking to end the publishers' copyright claims at the pleading stage, while the Times and Daily News must persuade the court that training and alleged output behavior state actionable harms.
The parties are putting competing technical characterizations of LLMs before the court: statistical models using weights versus systems capable of reproducing protected language or summaries.

Second-order effects

The outcome of the dismissal arguments will influence how other publishers frame claims around verbatim outputs, attribution, and copyright-management information rather than relying on a broad objection to web scraping alone.
AI developers and news publishers have stronger incentives to clarify training permissions, opt-outs, and output safeguards while the legal boundary between model training and content retrieval remains contested.

Third-order effects

If courts repeatedly treat training data as distinct from a retrievable content database, copyright disputes may increasingly hinge on demonstrable output substitution and provenance rather than ingestion alone.
The case is part of a broader effort to define whether news content is a freely usable inference input or an asset requiring enforceable controls and compensation.

The trend: Generative-AI copyright litigation is shifting from the fact of scraping toward technical evidence of what models retain, reproduce, and displace.

Discussion

@tyleraking.com Tyler King on bluesky
FWIW, when AI writing app Sudowrite (built on GPT and Claude APIs) described how chatbots work on its website, it specifically stated a bot will reliably regurgitate Harry Potter when prompted with text from the books. It has since updated its website. [embedded post]
@knibbs Kate Knibbs on bluesky
And we're done! “I have a lot to think about and you'll get an opinion in due course” judge says as a goodbye. — I really have no idea how those arguments landed.
@knibbs Kate Knibbs on bluesky
Daily News & NYT lawyer now speaking. Pushing back on the idea that the publishers should've known about the ingestion prior to statute of limitations cut-off
@knibbs Kate Knibbs on bluesky
Different NYT lawyer now. Is saying that LLMs don't actually learn “facts” as Microsoft asserted. “It only absorbs the expression of the fact, it never learns the underlying fact. These are just statistical models”
@knibbs Kate Knibbs on bluesky
Judge: “My confusion is that I read that the models are always learning” — (also, no shade to the court. this shit is hard to understand as a lay person!)
@knibbs Kate Knibbs on bluesky
Listening into today's hearing in NYT v. Microsoft/OpenAI — the judge is currently going over how the training process works with the NYT lawyer. “The information is stored in packets?”
@martyswant Marty Swant on x
Judge: What's the injury? Lieberman: “You're leaving people open for massive copyright infringement without the ability to trace it...It's like it causes the alarm system in your house to go down.”
@jason_kint Jason Kint on x
Attended oral arguments in NYT+ v OpenAI/Microsoft this morning @ SDNY. It was good to finally hear some back and forth - majority of claims imho will clearly survive OpenAI's attempts to dismiss - but all of OpenAI's PR offensive and talking points were on full display. /1
@martyswant Marty Swant on x
Judge just asked OpenAI's lawyer if GPT models have NYT content in its database. Lawyer said not doesn't, but then clarified it doesn't have a database but instead relies on weights. I wonder how NYT's lawyers will respond to that.
@martyswant Marty Swant on x
NYT's lawyer Ian Crosby said LLM outputs are just “the expression of the words and facts, not the underlying facts in the words and the models.” “They're not next-generation search engines, but answer engines,” Crosby said, adding that they're not substitutional.
@martyswant Marty Swant on x
Just joined the NYT v OpenAI oral arguments. Every time someone joins the audio feed, it announces the person w/ each person's voice & name...Awkward and distracting!
@jason_kint Jason Kint on x
A convo on “memorization” also interesting.. discovery here may be enlightening. I say this based on Facebook discovery unsealed last night showing how it balanced “memorization rate” with risk acknowledging the issue it creates. Don't miss this. /9 https://x.com/...
@martyswant Marty Swant on x
Lieberman: When copyright management information (CMI) is retracted, outputs will be “either the verbatim language or a summary without the attribution of the new york times or the daily news.”
@johnlegere John Legere on x
Publications like the New York Times are going after OpenAI about their use of articles to teach their AI. Will this mean more companies will go after not only OpenAI but other companies using AI too? https://www.npr.org/...
r/technology r on reddit
‘The New York Times’ takes OpenAI to court. ChatGPT's future could be on the line

Chronicles