OpenAI's Data Deletion Blunder Complicates New York Times Copyright Lawsuit

· 1 min read

article picture

In a recent development that has complicated the ongoing copyright lawsuit between OpenAI and major news publishers, OpenAI engineers accidentally erased critical search data that could have served as potential evidence in the case.

The incident occurred on November 14 when OpenAI engineers deleted data stored on one of two virtual machines provided to The New York Times and Daily News legal teams. These virtual machines were meant to help the publishers search for their copyrighted content within OpenAI's AI training datasets.

According to a letter filed in the U.S. District Court for the Southern District of New York, the publishers' legal teams and experts had invested over 150 hours since November 1 examining OpenAI's training data. While OpenAI managed to recover most of the deleted information, the folder structure and file names were permanently lost, making it impossible to determine how the publishers' articles were used in building OpenAI's models.

The publishers' legal team has been forced to restart their investigation from scratch, resulting in additional time and resource expenditure. While the deletion appears unintentional, the incident has highlighted the challenges in examining OpenAI's vast datasets for potential copyright violations.

The lawsuit centers on allegations that OpenAI used copyrighted content from these publishers without permission to train its AI models, including GPT-4. OpenAI maintains that training its models on publicly available data falls under fair use, though the company has recently secured licensing agreements with several major publishers including Associated Press and Axel Springer.

The case represents a broader debate in the AI industry about the use of copyrighted materials for AI training. While OpenAI has established paid partnerships with some media organizations - reportedly paying up to $16 million annually in some cases - it has neither confirmed nor denied using specific copyrighted works without permission in its training processes.

OpenAI has declined to comment on this latest development in the ongoing legal battle.