Despite the cautions of its attorneys, Meta utilized copyrighted materials for AI training
Meta Platforms, formerly known as Facebook, allegedly proceeded with using thousands of pirated books for training its AI models despite legal warnings from its lawyers, according to a recent filing in a copyright infringement lawsuit. Comedian Sarah Silverman, Pulitzer Prize winner Michael Chabon, and other notable authors filed the lawsuits, claiming that Meta utilized their works without permission to train its AI language model, Llama.
The new filing, consolidating two separate lawsuits, includes chat logs revealing a Meta-affiliated researcher discussing the procurement of the dataset in a Discord server. The chat logs suggest that Meta was aware of potential legal issues regarding the use of the books for AI training. In the logs, researcher Tim Dettmers discussed his interactions with Meta's legal department, mentioning that the data could not be used for legal reasons, possibly due to concerns about books with active copyrights.
While Dettmers did not elaborate on the legal concerns, others in the chat suggested that training on data containing books with active copyrights could be problematic. They debated whether the use of such data should be considered "fair use," a legal doctrine in the U.S. that protects certain unlicensed uses of copyrighted works.
The filing indicates that Meta released the first version of its Llama language model in February, acknowledging the use of "the Books3 section of ThePile" for training. However, the company did not disclose the training data for the latest version, Llama 2, which was made available for commercial use in the summer.
This legal dispute adds to the challenges faced by tech companies in 2023, as content creators pursue lawsuits accusing them of using copyrighted works without permission to develop generative AI models. Successful outcomes in such cases could impact the generative AI field by potentially increasing the cost of building data-intensive models and compelling AI companies to compensate content creators for the use of their works. Additionally, new provisional rules in Europe regulating artificial intelligence may require companies to disclose the data used to train their models, exposing them to additional legal risks.





