
Meta has been at the center of a scandal: the company is accused of illegally downloading 81.7 TB of pirated books to train its Llama language model. Court documents indicate that Meta employees bulk downloaded data from LibGen, Z-Library, and other shadow libraries.
Court documents revealed that Meta illegally downloaded more than 80 TB of books from pirated resources, including:
Meta allegedly even acted as a seeder, distributing the downloaded files. The company tried to avoid tracking by not using its own infrastructure for downloading. This became the subject of an investigation in the case “Kadrey et al. v. Meta Platforms,” filed in 2023 by authors who accuse Meta of using their works without permission.
The lawsuit against Meta was initiated by writers Richard Kadrey, Sarah Silverman and Christopher Golden in 2023. They claimed that the company used a huge amount of copyrighted content to train its LLM (large language model) Llama.
Meta previously claimed that 85 GB of its training data came from open sources, in particular from The Pile dataset, which includes 197,000 books from pirated libraries. However, new documents show that the volume of pirated books used was many times greater.
Meta denies the accusations, citing the principle of “fair use”. The company claims that using public datasets to train AI is legal and promotes technological development. However, publishers are demanding more information about the use of this data and believe that Meta has violated copyright laws.
The scandal over the use of pirated materials could have serious consequences for Meta and the entire generative AI industry. The trial will determine whether such practices become the norm or whether companies will be forced to reconsider the ways they train their models.