According to court documents, the company may have intentionally removed metadata to hide the origin of its content.
According to court documents, Meta used the LibGen database to train its artificial intelligence models. In a conversation between two Meta engineers, one of them questioned the legality of downloading data from pirate sites to corporate devices, saying “It doesn’t seem right to download torrents to Meta laptops.” The plaintiffs allege that Meta not only used illegal data, but also intentionally removed copyright management information (CMI) to prevent exposure that the models were trained on copyrighted content.
According to the lawsuit, Meta removed “unhashed source metadata” and other data that could indicate the origin of the content. In addition, Meta’s programmers created “controlled samples” of the data to ensure that the models did not release information that would indicate the use of material protected by the Llama model. This shows an attempt to hide the company’s behavior. In response, Meta filed legal documents claiming that LibGen’s use was not covert, but open.
Library Genesis is one of the world’s largest illegal archives of digital books and scientific materials. The site allows users to download content that infringes on publishers’ copyrights for free. In recent years, Shadow Library has become a frequent source of data because it contains vast amounts of information that is ideal for training artificial intelligence models. This is not the first time that Meta has come under fire for using data for AI projects: in 2023, the company was fined a record €1.2 billion for GDPR violations.
Allegations of illegal use of data could have a significant impact on Meta’s reputation and its future AI projects. If the case develops further, it could lead to new regulatory restrictions on data collection for training AI models.