The Data Provenance Initiative is a large-scale audit of AI datasets used to train large language models.