A new kind of large language model, developed by researchers at the Allen Institute for AI (Ai2), makes it possible to control how training data is used even after a model has been built.
Personally identifiable information has been found in DataComp CommonPool, one of the largest open-source data sets used to train image generation models. Millions of images of passports, credit cards ...