Mon. Dec 23rd, 2024
Dall・e 2 Mitigation Measures Before Training

It was observed that the internal predecessor version of DALL・E 2 sometimes reproduces the training images as is. We didn’t want this behavior because we want DALL·E 2 to create original, unique images by default, rather than simply “stitching” parts of existing images together. Additionally, recreating training images verbatim can raise legal issues regarding copyright infringement, ownership, and privacy (if the training data includes photos of people).

To better understand the problem of image regurgitation, we collected a dataset of prompts with frequent duplicate images. To do this, we used the trained model to sample 50,000 prompt images from the training dataset and ordered the samples by perceptual similarity to the corresponding training image. Finally, we manually inspected the top matches and found only a few hundred true duplicate pairs out of a total of 50,000 prompts. Although the reflux rate appeared to be less than 1%, we felt it necessary to reduce the reflux rate to 0 for the reasons listed above.

When we investigated the dataset of regurgitated images, we noticed two patterns. First, the images were almost all simple vector graphics, which may have been easier to remember because they contained less information. Second, and more importantly, all images have many near duplicates in the training dataset. For example, suppose you have a vector graphic that looks like a clock showing 1 o’clock. But then you’ll find a training sample containing the same clock that says 2 o’clock and then 3 o’clock. We noticed this and used distributed nearest neighbor search to ensure that, in fact, all regurgitated images had perceptually similar overlaps in the dataset. other the work They observed a similar phenomenon in large-scale language models and found that data overlap is strongly associated with memory.

The above findings suggest that deduplication of datasets may solve the backflow problem. To accomplish this, I planned to use a neural network to identify groups of similar images and remove all but one image from each group.[^footnote-2]

However, this requires checking for each image whether it is a duplicate of every other image in the dataset. The entire dataset contains hundreds of millions of images, so finding all duplicates would simply require checking quintillion image pairs. While this is technically possible, especially on large compute clusters, we’ve found a much more efficient alternative that works nearly as well at a fraction of the cost. Consider what would happen if you clustered your dataset before performing deduplication. Because nearby samples are often classified into the same cluster, most overlapping pairs do not cross the cluster decision boundary. This allowed us to eliminate duplicates of samples within each cluster without checking for duplicates outside the cluster, allowing us to miss only a small fraction of all duplicate pairs. This is much faster than the naive approach as it eliminates the need to check every pair of images.[^footnote-3]

We empirically tested this approach on a small subset of data and found 85% of all duplicate pairs when used:K=1024 To improve the success rate of the above algorithm, we leveraged one important observation. That said, when clustering different random subsets of a dataset, the decision boundaries of the resulting clusters are often quite different. Therefore, if an overlapping pair crosses the cluster boundary of one clustering of the data, the same pair may fall within a single cluster of another clustering. The more you try to cluster, the more likely you are to find a particular duplicate pair. In practice, we settled on using five clusterings. This means searching for duplicates of each image in a union of 5 different clusters. In practice, this found 97% of all duplicate pairs on the subset of data.

Surprisingly, almost a quarter of the dataset was removed by deduplication. When we examined the near-duplicate pairs we found, many contained meaningful changes. Remember the clock example above? A dataset may contain many images of the same clock at different times of the day. These images may teach the model to remember what this particular clock looks like, but they may also help the model learn to distinguish the time on the clock. Given the amount of data removed, I was worried that removing images like this would hurt the model’s performance.

To test the effect of deduplication on the model, we trained two models using the same hyperparameters. One with the full dataset and one with a deduplicated version of the dataset. To compare the models, we used the same human ratings used to evaluate the original GLIDE model.Surprisingly, only a few human raters preferable The model was trained on deduplicated data, suggesting that the large amount of redundant images in the dataset is actually negatively impacting performance.

After training the model on the deduplicated data, we reran the backward search we had previously run over 50,000 prompts from the training dataset. We found that the new model does not regurgitate training images when given exact prompts for images from the training dataset. To take this test further, we also performed a nearest neighbor search on the entire training dataset for each of the 50,000 images generated. I thought that in this way I might be able to capture models that spit out images that are different from those associated with a particular prompt. Despite a more thorough check, no image regurgitation was ever found.