We noticed that our internal DALL·E 2 predecessors would sometimes reproduce the training images verbatim. This behavior was undesirable, since we would like DALL·E 2 to create original, unique images by default, not just “merge” parts of existing images. In addition, verbatim reproduction of training images may raise legal issues regarding copyright infringement, ownership, and privacy (if photographs of people were present in the training data).
To better understand the image regurgitation problem, we collected a dataset of queries that often resulted in duplicated images. To do this, we used a trained image sampling model for 50,000 queries from our training dataset and classified the samples according to their perceptual similarity to the corresponding training image. Finally, we manually screened the best matches, finding only a few hundred true duplicate pairs out of a total of 50k queries. Although the regurgitation rate appeared to be less than 1%, we felt it necessary to reduce the rate to 0 for the above reasons.
When we studied our dataset of return images, we noticed two patterns. First, almost all the images were simple vector graphics, which were probably easy to remember because of their low information content. Second, and more importantly, all images had many near-duplicates in the training dataset. For example, there might be a vector graphic that looks like a clock showing the time 1 o'clock—but then we would discover a training sample containing the same clock showing the time 2 o'clock, then 3 o'clock, etc. Once we figured that out, we used distributed nearest neighbor search to confirm that indeed all returned images had perceptually similar duplicates in the data set. the rest works observed a similar phenomenon in large language models, finding that data duplication is strongly associated with memory.
The above finding suggested that if we deduplicate the data set, we could solve the regurgitation problem. To achieve this, we planned to use a neural network to identify groups of images that look similar and then remove all but one image from each group.(^footnote-2)
However, this would require checking, for each image, whether it is a duplicate of every other image in the dataset. Since our entire dataset contains hundreds of millions of images, we would naively have to check hundreds of quadrillion pairs of images to find all duplicates. While this is technically within reach, especially on a large computing cluster, we found a much more efficient alternative that works almost as well at a fraction of the cost. Consider what happens if we cluster our data set before performing deduplication. Since nearby samples often fall into the same cluster, most duplicate pairs would not cross the cluster's decision boundaries. We could then deduplicate samples within each cluster without checking for duplicates outside the cluster, while only missing a small fraction of all duplicate pairs. This is much faster than the naive approach, since we no longer need to check every single pair of images.(^footnote-3)
When we empirically tested this approach on a small subset of our data, 85% of all duplicate pairs were found usingK=1024 clusters. To improve the success rate of the algorithm above, we exploited a key observation: when you cluster different random subsets of a data set, the resulting cluster decision boundaries are often quite different. Therefore, if a duplicate pair crosses a cluster boundary for one data grouping, the same pair might fall within one cluster in another grouping. The more groupings you try, the more likely you are to discover a particular pair of duplicates. In practice, we decided to use five clustering, which means that we look for duplicates of each image in the union of five different clusters. In practice, this found 97% of all duplicate pairs on a subset of our data.
Surprisingly, almost a quarter of our data set was removed by deduplication. When we looked at the near duplicate pairs that were found, many of them involved significant changes. Recall the clock example above: a dataset can include many images of the same clock at different times of the day. While these images will likely make the model remember the look of this particular watch, they can also help the model learn to distinguish the time of day on the watch. Given how much data was removed, we were concerned that removing such images might hurt the model's performance.
To test the effect of deduplication on our models, we trained two models with identical hyperparameters: one on the full dataset and one on the deduplicated version of the dataset. For model comparison, we used the same human estimates that we used to estimate our original GLIDE model. Surprisingly, we found that human estimators little desirable the model was trained on deduplicated data, suggesting that the large amount of redundant images in the dataset actually hurt performance.
Once we had the model trained on the deduplicated data, we re-ran the regurgitation search we previously ran over 50k queries from the training dataset. We found that the new model never recovered the training image when given the correct image query from the training dataset. To take this test one step further, we also performed a nearest neighbor search over the entire training dataset for each of the 50k generated images. This way we thought we could catch the model returning a different image than the one associated with the given query. Even with this more thorough check, we never found a case of image regurgitation.