The researchers’ analysis also suggests that Labeled Faces in the Wild (LFW), a data set introduced in 2007 and the first to use face images scraped from the internet, has morphed multiple times through nearly 15 years of use. Whereas it began as a resource for evaluating research-only facial recognition models, it’s now used almost exclusively to evaluate systems meant for use in the real world. This is despite a warning label on the data set’s website that cautions against such use.

More recently, the data set was repurposed in a derivative called SMFRD, which added face masks to each of the images to advance facial recognition during the pandemic. The authors note that this could raise new ethical challenges. Privacy advocates have criticized such applications for fueling surveillance, for example—and especially for enabling government identification of masked protestors.

“This is a really important paper, because people’s eyes have not generally been open to the complexities, and potential harms and risks, of data sets,” says Margaret Mitchell, an AI ethics researcher and a leader in responsible data practices, who was not involved in the study.

For a long time, the culture within the AI community has been to assume that data exists to be used, she adds. This paper shows how that can lead to problems down the line. “It’s really important to think through the various values that a data set encodes, as well as the values that having a data set available encodes,” she says.

A fix

The study authors provide several recommendations for the AI community moving forward. First, creators should communicate more clearly about the intended use of their data sets, both through licenses and through detailed documentation. They should also place harder limits on access to their data, perhaps by requiring researchers to sign terms of agreement or asking them to fill out an application, especially if they intend to construct a derivative data set.

Second, research conferences should establish norms about how data should be collected, labeled, and used, and they should create incentives for responsible data set creation. NeurIPS, the largest AI research conference, already includes a checklist of best practices and ethical guidelines.

Mitchell suggests taking it even further. As part of the BigScience project, a collaboration among AI researchers to develop an AI model that can parse and generate natural language under a rigorous standard of ethics, she’s been experimenting with the idea of creating data set stewardship organizations—teams of people that not only handle the curation, maintenance, and use of the data but also work with lawyers, activists, and the general public to make sure it complies with legal standards, is collected only with consent, and can be removed if someone chooses to withdraw personal information. Such stewardship organizations wouldn’t be necessary for all data sets—but certainly for scraped data that could contain biometric or personally identifiable information or intellectual property.

“Data set collection and monitoring isn’t a one-off task for one or two people,” she says. “If you’re doing this responsibly, it breaks down into a ton of different tasks that require deep thinking, deep expertise, and a variety of different people.”

In recent years, the field has increasingly moved toward the belief that more carefully curated data sets will be key to overcoming many of the industry’s technical and ethical challenges. It’s now clear that constructing more responsible data sets isn’t nearly enough. Those working in AI must also make a long-term commitment to maintaining them and using them ethically.


We're not around right now. But you can send us an email and we'll get back to you, asap.


Log in with your credentials

Forgot your details?