Blog

Data’s Journey: The Story Behind AI Datasets

Jul 29, 2024

9 min read

My name is Ayşegül Güzel, and my work has been dedicated to looking for systems that can bring joy to humanity. In that journey, I’ve worn many different hats such as innovation and strategic consultant, time banker, community facilitator, storyteller, and data scientist. As a fellow at Humane Intelligence, I am doing a landscape analysis of datasets, benchmarks, and hands-on tools currently being used in AI evaluations.

While exploring all these different components is important, it is essential to understand the story and the standards of datasets and benchmarks in the AI field so far. With this aim, (and with great inspiration from the book that I have just finished reading Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence by Kate Crawford) I am writing this article to walk through a journey to answer several questions:

Why is data important in the first place?
How does machine learning work and how do we use data?
What data is used in machine learning?

Let's start.

Why is data important in the first place?

Data approach as a world-making process: Data is fundamental to the function and development of artificial intelligence. How data is understood, captured, classified, and labeled is an act of world-making and containment. Every dataset used in machine learning reflects a specific worldview, embedding political, cultural, and social choices into the AI systems we build. The AI industry often portrays data collection as a benign and necessary practice for advancing computational intelligence. However, this overlooks the power dynamics involved, favoring those who benefit from data exploitation while minimizing accountability for its consequences. The logic of extraction, historically applied to natural resources and human labor, also defines data used in AI.
The automatic reliance on training data: The reliance on training data has set new standards in fields like natural language processing, which repurposes vast text archives to train language models. These archives are often treated as neutral collections, assuming all text is interchangeable. However, language is context-dependent, and the origin of the text—whether from Reddit or corporate emails—affects the resulting model. Skews and biases in the data are inherent and significant, impacting the system's output. Moreover, languages with less available data are often neglected, leading to inequalities in AI applications.
Origins of training data: The origins of training data are crucial, yet there is a lack of standardized practices to document where data comes from or its biases.
Separation of ethical and technical: Separating ethical considerations from technical development is a broader issue in AI. Understanding and addressing the origins and biases of training data is essential for creating fair and accountable AI systems.

How does machine learning work and how do we use data in machine learning?

In the realm of machine learning, data is indispensable, actually huge data! It’s useful to consider why machine learning systems currently demand massive amounts of data. In short, data is training machines to see, but what does that mean?

From Rules-Based Approaches to Statistical Methods

In the 1970s, AI researchers primarily used expert systems, which relied on rules-based programming to simulate logical reasoning. However, these systems proved inadequate for real-world applications due to their inability to handle uncertainty and complexity. By the mid-1980s, researchers shifted towards probabilistic and brute force methods, which utilized extensive computing power to explore numerous options and identify optimal outcomes.

A pivotal shift occurred in the field of speech recognition. At IBM Research, a team led by Fred Jelinek and Lalit Bahl, with members like Peter Brown and Robert Mercer (Mercer long after became a billionaire, associated with funding Cambridge Analytica, Breitbart News, and Donald Trump’s 2016 presidential campaign), transitioned from linguistic methods to statistical approaches. Instead of focusing on grammatical rules, they analyzed how frequently words appeared with each other.

This required vast amounts of real speech and text data, transforming speech into mere data points for modeling and interpretation, independent of linguistic understanding. This reduction of context to data and meaning to statistical patterns became a standard approach in AI development.

Key components of machine learning

Data forms the basis for training algorithms that enable computers to learn from examples rather than explicit instructions. For example, creating a computer vision system typically involves collecting massive amounts of images from the internet, classifying them, and using them as a foundation to train the system's perception.

This massive amount of data is getting labeled by humans, which is a process called ground truthing. These collections, known as training datasets, shape the system's understanding of reality. Two types of algorithms are created and play a role in the machine learning: learners and classifiers. The learner algorithm is trained on labeled data and then guides the classifier algorithm in analyzing new inputs to make predictions, such as identifying faces in images or detecting spam emails. The larger the trained dataset with more examples of correctly labeled data, the more this improves the accuracy of the classifier algorithm and its predictions.

For example, to build a system that can tell apples and oranges apart in pictures, one would need to gather, label, and train a neural network on many images of apples and oranges. The software analyzes these images to learn the difference between the two. If all the training pictures of apples are red, the system might assume all apples are red. This is called inductive inference—making a guess based on the data it has. So, if it sees a green apple, it might not recognize it as an apple.

In summary, training data is the cornerstone of contemporary machine learning systems, shaping how AI perceives and interacts with the world. However, even the largest datasets cannot fully capture the complexity of the real world, highlighting the inherent limitations of current AI systems.

What Data Is Used in Machine Learning?

It is impossible to do a summary of all datasets that have been used so far in all machine learning approaches. Here I will try to give some examples from important datasets, again referencing back to the book of Kate Crawford.

The Importance of Large Datasets

The shift to statistical methods underscored the necessity of large datasets. Bigger datasets improved probability estimates and captured rare occurrences more effectively. Robert Mercer famously encapsulated this ethos by stating, "There’s no data like more data." However, acquiring large amounts of data was challenging. In the early days, researchers struggled to find even a million words in digital form. They scoured various sources, from IBM technical manuals to children’s novels and patents, to compile enough text.

Unexpected data sources: federal antitrust lawsuit against IBM data

One of the most significant data sources for IBM's speech recognition project came from a federal antitrust lawsuit against IBM, filed in 1969. The extensive legal proceedings generated a vast corpus of digitized deposition transcripts, amounting to a hundred million words by the mid-1980s. This unexpected treasure trove of data proved invaluable for developing IBM’s speech recognition systems.

Penn Treebank Project

From 1989 to 1992, the Penn Treebank Project at the University of Pennsylvania focused on annotating a substantial corpus of American English text to aid in training natural language processing systems. This team of linguists and computer scientists collected four and a half million words from sources such as Department of Energy abstracts, Dow Jones newswire articles, and Federal News Service reports on South American "terrorist activity."

The Enron Corpus

Following Enron Corporation's massive bankruptcy, the Federal Energy Regulatory Commission seized the emails of 158 employees for legal discovery. In a notable decision, the Commission released these emails online, asserting that public disclosure outweighed privacy concerns. This resulted in a remarkable dataset of over half a million everyday exchanges, providing rich linguistic material. Despite its widespread use in thousands of academic papers, the Enron corpus is seldom scrutinized, earning the New Yorker’s description as “a canonic research text that no one has actually read.”

NIST Special Database 32– Multiple Encounter Dataset

This collection includes thousands of mug shots taken over multiple arrests, capturing the same individuals over several years. The photographs, often distressing and showing visible injuries, are shared online for researchers testing facial recognition software. These mug shots, stripped of context and personal stories, serve purely as data points. The dataset presents these images uniformly, without indicating whether the individuals were charged, acquitted, or imprisoned. This repurposing shifts the mug shots' original intent—from identifying individuals within the criminal justice system to serving as a technical baseline for developing and testing facial recognition technologies.

https://www.nist.gov/srd/faceimg

Data in the Internet Era

The internet revolutionized AI research, becoming a vast resource for data. By 2019, an average day saw 350 million photos uploaded to Facebook and 500 million tweets sent. The internet's vastness made it a prime source of training data for AI, giving tech giants a constant stream of new images and text. Users labeled their photos with names and locations for free, providing accurate, labeled data for machine vision and language models. These collections became highly valuable and rarely shared due to privacy concerns and their competitive advantage.

However, academic computer science labs also needed access to such data. Harvesting and labeling this data manually was impractical, so new methods emerged: combining web-scraped images and text with the labor of low-paid crowd workers. One of the most significant datasets in AI, ImageNet, began this way. Conceived in 2006 by Professor Fei-Fei Li, ImageNet aimed to create an enormous dataset for object recognition. By 2009, the team had harvested over 14 million images from the internet, organized into more than 20,000 categories. Ethical concerns about using people's data were not mentioned in their papers, despite many images being personal or compromising.

The team initially planned to hire undergraduate students to label the images, but it proved too costly and time-consuming. The solution came via Amazon Mechanical Turk, allowing the team to access a large, low-cost labor force to sort images. This approach enabled ImageNet to grow rapidly, becoming the world’s largest academic user of Mechanical Turk at the time. ImageNet demonstrated that data collection, often viewed as grunt work, was crucial for AI advancement.

https://www.chiark.greenend.org.uk/~ijackson/2019/ImageNet-Roulette-cambridge-2017.html

Large, labeled datasets led to breakthroughs in computer vision research. A pivotal moment came with ImageNet Roulette, a tool developed by artist Trevor Paglen and scholar Kate Crawford. It allowed users to upload images and see how a neural network labeled them, exposing biases and offensive categorizations. This led to widespread public awareness and criticism, forcing ImageNet to make changes and sparking discussions about the harms of algorithmic systems. In response, around 600,000 images were removed from ImageNet in 2019. Over a decade, ImageNet became a cornerstone for object recognition in machine learning. Its methods of mass data extraction and low-cost labeling became standard practice, but these practices, along with the data they generated, eventually sparked significant ethical concerns and public scrutiny.

Cases of Data Extraction without Consent and Underpaid Labeling

University of Colorado Case: At the University of Colorado, Colorado Springs, a professor covertly installed a camera on campus's main walkway, capturing over seventeen hundred photos of students and faculty without their knowledge. This collection aimed to develop a private facial recognition system.
Duke University Case: Similarly, Duke University undertook a project named DukeMTMC, funded by the U.S. Army Research Office and the National Science Foundation. The project involved recording footage of more than two thousand students moving between classes, which was subsequently published online. This dataset drew criticism when it was discovered that the Chinese government used it to develop surveillance systems targeting ethnic minorities

https://exposing.ai/duke_mtmc/

Stanford University Case: At Stanford University, researchers took control of a webcam in a busy San Francisco café, amassing nearly twelve thousand images of everyday patrons without consent. These images were later used for machine learning research in automated imaging systems.
Microsoft’s MS-Celeb Dataset: Microsoft's MS-Celeb dataset gathered in 2016, scraped around ten million images of approximately one hundred thousand celebrities from the internet. Many individuals in this dataset, ironically, were known for their critiques of surveillance and facial recognition technologies.

Conclusion

The history of datasets in AI reveals the field's evolution from rules-based approaches to statistical methods relying on vast amounts of data. While datasets have enabled significant breakthroughs, they also reflect the worldviews and biases of their creators. As AI systems become more pervasive, it is crucial to scrutinize the origins, labeling, and use of training data to ensure responsible development and mitigate potential harms. The journey of datasets underscores a fundamental tension in AI: the drive for innovation often conflicts with the need for ethical integrity and accountability. The examples of data collection without consent, exploitation of underpaid labor, and the use of personal images highlight the darker side of AI’s rapid progress. These practices raise important questions about privacy, consent, and the equitable distribution of AI’s benefits and burdens. Addressing these issues requires a collective effort from researchers, policymakers, and the public to establish guidelines and regulations that prioritize ethical considerations and human rights.

Ultimately, the path forward for AI and its datasets lies in balancing technological advancement with ethical responsibility. By understanding and acknowledging the historical and present challenges, we can work towards creating AI systems that are not only powerful but also fair, transparent, and inclusive. This involves rethinking the processes of data collection and utilization, emphasizing transparency in AI development, and fostering collaboration across disciplines to create a more just and equitable AI landscape. As we continue to harness the power of AI, let us do so with a commitment to ensuring that the benefits of this technology are shared broadly and ethically.