Easing the Burden of Building Good Training Datasets

Michael Blum • Apr 14, 2021

3 minutes

One of the hardest parts about building a machine learning (ML) pipeline is making sure you have enough clean labeled data to train a model. Good data can occasionally be easy to come by, but good labeled data is much more rare and it can become quite expensive to label a dataset manually. That is why data scientists use Amazon Mechanical Turk (MTurk). MTurk uses contracted or temporary workers – commonly referred to as “gig workers” – that can help label massive amounts of information.

Gig workers are part of many startups and Big Tech organizations. In 2018, the Bureau of Labor Statistics reported that 55 million people in the U.S. are gig workers. These free-agent workers can complete a variety of tasks.

Even though many companies are putting their focus on emerging technologies, gig workers are still relevant to helping a company meet financial and technical goals. As an online marketplace for temporary workers, MTurk is part of the growing digital transformation.

The McKinsey Global Institute, an economic research firm, found that up to 162 million people in Europe and the United States—or 20 to 30 percent of the working-age population—engage in some form of independent, gig work.

The digital age has created a litany of jobs requiring outsourcing — not only to intelligent machines, but to a variety of humans. Part of strengthening ML capabilities involves occasionally relying on gig workers. There are still many jobs that require a human touch, such as gathering information via conversations or writing product reviews.

Gig workers help data scientists gather data for ML. This data can be used by developers to build algorithms. Depending on the scope of the project, the process can be time-consuming and expensive. MTurk can help large-scale organizations advance their ML techniques.

Training for Machine Learning Models

MTurk is a way for companies to access the power of a large swath of temporary workers who can bring diversity in talents, ages, ethnicities, cultures and locations, while collecting data and labeling it for ML.

It can take years for an organization to develop this type of workforce. With so many workers readily available, there is also inherent flexibility that employers and gig workers can take advantage of. Both can use MTurk’s large workforce capabilities on an as-needed basis and can help mitigate stressed budgets while helping workers subsidize their income, especially during a recession.

Because a large workforce brings so much diversity, MTurk has been used in political campaigns for canvassing, in social science for research and in psychology to gauge opinions that are representative of the general population. For organizations who have sensitive data that can only be shared with a private group, MTurk helps provide labeling options such as choosing one of the third-party vendors pre-selected by AWS.

One of MTurk’s biggest perks is its integration with another one of Amazon’s services, SageMaker. Together, they form a third service called Amazon SageMaker Ground Truth. This combination builds labeled data that can be used to train ML models with a public workforce. Developers can use it to label their data using human annotators through MTurk, third-party vendors or their own employees.

Several global companies use MTurk’s Amazon SageMaker for ML. Airbnb uses MTurk to generate and maintain high quality data in order to train and test ML models. T-Mobile uses labeled data to create high-performing ML models and to lighten the task loads of their data scientists and software engineers. ZipRecruiter uses it to help their ML model extract relevant data automatically from uploaded resumes. Codazen also utilizes MTurk as well to gather data used in creating several of our innovative projects in computer vision and deep learning.

The Amazon Mechanical Turk Blog describes the process:

Amazon SageMaker Ground Truth learns from these annotations in real time and can automatically apply labels to much of the remaining dataset, reducing the need for human review. Amazon SageMaker Ground Truth creates highly accurate training data sets, saves time and complexity, and reduces costs by up to 70 percent when compared to human annotation.

Source: Shutterstock.com

Codazen Solutions: Making Data Collection Easy

Human intelligence collaborating with artificial intelligence is the trend of the future. However, companies must know not only how to use Amazon Mechanical Turk or other gig work providers, but also how to use it so they meet bottom line revenue while leveraging the right technologies.

MTurk can come in handy because it allows the tedious tasks of data collection to be completed by gig workers without burdening the workflow processes of developers and data scientists.

Codazen can help organizations leverage MTurk as a way to speed up data collection and labeling, while staying on track with project timelines and leveraging AI capabilities. Labeled data is integral to building sophisticated algorithms—especially on a gig economy platform. After our team collects massive amounts of data, we work on training machine learning algorithms.

Datasets for deep learning are an integral part of this process and our experts collaborate with client teams to provide creative, technical solutions. We work with companies who have existing AI capabilities and those without. Codazen has the expertise and experience many companies need, but may lack, to make full use of the MTurk platform.

Codazen developers and data scientists create effective projects, experiments, and surveys using MTurk. We can help organizations minimize the costs and maximize the returns of tedious workflows and complex machine learning models.

To learn more about how to leverage the multiple benefits of MTurk and ML, contact Codazen.

Easing the Burden of Building Good Training Datasets

Michael Blum • Apr 14, 2021

Training for Machine Learning Models

Codazen Solutions: Making Data Collection Easy

Experience Results