Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
When embarking on any machine learning project, there’s a million-dollar question you need to ask: How is data prepared for machine learning? The entire process begins with careful planning and problem formulation, similar to making any major business decision. You’re essentially defining what problem you want the machine learning model to solve.
From there, the real work begins—building a training dataset. But here comes the first major hurdle: How much data is enough? Is a handful of samples sufficient, or should you aim for thousands or even millions of examples? The answer is tricky because there is no universal rule for determining the ideal dataset size.
Multiple factors, such as the complexity of the problem and the machine learning algorithm used, play a role in deciding this.
While there is no one-size-fits-all solution for the amount of data needed, the general rule is to gather as much data as possible.
But a lot can sound pretty vague, so let’s dive into some real-world examples to better understand the ideal dataset size.
Consider Gmail’s smart reply feature, which helps users quickly respond to emails with automated suggestions.
Google’s team used 238 million sample messages to train the model for this task. For Google Translate, the numbers were even larger, with trillions of data points.
However, not all machine learning projects require data on such a massive scale. Take, for example, a model created by AI researcher Chang at Tamkang University.
He successfully trained a neural network model with only 630 data samples to predict the compressive strength of high-performance concrete.
The key takeaway here? The complexity of your project largely dictates how much data is required.
Now that we’ve established the importance of collecting a lot of data, let’s talk about quality. Data quality is as important—if not more so—than data quantity.
There’s an old saying in data science: Garbage in, garbage out. In other words, if the data you feed into your model is inaccurate or irrelevant, your model’s predictions will be equally flawed.
Consider Amazon’s failed AI hiring tool. The tool was trained on biased data, which resulted in biased hiring recommendations.
No matter how sophisticated the algorithm or how well the model was built, poor-quality data doomed the project from the start.
Ensuring that your data is not only high-quality but also relevant to your specific task is critical. Imagine you’re building a model to forecast turkey sales in the U.S. during Thanksgiving, but your historical data is from Canada.
While both countries celebrate Thanksgiving, the cultural differences, including the timing of the holiday and its significance, make the data inadequate for predicting U.S. sales.
Once you’ve ensured your data is both high-quality and relevant, the next step is transforming it into a format your machine learning model can understand.
In supervised learning, this usually involves labeling the data. Labeling is when you assign “correct answers” to your data points, helping the model learn.
For instance, if you’re teaching a model to recognize apples, you would label images of apples accordingly so that the model learns to distinguish them from other fruits.
For machine learning models to work effectively, each data point needs to be described by features—measurable attributes that help the model make predictions.
In the case of apples, features could include their shape, color, and texture. When enough examples of these features are provided, the model can start making accurate predictions on new data.
But data preparation doesn’t end with labeling and feature identification. There are several other important processes, including data reduction, data cleansing, and data wrangling.
Although collecting vast amounts of data is important, not all of it will be useful. Data reduction involves removing irrelevant or redundant features that don’t add much value to your model.
For instance, if all your customers are from the U.S., the “country” feature won’t contribute to the accuracy of your model and can be safely removed.
Another important pre-processing step is data cleansing. Often, datasets contain missing or inaccurate data points.
Cleansing your data ensures that your model isn’t fed faulty information. Missing values can be filled in using estimates, while corrupted or irrelevant data can be removed entirely.
Data wrangling transforms raw data into a format that best describes the problem you’re trying to solve.
This might involve converting file formats (e.g., from Excel to CSV) or normalizing data to ensure that all features are measured on the same scale.
For example, in a dataset about turkey sales, the number of turkeys sold may range from 100 to 900, while the dollar amount from sales ranges from 1,500 to 13,000.
If left unnormalized, the model might assign more importance to the sales figures just because they are larger numbers.
Normalization scales the data, ensuring that each feature has equal weight in the model’s predictions. One common normalization technique is min-max normalization, which scales features to a range between 0 and 1.
Sometimes, the features you need to make accurate predictions aren’t immediately available in your raw data.
This is where feature engineering comes in. By creating new features from existing ones, you can make your model more efficient.
For instance, if you’re working with a dataset that includes date-time information, you might split it into two features: one for the date and one for the time.
This is useful when predicting customer demand for hotel rooms, as demand fluctuates based on the day of the week and time of day.
One of the biggest challenges in machine learning is ensuring that the data is unbiased. Models trained on biased data will produce biased results, as seen in Amazon’s hiring tool example.
Although data bias isn’t the sole reason for the failure of a model, it plays a significant role. Striving to eliminate biases in your dataset will improve your model’s accuracy and performance.
In machine learning, data preparation is one of the most crucial steps. It’s said that up to 80% of a data scientist’s time is spent on this phase, and for good reason. A machine learning model is only as good as the data you provide.
No matter how sophisticated the algorithm, if your data is flawed or incomplete, your model will be too.
Investing time and effort in data preparation—ensuring its quality, relevance, and usability—will pay off in the long run, leading to more accurate and reliable predictions.
Whether you’re working with a few hundred data samples or billions, the principles of good data preparation remain the same.