Data is Key

Large amounts of quality data are key to the success of any machine learning algorithm. After all, models can only be as good as the data used to train them. The bottleneck in an ML pipeline (see The ML Pipeline) is often in access to large quantities of high-quality data. It is not uncommon for the majority of the development time it takes to implement an ML solution will be spent collecting, creating, organizing, cleaning and preprocessing, your data (see Normalization & Preprocessing).

Data Quantities

There are so many factors at play in any given ML algorithm that it is often difficult to know the minimum amount of data needed for your model to start performing well in practice. Simple data statistics like standard deviation, number and quality of features, number of model parameter, and problem task all play a significant role in determining how much data is needed. That said, the answer to the question of "how much data do I need?" is usually always "more data."

I've seen rules of thumb that suggest that the minimum number of samples needed to train a model range anywhere from 10xN to N^2 where N is the number of features (columns) in your data. The truth is that the amount of data required scales with model capacity and complexity. Like many things in machine learning, the quantity of data required to solve a given problem must be empirically tested.

Data Representation

The way that you represent your data is arguably more important than the amount of raw data itself. Data is represented using features, and the ones you choose (or better yet, learn) influence how effectively your model can learn about the data (see Features & Design Matrices). Just because you have 20 columns in your database to represent a user doesn't mean that all 20 of those columns are necessary to solve a simple classification task. You may find that your model performs best using only 4 features from each data sample.

It's also very common not to use the features from your dataset directly, but rather process them first to compute new features that are more helpful for your model. For instance, if you were training a classifier to predict durations of metro-rail commutes given a database of card swipe timestamps, a ride duration feature (swipe_out - swipe_in) may be more beneficial than two separate swipe timestamp features. Knowing how best to represent your data is difficult and empirical testing is often the best way to determine what features to use or derive from your data when training your models.

Training Data vs Test Data

There is a critically important distinction to be made between data used to train a model and data used to test the performance of that model (see Performance Measures). The same data cannot be used to both train your model and evaluate its performance, as this will lead to drastic overfitting. The training of every supervised machine learning model requires that you split your data, using the majority of it for training and holding out the minority for testing and model evaluation. A split of 80% training data and 20% test data is common. If you have a lot of data, you can experiment with 85%+ training data. Model performance will likely always be better on your test data, but hopefully only a few percentage points away from the accuracy evaluation on your test data. Test performance is a measure of how well your model will generalize to unseen real-world data it will encounter "in the wild" once it is deployed. Data holdout is an important topic and I recommend checking out the data split Wikipedia page for more info.

Finally, Kaggle is an amazing data science community that hosts paid data science competitions and publishes publicly available datasets and code examples. Perusing the site should give you a good overview of the types of data representations that are often used in machine learning.

Next: Machine Learning Models
Previous: General Purpose Algorithms

Return to the main page.

All source code in this document is licensed under the GPL v3 or any later version. All non-source code text is licensed under a CC-BY-SA 4.0 international license. You are free to copy, remix, build upon, and distribute this work in any format for any purpose under those terms. A copy of this website is available on GitHub.