Decision trees and random forests: from simple to robust models
In machine learning, decision trees are among the most widely used techniques for classification or prediction. Unlike linear or logistic regression, which identify parameters describing a direct relationship between variables (known as parametric methods), a decision tree is a non-parametric approach. Rather than estimating a coefficient linking two variables, it progressively splits the data into homogeneous subgroups.
A decision tree is essentially a sequence of successive choices, each step aiming to make the information clearer and more consistent. For example, if we want to distinguish dogs from cats based on their characteristics, the algorithm might start by splitting animals according to ear shape, then refine with the presence or absence of whiskers, and finally with the shape of the face. Each subdivision makes the groups more homogeneous (separating cats and dogs) until a reliable prediction is reached. This progressive segmentation is what makes decision trees powerful, because they are intuitive and easy to interpret.
While useful, decision trees remain fragile: a small change in the data can completely alter their structure. To make the model more robust, the random forest method was developed. Its principle is simple to understand: instead of building a single tree, many trees are constructed, each using a different subset of the data.
How a random forest works ?
Each tree makes its own prediction, and the forest then combines all results by taking either the majority vote (for classification) or the average (for estimation). This significantly reduces the risk of error from relying on a single tree. Returning to the dog-and-cat example: a single tree might misclassify if ear shape alone is not a sufficient criterion. But by combining dozens of trees built on different criteria and samples, the forest produces a more reliable and stable answer.
It is this aggregation mechanism that explains the success of random forests: they provide higher accuracy, greater resistance to variations in the data, and strong adaptability in real-world contexts where information is rarely perfect.
For businesses, random forests are particularly appealing. They stand out for their ability to handle a large number of variables simultaneously and to combine multiple perspectives to improve reliability. They also demonstrate great robustness with imperfect data: whereas some models require meticulous preparation, random forests better tolerate noise, missing values, and inconsistencies. This makes them highly operational and suitable for complex, dynamic, and imperfect environments. They are also relatively quick to implement once data is available, which increases their value for high-stakes decision-making projects.
Their main limitation lies in interpretation. A single decision tree can be visualized with clear and intuitive rules, understandable even for non-specialists. In contrast, a random forest aggregates hundreds of trees, making it much harder to explain precisely how the model arrived at a given result. The predictions are strong and reliable, but the reasoning remains largely opaque. For decision-makers, this lack of transparency can be a drawback when clarity and detailed justification are required.