SMARTboost Learning for Tabular Data

We introduce SMARTboost (boosting of symmetric smooth additive regression trees), an extension of gradient boosting machines with improved accuracy when the underlying function is smooth or the sample small or noisy. In extensive simulations, we find that the combination of smooth symmetric trees and of carefully designed priors gives SMARTboost a large edge (in comparison with XGBoost and BART) on data generated by the most common parametric models in econometrics, and on a variety of other smooth functions. XGBoost outperforms SMARTboost only when the sample is large, and the underlying function is highly discontinuous. SMARTboost’s performance is illustrated in two applications to global equity returns and realized volatility prediction.

“Applications of prediction algorithms have focused, to sensational effects, on discrete target spaces—Amazon recommendations, translation programs, driving directions—where smoothness is irrelevant. The natural desire to use them for scientific investigation may hasten development of smoother, more physically plausible algorithms.” Bradley Efron (2020)

This article introduces SMARTboost (boosting of symmetric smooth additive regression trees), a statistical machine learning tool designed for ease of use and good performance in a wide class of financial tabular data.¹ SMARTboost extends well-known gradient boosting machines like XGBoost, LightGBM and CatBoost by employing smooth rather than hard (or “sharp”) assignments so that an observation can potentially be allocated, with varying degrees, to all the final leaves of the trees. While a standard tree splits the variable space in non-overlapping subspaces, a smooth tree is more general in that it allows overlapping subspaces. If the underlying function to be recovered is smooth, a smooth tree can achieve a given approximation with far fewer splits than a standard tree, leading to superior in-sample and out-of-sample accuracy. While the idea of smooth trees is not novel, their large computational costs had prevented their applicability to all but the smallest datasets. The key contributions of SMARTboost are: (i) a series of strategies (starting with symmetric trees) to reduce running time, and (ii) thoughtful priors (penalizations) to improve accuracy and greatly reduce the need for parameter cross-validation (CV). As a result of these contributions, SMARTboost is the first application of smooth trees suitable for serious work in financial econometrics.

Economic and financial relations are often reasonably characterized as smooth (although rarely linear), with low signal-to-noise ratios, and featuring a time series or panel-data structure absent from mainstream machine learning. Variables showing high autocorrelation or possible nonstationarities are also common, and, in panel data, cross-correlations within each time period can be sizable. In many cases, econometrics is, by the standards of ML, effectively dealing with small effective sample sizes even when the nominal size is fairly large; limited predictability (small signal-to-noise ratios) and (in panel data) sizable cross-correlations can be both thought of as reducing the effective sample of independent observations. (A more formal connection is made in Section 4.1.2.) SMARTboost attempts to address some of the specific needs of empirical work in economics and finance. Its use of smooth trees improves accuracy whenever the data-generating process is smooth or at least locally smooth. Fitting smooth symmetric trees can also improve accuracy (compared to standard trees) when the sample size is small or the signal-to-noise low.

The article proceeds as follows. SMARTboost is presented in Section 1. Extensive simulations are conducted in Section 2, leading us to conclude that SMARTboost is very promising for a wide range of data-generating processes familiar to financial econometrics and machine learning, capable of efficiently (compared to other tree boosting methods) capturing both simple linear relations and complex high-dimensional nonlinear relations. In many simulations, SMARTboost matches the in-sample and out-of-sample fit of XGBoost with one-tenth of the data. Time series and panel data require a different process for CV compared to the randomization common in ML; the SMARTboost default for CV and prior calibration is discussed in Section 3. Two empirical applications illustrate the method on real data in Section 4. On monthly global equity indexes, SMARTboost offers strong performance; partial effects and marginal effects plots offer insights into strong interactions which we summarize as “high valuations are fragile (to negative momentum and high volatility).” On realized log volatility data, SMARTboost outperforms OLS even if linearity turns out to be a fairly good approximation.

1 Boosting Smooth Symmetric Trees

SMARTboost fits an ensemble of symmetric smooth trees by boosting. The main novel elements are: (i) the use of smooth symmetric trees as a base, (ii) other strategies to speed up computations, particularly the distinction between selecting the split variable in a first phase and the splitting parameters $μ$ and $τ$ in a second phase, and (iii) the form of the priors as well as their default settings, which improve performance for small $n$ and low SNR and reduce the need for CV. In this section, we introduce boosting of standard trees, and then explore each of these novel elements.

1.1 Boosting and Standard Trees

Boosting is a heuristic approach to fitting a model of the form (in a regression context)

y_{i} \sim N (f (x_{i}), σ^{2})

(1)

f (x_{i}) = λ \sum_{b = 1}^{B} B_{b} (x_{i}; θ_{b}),

(2)

where the basis

B_{b} (x_{i}; θ_{b})

is a function of features

x

indexed by a set of parameters

θ_{b},

and

0 < λ \leq 1

⁠. SMARTboost employs a novel basis: symmetric smooth trees. The boosting process is otherwise standard, and covered in Appendix A. One of the main attactions of boosting trees is that the number of trees

B

can be selected by early stopping based on CV (or on a single validation set) so that the pseudo-out-of-sample performance automatically determines model complexity. Boosting can be applied to any base, but has been especially successful for trees. Trees can capture additive nonlinearities as well as complex interaction effects and perform feature selection. Trees are also relatively robust to leverage points and can deal with highly non-Gaussian features without pre-processing. While the forecasting performance of individual trees is notoriously poor, ensembles of trees built by boosting are currently considered the most successful approach to classification and regression for tabular data.

Standard tree-based methods employ hard splits, in which an observation is either sent to the left or right branch at each split. This divides the feature space² into non-overlapping regions. The prediction for

y

is then taken to be the sample mean in the relevant region (for a quadratic loss or, equivalently, a Gaussian log-likelihood). The problem of finding the optimal tree structure is in general intractable, and instead a greedy algorithm (CART) employs recursive binary partitions by selecting, at each split, a feature

ι

(thus performing variable selection) and a splitting point

μ

⁠. A single regression tree with sharp splits can be represented as

y_{i} = f (x_{i}) + ε_{i}, ε_{i} \sim N (0, σ^{2})

(3)

f (x_{i}) = \sum_{m = 1}^{M} β_{m} I (x_{i} \in R_{m}),

(4)

where

R_{1}, …, R_{M}

is a partition of the feature space induced by the feature matrix

x

and splitting features and points

ι, μ .

In econometric terms familiar from threshold models, we could write the same expression via the concentrated likelihood

y \sim N (G β, σ^{2} I),

(5)

where the matrix

G (x; ι, μ)

is a function of splitting features

ι

and splitting points

μ .

Hard splits imply that

G^{'} G

is a diagonal matrix of

0

’s and

1

’s, which allows for very efficient algorithms, particularly for large samples (Chen and Guestrin 2016).

1.2 Smooth Trees

Standard regression trees use hard (or “sharp”) splits to divide the feature space in non-overlapping regions, and the function $f (x)$ is therefore not continuous. As an example, consider how boosted trees approximate a linear relation with a step function induced by a large number of splits (and hence by a large number of parameters), shown in Figures 1 and 2. Continuity of the approximating function would seem desirable in most instances, particularly when the target variable is itself continuous (Efron 2020). It is then perhaps not surprising that the literature on smooth trees reaches back several decades; the idea of a fuzzy tree is developed in Chang and Pavlidis (1977), Jang, (1994), Olaru and Wehenkel, (2003), and Irsoy, Yildiz, and Alpaydin (2012) propose soft decision trees. These single trees are still not competitive in forecasting. The extensions to an ensemble context are recent, with the article most similar to our being Linero and Yang (2018), who generalize the fully Bayesian tree ensemble BART (Chipman, George, and McCulloch 2010) to allow for probabilistic assignments. Linero and Yang (2018) propose an ensemble of soft trees (SBART of SoftBart), where, for each tree, a single coefficient determines the softness of the allocation in all nodes. (SMARTboost relaxes this assumption.) They conduct full Bayesian inference by MCMC. SoftBART shows excellent performance in many applications, but is limited to fairly small $n$ ⁠. Full MCMC in this setting is slow and difficult to parallelize and starts to mix poorly at $n$ as small as a few thousands.³ SMARTboost incorporates elements of Bayesian thinking, but avoids full MCMC with the goal of providing a computationally feasible algorithm for much larger sample sizes. Another noteworthy result in Linero and Yang (2018) is to establish conditions under which the ensemble of smooth trees consistently estimates a smooth function in $R^{p}$ ⁠.

Figure 1

SMART and XGBoost fit, four different univariate functions with standard normal features, N(0,0.25) noise, and n = 200.

Open in new tab Download slide

Figure 2

SMART and XGBoost fit, four different univariate functions with standard normal features, N(0,0.25) noise, and n = 1000.

Open in new tab Download slide

In standard trees, the assignment function that allocates an observation to either branch at each split is binary. For a tree of depth one (a “stump”), hard splits then result in the model

f (x) = β_{1} I (x > μ) + β_{2} (1 - I (x \leq μ)) .

(6)

In smooth trees, the assignment function

g (x; μ, τ)

is more general and can take any value between 0 and 1, where the additional parameter

τ

determines the smoothness of the split. A stump then becomes

f (x) = β_{1} g (x; μ, τ) + β_{2} (1 - g (x; μ, τ)) .

(7)

Previous literature on smooth trees has used the logistic sigmoid function

g (x) = 1 - \frac{1}{1 + exp (τ \times (x - μ))} .

(8)

The default in SMARTboost is the square root sigmoid, which is several times faster to compute than the logistic and behaves similarly⁴

g (x) = 0.5 + 0.5 \frac{0.5 τ \times (x - μ)}{\sqrt{1 + {0.5}^{2} τ^{2} \times {(x - μ)}^{2}}} .

(9)

Figure 3 shows these sigmoid functions for a range of values of

τ

⁠, where

x

is assumed to have been standardized (see Section 1.5). A univariate linear model and a quasi-hard threshold (as in a standard tree) can both be recovered. If a tree is built fully to depth

d

⁠, each observation is channeled recursively through

d

soft splits. In terms of the

G

matrix in (5), the i-th row and j-th column is given by Linero and Yang (2018)

G_{i, j} = \prod_{b \in A (j)} g {(x_{i}; ι_{b}, μ_{b}, τ_{b})}^{1 - R_{b}} (1 - g {(x_{i}; ι_{b}, μ_{b}, τ_{b})}^{R_{b}})

(10)

where

A (j)

is the set of ancestor nodes of leaf

j

and

R_{b} = 1

(0) if the path to

j

goes right (left) at node

b

⁠.

Figure 3

Two sigmoid functions, $μ = 0,$ for four different $τ$ ⁠. The sqrt sigmoid is faster to compute and the default in SMARboost.

Open in new tab Download slide

Building the matrix $G$ now requires time-consuming multiplications. Also computationally demanding is the fact that $G^{'} G$ is not diagonal, and therefore the least square estimator of $β$ requires computing the cross-products in $G^{'} G$ ⁠. For example, in a full tree of depth four, $G$ is a $n \times 16$ matrix and there are eight splits; in a smooth tree, finding $(ι, μ, τ)$ for each of these eight splits requires building a $n \times 16$ matrix $G$ and computing the full cross-products $G^{'} G$ and $G^{'} y$ so that the computational burden grows exponentially with the depth of the tree. In contrast, in a standard tree each response is assigned to one leaf only and has no impact on any other split, so the computing costs (using clever algorithms) for eight splits is roughly the same as for one split.

1.3 Smooth Symmetric Trees for Speed

SMARTboost greatly reduces computing times for smooth trees by working with symmetric trees. Standard trees (non-symmetric, sharp splits) are built with a different tuple $(ι, μ)$ fitted at each node of the tree. The limited literature on smooth trees has extended the same structure to soft splits. A less common structure, called Symmetric Trees (or Oblivious Trees or Decision Tables), imposes the same tuple $(ι, μ)$ for all $2^{d - 1}$ splits at depth $d$ ⁠. Since $μ$ can take any value, this definition of symmetry does not imply left and right branches having the same share of observations. Symmetric trees with sharp splits are not faster to fit than standard trees, but have faster execution⁵ (Prokhorenkova et al. 2018), which is an advantage in ML applications requiring extreme speed (like page ranks). The small literature comparing the out-of-sample performance of symmetric trees and standard trees suggests that symmetric trees are competitive in most situations.⁶ Imposing symmetry is a form of regularization, an approximation to a proper Bayesian prior which would suggest symmetry without imposing it. While the form of parameter-sharing imposed by symmetric trees may seem a harsh constraint, a non-symmetric tree can in fact be equivalently represented as either a deeper symmetric tree or as the sum of symmetric trees of the same size.⁷

SMARTboost extends symmetry to smooth trees, forcing the same tuple $(ι, μ, τ)$ for all $2^{d - 1}$ splits at depth $d$ ⁠. Unlike in the sharp split case, this produces large speed gains in fitting the model. The reason for the speed gains is as follows. In trees with sharp splits, each element of $β$ (each leaf) is estimated independently, based exclusively on the observations reaching that leaf. The nodes at a given depth can therefore be updated sequentially, with total cost independent of the number of nodes. In smooth trees, in contrast, the entire vector $β$ must always be estimated jointly, since $G^{'} G$ is non-diagonal (see Section 1.5.2 for details). This means that evaluating a single node is as expensive as evaluating all the $2^{d - 1}$ splits at depth $d$ jointly. Evaluating all the nodes sequentially would therefore incur a cost proportional to the number of nodes, and evaluating all the nodes jointly would require an expensive high dimensional optimization. Symmetric trees provide a solution, where all the nodes at a given depth can be evaluated jointly because they share the same parameters $ι, μ, τ$ ⁠. When moving from depth $d - 1$ to depth $d$ ⁠, symmetry thus reduces computing costs to approximately a fraction $1 / 2^{d - 1}$ ⁠. For a tree of depth four (five), this is a time saving of 88% (94%). In summary, combining smoothness and symmetry (an innovation of SMARTboost) drastically reduces computing time and is likely to improve performance in noisy environments.

1.4 Strategies for Faster and More Robust Optimization

SMARTboost reduces computing costs by building symmetric trees (roughly 8 (16) time faster at depth 4 (5)) and by using the sqrt sigmoid function (9) (roughly ten times faster than the logistic). Section 1.5 describes how thoughtful priors greatly reduce the need for CV, leading to large computational savings. Here we describe our approach to speeding up parameter optimization, working with the concentrated likelihood: $β$ is concentrated out (the model in (5) is linear-Gaussian conditional on $μ$ and $τ$ ⁠) and the optimization is two-dimensional, over $μ$ and $τ$ ⁠.

When evaluating a new split, it is still very expensive to fully optimize with respect to $μ$ and $τ$ while looping through all the features, even though the loop is parallelized. SMARTboost tackles this problem by selecting $ι$ (the feature for the split) using a rough grid of values of $(μ, τ$ ⁠), and then performing a full optimization of $(μ, τ$ ⁠) only for the selected $ι$ ⁠. All the simulations and real examples in this article used a $3 \times 10$ grid, with three values of $τ$ (near linear, moderately nonlinear, near-threshold) and ten deciles of $μ$ ⁠. We have chosen this two-step approach and this grid size because it produces results just as good as full optimization on all features on all the simulated and real examples of this article. (The default $10 \times 3$ grid is 3–4 times faster than full optimization.) These rough grids are sufficiently accurate for the task of selecting the feature on which to split. In highly non-linear functions, they may occasionally fail to select the best feature at each split, but a modest degree of inaccuracy in this choice should not seriously compromise performance, as various ways of injecting some randomness in feature selection are often considered beneficial in boosting (e.g., Chen and Guestrin 2016 for a discussion of column and row sub-sampling). The full optimization is then speeded up by running a line search over $μ$ in parallel for a finer grid of values of $τ$ in a neighbourhood of the starting value. Besides making use of parallelization, this procedure is more robust than attempting joint optimization of $μ$ and $τ$ using derivative-based methods. A grid search over $μ$ conducted in parallel over various values of $τ$ is extremely reliable numerically. Tree depth has a large impact on computing time for smooth trees, because the number of elements in the full cross-product $G^{'} G$ is proportional to $2^{2 d}$ ⁠; four or five are good defaults.

1.5 SMARTboost Priors on Parameters

The addition of penalization terms to the log-likelihood is one of the reasons behind the success of machine learning. These penalizations are similar (and in some cases identical) to Bayesian priors followed by maximum-a-posteriori (MAP) inference. XGBoost can penalize the leaf output parameter $β$ as well as the number of leaves (tree depth); there is no explicit penalization on $μ$ other than the requirement of a minimum number of observations in each leaf. SMARTboost formulates proper Bayesian priors (followed by MAP) for $μ$ and $τ$ ⁠; the recommended defaults encourage smooth functions and discourage near-discontinuous, jumpy behaviors. The use of well-specified Bayesian priors is meant to eliminate or at least reduce the benefits of CV on some hyperparameters. Besides reducing computing time (an important consideration with smooth trees), limiting the number of hyperparameters to cross-validate is desirable with small and/or very noisy samples, since the choice of how much to cross-validate also involves a bias-variance trade-off; the cross-validated loss is itself a random variable, and therefore noisy so that excessive CV can result in overfitting, particularly in small samples (Cawley and Talbot 2010). SMARTboost trees are always grown to maximum depth (default four). Priors are used to control tree complexity. The default priors are centered on $f (x)$ being smooth and are informative but not dogmatic so that functions of arbitrary complexity can be captured if the sample size is sufficiently large (and/or the signal-to-noise sufficiently high).

1.5.1 Priors that encourage smooth functions

Smooth trees become sharp for high values of $τ$ ⁠, so an uninformative prior on $τ$ would imply no a-priori opinion on whether smooth or discontinuous functions are more likely. We wish to formulate a prior suggesting (but not forcing) smooth functions. In doing so, we must consider that many economic and financial variables have fat-tailed or otherwise highly non-Gaussian distributions. The default priors on $τ$ and $μ$ attempt to capture the following assumptions: (i) smooth functions are more likely than highly nonlinear and near-discontinuous functions (in financial data); (ii) highly skewed and leptokurtik features are less likely to generate linear functions compared to near-Gaussian features. We also wish for the priors to be fairly insensitive to choices of data transformation such as windsorizing or otherwise eliminating leverage points, and produce good performance even if leverage points are not purged and non-Gaussian features are not transformed to Gaussian or uniform, thus retaining one of the main advantages of boosted trees.

The default prior on

μ

is student-t

μ \sim t (0, 2^{2}, 10),

(11)

a rather weak prior discouraging splits at extreme values of the standardized features.

The default prior on

τ

is also student-t

log (τ) \sim t (0, 1 / d, 5),

(12)

which is fairly informative and centered on a nearly linear sigmoid (see Figure 3).⁸ The prior’s impact vanishes in large samples, but is meaningful in small and/or very noisy samples. The use of a student-t rather than a Gaussian is meant to allow for informative but fatter-tailed, non-dogmatic priors. The dispersion of the prior distribution can of course be cross-validated if desired. The dispersion parameter is divided by the depth of the fully grown trees to account for the fact that deeper tree can produce stronger nonlinearities; tightening the prior for deeper trees is beneficial in reducing the benefit from cross-validating depth (see Section 2).

Both priors operate on scaled features. We scale each continuous feature as

x \leftarrow \frac{x - median (x)}{1.42 \times median (| x - median (x) |)} .

(13)

The denominator is an outlier-robust estimate of dispersion. It is asymptotically equal to the standard deviation if $x$ is Gaussian, but smaller when $x$ is leptokurtik. For example, for a standard student-t distribution with $ν$ degrees of freedom, the variance is $ν / (ν - 2)$ ⁠, but the denominator would remain close to one even when $ν \leq 2$ ⁠. Compared to common standardization (de-mean and divide by the standard deviation), this choice implies a prior that more asymmetric and fat-tailed features are less likely to induce linear functions on the full range of realized values. The prior mean implies nearly linear behavior between minus and plus two standard deviations, and then a tapering of the relation; when the feature has high kurtosis, the use of a robust estimation of standard deviation meaningfully affects the prior. This is illustrated in Figure 4 for four highly leptokurtik features from a dataset of US large cap stocks.

Figure 4

The histograms are from a monthly dataset of US stocks (500 largest), period 1998–2020. Each series is windsorized at 0.5% and 99.5%. The red continuous line below each histogram shows the corresponding sqrt logistic at the prior mean for $μ$ and $τ$ when scaling using the SMARTboost default in Equation (13). The blue dashed line shows how the same sqrt logistic would look when simply standardizing (de-mean and divide by the standard deviation). On Gaussian features the red and blue lines would asymptotically overlap.

Open in new tab Download slide

It’s worth emphasizing that the priors on $μ$ and $τ$ apply to individual layers of individual trees; they are easy to calibrate on a single stump (depth = 1), but the implications for $f (x)$ built from ensembles of deeper trees are more complex and less transparent. In practice, the prior amount of smoothness on $f (x)$ is not as strong as suggested by plots for individual splits. These priors have worked well across a variety of dgp in simulations (see Section 2) and for small $n$ produce fairly large gains compared in particular to weaker priors on $τ$ ⁠.

1.5.2 Prior on $β$ and MAP inference

XGBoost disciplines the leaf values $β$ by augmenting the loss function at each iteration of the boosting process with a ridge-type penalization of the form $η β^{'} β$ (default $η = 1) .$ A positive $η$ penalizes leaves with very large values, which often coincide with splitting points leaving few observations in some leaf. In most application of boosting, the learning parameter $λ$ is set at a value of 0.1 or smaller so that the approximating function is built slowly, and the number of trees is cross-validated rather than predetermined. In this context, the role of the penalization on $β$ is not nearly as important as in other models and methods with pre-determined architecture, such as neural networks or regression splines.

While XGBoost does not require $η > 0$ ⁠, a proper prior on $β$ in SMARTboost is needed for reliability, as it ensures that the matrix $G^{'} G$ is invertible; some parameter combinations induce singular or near-singular matrices, a problem that increases with tree depth. Symmetric trees are particularly prone to producing singular matrices; for example, if the same variable appears more than once in a symmetric trees with hard splits, then at least one leaf will be empty. This poses no problem with hard splits, since the elements of $β$ are estimated individually, and a leaf that cannot be reached has no impact on the fit. In the context of smooth trees, however, all the elements of $β$ are estimated jointly, and a leaf that cannot be reached creates a column of zeros in $G$ and makes $G^{'} G$ singular. An informative prior (or regularization) restores invertibility, and the fact that a leaf is empty then presents no problem.

While using a standard ridge penalization is possible and typically leads to good result, we specify a novel Bayesian prior, and then perform maximum-a-posteriori inference conditional on

G

⁠. This prior, which is is designed to avoid the need for CV, is of the form

β \sim N (0, P_{β}^{- 1})

(14)

\begin{matrix} P_{β} = (\frac{trace (G^{'} G) / n}{var (r) \times R_{p}^{2}}) \times I, \end{matrix}

(15)

where

n

is the sample size,

var (r)

is the sample variance of the residuals from the ensemble build on previous trees (initialized at

r = y)

⁠, and

R_{p}^{2}

is a prior mean formulated on the

R^{2}

of the tree (as in Zhang et al. 2022) rather than on individual coefficients, as standard.⁹ This prior on

β

has the property of implying

E (R^{2} | r, G) = R_{p}^{2}

⁠, and of being proper even for high values of

R_{p}^{2}

⁠.¹⁰ The default value for

R_{p}^{2}

in SMARTboost is the empirical

R^{2}

of the first tree, which is itself estimated with a weakly informative prior. The idea is that a single smooth tree, while not as weak a base as a standard tree, is still unlikely to grossly overfit and that the fit of the first tree should provide an upper bound (or at least a good indication) of the fit of successive trees. This prior adapts automatically to the SNR, tree depth, and function smoothness, becoming tighter (on individual elements of

β

⁠) for lower SNR, deeper trees, and more nonlinear functions.¹¹ Un-even allocations of observations to each leaf are also mildly discouraged by the prior.

For each candidate value of the tuple

(ι, μ, τ)

⁠, we compute MAP estimates of

β |

ι, μ, τ

as follows: we construct

G (ι, μ, τ)

and use standard results to go from the density plus prior

r | μ, τ \sim N (G β, σ^{2} I)

(16)

β \sim N (0, P_{β}^{- 1}),

(17)

to the posterior mean

E (β|y, μ, τ) = {(G^{'} G + σ^{2} l P_{β})}^{- 1} G^{'} r,

(18)

where we set

σ^{2} = var (r) (1 - R_{p}^{2})

⁠, and where

l

(for “loglikdivide”) is explained in Section 3.2 and defaults to one. The MAP estimate of

β

is then given by

{\hat{β}}_{map} = E (β|y, μ, τ)

⁠, the fitted values are

\hat{r} = G {\hat{β}}_{map}

⁠, and for each candidate tuple

(ι, μ, τ, β)

we evaluate

\frac{1}{l} \sum_{i = 1}^{n} l o g (ϕ (r_{i} | \hat{r_{i}}, σ^{2})) + \log (p (τ)) + \log (p (μ)),

(19)

and finally select the tuple with the highest value in (19).

1.6 A First Look at SMARTboost in Action

1.6.1 SMARTboost and XGBoost on four univariate functions

Figures 1 and 2 illustrate the behavior of XGBoost and SMARTboost on four univariate functions. The sample size is 200 for Figure 1 and 1000 for Figure 2. The maximum depth is cross-validated for both methods. The features are standard normal, and $N (0, {0.5}^{2})$ noise is added. Since the features are standard normal, accurate inference at very high and low values of $x$ is challenging because there are few observations for a local fit. SMARTboost manages the task better, producing a smoother fit. XGBoost requires a large number of splits to approximate these smooth functions via step functions.

1.6.2 Extrapolation in forecasting

It is not uncommon, when forecasting economic and financial data, for some important feature to take values outside their empirical support in the training set. Well-known examples may include equity valuations in 2000 or volatilities and correlations in 2008. Various types of nonstationarities (like structural changes in mean or variance), omitted variables, or simply high autocorrelation of some features can all contribute. Ensembles of standard trees typically perform very poorly in these situations, since theyextrapolate flat functions (the same characteristic that makes them robust to leverage points). Genuine extrapolation is always perilous, and while there can be no general solution to the problem, we believe that SMARTboost fares comparatively better in most situations. When SMARTboost recovers a highly nonlinear function, large values of $τ$ are estimated, which means that it will extrapolate similarly to XGBoost. If, on the other hand, the function retrieved is only moderately nonlinear, SMARTboost will tend to extrapolate in a way that is more intuitive. Figure 5 illustrates in a few simple cases. The training set is, for each feature, $U [- 3, 3] .$ If the dgp is linear, SMARTboost asymptotically extrapolates very well even at $x = \pm 5$ ⁠, while XGBoost’s predictions are unappealing. If the dgp is not linear, neither method extrapolates correctly, but SMARTboost fares relatively better.

Figure 5

SMARTboost and XGboost extrapolating on four functions. In all cases, the training set consists of two uncorrelated features, $U [- 3, 3] .$ n = 10k, $σ = 1.$ Depth is cross-validated. The fitted functions are then shown in $[- 5$ ⁠,5].

Open in new tab Download slide

1.6.3 Computing time

SMARTboost retains many advantages of more standard GBMs, but not their speed. To get a sense of what combinations of (n, p) are currently possible, using eight cores (on an AMD EPYC 7542), fitting SMARTboost at default values takes approximately 1” (5”) per tree for n = 100k and p = 10 (p = 100). The required numer of trees varies with $n$ ⁠, the SNR, and the learning parameter, typically from a few dozens to a few hundreds. Research is ongoing to further reduce the gap with other GBMs, which is, however, likely to remain subsantial.

2 Simulations

2.1 Simulation Setup

The purpose of this section is to explore the performance of SMARTboost when data are generated by several well-known processes in financial econometrics and machine learning. We compare four main models: (i) SMARTboost, (ii) XGBoost, (iii) BART (Chipman, George, and McCulloch 2010), and (iv) Boosted sharp symmetric trees (SST). XGBoost is considered one of the most accurate GBMs and is widely used in science and industry. BART is a Bayesian alternative which has been found to perform well in low signal-to-noise environments. For boosted sharp symmetric trees, we employ the SMARTboost software, but forcing hard splits (⁠ $τ = Inf) .$ The resulting model is very close to CatBoost, but has the same prior on $β$ as SMARTboost, allowing us to isolate the impact of smooth splits.

SMARTboost is fitted at default values (⁠ $depth = 4,$ $λ = 0.2)$ ⁠. Modest but fairly consistent improvements in accuracy would be available (not reported) lowering the learning parameter $λ$ to 0.1, which unfortunately doubles computing time,¹² and when cross-validating depth. For XGBoost, we set the learning rate to 0.1 and use five-fold cv to select depth in the range 1-8. Other parameters are kept at default values.¹³ The number of trees is chosen by five-fold CV for both SMARTboost and XGBoost.¹⁴ BART benefits less than non-Bayesian alternatives from CV; we use the default parametrization, except for increasing the burn-in from 100 to 300. For each training sample, we fit the models and estimate $f (x)$ on 100k test data, and estimate the empirical mean of ${(E (y | x) - f (x))}^{2}$ ⁠. These means are then averaged over multiple runs to produce an average mean-squared error (figures report its squared root). We consider two sets of data-generating-process (dgp): (i) smooth and partially (i.e., locally) smooth functions, and (ii) highly irregular (discontinuous) functions. For each data-generating-processes, three sample sizes are considered: 1k, 10k, and 100k, with $p$ varying from 3 to 30.

2.2 Smooth and Partially Smooth Functions

The results from (partially) smooth functions are summarized in Figure 6.

Figure 6

Simulation results with data generated by various smooth and partially smooth functions. Symmetric boosts symmetric trees with sharp splits. All models are fitted at default values except XGB, which cross-validates depth. The root-mean-squared error is defined as $\sqrt{E {(f (x) - E (y | x))}^{2}} .$ ⁠.

Open in new tab Download slide

2.2.1 Linear function

We simulate from the linear process

f (x) = 2 x_{1} + 1.5 x_{2} + x_{3} + 0.5 x_{4}

with

x_{i} \sim N (0, 1)

and

y \sim N (f (x), 10^{2}) .

SMARTboost is the clear winner. This is the only smooth function where XGBoost outperforms BART, possibly because BART does not cross-validate tree depth, which in this case is optimally set at one.

2.2.2 Friedman’s function

A common test case in machine learning, first proposed by Friedman (1991), sets

f (x) = 10 sin (π x_{1} x_{2}) + 20 {(x_{3} - 0.5)}^{2} + 10 x_{4} + 5 x_{5},

to which Gaussian noise is added, in our case

y \sim N (f (x), 5^{2}),

and where the features are independent standard uniform. SMARTboost achieves very large gains over XGBoost.

2.2.3 Threshold Friedman

As an example of a (sharp) threshold model, we consider two regimes driven by

x_{1},

and the modified Friedman’s function in the second regimes takes different values for all terms:

\begin{matrix} f_{1} (x) = 10 sin (π x_{2}) + 20 {(x_{3} - 0.5)}^{2} + 10 x_{4} + 5 x_{5} \\ f_{2} (x) = 2 sin (π x_{2}) + 10 {(x_{3} - 0.5)}^{2} + 20 x_{4} + 10 x_{5} \\ f (x) = f_{1} (x) I (x_{1} \leq 0) + f_{2} (x) I (x_{1} > 0), \end{matrix}

and

y \sim N (f (x), 5^{2}) .

SMARTboost handles the sharp threshold well.

2.2.4 Projection pursuit regression

Projection pursuit regression generalizes the single-index model common in econometrics and can be seen as a precursor to neural networks (see Hastie, Tibshirani, and Friedman 2013 for a brief introduction). The response is modeled as the sum of non-linear transformations of linear combinations of the features. An alternative interpretation is that

f (x)

is additive in a number of factors, each of which is a weighted average of some features. We simulate from

\begin{matrix} z_{1} = 1.5 x_{1} + 1.0 x_{2} + 0.5 x_{3} + 0.5 x_{4} \\ z_{2} = x_{3} + x_{4} + x_{5} + x_{6} \\ y \sim N (z_{1} I (z_{1} > 0) + 0.5 z_{2}^{2}, 1.0), \end{matrix}

with

x_{i} \sim N (0, 1)

⁠. XGBoost struggles to find an efficient representation for this process, and SMARTboost achieves very large gains.

2.2.5 Nonlinear factors

We now simulate from a more explicit factor representation, where by “factor” we mean a linear combination of variables. The data are generated by

y \sim N ({\bar{x}}_{1} + ({\bar{x}}_{2} + 1) I ({\bar{x}}_{2} < - 1) + ({\bar{x}}_{2} - 1) I ({\bar{x}}_{2} > 1) + {\bar{x}}_{3} I ({\bar{x}}_{3} > 0), 1) .

where

{\bar{x}}_{i} = mean (x_{i}),

with

x_{i}

being a ten-dimensional vector of multivariate normal variables with unit variance and all cross-correlations set to 0.5, for a total of thirty features. SMARTboost can approximate this process far more efficiently than XGBoost.

2.2.6 Neural network with ReLu transform

This is a feed-forward neural network with two hidden layers and standard normal features:

\begin{matrix} z_{1} = x_{1} + x_{2} + x_{3}, f_{1} = z_{1} I (z_{1} > 0) \\ z_{2} = x_{4} + x_{5} + x_{6}, f_{2} = (z_{2} - 1) I (z_{2} > 1) \\ z_{3} = x_{7} + x_{8} + x_{9}, f_{3} = - z_{3} I (z_{3} > 0) \\ z_{4} = 1.5 f_{1} + f_{2} + 0.5 f_{3} \\ y \sim N (z_{4} I (z_{4} > 0), 1.0) . \end{matrix}

SMARTboost is much more efficient than XGBoost on data generated by this small neural network.

2.3 Highly Irregular Functions

The results from irregular functions are summarized in Figure 7. The general pattern is that SMARTboost is superior to or competitive with XGBoost at $n = 1 k,$ and then gets gradually overtaken. BART is consistently the worst performer on these functions.

Figure 7

Simulation results with data generated by various irregular functions. Symmetric boosts symmetric trees with sharp splits. All models are fitted at default values except XGB, which cross-validates depth. The root-mean-squared error is defined as $\sqrt{E {(f (x) - E (y | x))}^{2}} .$ ⁠.

Open in new tab Download slide

2.3.1 Tree with sharp splits

We simulate data from a single standard tree of depth three, with seven independent standard normal features, involving seven splits, each on a different feature (no two splits can share the same feature), random

μ

and

β

(both re-drawn for each simulated sample from standard normals), splitting variable set to

ι_{j} = j, x_{j} \sim N (0, 1), μ_{j} \sim N (0, 1), for j = 1, 2, …, 7

and standard Gaussian errors

ε \sim N (0, 1)

⁠.

2.3.2 Tree with sharp splits and high noise

We keep the same trees, but draw the errors from $ε \sim N (0, 10^{2})$ to produce a lower SNR. Even though there is no apparent symmetry in the dgp, the regularization induced by symmetric trees and smoothness helps in this high-noise situation so that $n = 100 k$ is required for XGBoost to outperform SMARTboost and SSN.

2.3.3 AND function

The AND and OR function are sometimes used in benchmarking GMBs. Both are highly discontinuous functions.

y \sim N (2 \times I (x_{1} > 0.3) \times I (x_{2} > 0.3) \times I (x_{3} > 0.3), 1)

2.3.4 OR function

y \sim N (6 \times (I (x_{1} > 0.3) + I (x_{2} > 0.3) + I (x_{3} > 0.3)), 1)

2.4 Summary of Main Results

Our reading of the simulation results is as follows:

SMARTboost dominates BART; it is more accurate for each dgp and sample size, and far more accurate for smooth functions.
SMARTboost dominates XGBoost for all (partially) smooth functions. This is mostly due to smooth splits, although symmetric trees also help (see point 5). SMARTboost is typically more precise with a sample of 10k observations than XGBoost with a sample of 100k.
For highly irregular functions, SMARTboost is more (less) precise than XGBoost at low (high) $n$ ⁠.
BART tends to be better (worse) than XGBoost on smooth (irregular) functions.
With sharp splits, symmetric trees are on average more precise than XGBoost on smooth functions, and are competitive for highly irregular functions except at very large $n$ ⁠.

Most results are in line with our expectations. In particular, we would expect SMARTboost to dominate XGBoost on smooth and partially smooth functions, because these can be approximated by smooth trees with far fewer splits. The importance of regularization on the smoothness parameter $τ$ was perhaps not obvious; an informative (but not dogmatic) prior of smoothness typically produces large gains over disperse priors, at modest costs if the function is irregular. We expected BART to outperform XGBoost in most cases, but found this to be true only for (partially) smooth functions, while XGBoost is the clear winner for irregular functions. We conjecture that the randomization of split points in BART induces some smoothness at modest sample sizes or in noisy data-sets.

2.5 Ablation Study

SMARTboost differs from XGBoost along three dimensions: (i) symmetric trees, (ii) smooth splits, and (iii) priors on $β$ and $τ$ ⁠. We perform a small ablation study to decompose SMARTboost’s gains on smooth functions between these three elements, choosing the Friedman function as specified in Section 2.2. We define disperse priors on $β$ and $τ$ by multiplying the prior variances in 17 and 12 by 100². The results are shown in Table 1. A general trend is for priors to be more important at low $n$ ⁠, which is to be expected. Symmetric trees with sharp splits outperform XGB on this function, and are approximately as good as BART. An important result is that while a disperse prior lead to a modest deterioration in performance for trees with sharp splits, the loss is much greater for the more flexible smooth trees.¹⁵ An informative prior on $τ$ (split smoothness) is even more important at low $n$ ⁠.

Table 1

Open in new tab

Ablation study. Simulation results for Friedman function, $n = 1 k,$ $σ = 5$ ⁠, for different priors. The first four columns refer to SMARTboost with different priors, and columns 5–6 to symmetric trees with sharp splits. d. stands for “disperse.” SMARTboost and BART are fitted at default parameters, while XGB cross-validates depth and sets the learning rate to 0.1

avg RMSE	default	d. $β$	d. $τ$	d. $β$ d. $τ$	sharp	sharp d. $β$	XGB	BART
n = 1k	1.03	1.16	1.34	1.41	1.52	1.57	1.79	1.43
n = 10k	0.43	0.50	0.51	0.52	0.70	0.71	1.03	0.76
n = 100k	0.18	0.21	0.21	0.21	0.35	0.36	0.57	0.37

avg RMSE	default	d. $β$	d. $τ$	d. $β$ d. $τ$	sharp	sharp d. $β$	XGB	BART
n = 1k	1.03	1.16	1.34	1.41	1.52	1.57	1.79	1.43
n = 10k	0.43	0.50	0.51	0.52	0.70	0.71	1.03	0.76
n = 100k	0.18	0.21	0.21	0.21	0.35	0.36	0.57	0.37

Table 1

Open in new tab

avg RMSE	default	d. $β$	d. $τ$	d. $β$ d. $τ$	sharp	sharp d. $β$	XGB	BART
n = 1k	1.03	1.16	1.34	1.41	1.52	1.57	1.79	1.43
n = 10k	0.43	0.50	0.51	0.52	0.70	0.71	1.03	0.76
n = 100k	0.18	0.21	0.21	0.21	0.35	0.36	0.57	0.37

avg RMSE	default	d. $β$	d. $τ$	d. $β$ d. $τ$	sharp	sharp d. $β$	XGB	BART
n = 1k	1.03	1.16	1.34	1.41	1.52	1.57	1.79	1.43
n = 10k	0.43	0.50	0.51	0.52	0.70	0.71	1.03	0.76
n = 100k	0.18	0.21	0.21	0.21	0.35	0.36	0.57	0.37

3 CV and Priors for Time Series and Panel Data

3.1 Purged (hv-Block) CV for Time Series and Panel Data

CV is used for hyperparameter optimization and for evaluating pseudo-out-of-sample performance. Here we discuss CV for the purpose of hyperparameter optimization. SMARTboost uses CV to optimize the number of trees and, optionally, tree depth, while employing, in its default mode, a prior plus maximum-a-posteriori (MAP) inference for other hyperparameters. XGBoost has many hyperparameters that can be cross-validated; in practice, the number of trees and tree depth are always optimized, while most others are often left at default values, except for the learning parameter, typically set in the range $[0.01 - 0.10] .$

CV of hyper-parameters for financial data requires special care for at least two reasons. First, financial data often come in the form of time series or panel data, where the assumption of conditionally independent observations may not hold, invalidating standard (i.e., fully randomized) CV. Second, when the signal-to-noise ratio is small in relation to the sample size, even small amounts of over-fitting can have very negative consequences. For example, in stock returns prediction, a model with an out-of-sample $R^{2}$ of 0.5% may imply trading strategies with high profitability (see Gu, Kelly, and Xiu 2020 for an example); if the true $R^{2}$ is in fact 0.1%, the same trading strategy may be worthless after transaction costs. In comparison, when predicting realized volatility, an estimated $R^{2}$ of 50.5% is unlikely to have any meaningful negative consequence if the true $R^{2}$ is in fact 50.1%.

3.1.1 Time series data

Standard CV, which comes as a default in ML packages, assumes that all observations are conditionally independent of each other. This assumption may fail for time series and is very likely to fail for panel data, leading to overfitting. For time series data, de Prado (2018) recommends Purged CV in the presence of overlapping observations or when otherwise suspecting residual serial correlation; in the statistical literature, a very similar procedure is called hv-Block CV (Cerqueira et al. 2017). Purged CV differs from standard CV in two ways: (i) the split between test and train set is not randomized, but each fold is a time block so that each test set is comprised of all observations between a start and an end date (known as Block CV in statistics), and (ii) data in the training set close to the start or to the end date of the test set are deleted, with the goal of making the test set (approximately) conditionally independent of the training set. While a time series structure may not per se require blocking and purging under the assumption of a correctly specified model with no overlap (Bergmeir, Hyndman, and Koo 2018), little efficiency would be gained by Standard CV in this ideal case. Purged CV is therefore the default in SMARTboost for hyperparameter optimization. Various forms of Recursive CV (also known as out-of-sample, or OOS CV) are often employed with time series data. In Recursive CV, the test set is always a time block of data post-dating the observations in the training set. These procedures use a smaller share of the data for training compared to Standard or Purged CV, and are therefore less efficient under ideal circumstances (stationarity and correctly specified model), particularly when the sample size and/or the SNR is low (Bergmeir, Hyndman, and Koo 2018). However, Cerqueira et al. (2017) find that, in many ML data sets, various forms of Recursive CV perform comparatively better than in simulations, which they conjecture could be due to unmodeled nonstationarities and long-memory effects. Among non-recursive CV, they recommend Block and hv-Block (Purged) CV, which preserve the time series structure of the data and therefore minimize information spill-over and overfitting in case of misspecification.

3.1.2 Panel data

While standard CV may still perform well for time series under ideal circumstances and no overlap, it is likely to overfit with panel data, since observations belonging to the same date typically have cross-correlated residuals. Consider for example a panel of stock returns (or any spatio-temporal data¹⁶) where five-fold Standard CV would proceed to predict returns of 20% of the stocks at any given date with the knowledge of returns of the remaining 80% at the same date, most likely leading to overfitting.

A valid option for panel data is OOS CV, at the cost of some loss of efficiency if the model is correctly specified. The default for hyperparameter optimization in SMARTboost is to extend Purged CV to the panel data setting. OOS Hold-out CV is a valid alternative if strong non-stationarities are suspected and/or speed is important, particularly in combination with large sample sizes (where the loss of efficiency is less of a concern).¹⁷

3.2 Priors for Time Series and Panel Data

Unlike other implementations of boosted trees, which rely on CV or default values for all hyperparameters, SMARTboost places proper Bayesian priors on the splitting parameters $μ$ and $τ$ and on the leaf parameters $β$ (see Section 1.5). This choice is made for three main reasons: (i) cross-validating fewer parameters is desirable given the high computational costs of smooth trees, (ii) CV also has a bias-variance trade-off, and with small samples and/or low SNR this trade-off favors reducing the number of CV hyperparameters (Cawley and Talbot 2010), and (iii) we want to center the prior on smooth, monotonic functions, while allowing the recovery of any functional form given sufficient data.

Given that $p (μ),$ $p (τ),$ $p (β)$ are interpreted as priors, an adjustment to the log-likelihood is suggested when observations are not conditionally independent (e.g., with overlapping data and panel data). Consider the example of forecasting realized volatility twenty days ahead as in Bollerslev et al. 2018; due to the extensive overlaps, the number of genuinely independent observations is much smaller (roughly ten times smaller) than the total number of observations, and this should be accounted for if we want the penalization to be interpretable as a prior. The example of stock prediction in Gu, Kelly, and Xiu 2020 is even more dramatic, since the 1000 largest stocks can be reduced to less than ten independent observations per date.

SMARTboost introduces a user-defined parameter

l

(for “log-likelihood divide,” with default value of one) which divides the evidence in the log-likelihood (loss function) in computing the posterior mean of

β

as in (18) and the posterior mean of

\log (p (y | τ, μ, β))

- \frac{1}{2 σ^{2} l} \sum_{i = 1}^{n} {(y_{i} - G_{i} (τ, μ) β)}^{2} + \log (p (τ) \times p (μ)),

(20)

which is maximized with respect to

μ

and

τ

⁠. The ratio

n / l

can be interpreted as the effective sample size (the number of pseudo-independent observations).

If the assumption of conditional independence seems reasonable,

l

can be left at one. For panels, we recommend to calibrate

l

as implied in the computation of clustered standard errors (see Petersen 2009)

l = \frac{\sum_{t = 1}^{T} {(\sum_{i = 1}^{I_{t}} (y_{i, t} - \bar{y}))}^{2}}{\sum_{t = 1}^{T} \sum_{i = 1}^{I_{t}} {(y_{i, t} - \bar{y})}^{2}},

(21)

where

I_{t}

is the number of observations at date

t

⁠. For overlapping data, the default is to set

l = 1 + 0.5 (h - 1)

⁠, where

h

is the forecasting horizon and thus

h - 1

the number of overlaps. For overlapping data in a panel, the product of the two corrections is used.

While these corrections are borrowed from the econometric literature, their objective is not to produce accurate standard errors and t-statistics, but to automatically calibrate the strength of the priors and avoid CV on hyperparameters. The importance of these corrections will therefore decrease with larger $n$ and SNR. The user may replace either correction if more specific information about the data is available. Finally, it is of course possible to use these corrections simply as a starting point for a CV search.

4 Applications on Real Data

Simulations suggest that SMARTboost is a promising new tool for fitting financial data, and for tabular data more generally. Real datasets have a way of being less cooperative, and only the accumulation of experience will be able to confirm these encouraging results. We take a first step with two illustrations to well-known problems. We focus on XGBoost and SMARTboost in these applications.

4.1 Global Equity Indexes and the Fragility of High Valuations

Our first illustration is to an unbalanced panel of monthly global equity total excess returns. The data are described in Giordani and Halling (2019). There are a total of 13500 observations from thirty-six countries (unbalanced), and only four features: the log CAPE ratio, a measure of momentum (cumulative total log excess return in the last 12 m), and two measures of volatility (the last 3 months and last 12 months). Following Asness, Moskowitz, and Pedersen (2013), Blitz and Vliet (2007), we expect high value, high momentum, and low volatility to predict higher returns. Our interest is in moving beyond linearity and exploring interactions between these factors in a small $p$ settings to facilitate interpretation.

CV and priors are adapted to the panel setting as in Section 3 for both SMARTboost and XGBoost.¹⁸ Both models use five-fold CV to select the number of trees and their depth. We estimate $l = 10.3$ so that the effective sample size is 1300 observations.

4.1.1 Forecasting

Table 2 summarizes the out-of-sample results. OLS performs approximately as well as XGBoost. However, XGBoost and SMARTboost are minimally affected by alternative feature transformations like defining volatility in terms of variances instead of standard deviations, or CAPE instead of log CAPE, while OLS is not so that the table presents the best possible comparison for OLS. SMARTboost strongly outperforms XGBoost when predicting log returns and (especially) returns. Neither model overfits: the average test $R^{2}$ is nearly identical to the training set $R^{2}$ ⁠. Importantly for the interpretation of results, SMARTboost with depth = 1 fits only marginally better than OLS both in-sample and out-of-sample, suggesting important interaction effects.

Table 2

Open in new tab

Pseudo-out-of-sample fit for several models. The out-of-sample $R^{2}$ is computed as $R_{oos}^{2} = 1 - \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2} / \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2},$ where forecasts are produced by ten-fold purged CV, as explained in Section 3.1

oos $R^{2}$ in %	log returns	returns
SMART depth = 4	2.16	1.86
SMART depth CV	2.15	1.57
XGBoost	0.84	0.21
OLS	1.16	0.48

Table 2

Open in new tab

oos $R^{2}$ in %	log returns	returns
SMART depth = 4	2.16	1.86
SMART depth CV	2.15	1.57
XGBoost	0.84	0.21
OLS	1.16	0.48

4.1.2 How can OLS outperform XGBoost if there are strong interactions?

SMARTboost suggests the presence of sizable interaction effects, which OLS cannot capture. So why isn’t XGBoost outperforming OLS? Our interpretation of this result is that the effective sample size is very small. Let’s assume that SMARTboost is accurate and the dgp in fact has an $R^{2} = 0.02,$ so that $var (f) / var (y) = 0.02.$ If we could lower the variance of the errors so that $R^{2} = 0.5$ (a more typical scenario in ML), we would only require a 1/49th share of observations to achieve the same precision in estimating f. We have also estimated $l = 10.3$ due to cross-correlation in the panel. As a result, our nominal $n = 13500$ can be compared to a sample of $13500 / (10.3 \times 49) \approx 27$ observations in a $R^{2} = 0.5$ environment. The mediocre performance of XGBoost is less surprising if we consider that we are applying a universal approximator to a sample of twenty-seven observations.

4.1.3 Marginal effects

Standard boosted trees are not suitable for the computation of marginal effects since their derivatives are zero almost everywhere. In contrast, SMARTboost has continuous derivatives. Although we do not provide any theoretical results on the consistency of estimated marginal effects, preliminary simulation results (not reported) are encouraging and do suggest convergence to the true partial derivatives.¹⁹

4.1.4 Interpretation of results: the fragility of high valuations

Table 3 shows that volatility measures account for 50% of variable importance, with the rest equally split between valuations and momentum. (Results for log returns are similar.) The small number of important features makes it easier to interpret the model via partial effects and marginal effects plots,²⁰ shown in Figures 8–13. The plots indicate that (i) momentum and short-term volatility are more relevant at high valuations than at low valuations, and conversely (ii) valuations matter little as long as momentum is positive and volatility is low.

Figure 8

Global Equity Dataset. Partial effect plots show the expected return as a function of each variable, conditional on momentum at its mean (medium), at its 90% (high) and 10% (low) quantiles, and other variables at their mean.

Open in new tab Download slide

Figure 9

Global Equity Dataset. Marginal effects plots show the marginal effect of each variable, conditional on momentum at its mean (medium), at its 90% (high) and 10% (low) quantiles, and other variables at their mean.

Open in new tab Download slide

Figure 10

Global Equity Dataset. Partial effect plots show the expected return as a function of each variable, conditional on log CAPE at its mean (medium), at its 90% (high) and 10% (low) quantiles, and other variables at their mean.

Open in new tab Download slide

Figure 11

Global Equity Dataset. Marginal effects plots show the marginal effect of each variable, conditional on log CAPE at its mean (medium), at its 90% (high) and 10% (low) quantiles, and other variables at their mean.

Open in new tab Download slide

Figure 12

Global Equity Dataset. Partial effect plots show the expected return as a function of each variable, conditional on Vol3m at its mean (medium), at its 90% (high) and 10% (low) quantiles, and other variables at their mean.

Open in new tab Download slide

Figure 13

Global Equity Dataset. Marginal effects plots show the marginal effect of each variable, conditional on Vol3m at its mean (medium), at its 90% (high) and 10% (low) quantiles, and other variables at their mean.

Open in new tab Download slide

Table 3

Open in new tab

Feature importance for global equity data returns, computed as in Hastie et al. (2013) except that the sum (rather than the largest number) is normalized to 100

	Feature importance in %
vol3m	32
log CAPE	26
momentum12m	22
vol12m	20

	Feature importance in %
vol3m	32
log CAPE	26
momentum12m	22
vol12m	20

Table 3

Open in new tab

Feature importance for global equity data returns, computed as in Hastie et al. (2013) except that the sum (rather than the largest number) is normalized to 100

	Feature importance in %
vol3m	32
log CAPE	26
momentum12m	22
vol12m	20

	Feature importance in %
vol3m	32
log CAPE	26
momentum12m	22
vol12m	20

4.2 Risk Everywhere and Subtle Nonlinearities

Our second illustration is inspired by Bollerslev et al. (2018), who forecast realized volatility in a panel setup. We obtain daily data on 5-min sub-sampled realized volatility from the Oxford-Man Institute’s realized library (Heber et al. 2009), which includes thirty-one equity indexes. The sample is January 2000 to May 2021 for most indexes, for a total of 151736 observations. We forecast normalized realized log volatility²¹ at daily, weekly, and monthly horizons, using an expanded set of variables which includes daily, weekly, monthly and quarterly averages of: (i) normalized realized volatility (in logs), (ii) normalized realized returns, and (iii) average normalized RV and normalized returns across all assets, for a total of sixteen features. Similarly to Bollerslev et al. (2018), realized volatilities are normalized by their sample mean of the corresponding index. We also divide returns by the asset’s standard deviation.²² Forecasting weekly and monthly RV on daily data creates overlapping data, which is taken into account by SMARTboost’s prior as discussed in Section 3.2.

We consider two settings. In the first setting, all data (normalized as just described) are pulled together in a single panel of roughly 150k observations. In the second setting, each index is fitted separately on roughly 5k observations. Realized volatility built from high-frequency data is highly predictable: the in-sample and out-of-sample $R^{2}$ are in the 0.6-075 range. Substantial cross-correlations and the overlapping induced by $h > 1$ reduced the effective sample size (computed as in Section 3.2) to roughly 10k for $h = 1$ and only 800 for $h = 20$ ⁠. Still, with such high SNR, XGBoost and SMARTboost should have every chance of fitting complex nonlinearities. As it turns out, both models find only modest nonlinearities and produce fitted values and forecasts extremely close to those produced by OLS; the in-sample correlations between fitted values from OLS and SMARTboost ranges from 0.991 for daily RV prediction to 0.996 for monthly. The partial effects plots for the panel (Figures 14 and 15) also suggest modest deviations from linearity for the most important features. Those subtle nonlinearities that are suggested by the plots are broadly compatible with previous findings in the literature. The first nonlinearity we notice in Figure 14 is that negative lagged returns $r$ and $r_{5}$ increase volatility while positive returns have little effect. The presence of an asymmetry in the leverage effect is fairly well known in the literature; for example, the LHAR model of Corsi and Reno (2012) specifies a linear model in $\min (r, 0),$ which is very close to the partial effects retrived by SMARboost and XGboost. The second nonlinearity is in the response to rv at $h = 5$ (and at $h = 1$ ⁠, not reported), which also flattens at above-average levels. Our interpretation is in terms of the QHAR of Bollerslev, Patton, and Quaedvlieg (2016); in their model, the fixed coefficient multiplying $R V$ (where $r v = log (R V)$ ⁠) is replaced by a varying coefficient $(β + β_{Q} R Q^{1 / 2}),$ where $R Q$ is the realized quarticity (i.e., the realized variance of the realized variance) and $β_{Q} < 0$ so that the coefficient is lower when measurement uncertainty is higher. Our list of features does not include $R Q$ ⁠, but (Cipollini, Gallo, and Otranto 2021) report that $cor (r q, r v) = 0.98$ for a sample of twenty-nine large US stocks, and therefore the QHAR model would predict a nonlinearity of the type observed in Figure 14.

Figure 14

Realized volatility in equity indexes (all data), h = 5. Partial effects plots (with all other features at their mean) for SMARTboost and XGboost for the first six features by variable importance. The dependent variable is the log of the weekly realized variance.

Open in new tab Download slide

Figure 15

Realized volatility in equity indexes (all data), h = 20. Partial effects plots (with all other features at their mean) for SMARTboost and XGboost for the first six features by variable importance. The dependent variable is the log of the monthly realized variance.

Open in new tab Download slide

Table 4

Open in new tab

Pseudo-out-of-sample RMSE, where forecasts are produced by ten-fold purged cross-validation, as explained in Section 3.1. Each column reports a ratio of RMSE, for the panel and for the average of individual countries. SMARTboost outperforms XGboost in all cases, and more as the effective sample size gets smaller due to larger h and/or estimation on individual equities.

	RMSE XGboost over SMARTboost	RMSE OLS over SMARTboost
forecast horizon	panel—individual	panel—individual
h = 1	1.004–1.035	1.018–1.013
h = 5	1.009–1.039	1.021–1.002
h = 20	1.012–1.044	1.011–0.989

	RMSE XGboost over SMARTboost	RMSE OLS over SMARTboost
forecast horizon	panel—individual	panel—individual
h = 1	1.004–1.035	1.018–1.013
h = 5	1.009–1.039	1.021–1.002
h = 20	1.012–1.044	1.011–0.989

Table 4

Open in new tab

	RMSE XGboost over SMARTboost	RMSE OLS over SMARTboost
forecast horizon	panel—individual	panel—individual
h = 1	1.004–1.035	1.018–1.013
h = 5	1.009–1.039	1.021–1.002
h = 20	1.012–1.044	1.011–0.989

	RMSE XGboost over SMARTboost	RMSE OLS over SMARTboost
forecast horizon	panel—individual	panel—individual
h = 1	1.004–1.035	1.018–1.013
h = 5	1.009–1.039	1.021–1.002
h = 20	1.012–1.044	1.011–0.989

XGboost is an inefficient method to retrieve these smooth relations, but the combination of large $n$ and large signal-to-noise allows it to perform the task adequately in the panel. The results are summarized in Table 4. The relative out-of-sample performance of XGboost deteriorates as the horizon $h$ increases, which is expected since larger $h$ are equivalent to a smaller effective $n$ with overlapping data. In the second setting, the time series of each index is fitted separately. With a smaller $n$ ⁠, SMARTboost outperforms XGboost by a larger margin, particularly at larger $h,$ but outperforms OLS only at short horizons. This is not surprising in light of the full-panel results: since nonlinearities are modest, the linear model outperforms in small samples. The reported results are for cross-validated depth, but results at the default $d = 4$ are nearly identical. Although depth is estimated at 3 or 4, forcing $d = 1$ also produces in-sample and out-of-sample fit of nearly identical quality (not reported), suggesting very small interaction effects.

5 Conclusions

SMARTboost is a promising tool for financial data and for tabular data more generally. It can capture very complex functions in high dimensions and at the same time performs well in small and noisy datasets, thanks to its efficiency in recovering smooth functions and to its priors. Model complexity is determined automatically, making SMARTboost easy to use. In simulations, SMARTboost strongly outperforms a state-of-the-art boosting algorithm on a wide variety of smooth and partially smooth data-generating-processes well-known from econometrics and machine learning. It offers similar tools to interpret the results (feature importance, partial effect plots), but also marginal effects. We are hopeful that researchers will, quoting Efron, find SMARTboost a “smoother, more physically plausible algorithm for scientific investigation,” facilitating the discovery of more compact and/or theory-driven representations. The main objective of the article was to present SMARTboost and explore its forecasting performance, but ML has also been instrumental in many fields in facilitating the development of new theories (Prado 2020). More accurate models also deliver more accurate estimates of causal effects (Varian 2014), and there is a rapidly expanding literature on improving inference on treatment effects adapting ML tools such as random forests (Wager and Athey 2018), neural networks (Hartford et al. 2017; Shi, Blei, and Victor Veitch 2019), and Bayesian additive regression trees (Hill 2011). Similar applications and extensions of SMARTboost may also prove fruitful given the accuracy and continuity of its representation.

This article has focused exclusively on Gaussian likelihoods. SMARTboost can be generalized to other distributions, and to handle categorical variables and missing values, all extensions that we leave for follow-up work. SMARTboost contains a number of innovations that make it possible to reliably fit ensembles of smooth trees to fairly large datasets, but extensions to truly large datasets will require more research. Finally, only experience will tell whether the large gains over XGBoost obtained in simulations (and in a few empirical illustrations) will also materialize in more actual applications. Such gains should be particularly welcome for researchers in finance and in any area in which data is scarce or very noisy.

For comments or encouragements, I would like to thank Erlend Aune, Daniel Buncic, Fulvio Corsi, Bradley Efron, Isaiah Hull, Robert Kohn, Mattias Villani, Roberto Renò, and Aman Ullah, as well as two anonymous referees.

Footnotes

Julia and R code is available at https://github.com/PaoloGiordani/SMARTboost.jl

“Features” (or “predictors”) are the ML equivalent of explanatory variables. “Responses” (or “labels”) corresponds to dependent variables in econometrics.

We conjecture that mixing problems are due largely to all the many proposals in the chain being blind, that is, making no use of the conditional posterior distributions.

The multiplication by 0.5 makes the function close to a logistic with the same value of $τ$ ⁠.

Execution time is the time required to produce a forecast after the model has been fitted.

The first paper boosting symmetric trees is Lou & Obukhov (2017. CatBoost (Prokhorenkova et al. 2018) is currently the only major software to implement symmetric trees.

If all splits are indeed on completely different features and points, the proliferation of terminal leaf values $β$ will hurt the performance of symmetric trees. On the other hand, non-symmetric trees have a much greater potential for overfitting since at each split they search for a splitting feature and point from a blank slate, considering each feature and split point as an equally likely candidate for a node, with no regard for any information accumulated from other nodes at the same depth.

It is also possible to center the prior on “exact” linearity, by setting the mean at log(0.2) instead of 0. $τ = 1$ is suggested as a default because it is more robust to leverage points.

The inclusion of $var (r)$ ⁠, that is, the variance of the output variable, is non-standard; this is not a “prior” in a strict Bayesian sense because it conditions on observing the variance of the data.

$E (R^{2} | r, G) = R_{p}^{2}$ here means that if we simulate draws of $β$ from the prior, for given $r$ and $G$ (assuming the columns of $G$ have mean zero), the average $R^{2}$ implied by $G β$ is equal to $R_{p}^{2}$ ⁠.

Lower values of $τ$ induce matrices $G$ with lower variance (see Figure 3) so that a constant $η$ would be biased toward higher values of $τ$ and hence toward more non-linear functions.

Smooth trees seem to allow a larger learning parameters than sharp trees. For SMARTboost, the default $λ = 0.2$ is a good compromise between accuracy and speed. If computing costs are not a concern, $λ = 0.1$ is recommended to maximize accuracy.

All parameters for XGBoost and their default values can be found at https://xgboost.readthedocs.io/en/latest/parameter.html

When $n \geq 100 k,$ we use a single validation set for SMARTboost.

In unreported results, we found that the exact form of this penalization is not as important, and a simpler ridge-type penalization performs almost as well as the default prior.

Correlation due to spatial proximity and its impact on CV is discussed by Oliveira, Torgo, and Santos Costa (2021), who recommend methods preserving the order of the series.

OOS Hold-out CV in SMARTboost defaults to using the last (in chronological order) 30% of the data as a test set, and then purging the training set if needed.

In this example, SMARTboost performs approximately as well even if CV is incorrectly randomized, while XGBoost is clearly worse off in that case.

I thank Aman Ullah for correspondence on this topic.

In the original (and most common) definition of partial effect plots we would integrate out the features not being plotted, while we set them at a specific value (mean or quantile) as in Gu et al. (2020).

Working with logs rather than variances reduces residual heteroskedasticity and nonlinearities (Corsi et al. 2008) and leads to superior forecasts (10.1093/jjfinec/nbz025]).

The comparison between OLS, XGboost, and SMARTboost is little affected if performed on raw rather than normalized responses and features, with SMARTboost slightly increasing its advantage.

Appendix

Appendix A: Boosting Regression Trees

Good textbook presentations of gradient boosting decision trees (GBDT, also known as GBM) are given in Hastie et al. (2013) and Efron & Hastie (2016). Here we assume some familiarity with these topics and introduce trees from a different angle that does not reflect the actual algorithms used for fitting, but is useful as a connection to smooth trees, and draws parallels with the econometric literature. The discussion is in terms of the regression problem

E (y_{i} | x_{i})

⁠, where

y

\in R

and

x \in R^{p} .

The most popular variations of boosting are gradient boosting and Newton boosting. This article focuses on regression problems with Gaussian log-likelihood (squared loss), where the difference is not relevant, and follows the exposition in Efron and Hastie (2016). Boosting in a regression context can be interpreted as a heuristic approach to fitting a model of the form

y_{i} \sim N (f (x_{i}), σ^{2})

(A1)

f (x_{i}) = \sum_{b = 1}^{B} B_{b} (x_{i}; θ_{b}),

(A2)

where the basis

B_{b} (x_{i}; θ_{b})

is a function of the features

x

indexed by a set of parameters

θ_{b} .

In the case of trees, exact minimization is typically intractable even for a single tree. We can then resort to the following boosting algorithm, which recursively fits a single tree to the residuals from the ensemble.

Boosting trees for regression problems, where the goal is to fit an ensemble of trees in the form

f (x_{i}) = \sum_{b = 1}^{B} α_{b} T_{b} (x_{i}; θ_{b}) .

Initialize a learning (or “shrinkage”) parameter $λ \in (0, 1],$ and ${\hat{f}}_{0} (x_{i}) = \bar{y}, r_{i} = y_{i} - \bar{y}, i = 1, …, n .$
Repeat step (3) for $b = 1, …, B$ ⁠.
Fit a single tree $T_{b}$ to $r_{1 : n} .$ Update ${\hat{f}}_{b} (x_{i}) = {\hat{f}}_{b - 1} (x_{i}) + λ T_{b} (x_{i}; θ_{b})$ and $r_{i} = y_{i} - \hat{f_{b}} (x_{i})$ ⁠.

One of the main attractions of boosting trees is that the number of trees $B$ can be selected by early stopping based on CV (or on a single validation set) so that the pseudo-out-of-sample performance automatically determines model complexity.

References

Asness

Clifford S.

Moskowitz

Tobias J.

Pedersen

Lasse Heje

2013

Value and Momentum Everywhere

The Journal of Finance

929

–

985

Google Scholar

Crossref

WorldCat

Bergmeir

Christoph.

Hyndman

Rob J.

Koo

Bonsoo

2018

A Note on the Validity of Cross-Validation for Evaluating Autoregressive Time Series Prediction

Computational Statistics & Data Analysis

120

–

Google Scholar

Crossref

WorldCat

Blitz

David C

Vliet

Pim Van

2007

The Volatility Effect: Lower Risk without Lower Return

The Journal of Portfolio Management

102

–

113

Google Scholar

Crossref

WorldCat

Bollerslev

Tim.

Patton

Andrew J.

Quaedvlieg

Rogier

2016

Exploiting the Errors: A Simple Approach for Improved Volatility Forecasting

Journal of Econometrics

192

–

Google Scholar

Crossref

WorldCat

Bollerslev

Tim.

Hood

Benjamin

Huss

John

Pedersen

Lasse Heje

2018

Risk Everywhere: Modeling and Managing Volatility

The Review of Financial Studies

2729

–

2773

Google Scholar

Crossref

WorldCat

Cawley

Gavin C

Talbot

Nicola L. C.

2010

On Over-Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

Journal of Machine Learning Research

2079

–

2107

Google Scholar

OpenURL Placeholder Text

WorldCat

Cerqueira

Vitor

Torgo

Luis

Smailovic

Jasmina

Mozetic

Igor

2017

. “A Comparative Study of Performance Estimation Methods for Time Series Forecasting.” 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA),

–

538

Chang

Robin L. P

Pavlidis

Theodosios

1977

Fuzzy Decision Tree Algorithms

IEEE Transactions on Systems, Man, and Cybernetics

–

Google Scholar

Crossref

WorldCat

Chen

Tianqi

Guestrin

Carlos

2016

. “XGBoost: A Scalable Tree Boosting System.” Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. KDD’16, 785–794. New York, NY, USA: ACM.

Chipman

Hugh A.

George

Edward I.

McCulloch

Robert E.

2010

BART: Bayesian Additive Regressiontrees

The Annals of Applied Statistics

266

–

298

Google Scholar

Crossref

WorldCat

Cipollini

Fabrizio

Gallo

Giampiero M.

Otranto

Edoardo

2021

Realized Volatility Forecasting: Robustness to Measurement Errors

International Journal of Forecasting

–

Google Scholar

Crossref

WorldCat

Corsi

Fulvio

Reno

Roberto

2012

Discrete-Time Volatility Forecasting with Persistent Leverage Effect and the Link with Continuous-Time Volatility Modeling

Journal of Business & Economic Statistics

368

–

380

Google Scholar

Crossref

WorldCat

Corsi

Fulvio.

Mittnik

Stefan

Pigorsch

Christian

Pigorsch

Uta

2008

The Volatility of Realized Volatility

Econometric Reviews

–

Google Scholar

Crossref

WorldCat

de Prado

Marcos Lopez.

2018

. Advances in Financial Machine Learning. New York, NY: John Wiley & Sons.

Efron

Bradley.

2020

Prediction, Estimation, and Attribution

Journal of the American Statistical Association

115

636

–

655

Google Scholar

Crossref

WorldCat

Efron

Bradley

Hastie

Trevor

2016

Computer Age Statistical Inference: Algorithms, Evidence, and Data Science

Cambridge University Press

Giordani

Paolo

Halling

Michael

2019

Valuation Ratios and Shape Predictability in the Distribution of Stock Returns

Swedish House of Finance Research Paper

Google Scholar

OpenURL Placeholder Text

WorldCat

Shihao

Kelly

Bryan

Xiu

Dacheng

2020

Empirical Asset Pricing via Machine Learning

The Review of Financial Studies

2223

–

2273

Google Scholar

Crossref

WorldCat

Hartford

Jason

Lewis

Greg

Leyton-Brown

Kevin

Taddy

Matt

2017

. Deep IV: A flexible approach for counterfactual prediction. Precup, Doina, & Teh, Yee Whye (eds), Proceedings of the 34th International Conference on Machine Learning, vol. 70, 1414–1423. PMLR.

Hastie

Trevor

Tibshirani

Robert

Friedman

Jerome

2013

The Elements of Statistical Learning – Data Mining, Inference and Prediction

. New York: Springer.

Heber

Gerd

Lunde

Asger

Shephard

Neil

Sheppard

Kevin K.

2009

. Oxford-man institute’s realized library, version 0.3.

Hill

Jennifer L.

2011

Bayesian Nonparametric Modeling for Causal Inference

Journal of Computational and Graphical Statistics

217

–

240

Google Scholar

Crossref

WorldCat

Irsoy

Ozan

Yildiz

Olcay Taner

Alpaydin

Ethem

2012

. “Soft decision trees.” International Conference on Pattern Recognition, 1819–1822.

Jang

J.-S. R.

1994

. “Structure determination in fuzzy modeling: a fuzzy cart approach.” Proceedings of 1994 IEEE 3rd International Fuzzy Systems Conference, Vol.

480

–

485

Linero

Antonio Ricardo

Yang

Yun

2018

. Bayesian Regression Tree Ensembles that Adapt to Smoothness and Sparsity. Journal of the Royal Statistical Society Series B: Statistical Methodology

1087

–

1110

Lou

Yin

Obukhov

2017

. “Bdt: Gradient boosted decision tables for high accuracy and scoring efficiency.” Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1893–1901.

Olaru

Cristina

Wehenkel

Louis

2003

A Complete Fuzzy Decision Tree Technique

Fuzzy Sets and Systems

138

221

–

254

Google Scholar

Crossref

WorldCat

Oliveira

Mariana

Torgo

Lu’s

Santos Costa

V’tor

2021

Evaluation Procedures for Forecasting with Spatiotemporal Data

Mathematics

691

Google Scholar

Crossref

WorldCat

Petersen

Mitchell A.

2009

Estimating Standard Errors in Finance Panel Data Sets: Comparing Approaches

Review of Financial Studies

435

–

480

Google Scholar

Crossref

WorldCat

Prado

Marcos.

2020

. Machine Learning for Asset Managers. Cambridge University Press.

Prokhorenkova

Liudmila

Gusev

Gleb

Vorobev

Aleksandr

Dorogush

Anna Veronika

Gulin

Andrey

2018

. “Catboost: Unbiased boosting with categorical features.” Proceedings of the 32nd International Conference on Neural Information Processing Systems, 6639–6649. NIPS’18. Red Hook, NY, USA: Curran Associates Inc.

Shi

Claudia

Blei

David M.

Veitch

Victor

2019

. “Adapting neural networks for the estimation of treatment effects.” ArXiv. https://arxiv.org/abs/1906.02120

Varian

Hal R.

2014

Big Data: New Tricks for Econometrics

Journal of Economic Perspectives

–

Google Scholar

Crossref

WorldCat

Wager

Stefan

Athey

Susan

2018

Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests

Journal of the American Statistical Association

113

1228

–

1242

Google Scholar

Crossref

WorldCat

Zhang

Yan Dora.

Naughton

Brian P.

Bondell

Howard D.

Reich

Brian J.

2022

Bayesian Regression Using a Prior on the Model Fit: The r2-d2 Shrinkage Prior

Journal of the American Statistical Association

117

862

–

874

Google Scholar

Crossref

WorldCat

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
November 2024	138
December 2024	63
January 2025	39
February 2025	41
March 2025	47
April 2025	86
May 2025	14

Article Contents

SMARTboost Learning for Tabular Data

1 Boosting Smooth Symmetric Trees

1.1 Boosting and Standard Trees

1.2 Smooth Trees

1.3 Smooth Symmetric Trees for Speed

1.4 Strategies for Faster and More Robust Optimization

1.5 SMARTboost Priors on Parameters

1.5.1 Priors that encourage smooth functions

1.5.2 Prior on β and MAP inference

1.6 A First Look at SMARTboost in Action

1.6.1 SMARTboost and XGBoost on four univariate functions

1.6.2 Extrapolation in forecasting

1.6.3 Computing time

2 Simulations

2.1 Simulation Setup

2.2 Smooth and Partially Smooth Functions

2.2.1 Linear function

2.2.2 Friedman’s function

2.2.3 Threshold Friedman

2.2.4 Projection pursuit regression

2.2.5 Nonlinear factors

2.2.6 Neural network with ReLu transform

2.3 Highly Irregular Functions

2.3.1 Tree with sharp splits

2.3.2 Tree with sharp splits and high noise

2.3.3 AND function

2.3.4 OR function

2.4 Summary of Main Results

2.5 Ablation Study

3 CV and Priors for Time Series and Panel Data

3.1 Purged (hv-Block) CV for Time Series and Panel Data

3.1.1 Time series data

3.1.2 Panel data

3.2 Priors for Time Series and Panel Data

4 Applications on Real Data

4.1 Global Equity Indexes and the Fragility of High Valuations

4.1.1 Forecasting

4.1.2 How can OLS outperform XGBoost if there are strong interactions?

4.1.3 Marginal effects

4.1.4 Interpretation of results: the fragility of high valuations

4.2 Risk Everywhere and Subtle Nonlinearities

5 Conclusions

Footnotes

Appendix

Appendix A: Boosting Regression Trees

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

1.5.2 Prior on $β$ and MAP inference