Читать книгу Intelligent Security Management and Control in the IoT - Mohamed-Aymen Chalouf - Страница 47

2.5.1. Formulating the problem

In reinforcement learning (Sutton et al. 2019), we have two main entities, an environment and an agent. The learning process happens through interaction between these entities so that the agent can optimize a total revenue. At each stage t, the agent obtains a representation of the state St of the environment and chooses an action at, based on this. Then the agent applies this action to the environment. Consequently, the environment passes to a new state S_t+1 and the agent receives a recompense rt. This interaction can be modeled as a Markovian decision process (MDP) M = (S, A, P, R), where S is the state space, A the action space, P the dynamic transition and R the revenue function. The behavior of the agent is defined by its policy π: S ⟶ A, which makes it possible to link a state to an action where a deterministic system is involved, or a distribution of action where it is probabilistic. The objective of such a system is to find the optimal policy π*, making it possible to maximize the accumulated revenue.

In the problem of controlling access to the IoT, we define a discrete MDP, where the state, the action and the revenue are defined as follows:

– The state: given that the number of terminals attempting access at a given instant k is unavailable, the state we are considering is based on measured estimates. Since a single measurement of this number is necessarily very noisy, we will consider a set of several measurements, which can better reveal the state present in the network. The state sk is defined as the vector , where H represents the measurement horizon.

– The action: at each state, the agent must select the blocking factor p which will need to be considered by the IoT objects. This value is continuous and determinist in the problem that we consider, that is that the same state sk will always give the same action ak.

– The revenue: this is a signal that the agent receives from the environment after the execution of an action. Thus, at stage k, the agent obtains a revenue rk as a consequence of the action ak that it carried out in state sk. This revenue will allow the agent to know the quality of the action executed, the objective of the agent being to maximize this revenue.

Unlike classical reinforcement learning problems, the optimum is known here and given by equation [2.2]. The revenue is given by equation [2.6]:

[2.6]

The revenue is therefore maximal when the chosen action makes it possible to obtain a number of devices attempting access equal to the optimum . However, as the measurement is marred by noise, this impacts the measured revenue.

The objective of such a system is to find the blocking probability, making it possible to maximize the average recompense, which amounts to reducing the distance between the measurements of the number of terminals attempting access and the optimum. To meet this objective, we rely on the TD3 algorithm.

The TD3 algorithm is an actor-critic approach, where the actor is a network of neurons which decides the action to take in a particular state; the main network makes it possible to know the value of being in a state and to choose a particular action. TD3 makes it possible to resolve the question of over-evaluation in estimating the value (Thrun and Schwartz 1993) by introducing two critical networks and by taking the minimum between these two estimations. This approach is particularly beneficial in our case due to the inherent presence of measurement errors.

Intelligent Security Management and Control in the IoT

Подняться наверх