The majority of TTA optimization approaches require prior knowledge, but such approaches are not applicable in dynamic mobile crowdsourcing environments, where the availability of mobile workers is subject to frequent and unpredictable changes [2, 6]. Let us consider the submission of spatial tasks from requesters through a mobile crowdsourcing system, whereby spatial tasks are reached in an online manner. In such a scenario, the mobile crowdsourcing system possesses no prior information regarding spatial tasks and mobile workers.
The crowdsourcing TTA optimization problem is modeled as a Markov decision process-based mobile crowdsourcing (MCMDP) problem. Deep Q-learning is introduced to address the MCMDP problem. Furthermore, we propose an improved deep Q-learning-based trust-aware task allocation (ImprovedDQL-TTA) algorithm by combining trust crowdsourcing optimization and deep Q-learning, which enables the learning agent to solve large-scale MCMDP problems in an uncertain scenario.
MDP model for uncertain mobile crowdsourcing
To address the dynamic problems of uncertain crowdsourcing TTA, a Markov decision process is adopted. The Markov decision process, a machine learning model, is a typical intelligence framework for modeling sequential decision-making problems under uncertainty [15]. In this paper, the MDP is applied to demonstrate the trust-aware task allocations and adaptation processes schematically in uncertain mobile crowdsourcing.
A mobile crowdsourcing MDP consists of a five-tuple \(= \langle S, A, P, R, O\rangle\), where S is a state space composed of a finite set of crowdsourcing states, A is a crowdsourcing action space composed of a finite set of actions, P is the transition function for reaching the next crowdsourcing state \(s'\) from state s when an action \(a \in A(s)\) is performed by a crowdsourcing agent, R is a real crowdsourcing valued reward function, where the agent receives an immediate reward \(r = R(s'|s,a)\), and O is the crowdsourcing observation space in which the agent can fully observe the mobile crowdsourcing decision environment. On this basis, the mobile crowdsourcing MDP can be defined as follows.
Definition 6
(Mobile Crowdsourcing MDP (MCMDP)) A MCMDP is formally defined as a seven-tuple: MCMDP\(=\langle S^{i}, s_{0}^{i}, s_{r}^{i}, A^{i}, P^{i}, R^{i}, O^{i} \rangle\), where:
-
\(S^{i}\) is the set of tasks in the state space of a particular crowdsourcing partially observed by agent i.
-
\(s_{0}^{i} \in S\) is the initial task and any execution of the mobile crowdsourcing beginning from this task.
-
\(s_{r}^{i} \in S\) represents the terminal task. When arriving at the terminal task, an execution of mobile crowdsourcing is terminated.
-
\(A^{i}\) is the set of mobile workers that can perform tasks \(s \in S^{i}\), and mobile worker cw belongs to \(A^{i}\) only if the precondition is satisfied by s.
-
P is a probability value, that is, a transition distribution \(P(s'|s, a)\) that determines the probability of reaching the next state \(s'\) from state s if action \(a \in A(s)\) is fulfilled by a crowdsourcing agent. The probability distribution \(P(s'|s, a)\) can be defined as
$$\begin{aligned} \sum _{s' \in S} P(s'|s, a) = 1, \forall s \in S,\forall a \in A. \end{aligned}$$
(7)
-
\(R^{i}\) is the reward function when mobile worker \(cw \in A^{i}\) is invoked, agent i transits from s to \(s'\), and the learning agent obtains an immediate reward \(r^i\). The expected value is \(R^{i}(s'|s, ws)\). Consider selecting mobile worker cw with multiple quality criteria, where agent i receives the following quality vector as a reward:
$$\begin{aligned} \begin{aligned} QoS(s,cw,s') =&[f_{tr}(s,cw,s'), f_{dist}(s,cw,s')]^{T}, \end{aligned} \end{aligned}$$
(8)
where each \(f_k(\cdot )\) denotes a quality attribute of mobile worker cw.
-
O is the crowdsourcing observation space in which the agent can fully observe the mobile crowdsourcing decision environment.
The MCMDP solution is a collection of TTA decision policies, each of which can be described as a procedure of trust-aware task allocation \(cw \in A\) by agent i in each state s. These policies, denoted as \(\pi\), actually map spatial tasks to mobile workers, defined as \(\pi = S \rightarrow A\). The MCMDP policy can be defined as a mobile crowdsourcing model. The main idea is to identify the optimal policy for trust-aware allocation in uncertain mobile crowdsourcing.
Deep Q-learning-based trust-aware task allocation algorithm
The above section analyzed the optimization problem of trust aware allocation by means of the MCMDP model. The optimization objective is to maximize the long-term rewards of the MCMDP. The solution of the MCMDP can be denoted as a policy \(\pi\) that guides a learning agent to take the right action for the specific crowdsourcing state.
Dynamic task allocation with Q-Learning
The uncertain mobile crowdsourcing problem can be formulated as an MCMDP model. However, the transition probabilities are not known, and we do not initially know the rewards of taking the allocation action. In this case, Q-learning is suggested for a crowdsourcing agent to determine the optimal policy. Q-learning is a temporal difference learning algorithm [14, 15] that takes into account the fact that the agent initially has only partial knowledge of the crowdsourcing MCMDP. In general, assume that an agent learns from experience to address uncertain mobile crowdsourcing. The agent can obtain a set of state-action rewards \(\langle s_1, a_1, r_1,s_2, a_2, r_2,\cdots , s_t, a_t, r_t \rangle\), which indicates that the agent was in state \(s_t\), selected action \(a_t\), and obtained reward \(r_t\). Figure 2 illustrates the sequence of the crowdsourcing state and state-action reward pairs.
Temporal difference learning agents determine the increment to \(V(s_t)\) in each time step. At time t, the agents immediately create an update by using discount rewards and computing \(V(s_t)\). Temporal difference learning [15] can defined as
$$\begin{aligned} V(s_t) = V(s_{t-1}) + \alpha \cdot \big (r_t+\gamma \cdot V(s_t) - V(s_{t-1})\big ). \end{aligned}$$
(9)
The goal of temporal difference learning agents is to update \(V(s_t)\) by \(R(s_t)+\gamma \cdot V(s_t)\) in each step. Tabular Q-learning is a common approach in temporal difference learning for maximizing total rewards. For each state s and action a, the tabular Q-learning algorithm takes an action, observes a reward r, enters a next state \(s'\), and updates Q(s, a). The key of the Q-learning algorithm is a straightforward value Q(s, a) iteration update. Q(s, a) is accumulated for the current estimate of \(Q^{\pi }\) in each training iteration. The learning table values of Q(s, a) are revised by the following function:
$$\begin{aligned} Q^{\pi }(s, a) = (1 - \alpha ) \cdot Q(s, a)+ \alpha \cdot \Big (r + \gamma \cdot \max _{a'}Q(s', a')\Big ). \end{aligned}$$
(10)
The learning rate \(\alpha \in [0,1]\) indicates the extent to which the existing estimation of \(Q^{\pi }(s, a)\) contributes to the next estimation. The Q(s, a) values ultimately converge to the optimum value \(Q^{*}(s, a)\) [15]. Thus, the Q-learning-based allocation algorithm ultimately discovers an optimal policy for any finite MCMDP [6]. The basic optimization involves incorporating both the travel distance and the trust score of mobile workers into the dynamic mobile crowdsourcing decisions. Thus, the reward function of Q Learning-based TTA is defined as in Definition 7.
Definition 7
(Reward function) Suppose that a mobile worker completing a task can be estimated by a trust score \(f_{tr}(x_i^j) = tr(x_i^j)\). Each mobile worker is required to move from location aloc to bloc when completing the spatial task, which incurs a distance cost \(f_{dist}(x_i^j)\). The distance cost is evaluated in terms of the distance \(f_{dist}(x_i^j)=dist(aloc, bloc)\) between aloc and bloc. As a result, the reward function is determined with QoS vectors \([f_{tr}(x_i^j), f_{dist}(x_i^j)]\). Owing to the different scale of each QoS objective, the QoS value is mapped into the interval [0, 1]. With the min-max operator, the learning reward function adopts the linearly weighted sum approach to calculate the value of all QoS objectives:
$$\begin{aligned} r = \sum _{k=1}^{2}w_{k} \cdot (f_k(x_{i}^{j})-z_{k}^{U})/(z_{k}^{N}-z_{k}^{U}) \end{aligned}$$
(11)
In the training iterations, the learning agent estimates its optimal policy by maximizing the total of received crowdsourcing rewards in the uncertain scenario.
Dynamic task allocation with deep Q-learning
Tabular Q-learning is not a feasible solution owing to the large-scale state and action spaces in uncertain mobile crowdsourcing systems. Moreover, a Q-learning table is environment-specific and not generalized. In large-scale uncertain systems, there are too many states and actions to store in machine memory, and learning the value of each state is a slow process. This section introduces a new and highly effective Q-Learning-based task allocation mechanism.
To adapt to changes in large-scale mobile crowdsourcing systems, we propose a deep Q-learning-based trust-aware task allocation (DQL-TTA) algorithm that is a combination of advances in deep neural network and Q-learning techniques. Specifically, the dynamic TTA problem is formalized as a Markov decision process-based mobile crowdsourcing model. The experience of a crowdsourcing state transition is denoted as \(s, a, r, s'\), and a set of crowdsourcing states and allocation actions with a transition policy constitute an MCMDP. One episode of an MCMDP forms a limited sequence of crowdsourcing states, allocation actions and rewards:
$$\begin{aligned} s_0, a_0, r_0, S_1, a_1, r_1, s_2, ..., s_t, a_t, s_{t+1}, ..., s_{n-1}, a_{n-1}, r_{n-1}, s_n \end{aligned}$$
(12)
where \(s_t\) denotes the current state, \(a_t\) denotes the current action, \(r_t\) denotes the reward after performing an action, and \(s_{t+1}\) denotes the next state in the dynamic mobile crowdsourcing system.
The DQL-TTA algorithm directly combines a deep neural network and Q-Learning to solve the dynamic trust-aware allocation problem. The DQL-TTA learning algorithm uses a value iteration approach, in which the crowdsourcing value function \(Q = Q(s, a; \theta )\) is a parameterized function with parameter \(\theta\) that takes crowdsourcing state S and crowdsourcing action space A as inputs and returns a crowdsourcing Q value for each action \(a \in A\). Then, we can use a greedy approach to select a crowdsourcing action:
$$\begin{aligned} Q(s) = argmax_{a \in A} Q(s, a; \theta ) \end{aligned}$$
(13)
DQL-TTA iteratively solves the mobile crowdsourcing MDP problem by learning the weights of the deep neural network towards the optimization objective. The DQL-TTA algorithm differs from Q-Learning in two ways. Traditional Q-Learning is based on the Bellman equation, and the Q value is iteratively updated: \(Q_{t+1}(s,a) = E [r + \gamma \cdot max_{a'}Q_t(s', a')|s, a]\). Q-Learning algorithms with value iterations are impractical for large-scale crowdsourcing problems. Thus, it is practical to employ a dynamic crowdsourcing function approximation to assess the action value function \(Q(s, a; \theta )\approx Q^{*}(s, a)\), which is a typical function approximation.
DQL-TTA is designed as a function approximation with weight \(\theta\) for the mobile crowdsourcing MDP problem. The parameters of the DQL-TTA function approximation can be learned by minimizing loss function \(L(\theta _t)\), which is optimized at iteration i
$$\begin{aligned} L(\theta _t) = {\mathbb {E}}_\pi \Big [\big (y_t - Q(s, a; \theta _t)\big )^{2} \Big ] \end{aligned}$$
(14)
where \(y_t\) is the target value for iteration i and can be computed as
$$\begin{aligned} y_t = \left\{ \begin{array}{ll} r_t, &{} if \, A(s') = \varnothing \\ r_t + \gamma \cdot max_{a'}Q(s', a'; \theta _{t-1}), &{} else \\ \end{array} \right. \end{aligned}$$
(15)
DQL-TTA considers the crowdsourcing states and allocation actions as the inputs of a deep Q-network and outputs the Q-value for dynamic allocations. Figure 3 illustrates the deep Q-learning-based trust-aware task allocation (DQL-TTA) algorithm framework.
Dynamic task allocation with improved deep Q-learning
As discussed in [17, 33,34,35], the performance of deep Q-learning algorithms may not to be stable. To improve the overall performance of DQL-TTA, an improved DQL-TTA algorithm (ImprovedDQL-TTA) is further proposed to handle large-scale MCMDP problems much more stably in uncertain mobile crowdsourcing environments. Our proposed ImprovedDQL-TTA algorithm has been improved with the following important mechanisms: (i) mini-batch stochastic gradient descent approach with advanced training mechanisms; (ii) Epsilon-decreasing greedy policy; iii) a novel deep neural network architecture with an action advantage function.
Mini-batch stochastic gradient descent The parameters of ImprovedDQL-TTA from an earlier training iteration \(\theta _{t-1}\) are fixed while optimizing the loss function \(L(\theta _t)\). Note that the targets rely on the ImprovedDQL-TTA weight parameters. A local minimum of the loss function by the gradient is obtained as follows,
$$\begin{aligned} \begin{aligned} \varDelta \theta _t&= - \frac{1}{2} \eta \cdot \triangledown _{\theta }(L(\theta _t)) \\&= \eta \cdot {\mathbb {E}}_\pi \Big [r + \gamma \cdot max_{a'}Q(s', a'; \theta _{t-1})- Q(s, a; \theta _t)\Big ] \cdot \triangledown _{\theta } Q(s, a; \theta _t) \end{aligned} \end{aligned}$$
(16)
Instead of calculating the full expectation in the above gradient, the loss function of the ImprovedDQL-TTA is computationally optimized by stochastic gradient descent [17]. The weights of the ImprovedDQL-TTA approximation are trained using a gradient descent rule, and the parameter \(\theta\) can be updated using stochastic gradient descent by
$$\begin{aligned} \begin{aligned} \varDelta \theta _t&= \eta \cdot \big (r + \gamma \cdot max_{a'}Q(s', a'; \theta _{t-1})- Q(s, a; \theta _t)\big ) \cdot \triangledown _{\theta } Q(s, a; \theta _t) \\ \theta _{t}&= \theta _t - \eta \cdot \varDelta \theta _t \end{aligned} \end{aligned}$$
(17)
Stochastic gradient descent is simple and appealing for DQL-TTA; however, it is not sample efficient. In this paper, mini-batch stochastic gradient descent learning is therefore proposed to discover the optimal fitting value function of ImprovedDQL-TTA by training on mini-batch crowdsourcing data. Instead of making decisions based solely on the current allocation experience, the allocation experience replay helps the ImprovedDQL-TTA network to learn from several mini-batches of crowdsourcing data. Each of these allocation experiences is stored as a four-dimensional vector of \(\langle state, action, reward, next state\rangle\). During training iteration t, allocation experience \(e_t = (s_t, a_t, r_t, s_{t+1})\) is stored into a replay tuple \(D = \{e_1, ..., e_t\}\). The memory buffer of the allocation experience replay is fixed, and as new allocation experience are inserted, previous experience are removed [19]. To train the ImprovedDQL-TTA neural networks, uniform mini-batches of experiences are extracted randomly from the allocation memory buffer.
To obtain stable Q-values, a separate target network is used to estimate the loss function after every training iterations; another neural network, whose weights are changed gradually compared to the primary Q-network, is also used [35]. In this context, the ImprovedDQL-TTA algorithm learns to optimize two separate neural networks \(Q(s, a; \theta )\) and \(Q(s, a; {\hat{\theta }})\) with current learning parameters \(\theta\) and previous learning parameters \({\hat{\theta }}\). \(\theta\) are updated numerous times during the training iterations and are cloned to the previous parameters \({\hat{\theta }}\) after \(NUM_{training}\) iterations.
$$\begin{aligned} \theta _{t} = \theta _t - \eta _t \cdot \frac{1}{b} \sum _{t=k}^{k+b} \varDelta \theta _t \end{aligned}$$
(18)
ImprovedDQL-TTA is refreshed with a batch of collected samples in the experience replay buffer by means of mini-batch stochastic gradient descent at each decision epoch.
Theorem 1
(The convergence analysis of mini-batch stochastic gradient descent) Assume that there are two constants A and B that satisfy \(E[\Vert \triangledown h_b(\theta )\Vert ^2]\le A\) and \({\mathbb {E}}[\Vert \theta ^* - \theta _t\Vert ^2]\le B\), where t denotes the gradient optimization iteration and
$$\begin{aligned} \triangledown h_b(\theta ) = \frac{1}{b} \sum _{t=k}^{k+b} \varDelta \theta _t \end{aligned}$$
(19)
Let \(h_{min}(\theta )=min\{h(\theta _1), h(\theta _2), \cdots , h(\theta _t)\}\) and assume that
$$\begin{aligned} 1> \eta _t > 0, \sum _{t=0}^{\infty } \eta _t^2 < \infty , \sum _{t=0}^{\infty } \eta _t = \infty \end{aligned}$$
(20)
When the optimization of the mini-batch approach reaches \(t+1\) iterations, then
$$\begin{aligned} \begin{aligned} \Vert \theta _{t+1} - \theta ^{*}\Vert ^2&= \Vert \theta _{t}-\eta _t \cdot \triangledown h_b(\theta )-\theta ^{*}\Vert ^2\\&= \Vert \theta _{t} - \theta ^{*}\Vert ^2 - 2 \eta _{t} \cdot \triangledown h_b(\theta ) \cdot (\theta _{t} - \theta ^{*}) + \eta _t^2 \cdot \Vert \triangledown h_b(\theta ) \Vert ^2 \end{aligned} \end{aligned}$$
(21)
According to the conditional expectation of mathematics, we can obtain
$$\begin{aligned} \begin{aligned} {\mathbb {E}}[\Vert \theta _{t+1} - \theta ^{*}\Vert ^2|\theta _t]&= {\mathbb {E}}[\Vert \theta _{t} - \theta ^{*}\Vert ^2|\theta _t] - 2 \eta _{t} \cdot {\mathbb {E}}[\triangledown h_b(\theta ) \cdot (\theta _{t} - \theta ^{*})|\theta _t] + \\&\eta _t^2 \cdot {\mathbb {E}}[\Vert \triangledown h_b(\theta ) \Vert ^2|\theta _t] \\&\le \Vert \theta _{t} - \theta ^{*}\Vert ^2 - 2 \eta _{t} \cdot (h(\theta _t)-h(\theta ^*)) + \eta _t^2 \cdot A^2 \end{aligned} \end{aligned}$$
(22)
Taking the expectation of \(\theta _t\) in Equation (22) yields
$$\begin{aligned} \begin{aligned} {\mathbb {E}}[\Vert \theta _{t+1} - \theta ^{*}\Vert ^2]&= {\mathbb {E}}[\Vert \theta _{t} - \theta ^{*}\Vert ^2] - 2 \eta _{t} \cdot {\mathbb {E}}[h(\theta _t)-h(\theta ^*)] + \eta _t^2 \cdot A^2 \end{aligned} \end{aligned}$$
(23)
Accordingly,
$$\begin{aligned} \begin{aligned} {\mathbb {E}}[\Vert \theta _{t+1} - \theta ^{*}\Vert ^2]&\le {\mathbb {E}}[\Vert \theta _{t} - \theta ^{*}\Vert ^2] - 2 \sum _t \eta _{t} \cdot {\mathbb {E}}[h(\theta _t)-h(\theta ^*)] + A^2 \cdot \sum _t \eta _t^2 \end{aligned} \end{aligned}$$
(24)
Since \({\mathbb {E}}[\Vert \theta _{t+1} - \theta ^{*}\Vert ^2] \ge 0\), we obtain
$$\begin{aligned} \begin{aligned} {\mathbb {E}}[\Vert \theta _{t+1} - \theta ^{*}\Vert ^2] + A^2 \sum _t \eta _{t}^2&\ge 2 \sum _t \eta _{t} \cdot {\mathbb {E}}[h(\theta _t)-h(\theta ^*)] \\&\ge 2 \sum _t \eta _{t} \cdot {\mathbb {E}}[h_{min}(\theta _t)-h(\theta ^*)] \end{aligned} \end{aligned}$$
(25)
Since \({\mathbb {E}}[\Vert \theta _{t} - \theta ^{*}\Vert ^2] \le B\), we obtain
$$\begin{aligned} B + A^2 \sum _t \eta _{t}^2 \ge 2 \sum _t \eta _{t} \cdot {\mathbb {E}}[h_{min}(\theta _t)-h(\theta ^*)] \end{aligned}$$
(26)
and
$$\begin{aligned} {\mathbb {E}}[h_{min}(\theta _t)-h(\theta ^*)] \le \frac{B + A^2 \sum _t \eta _{t}^2}{2 \sum _t \eta _{t}} \end{aligned}$$
(27)
Since \(\sum _{t=0}^{\infty } \eta _t=\infty\), it is clear that \(h_{min}(\theta ) \rightarrow h(\theta ^*)\).
Therefore, it can be concluded that ImprovedDQL-TTA with mini-batch stochastic gradient descent converges to \(h(\theta ^*)\).
\(\epsilon\)-decreasing greedy policy The ImprovedDQL-TTA algorithm selects the allocation action a with the maximum Q value by exploiting the knowledge found by the current s. To build a better estimate of the optimal ImprovedDQL-TTA function, the algorithm should explore and select a different allocation action from the current best allocation. In this paper, the \(\epsilon\)-greedy policy is employed to select a random allocation action \(\epsilon\) at one time (\(0 \le \epsilon \le 1\)) and to select the optimal allocation action by maximizing its Q value at the other time [15]. By means of this strategy, ImprovedDQL-TTA can achieve a trade off between exploration and exploitation in uncertain mobile crowdsourcing systems. The \(\epsilon\)-greedy policy can be illustrated as follows
$$\begin{aligned} \pi (a|s) = \left\{ \begin{array}{ll} \frac{\epsilon }{actnum}+1-\epsilon , &{} if \, a^{*} = argmax_{a \in A} Q(s, a)\\ \frac{\epsilon }{actnum}, &{} otherwise \\ \end{array} \right. \end{aligned}$$
(28)
where actnum denotes the total number of available allocation actions.
Theorem 2
(\(\epsilon\)-greedy policy improvement) For any \(\epsilon\)-greedy policy \(\pi\), the \(\epsilon\) -greedy policy \(\pi '\) with respect to \(q_{\pi }\) is an improvement, \(v_{\pi '}(s) \ge v_{\pi }(s)\).
$$\begin{aligned} \begin{aligned} q_{\pi }(s, \pi '(s))&= \sum _{a \in A}\pi '(a|s)q_{\pi }(s,a) \\&= \frac{\epsilon }{actnum} \sum _{a \in A}q_{\pi }(s,a) + (1-\epsilon ) max_{a \in A}q_{\pi }(s,a) \\&\ge \frac{\epsilon }{actnum} \sum _{a \in A}q_{\pi }(s,a) + (1-\epsilon ) \sum _{a \in A} \frac{\pi (a|s)-\epsilon /actnum}{1-\epsilon }q_{\pi }(s,a) \\&= \sum _{a \in A}\pi (a|s)q_{\pi }(s,a) \\&= v_{\pi }(s) \end{aligned} \end{aligned}$$
(29)
Therefore, the \(\epsilon\)-greedy policy is an improvement, \(v_{\pi '}(s) \ge v_{\pi }(s)\).
To maintain a good balance of exploration and exploitation, a suitable learning parameter should be selected for the \(\epsilon\)-greedy strategy. In the early training time, a more random policy should be used to encourage initial exploration, and as training time progresses, a more greedy policy should be considered. The training performance of ImprovedDQL-TTA can be improved by using an \(\epsilon\)-greedy parameter that changes during training, which is defined as following.
$$\begin{aligned} \epsilon = \epsilon - \frac{\epsilon _i - \epsilon _f}{explore} \end{aligned}$$
(30)
where \(\epsilon _i\) is the initial value of \(\epsilon\), \(\epsilon _f\) is the final value of \(\epsilon\), and explore is the total number of training steps.
Novel neural network architecture with action advantage function To further improve the convergence stability, a novel deep network architecture is integrated into ImprovedDQL-TTA for learning the crowdsourcing decision process with an action advantage function [33,34,35]. The key idea of this mechanism is to design a novel neural network with two sequences of fully connected layers. In this way, the state values and the action advantage are separately learned by the novel ImprovedDQL-TTA neural network. Figure 4 illustrates the novel neural network architecture.
For a stochastic policy \(\pi\), \(Q_{\pi }(s, a)\) and \(V_{\pi }(s)\) can be formulated as
$$\begin{aligned} \begin{aligned} Q_{\pi }(s, a)&= {\mathbb {E}}[R_t|s_t=s, a_t=a,\pi ] \\ V_{\pi }(s)&= {\mathbb {E}}_{a\sim \pi (s)}[Q^{\pi }(s, a)] \end{aligned} \end{aligned}$$
(31)
The action advantage function can be defined as
$$\begin{aligned} G_{\pi }(s, a) = Q_{\pi }(s, a) - V_{\pi }(s) \end{aligned}$$
(32)
Note that \({\mathbb {E}}[G_{\pi }(s, a)]=0\). Intuitively, the \(V_{\pi }(s)\) function calculates the value of a particular state s, and \(Q_{\pi }(s, a)\) evaluates the value of selection action a in state s and then combines the results to estimate the crowdsourcing action value. Based on this definition, the evaluation of the relative importance of the each crowdsourcing action can be obtained from the action advantage function \(G_{\pi }(s, a)\).
To estimate the values of V and G functions, ImprovedDQL-TTA is implemented with a novel neural network, where two streams of fully connected layers output vector \(V(a;\beta )\) and vector \(G(s,a;\alpha )\). ImprovedDQL-TTA combines \(V_{\pi }(s)\) and \(G_{\pi }(s, a)\) to obtain \(Q_{\pi }(s, a)\), as follows
$$\begin{aligned} Q(s, a;\theta , \alpha ,\beta ) = V(s;\theta , \beta ) + G(s, a;\theta ,\alpha ) \end{aligned}$$
(33)
and
$$\begin{aligned} Q(s, a;\theta , \alpha , \beta )=V(s;\theta , \beta ) + \bigg (G(s, a;\theta , \alpha ) - \max G(s, a;\theta , \alpha )\bigg ) \end{aligned}$$
(34)
where \(\alpha\) and \(\beta\) are parameters of the two sequences of novel neural network layers. The action advantage function has zero advantage in selecting an action. For \(a^*=argmax_{a \in A}Q(s, a; \alpha , \beta ) = argmax_{a \in A}G(s, a;\alpha )\), the function obtains \(Q(s, a^*; \alpha , \beta )=V(s;\beta )\). Furthermore, for better stability, an alternative module of ImprovedDQL-TTA replaces the max operator with an average operator
$$\begin{aligned} Q(s, a;\theta , \alpha , \beta ) = V(s;\theta , \beta ) + \bigg (G(s, a;\theta , \alpha ) - \frac{1}{|A|} \sum _{a} G(s, a;\theta , \alpha )\bigg ) \end{aligned}$$
(35)
ImprovedDQL-TTA is an intelligent algorithm for addressing sequential decision-making problems of mobile crowdsourcing systems. ImprovedDQL-TTA is implemented with mini-batch stochastic gradient descent, \(\epsilon\)-decreasing greedy policy, and a novel network architecture with an action advantage function. To intelligently develop an appropriate strategy, ImprovedDQL-TTA is built with a multiple-layer network that takes the crowdsourcing state encoded in a \([1 \times statenum]\) vector and learns the best action (mobile workers), mapping all possible actions in a vector of length actnum. In summary, the pseudo code for improved deep Q-learning-based trust-aware task allocation is illustrated in Algorithm 1.
ImprovedDQL-TTA is able to effectively identify an optimal solution for the large-scale MCMDP. ImprovedDQL-TTA operates by learning to optimize the expected reward of selecting an action for a given state and discovering the optimal action-selection policy to stably adapt to changes in a large-scale environment.