The majority of TTA optimization approaches require prior knowledge, but such approaches are not applicable in dynamic mobile crowdsourcing environments, where the availability of mobile workers is subject to frequent and unpredictable changes [2, 6]. Let us consider the submission of spatial tasks from requesters through a mobile crowdsourcing system, whereby spatial tasks are reached in an online manner. In such a scenario, the mobile crowdsourcing system possesses no prior information regarding spatial tasks and mobile workers.
The crowdsourcing TTA optimization problem is modeled as a Markov decision processbased mobile crowdsourcing (MCMDP) problem. Deep Qlearning is introduced to address the MCMDP problem. Furthermore, we propose an improved deep Qlearningbased trustaware task allocation (ImprovedDQLTTA) algorithm by combining trust crowdsourcing optimization and deep Qlearning, which enables the learning agent to solve largescale MCMDP problems in an uncertain scenario.
MDP model for uncertain mobile crowdsourcing
To address the dynamic problems of uncertain crowdsourcing TTA, a Markov decision process is adopted. The Markov decision process, a machine learning model, is a typical intelligence framework for modeling sequential decisionmaking problems under uncertainty [15]. In this paper, the MDP is applied to demonstrate the trustaware task allocations and adaptation processes schematically in uncertain mobile crowdsourcing.
A mobile crowdsourcing MDP consists of a fivetuple \(= \langle S, A, P, R, O\rangle\), where S is a state space composed of a finite set of crowdsourcing states, A is a crowdsourcing action space composed of a finite set of actions, P is the transition function for reaching the next crowdsourcing state \(s'\) from state s when an action \(a \in A(s)\) is performed by a crowdsourcing agent, R is a real crowdsourcing valued reward function, where the agent receives an immediate reward \(r = R(s's,a)\), and O is the crowdsourcing observation space in which the agent can fully observe the mobile crowdsourcing decision environment. On this basis, the mobile crowdsourcing MDP can be defined as follows.
Definition 6
(Mobile Crowdsourcing MDP (MCMDP)) A MCMDP is formally defined as a seventuple: MCMDP\(=\langle S^{i}, s_{0}^{i}, s_{r}^{i}, A^{i}, P^{i}, R^{i}, O^{i} \rangle\), where:

\(S^{i}\) is the set of tasks in the state space of a particular crowdsourcing partially observed by agent i.

\(s_{0}^{i} \in S\) is the initial task and any execution of the mobile crowdsourcing beginning from this task.

\(s_{r}^{i} \in S\) represents the terminal task. When arriving at the terminal task, an execution of mobile crowdsourcing is terminated.

\(A^{i}\) is the set of mobile workers that can perform tasks \(s \in S^{i}\), and mobile worker cw belongs to \(A^{i}\) only if the precondition is satisfied by s.

P is a probability value, that is, a transition distribution \(P(s's, a)\) that determines the probability of reaching the next state \(s'\) from state s if action \(a \in A(s)\) is fulfilled by a crowdsourcing agent. The probability distribution \(P(s's, a)\) can be defined as
$$\begin{aligned} \sum _{s' \in S} P(s's, a) = 1, \forall s \in S,\forall a \in A. \end{aligned}$$
(7)

\(R^{i}\) is the reward function when mobile worker \(cw \in A^{i}\) is invoked, agent i transits from s to \(s'\), and the learning agent obtains an immediate reward \(r^i\). The expected value is \(R^{i}(s's, ws)\). Consider selecting mobile worker cw with multiple quality criteria, where agent i receives the following quality vector as a reward:
$$\begin{aligned} \begin{aligned} QoS(s,cw,s') =&[f_{tr}(s,cw,s'), f_{dist}(s,cw,s')]^{T}, \end{aligned} \end{aligned}$$
(8)
where each \(f_k(\cdot )\) denotes a quality attribute of mobile worker cw.

O is the crowdsourcing observation space in which the agent can fully observe the mobile crowdsourcing decision environment.
The MCMDP solution is a collection of TTA decision policies, each of which can be described as a procedure of trustaware task allocation \(cw \in A\) by agent i in each state s. These policies, denoted as \(\pi\), actually map spatial tasks to mobile workers, defined as \(\pi = S \rightarrow A\). The MCMDP policy can be defined as a mobile crowdsourcing model. The main idea is to identify the optimal policy for trustaware allocation in uncertain mobile crowdsourcing.
Deep Qlearningbased trustaware task allocation algorithm
The above section analyzed the optimization problem of trust aware allocation by means of the MCMDP model. The optimization objective is to maximize the longterm rewards of the MCMDP. The solution of the MCMDP can be denoted as a policy \(\pi\) that guides a learning agent to take the right action for the specific crowdsourcing state.
Dynamic task allocation with QLearning
The uncertain mobile crowdsourcing problem can be formulated as an MCMDP model. However, the transition probabilities are not known, and we do not initially know the rewards of taking the allocation action. In this case, Qlearning is suggested for a crowdsourcing agent to determine the optimal policy. Qlearning is a temporal difference learning algorithm [14, 15] that takes into account the fact that the agent initially has only partial knowledge of the crowdsourcing MCMDP. In general, assume that an agent learns from experience to address uncertain mobile crowdsourcing. The agent can obtain a set of stateaction rewards \(\langle s_1, a_1, r_1,s_2, a_2, r_2,\cdots , s_t, a_t, r_t \rangle\), which indicates that the agent was in state \(s_t\), selected action \(a_t\), and obtained reward \(r_t\). Figure 2 illustrates the sequence of the crowdsourcing state and stateaction reward pairs.
Temporal difference learning agents determine the increment to \(V(s_t)\) in each time step. At time t, the agents immediately create an update by using discount rewards and computing \(V(s_t)\). Temporal difference learning [15] can defined as
$$\begin{aligned} V(s_t) = V(s_{t1}) + \alpha \cdot \big (r_t+\gamma \cdot V(s_t)  V(s_{t1})\big ). \end{aligned}$$
(9)
The goal of temporal difference learning agents is to update \(V(s_t)\) by \(R(s_t)+\gamma \cdot V(s_t)\) in each step. Tabular Qlearning is a common approach in temporal difference learning for maximizing total rewards. For each state s and action a, the tabular Qlearning algorithm takes an action, observes a reward r, enters a next state \(s'\), and updates Q(s, a). The key of the Qlearning algorithm is a straightforward value Q(s, a) iteration update. Q(s, a) is accumulated for the current estimate of \(Q^{\pi }\) in each training iteration. The learning table values of Q(s, a) are revised by the following function:
$$\begin{aligned} Q^{\pi }(s, a) = (1  \alpha ) \cdot Q(s, a)+ \alpha \cdot \Big (r + \gamma \cdot \max _{a'}Q(s', a')\Big ). \end{aligned}$$
(10)
The learning rate \(\alpha \in [0,1]\) indicates the extent to which the existing estimation of \(Q^{\pi }(s, a)\) contributes to the next estimation. The Q(s, a) values ultimately converge to the optimum value \(Q^{*}(s, a)\) [15]. Thus, the Qlearningbased allocation algorithm ultimately discovers an optimal policy for any finite MCMDP [6]. The basic optimization involves incorporating both the travel distance and the trust score of mobile workers into the dynamic mobile crowdsourcing decisions. Thus, the reward function of Q Learningbased TTA is defined as in Definition 7.
Definition 7
(Reward function) Suppose that a mobile worker completing a task can be estimated by a trust score \(f_{tr}(x_i^j) = tr(x_i^j)\). Each mobile worker is required to move from location aloc to bloc when completing the spatial task, which incurs a distance cost \(f_{dist}(x_i^j)\). The distance cost is evaluated in terms of the distance \(f_{dist}(x_i^j)=dist(aloc, bloc)\) between aloc and bloc. As a result, the reward function is determined with QoS vectors \([f_{tr}(x_i^j), f_{dist}(x_i^j)]\). Owing to the different scale of each QoS objective, the QoS value is mapped into the interval [0, 1]. With the minmax operator, the learning reward function adopts the linearly weighted sum approach to calculate the value of all QoS objectives:
$$\begin{aligned} r = \sum _{k=1}^{2}w_{k} \cdot (f_k(x_{i}^{j})z_{k}^{U})/(z_{k}^{N}z_{k}^{U}) \end{aligned}$$
(11)
In the training iterations, the learning agent estimates its optimal policy by maximizing the total of received crowdsourcing rewards in the uncertain scenario.
Dynamic task allocation with deep Qlearning
Tabular Qlearning is not a feasible solution owing to the largescale state and action spaces in uncertain mobile crowdsourcing systems. Moreover, a Qlearning table is environmentspecific and not generalized. In largescale uncertain systems, there are too many states and actions to store in machine memory, and learning the value of each state is a slow process. This section introduces a new and highly effective QLearningbased task allocation mechanism.
To adapt to changes in largescale mobile crowdsourcing systems, we propose a deep Qlearningbased trustaware task allocation (DQLTTA) algorithm that is a combination of advances in deep neural network and Qlearning techniques. Specifically, the dynamic TTA problem is formalized as a Markov decision processbased mobile crowdsourcing model. The experience of a crowdsourcing state transition is denoted as \(s, a, r, s'\), and a set of crowdsourcing states and allocation actions with a transition policy constitute an MCMDP. One episode of an MCMDP forms a limited sequence of crowdsourcing states, allocation actions and rewards:
$$\begin{aligned} s_0, a_0, r_0, S_1, a_1, r_1, s_2, ..., s_t, a_t, s_{t+1}, ..., s_{n1}, a_{n1}, r_{n1}, s_n \end{aligned}$$
(12)
where \(s_t\) denotes the current state, \(a_t\) denotes the current action, \(r_t\) denotes the reward after performing an action, and \(s_{t+1}\) denotes the next state in the dynamic mobile crowdsourcing system.
The DQLTTA algorithm directly combines a deep neural network and QLearning to solve the dynamic trustaware allocation problem. The DQLTTA learning algorithm uses a value iteration approach, in which the crowdsourcing value function \(Q = Q(s, a; \theta )\) is a parameterized function with parameter \(\theta\) that takes crowdsourcing state S and crowdsourcing action space A as inputs and returns a crowdsourcing Q value for each action \(a \in A\). Then, we can use a greedy approach to select a crowdsourcing action:
$$\begin{aligned} Q(s) = argmax_{a \in A} Q(s, a; \theta ) \end{aligned}$$
(13)
DQLTTA iteratively solves the mobile crowdsourcing MDP problem by learning the weights of the deep neural network towards the optimization objective. The DQLTTA algorithm differs from QLearning in two ways. Traditional QLearning is based on the Bellman equation, and the Q value is iteratively updated: \(Q_{t+1}(s,a) = E [r + \gamma \cdot max_{a'}Q_t(s', a')s, a]\). QLearning algorithms with value iterations are impractical for largescale crowdsourcing problems. Thus, it is practical to employ a dynamic crowdsourcing function approximation to assess the action value function \(Q(s, a; \theta )\approx Q^{*}(s, a)\), which is a typical function approximation.
DQLTTA is designed as a function approximation with weight \(\theta\) for the mobile crowdsourcing MDP problem. The parameters of the DQLTTA function approximation can be learned by minimizing loss function \(L(\theta _t)\), which is optimized at iteration i
$$\begin{aligned} L(\theta _t) = {\mathbb {E}}_\pi \Big [\big (y_t  Q(s, a; \theta _t)\big )^{2} \Big ] \end{aligned}$$
(14)
where \(y_t\) is the target value for iteration i and can be computed as
$$\begin{aligned} y_t = \left\{ \begin{array}{ll} r_t, &{} if \, A(s') = \varnothing \\ r_t + \gamma \cdot max_{a'}Q(s', a'; \theta _{t1}), &{} else \\ \end{array} \right. \end{aligned}$$
(15)
DQLTTA considers the crowdsourcing states and allocation actions as the inputs of a deep Qnetwork and outputs the Qvalue for dynamic allocations. Figure 3 illustrates the deep Qlearningbased trustaware task allocation (DQLTTA) algorithm framework.
Dynamic task allocation with improved deep Qlearning
As discussed in [17, 33,34,35], the performance of deep Qlearning algorithms may not to be stable. To improve the overall performance of DQLTTA, an improved DQLTTA algorithm (ImprovedDQLTTA) is further proposed to handle largescale MCMDP problems much more stably in uncertain mobile crowdsourcing environments. Our proposed ImprovedDQLTTA algorithm has been improved with the following important mechanisms: (i) minibatch stochastic gradient descent approach with advanced training mechanisms; (ii) Epsilondecreasing greedy policy; iii) a novel deep neural network architecture with an action advantage function.
Minibatch stochastic gradient descent The parameters of ImprovedDQLTTA from an earlier training iteration \(\theta _{t1}\) are fixed while optimizing the loss function \(L(\theta _t)\). Note that the targets rely on the ImprovedDQLTTA weight parameters. A local minimum of the loss function by the gradient is obtained as follows,
$$\begin{aligned} \begin{aligned} \varDelta \theta _t&=  \frac{1}{2} \eta \cdot \triangledown _{\theta }(L(\theta _t)) \\&= \eta \cdot {\mathbb {E}}_\pi \Big [r + \gamma \cdot max_{a'}Q(s', a'; \theta _{t1}) Q(s, a; \theta _t)\Big ] \cdot \triangledown _{\theta } Q(s, a; \theta _t) \end{aligned} \end{aligned}$$
(16)
Instead of calculating the full expectation in the above gradient, the loss function of the ImprovedDQLTTA is computationally optimized by stochastic gradient descent [17]. The weights of the ImprovedDQLTTA approximation are trained using a gradient descent rule, and the parameter \(\theta\) can be updated using stochastic gradient descent by
$$\begin{aligned} \begin{aligned} \varDelta \theta _t&= \eta \cdot \big (r + \gamma \cdot max_{a'}Q(s', a'; \theta _{t1}) Q(s, a; \theta _t)\big ) \cdot \triangledown _{\theta } Q(s, a; \theta _t) \\ \theta _{t}&= \theta _t  \eta \cdot \varDelta \theta _t \end{aligned} \end{aligned}$$
(17)
Stochastic gradient descent is simple and appealing for DQLTTA; however, it is not sample efficient. In this paper, minibatch stochastic gradient descent learning is therefore proposed to discover the optimal fitting value function of ImprovedDQLTTA by training on minibatch crowdsourcing data. Instead of making decisions based solely on the current allocation experience, the allocation experience replay helps the ImprovedDQLTTA network to learn from several minibatches of crowdsourcing data. Each of these allocation experiences is stored as a fourdimensional vector of \(\langle state, action, reward, next state\rangle\). During training iteration t, allocation experience \(e_t = (s_t, a_t, r_t, s_{t+1})\) is stored into a replay tuple \(D = \{e_1, ..., e_t\}\). The memory buffer of the allocation experience replay is fixed, and as new allocation experience are inserted, previous experience are removed [19]. To train the ImprovedDQLTTA neural networks, uniform minibatches of experiences are extracted randomly from the allocation memory buffer.
To obtain stable Qvalues, a separate target network is used to estimate the loss function after every training iterations; another neural network, whose weights are changed gradually compared to the primary Qnetwork, is also used [35]. In this context, the ImprovedDQLTTA algorithm learns to optimize two separate neural networks \(Q(s, a; \theta )\) and \(Q(s, a; {\hat{\theta }})\) with current learning parameters \(\theta\) and previous learning parameters \({\hat{\theta }}\). \(\theta\) are updated numerous times during the training iterations and are cloned to the previous parameters \({\hat{\theta }}\) after \(NUM_{training}\) iterations.
$$\begin{aligned} \theta _{t} = \theta _t  \eta _t \cdot \frac{1}{b} \sum _{t=k}^{k+b} \varDelta \theta _t \end{aligned}$$
(18)
ImprovedDQLTTA is refreshed with a batch of collected samples in the experience replay buffer by means of minibatch stochastic gradient descent at each decision epoch.
Theorem 1
(The convergence analysis of minibatch stochastic gradient descent) Assume that there are two constants A and B that satisfy \(E[\Vert \triangledown h_b(\theta )\Vert ^2]\le A\) and \({\mathbb {E}}[\Vert \theta ^*  \theta _t\Vert ^2]\le B\), where t denotes the gradient optimization iteration and
$$\begin{aligned} \triangledown h_b(\theta ) = \frac{1}{b} \sum _{t=k}^{k+b} \varDelta \theta _t \end{aligned}$$
(19)
Let \(h_{min}(\theta )=min\{h(\theta _1), h(\theta _2), \cdots , h(\theta _t)\}\) and assume that
$$\begin{aligned} 1> \eta _t > 0, \sum _{t=0}^{\infty } \eta _t^2 < \infty , \sum _{t=0}^{\infty } \eta _t = \infty \end{aligned}$$
(20)
When the optimization of the minibatch approach reaches \(t+1\) iterations, then
$$\begin{aligned} \begin{aligned} \Vert \theta _{t+1}  \theta ^{*}\Vert ^2&= \Vert \theta _{t}\eta _t \cdot \triangledown h_b(\theta )\theta ^{*}\Vert ^2\\&= \Vert \theta _{t}  \theta ^{*}\Vert ^2  2 \eta _{t} \cdot \triangledown h_b(\theta ) \cdot (\theta _{t}  \theta ^{*}) + \eta _t^2 \cdot \Vert \triangledown h_b(\theta ) \Vert ^2 \end{aligned} \end{aligned}$$
(21)
According to the conditional expectation of mathematics, we can obtain
$$\begin{aligned} \begin{aligned} {\mathbb {E}}[\Vert \theta _{t+1}  \theta ^{*}\Vert ^2\theta _t]&= {\mathbb {E}}[\Vert \theta _{t}  \theta ^{*}\Vert ^2\theta _t]  2 \eta _{t} \cdot {\mathbb {E}}[\triangledown h_b(\theta ) \cdot (\theta _{t}  \theta ^{*})\theta _t] + \\&\eta _t^2 \cdot {\mathbb {E}}[\Vert \triangledown h_b(\theta ) \Vert ^2\theta _t] \\&\le \Vert \theta _{t}  \theta ^{*}\Vert ^2  2 \eta _{t} \cdot (h(\theta _t)h(\theta ^*)) + \eta _t^2 \cdot A^2 \end{aligned} \end{aligned}$$
(22)
Taking the expectation of \(\theta _t\) in Equation (22) yields
$$\begin{aligned} \begin{aligned} {\mathbb {E}}[\Vert \theta _{t+1}  \theta ^{*}\Vert ^2]&= {\mathbb {E}}[\Vert \theta _{t}  \theta ^{*}\Vert ^2]  2 \eta _{t} \cdot {\mathbb {E}}[h(\theta _t)h(\theta ^*)] + \eta _t^2 \cdot A^2 \end{aligned} \end{aligned}$$
(23)
Accordingly,
$$\begin{aligned} \begin{aligned} {\mathbb {E}}[\Vert \theta _{t+1}  \theta ^{*}\Vert ^2]&\le {\mathbb {E}}[\Vert \theta _{t}  \theta ^{*}\Vert ^2]  2 \sum _t \eta _{t} \cdot {\mathbb {E}}[h(\theta _t)h(\theta ^*)] + A^2 \cdot \sum _t \eta _t^2 \end{aligned} \end{aligned}$$
(24)
Since \({\mathbb {E}}[\Vert \theta _{t+1}  \theta ^{*}\Vert ^2] \ge 0\), we obtain
$$\begin{aligned} \begin{aligned} {\mathbb {E}}[\Vert \theta _{t+1}  \theta ^{*}\Vert ^2] + A^2 \sum _t \eta _{t}^2&\ge 2 \sum _t \eta _{t} \cdot {\mathbb {E}}[h(\theta _t)h(\theta ^*)] \\&\ge 2 \sum _t \eta _{t} \cdot {\mathbb {E}}[h_{min}(\theta _t)h(\theta ^*)] \end{aligned} \end{aligned}$$
(25)
Since \({\mathbb {E}}[\Vert \theta _{t}  \theta ^{*}\Vert ^2] \le B\), we obtain
$$\begin{aligned} B + A^2 \sum _t \eta _{t}^2 \ge 2 \sum _t \eta _{t} \cdot {\mathbb {E}}[h_{min}(\theta _t)h(\theta ^*)] \end{aligned}$$
(26)
and
$$\begin{aligned} {\mathbb {E}}[h_{min}(\theta _t)h(\theta ^*)] \le \frac{B + A^2 \sum _t \eta _{t}^2}{2 \sum _t \eta _{t}} \end{aligned}$$
(27)
Since \(\sum _{t=0}^{\infty } \eta _t=\infty\), it is clear that \(h_{min}(\theta ) \rightarrow h(\theta ^*)\).
Therefore, it can be concluded that ImprovedDQLTTA with minibatch stochastic gradient descent converges to \(h(\theta ^*)\).
\(\epsilon\)decreasing greedy policy The ImprovedDQLTTA algorithm selects the allocation action a with the maximum Q value by exploiting the knowledge found by the current s. To build a better estimate of the optimal ImprovedDQLTTA function, the algorithm should explore and select a different allocation action from the current best allocation. In this paper, the \(\epsilon\)greedy policy is employed to select a random allocation action \(\epsilon\) at one time (\(0 \le \epsilon \le 1\)) and to select the optimal allocation action by maximizing its Q value at the other time [15]. By means of this strategy, ImprovedDQLTTA can achieve a trade off between exploration and exploitation in uncertain mobile crowdsourcing systems. The \(\epsilon\)greedy policy can be illustrated as follows
$$\begin{aligned} \pi (as) = \left\{ \begin{array}{ll} \frac{\epsilon }{actnum}+1\epsilon , &{} if \, a^{*} = argmax_{a \in A} Q(s, a)\\ \frac{\epsilon }{actnum}, &{} otherwise \\ \end{array} \right. \end{aligned}$$
(28)
where actnum denotes the total number of available allocation actions.
Theorem 2
(\(\epsilon\)greedy policy improvement) For any \(\epsilon\)greedy policy \(\pi\), the \(\epsilon\) greedy policy \(\pi '\) with respect to \(q_{\pi }\) is an improvement, \(v_{\pi '}(s) \ge v_{\pi }(s)\).
$$\begin{aligned} \begin{aligned} q_{\pi }(s, \pi '(s))&= \sum _{a \in A}\pi '(as)q_{\pi }(s,a) \\&= \frac{\epsilon }{actnum} \sum _{a \in A}q_{\pi }(s,a) + (1\epsilon ) max_{a \in A}q_{\pi }(s,a) \\&\ge \frac{\epsilon }{actnum} \sum _{a \in A}q_{\pi }(s,a) + (1\epsilon ) \sum _{a \in A} \frac{\pi (as)\epsilon /actnum}{1\epsilon }q_{\pi }(s,a) \\&= \sum _{a \in A}\pi (as)q_{\pi }(s,a) \\&= v_{\pi }(s) \end{aligned} \end{aligned}$$
(29)
Therefore, the \(\epsilon\)greedy policy is an improvement, \(v_{\pi '}(s) \ge v_{\pi }(s)\).
To maintain a good balance of exploration and exploitation, a suitable learning parameter should be selected for the \(\epsilon\)greedy strategy. In the early training time, a more random policy should be used to encourage initial exploration, and as training time progresses, a more greedy policy should be considered. The training performance of ImprovedDQLTTA can be improved by using an \(\epsilon\)greedy parameter that changes during training, which is defined as following.
$$\begin{aligned} \epsilon = \epsilon  \frac{\epsilon _i  \epsilon _f}{explore} \end{aligned}$$
(30)
where \(\epsilon _i\) is the initial value of \(\epsilon\), \(\epsilon _f\) is the final value of \(\epsilon\), and explore is the total number of training steps.
Novel neural network architecture with action advantage function To further improve the convergence stability, a novel deep network architecture is integrated into ImprovedDQLTTA for learning the crowdsourcing decision process with an action advantage function [33,34,35]. The key idea of this mechanism is to design a novel neural network with two sequences of fully connected layers. In this way, the state values and the action advantage are separately learned by the novel ImprovedDQLTTA neural network. Figure 4 illustrates the novel neural network architecture.
For a stochastic policy \(\pi\), \(Q_{\pi }(s, a)\) and \(V_{\pi }(s)\) can be formulated as
$$\begin{aligned} \begin{aligned} Q_{\pi }(s, a)&= {\mathbb {E}}[R_ts_t=s, a_t=a,\pi ] \\ V_{\pi }(s)&= {\mathbb {E}}_{a\sim \pi (s)}[Q^{\pi }(s, a)] \end{aligned} \end{aligned}$$
(31)
The action advantage function can be defined as
$$\begin{aligned} G_{\pi }(s, a) = Q_{\pi }(s, a)  V_{\pi }(s) \end{aligned}$$
(32)
Note that \({\mathbb {E}}[G_{\pi }(s, a)]=0\). Intuitively, the \(V_{\pi }(s)\) function calculates the value of a particular state s, and \(Q_{\pi }(s, a)\) evaluates the value of selection action a in state s and then combines the results to estimate the crowdsourcing action value. Based on this definition, the evaluation of the relative importance of the each crowdsourcing action can be obtained from the action advantage function \(G_{\pi }(s, a)\).
To estimate the values of V and G functions, ImprovedDQLTTA is implemented with a novel neural network, where two streams of fully connected layers output vector \(V(a;\beta )\) and vector \(G(s,a;\alpha )\). ImprovedDQLTTA combines \(V_{\pi }(s)\) and \(G_{\pi }(s, a)\) to obtain \(Q_{\pi }(s, a)\), as follows
$$\begin{aligned} Q(s, a;\theta , \alpha ,\beta ) = V(s;\theta , \beta ) + G(s, a;\theta ,\alpha ) \end{aligned}$$
(33)
and
$$\begin{aligned} Q(s, a;\theta , \alpha , \beta )=V(s;\theta , \beta ) + \bigg (G(s, a;\theta , \alpha )  \max G(s, a;\theta , \alpha )\bigg ) \end{aligned}$$
(34)
where \(\alpha\) and \(\beta\) are parameters of the two sequences of novel neural network layers. The action advantage function has zero advantage in selecting an action. For \(a^*=argmax_{a \in A}Q(s, a; \alpha , \beta ) = argmax_{a \in A}G(s, a;\alpha )\), the function obtains \(Q(s, a^*; \alpha , \beta )=V(s;\beta )\). Furthermore, for better stability, an alternative module of ImprovedDQLTTA replaces the max operator with an average operator
$$\begin{aligned} Q(s, a;\theta , \alpha , \beta ) = V(s;\theta , \beta ) + \bigg (G(s, a;\theta , \alpha )  \frac{1}{A} \sum _{a} G(s, a;\theta , \alpha )\bigg ) \end{aligned}$$
(35)
ImprovedDQLTTA is an intelligent algorithm for addressing sequential decisionmaking problems of mobile crowdsourcing systems. ImprovedDQLTTA is implemented with minibatch stochastic gradient descent, \(\epsilon\)decreasing greedy policy, and a novel network architecture with an action advantage function. To intelligently develop an appropriate strategy, ImprovedDQLTTA is built with a multiplelayer network that takes the crowdsourcing state encoded in a \([1 \times statenum]\) vector and learns the best action (mobile workers), mapping all possible actions in a vector of length actnum. In summary, the pseudo code for improved deep Qlearningbased trustaware task allocation is illustrated in Algorithm 1.
ImprovedDQLTTA is able to effectively identify an optimal solution for the largescale MCMDP. ImprovedDQLTTA operates by learning to optimize the expected reward of selecting an action for a given state and discovering the optimal actionselection policy to stably adapt to changes in a largescale environment.