In the case where we know nothing about the unobserved regions, any selection is as good as any other. The solutions have not so far considered the case where knowing something about a region tells us something about another region. For example, we assume initially no information, i.e. \(F(r) = TRUE\) with probability \(\alpha\) (and \(F(r) = FALSE\) with probability \(1\,{-}\, \alpha\)). The value of \(\alpha\) may be estimated from some initial density measure if there is some a priori information, but here, we take \(\alpha\) to be the true proportion of regions in *R* that satisfy *F*. Hence, the success factor \(\sigma\) has been approximated via \(\alpha\) in the experiments above.

It can be seen from the definitions of *Q*, \(Q'\), *T* and \(T'\) above and their monotonically increasing properties that if \(\sigma\) was to increase, we can reduce the number of questions and the number of rounds. In this section and the next section, we consider heuristics that can improve the success factor in each round of querying.

Given direct observation of a region, then *F*(*r*) must evaluate to true or false, but without direct observation of a region, we can only compute the probability of *F*(*r*) being true or false in some way. Note that we say we *observe* a region whenever we ask the crowd a question about it.

To estimate \(Pr(F(r) = TRUE)\) given that *r* has not been observed, we introduce the neighbourhood association factor \(\delta\) (\({>}0\)) which represents the informational relationship between neighbouring regions, where knowing something certain about a region *q* tells us something about its neighbouring region, i.e., if *r* and *p* are two neighbouring regions, then if we observed that \(F(q) = TRUE\), but have not observed *r*, then, we set:

$$\begin{aligned} Pr(F(r) = TRUE~|~F(q) = TRUE)& = \alpha \cdot (1+\delta ) \\ Pr(F(r) = FALSE ~|~F(q) = TRUE)& = 1\,{-}\, \alpha \cdot (1+\delta ) \end{aligned}$$

where \(\delta\) is chosen so that \(0 \le \alpha \cdot (1+\delta ) \le 1\), and also, if we directly observed that \(F(q) = FALSE\), using Baye’s rule:

$$\begin{aligned} Pr(F(r) &= TRUE~|~F(q) = FALSE) \\ & = Pr(F(r) = TRUE) \cdot \frac{Pr( F(q) = FALSE~|~F(r) = TRUE)}{Pr(F(q) = FALSE)} \\ & = \alpha \cdot \frac{ 1\,{-}\, \alpha \cdot (1+\delta ) }{1\,{-} \,\alpha } \\ Pr(F(r) &= FALSE~|~F(q) = FALSE) = 1\,{-}\,\alpha \cdot \frac{ 1\,{-}\, \alpha \cdot (1+\delta ) }{1\,{-}\, \alpha } \end{aligned}$$

For example, if \(\alpha\) is 0.5 and \(\delta\) is 0.1, then \(Pr(F(r) = TRUE~|~F(q) = TRUE) = 0.55 > 0.5\). In other words, as we observe more regions, given the association among regions, we might be able to do better than randomly selecting a set of regions to ask about in each round; we can select regions with a higher probability of evaluating *F* to *TRUE* based on such association information. Also, if a region should be false with probability \(1\,{-}\,\alpha\), on observing that its neighbour is *TRUE*, its probability of being *FALSE* is reduced.

More precisely, let *N* be a function that returns the immediate neighbours of a region, i.e. \(N(r) \subseteq R\) is the set of regions sharing a boundary with *r* defined in some way. *N*(*r*) would have eight members at most if *R* is divided into a grid of rectangular regions (including diagonally adjacent regions).

From the point of view of the unobserved region *r*, it is possible that multiple neighbouring regions have been observed, and so, we need to combine the influence from multiple observed neighbours.

For example, for a given region, if one of its neighbours \(q_1\) is found that \(F(q_1) = FALSE\) and another two neighbours \(q_2\) and \(q_3\) are such that \(F(q_2) = F(q_3) = TRUE\), then by Bayes’ rule (where \(H = \alpha \cdot ( 1\,{-}\, \alpha \cdot (1+\delta ) ) \cdot (\alpha \cdot (1 + \delta )) \cdot (\alpha \cdot (1 + \delta ))\):

$$\begin{aligned} Pr(F(r) &= TRUE~|~(F(q_1) = FALSE \wedge F(q_2) = TRUE \wedge F(q_3) = TRUE)) \\& = \frac{H }{H + (1 - \alpha ) \cdot \left( {1} - \alpha \cdot \frac{ 1- \alpha \cdot (1+\delta ) }{1 - \alpha } \right) \cdot \left( \alpha \cdot \frac{ 1 - \alpha \cdot (1+\delta ) }{1- \alpha } \right) \cdot \left(\alpha \cdot \frac{ 1-\alpha \cdot (1+\delta ) }{1- \alpha }\right) } \end{aligned}$$

Given a region *r*, and that we observed some subset of the neighbours of *r*, say \((A \cup B) \subseteq N(r)\), where *A* are neighbours where *F* evaluated to FALSE and *B* are neighbours where *F* evaluated to TRUE, then using *obs*(*N*(*r*)) to denote the observed neighbours of *r*, by a Bayesian approach of combining information, we have what we call the *neighbourhood formula*, where \(H' = \alpha \cdot ( 1 \,{-}\, \alpha \cdot (1+\delta ) )^{|A|} \cdot (\alpha \cdot (1 + \delta ))^{|B|}\):

$$\begin{aligned} &Pr\left( {F(r) = TRUE~~|~~\rm{\bigwedge }_{{p \in A}} (F(p) = FALSE)~ \wedge ~ \rm{\bigwedge }_{{q \in B}} (F(q) = TRUE)~ \wedge ~(A \cup B) = obs(N(r))} \right) \\ & \qquad \,\quad = \frac{{H^{\prime}}}{{H^{\prime} + (1 - \alpha )\cdot(1 - \alpha \cdot\frac{{1 - \alpha \cdot(1 + \delta )}}{{1 - \alpha }})^{{|A|}} \cdot(\alpha \cdot\frac{{1 - \alpha \cdot(1 + \delta )}}{{1 - \alpha }})^{{|B|}} }} \\ \end{aligned}$$

Note that the above is merely a heuristic for estimating the probability of a region satisfying *F*; our guess could turn out completely wrong upon observation, i.e. given current observations *obs*, we estimate that \(Pr(F(r) = TRUE~ | ~obs) > 0.5\) but we later may observe that \(F(r) = FALSE\). Also, for simplicity, we have taken a Markov-inspired assumption in that we compute the probability based only on observed regions in the neighbourhood of *r*, and do not consider any influence from regions beyond the neighbourhood, i.e., using *obs* (*R*) to denote observed regions in the entire area *R*:

\(Pr(F(r) = TRUE~|~obs(N(r))~) = Pr(~F(r) = TRUE~|~obs(R)).\)

If we are using solution (3), in each round *i*, for simplicity, we compute probabilities only for regions not yet observed, with the aim of choosing the \(k{-}k_i\) regions most likely to evaluate *F* to *TRUE*, and we use only observed information. For example, an unobserved region *r* that has no observed neighbours will have \(Pr(F(r) = TRUE) = \alpha\) even if all its unobserved neighbours *q* have estimated \(Pr(F(q) = TRUE~|~ obs) > \alpha\) given some observations *obs*.

In the previous random spatial crowdsourcing algorithm, in *SpatialCrowdsourcing (k, F, R)* given above, chooseCandidates(\(c,R \backslash O\)) chooses *c* candidates from \(R \backslash O\) in a random way, and in the associative spatial crowdsourcing algorithm, chooseCandidates(\(c,R \backslash O\)) chooses *c* candidates from \(R \backslash O\) by selecting the *c* regions with the highest probability of *F* evaluating to *TRUE*, i.e., for each region \(r \in R \backslash O\), we compute the probability of \(F(r) = TRUE\) using the neighbourhood formula above and select *c* regions with the highest probabilities according to the formula, randomly selecting among equal probability regions.

This slight variation to solution (3) above using neighbourhood association is given by this definition of \(chooseCandidates(\cdot )\) in Algorithm 4.

### Experiments with randomly generated area maps

We study the effect that the extent of clustering has with the use of this heuristic as *k* varies and as \(\delta\) is varied. In the first set of experiments, we generate area maps with \(\alpha\) set to values within the range [0.15, 0.20] and clustering introduced so that where whenever there are three ‘1’s surrounding a region, the region will be a ‘1’ (otherwise the region is either ‘1’ or ‘0’ with equal probability).

Figure 4 shows the results of associative spatial crowdsourcing compared with random spatial crowdsourcing (RSC-NR) as *k* is varied for a range of \(\delta\) values—the number of questions used and the number of rounds used are averages over 1000 runs with the same region map. It can be seen that with even small \(\delta ( {=}0.1)\), associative spatial crowdsourcing yields, on average, both a significant reduction in both the number of questions used (up to 30–40 %) and the number of rounds required to find the *k* positive regions (as low as a third or half of the number of rounds required with RSC-NR). The reductions are proportionately larger with larger *k*. Larger values of \(\delta ({>}0.1)\) do not seem to yield much improvement.

Note, however, that with little clustering, associative spatial crowdsourcing provides little to nor advantage, and can even do slightly worse in case it assumed clustering when there wasn’t any. However, as we show in the following examples, contiguous and clustered regions (fortunately) occur in a range of real-world scenarios. Below, we use maps sourced from real-world applications as a starting point representing the current state of the world from which we want to find regions of interest.

### Experiments on finding parking

We consider using spatial crowdsourcing to look for regions with parking spaces. For our experiments, we use a parking map abstracted from a San Francisco parking census data, dividing an area into 26 × 20 regions, as illustrated in Fig. 5, which shows the location of parking lots. The problem we address here is then: given the parking map, which we assume here captures the current state of the world with regard to parking in that area, we want to find *k* = 5 or *k* = 40 regions where there is parking available, using crowdsourcing. (Note that, in reality, there could be fewer regions with available parking since some of the parking spaces would have been taken up.) Hence, a query will ask if there are parking spaces in a region of size 37 by 37 m, and for simplicity, answers are binary, YES or NO, and we assume truthfulness in answers given.

Figure 6 shows the average over 1000 runs of results (number of questions and number of rounds) with two values of *k* (5 and 40). The median and standard deviation are included to indicate there is a fair amount of variability between runs. With *k* = 5 in Fig. 6a, b, we see that with large enough \(\delta\) (e.g., 1.7), i.e., using a strong association between neighbouring positive regions), the algorithm can effectively zoom in on positive regions faster than a random approach (RSC-NR, i.e., \(\delta = 0\)), resulting, on average, with both 40 % reduction in the number of rounds and 25 % reduction in the number of questions used at the same time, i.e., it is not a trading off rounds with questions but reduction in both. However, with the standard deviation of sometimes over 40 % of the average rounds and questions, there is substantial variability among runs so that gains can be small. A similar result is observed for *k* = 40, 100 in Fig. 6c–h with proportionate reductions in the number of rounds and questions, on average. The type of clustering observed in the parking map made it susceptible to gains using our neighbourhood association heuristic. As before, gains can be obtained just with \(\delta = 0.1\), with little improvements for \(\delta >0.1\).

### Experiments on finding crowds

In this experiment, we are simulating the use of crowdsourcing to find where the crowds are in a city or urban setting. We use a crowd map obtained from the MIT Citysense project,^{Footnote 2} abstracted into 126 × 148 regions, each region corresponding to roughly 28.5 × 28.5m in size. Figure 7 illustrates the map we use that, we assume here, represents the current real state of the urban area, and the problem is then, given this state of the world, to find *k* = 5, 40, 100 or 3000 regions where there are crowds, using crowdsourcing. Again, for simplicity, we assume binary answers to a query on each region: is there a crowd here or not?

Figure 8 shows our results when finding *k* = 5, 20, 40, 100 and 3000 crowded regions. Similar to the previous case study, our results show a considerable reduction (up to 70 %) in the number of rounds required and up to 60 % reduction in the number of questions required, on average, with *k* = 5, 100 and 3000. This is due to the clustering in the crowd map, which is to a higher degree than in the parking map. These results show that neighbourhood association can be extremely useful in knowing which regions to ask about when looking for crowded regions—neighbouring regions tend to be crowded.

### Experiments on finding coverage/bandwidth

In this experiment, we simulate finding regions where there is coverage (or adequate bandwidth) for 3G/4G networking. The assumed current coverage/bandwidth map is taken from OpenSignal as illustrated in Fig. 9. We want to find *k* = 5, 20, 40, 100 or 1000 regions where there is coverage, using crowdsourcing. We have 103 × 77 regions, each region corresponding to roughly 100 by 100 m in size. Each query will determine if each such region has 3G/4G coverage or adequate bandwidth.

Similar to the previous two experiments, from Fig. 10, a significant reduction (up to 60 %) in the number of rounds required can be achieved and 40–50 % reductions in the number of questions required are observed, with all values of *k* used.