Let \(\gamma_{DT}\) be the regulated SNR for direct transmission from the *i*th SU source to the *k*th SU destination, which can be expressed as follows:

$$\gamma_{DT} \le \frac{{P_{tx,ik} \left| {h_{ik} } \right|^{2} }}{{Z_{ik} }}$$

(4)

where \(P_{tx,ik}\) is the transmission power of the *i*th SU source to the *k*th SU destination, \(h_{ik}\) is the channel gain between the *i*th SU source to the *k*th SU destination, and \(Z_{ik}\) is an independent and identically distributed (IID) additive white Gaussian noise (AWGN). From the Eq. (4) the minimum decodable transmission power with given SNR is calculated in [29] as follows:

$$P_{tx,ik} = \gamma_{DT} \frac{{Z_{ik} }}{{\left| {h_{ik} } \right|^{2} }}$$

(5)

When a maximum allowable transmission power for SUs \(P_{tx}^{max}\) is given, the maximum transmission power has \(\left( {K_{1} + 1} \right)\) feasible power intervals as follows:

$$P_{n} = \frac{n}{{K_{1} }}\left( {P_{tx}^{max} } \right), \quad n \in \left\{ {0,1, \ldots ,K_{1} } \right\}$$

(6)

where \(K_{1} > 0\) is an integer. Thus, the actual transmission power of the *i*th SU source to the *k*th SU destination will be determined by:

$$\bar{P}_{tx,ik} = \left\{ \begin{array}{*{20}ll} P_{n} , &\quad if\,\,P_{n - 1} < P_{tx,ik} \le P_{n} \\ P_{tx}^{max} , &\quad otherwise \\ \end{array} \right.$$

(7)

When a SU transmits during the transmission of the PU transmitter is active, the interference may occur on the PU receiver. Hence, the SINR level required by the PUs must be satisfied, which is expressed as:

$$SINR_{PU} = \frac{{P_{tx,pq} \left| {h_{pq} } \right|^{2} }}{{I_{iq} + Z_{pq} }}$$

(8)

where \(P_{tx,pq}\) is the transmission power of the *p*th PU transmitter to the *q*th PU receiver, \(h_{pq}\) is the channel gain between the *p*th PU transmitter and the *q*th PU receiver, \(I_{iq}\) represents the interference to the *q*th PU receiver by the *i*th SU source, and \(Z_{pq}\) is an IID AWGN. The interference to the *q*th PU receiver by the transmission of the *i*th SU source can be expressed as:

$$I_{iq} = \frac{{\bar{P}_{tx,ik} \left| {h_{iq} } \right|^{2} }}{{Z_{iq} }}$$

(9)

where \(h_{iq}\) is the channel gain between the *i*th SU source to the *q*th PU receiver and \(Z_{iq}\) is an IID AWGN respectively. The interference to the *q*th PU receiver during transmission of the *j*th SU relay to the *k*th SU destination can be expressed as:

$$I_{jq} = \frac{{\bar{P}_{tx,jk} \left| {h_{jq} } \right|^{2} }}{{Z_{jq} }}$$

(10)

where \(h_{jq}\) is the channel gain between the *j*th SU relay to the *q*th PU receiver and \(Z_{jq}\) is an IID AWGN respectively.

Since it is assumed that the SUs can identify the pilot signals from the PUs, a reflector for \(\left| {h_{iq} } \right|^{2}\) and \(\left| {h_{jq} } \right|^{2}\) can be derived by considering the energy detector concept in CRNs. With a given reference threshold \(\lambda_{th,i}\) and \(\lambda_{th,j}\) of energy detector of the *i*th SU source as well as the *j*th SU relay, the reflector for the *i*th SU source to the *q*th PU receiver \(\left| {h_{iq} } \right|^{2}\) and the *j*th SU relay to the *q*th PU receiver \(\left| {h_{jq} } \right|^{2}\) can be defined as follows:

$$\varPsi_{i} = \left| {\left( {\frac{{\log \lambda_{th,i} }}{{\log \lambda_{iq} }}} \right)} \right|$$

(11)

$$\varPsi_{j} = \left| {\left( {\frac{{\log \lambda_{th,j} }}{{\log \lambda_{jq} }}} \right)} \right|$$

(12)

where \(\lambda_{iq}\) and \(\lambda_{jq}\) is the RSS value at the *i*th SU from the pilot signal of the *q*th PU receiver and at the *j*th SU relay from the *q*th PU receiver respectively. If the value of the given reference threshold and the value of the RSS are assumed as \(\lambda_{th,i} ,\lambda_{iq} ,\lambda_{th,j} ,\ and\ \lambda_{jq} < 1\), then the estimated interference to the *q*th PU receiver by the *i*th SU source and the *j*th SU relay can be expressed as:

$$I_{iq} \approx \tilde{I}_{iq} = \bar{P}_{tx,ik} min\left( {1,\varPsi_{i} } \right)$$

(13)

$$I_{jq} \approx \tilde{I}_{jq} = \bar{P}_{tx,jk} min\left( {1,\varPsi_{j} } \right)$$

(14)

By applying the Eq. (13) into Shannon channel capacity form, the interference capacity level of the *i*th SU source is expressed by:

$$L_{i} = \frac{{log_{2} \left( {1 + \tilde{I}_{iq} } \right)}}{{Y_{0} }}$$

(15)

where \(Y_{0}\) represents a normalizing factor. Since the maximum interference capacity level becomes 1 due to the normalizing factor \(Y_{0}\), the interference capacity level can be expressed into \(\left( {K_{2} + 1} \right)\) intervals as follows:

$$l_{n} = \frac{n}{{K_{2} }}max\left( {L_{i} } \right) = \frac{n}{{K_{2} }}, n \in \left\{ {0,1, \ldots, K_{2} } \right\}$$

(16)

where \(K_{2} > 0\) is an integer. Thus, the actual interference capacity level of the *i*th SU source will be determined by:

$$\bar{L}_{i} = \left\{ {\begin{array}{*{20}ll} l_{n} , &\quad if\,\, l_{n - 1} < L_{i} \le l_{n} \\ 1, &\quad otherwise \end{array} } \right.$$

(17)

The state in Q-learning reflects the situation of the network environment, which is constructed by two elements in this paper: the direct transmission power from the *i*th SU source to the *k*th SU destination \(\bar{P}_{tx,ik}\) and the interference capacity level of the *i*th SU source \(\bar{L}_{i}\). Therefore, the state of the *i*th SU source for Q-learning is expressed as:

$$S_{i} = \left( {\bar{P}_{tx,ik} ,\bar{L}_{i} } \right) \in \mathop \sum \nolimits = P \times \varLambda , P = \left\{ {P_{0} ,P_{1} , \ldots,P_{{K_{1} }} } \right\}, \varLambda = \left\{ {l_{0} ,l_{1} , \ldots ,l_{{K_{2} }} } \right\}$$

(18)