Information cascades prediction with attention neural network

Cascade prediction helps us uncover the basic mechanisms that govern collective human behavior in networks, and it also is very important in extensive other applications, such as viral marketing, online advertising, and recommender systems. However, it is not trivial to make predictions due to the myriad factors that influence a user’s decision to reshare content. This paper presents a novel method for predicting the increment size of the information cascade based on an end-to-end neural network. Learning the representation of a cascade in an end-to-end manner circumvents the difficulties inherent to blue the design of hand-crafted features. An attention mechanism, which consists of the intra-attention and inter-gate module, was designed to obtain and fuse the temporal and structural information learned from the observed period of the cascade. The experiments were performed on two real-world scenarios, i.e., predicting the size of retweet cascades on Twitter and predicting the citation of papers in AMiner. Extensive results demonstrated that our method outperformed the state-of-the-art cascade prediction methods, including both feature-based and generative approaches.

cascade prediction is challenging due to the myriad factors that influence a user's decision to reshare content.
The problem of cascade prediction has been studied extensively [1][2][3], but most of the studies either depended heavily on the quality of the carefully designed hand-crafted features or made various strong assumptions about the generative processes of the resharing events and oversimplified reality, leading to impaired predictive power. On the other hand, deep learning methods, such as convolutional neural networks (CNNs) [4] and recurrent neural networks (RNNs) [5], have achieved great success in various complicated tasks [6][7][8], and some studies have used neural networks as a transformer to leverage various informative features for cascade prediction [9]. Nevertheless, these methods ignore the temporal properties for cascade prediction, which are regarded as the valuable information that is needed to improve cascade prediction in traditional works.
In this paper, we propose to predict the information cascade within a neural network framework, by incorporating an attention mechanism using temporal and structural information learned from the observed period of the cascade. Our proposed method consists of three layers. In the first layer, the structure embedding is obtained by representing the cascade graph as a set of random walk paths that carry information about the propagator of the message and the local and global topologies among them. Inspired by the recent successes of the point process model in a cascade dynamic modeling task [10], temporal embedding is a series of hidden representations of reshared events ordered ascendingly by time. The challenge is how to assemble paths or events into the effective representation of each factor. Thus, in the second layer, we designed a novel attention mechanism that contains intraattention and inter-gate modules. The assembly problem is solved via the intra-attention mechanism with respect to (w.r.t.) the topological structure and the temporal properties. Further, a gate mechanism is proposed to fuse the structure and temporal representation by capturing the importance of the two factors for cascade prediction. Finally, the top layer introduces a multi-layer perceptron (MLP) to output the prediction (increment size of the cascade in our case). We performed extensive experiments on two representative realworld datasets, a Twitter dataset and an AMiner citation network dataset. Our results indicated that our proposed method outperformed state-of-the-art cascade prediction models.
The remainder of this paper is organized as follows. "Related work" section presents a survey of the related work. "Preliminaries" section formulates the cascade prediction problem and introduces the recurrent neural network. "Approach" section presents the details of the proposed model. The experimental results are presented in "Experiments" section, and conclusions and plans for future work are reported in "Conclusions" section.

Related work
We reviewed and presented relevant studies to our work from two aspects, i.e, cascade prediction and attention mechanism.

Cascade prediction
Information cascade prediction has been explored in recent years and is still an open problem. Existing methods for cascade prediction can be categorized into two broad types, i.e., feature-based and model-based approaches.
Feature-based approaches [2,3,[11][12][13][14][15] make the connection between the prediction and various types of hand-crafted features that are extracted from the information cascade, including the structural features of the social network, content features, temporal features, and user features. To predict the popularity of news articles in Yahoo News, Arapakis et al. [16] used 10 different features that they extracted from the content of the news articles as well as external sources. To predict the popularity of online videos in YouTube and Facebook, Trzcinski et al. [17] utilized both the visual clues and the early popularity patterns of the videos once they were released. Instead of predicting the total volume or level of popularity, Kong et al. [18] focused on the popularity evolution of online contents and consider the dynamic factors that influenced how the popularity evolved. Nevertheless, there is no principled way to design and extract these features, and the accuracy of the predictions is sensitive to the quality of the extracted features.
Model-based approaches [1,[19][20][21][22] are devoted to directly characterizing and modeling the formation of an information cascade in the network. These approaches often are optimized to provide intuitive explanations for the prediction due to the interpretable factors that are incorporated in them. Yu et al. [21] proposed a novel NEtworked Weibull Regression model for modeling microbehavioral dynamics that significantly improved the interpretability and generalizability of traditional survival models. Bao et al. [23] modeled the popularity dynamics of the tweet in Twitter using the Hawkes process. They also proposed a method for exploring an adaptive peeking window for each tweet, which can synthesize all of the global dynamic information within the observed period into the predicted peek point. However, using the model-based approach for cascade prediction often is sub-optimal, because strong assumptions often are made about the process of information flow during a diffusion, and they lack the size of the future cascade as a guide. Inspired by the recent success of deep learning in various complicated tasks, several studies [9,24] have adopted deep learning methods to leverage various features for cascade prediction, which achieves satisfactory results. Our work is closely related to the above works. While in our work, learning the representation of cascade in an end-toend manner circumvents the difficulties inherent to the hand-crafted features design step. We also incorporate the temporal properties, which has been ignored in previous work [9].

Attention mechanism
The concept of attention was first introduced in Neuroscience and Computational Neuroscience [25,26]. For instance, visual attention is the process by which humans focus on specific portion of their visual inputs for computing the adequate responses. Similarly, in training neural networks, the attention mechanism allows models to learn alignments between different parts of the input. Attention mechanism has gained popularity recently in various tasks, such as neural machine translation [27], image caption [28], image/video popularity prediction [24,29], and question answering [30,31]. To predict video popularity, Bielski et al. [29] proposed a model with self-attention mechanism to hierarchically attend both video frames and textual modalities. To the best of our knowledge, we are the first to propose the attention mechanism into cascade prediction by fusing temporal and structural information.

Preliminaries
In this section, we first present a formal definition of the cascade prediction problem ("Problem definition" section), and then we briefly describe the recurrent neural network that is used in our proposed method ("Recurrent neural network" section).

Problem definition
Let G = (V, E) be a social network (e.g., Twitter or the academic paper network), where V is the set of vertices of G , and E ⊂ V × V is the set of edges of G . A vertex u ∈ V represents a user in the social network and an edge (u, v) ∈ E represents that there exists a feedback relationship (e.g., using a like, comment, share, or cite) between user u and user v.
Suppose we have M cascades that start in G after time t 0 . At time t, we denote the i-th cascade as where V i t is the subset of V who have taken part in the cascade, represents the time when a user in V i t takes part in the cascade, and represents the feedback relationships between users in V i t . In this work, we first obtain g i t 's detailed representation as {S i , H i } , where S i and H i correspond to structure representation and temporal representation, respectively. We denote the cascade size of g i t as R i t = |V i t | . Thus, our aim is to predict the incremental size In other words, the target is to learn a function f that maps Note that throughout this paper, we denote vectors by bold lowercase letters and matrices by bold capital Roman letters. In what follows, we will omit the superscript i of related notations for simplicity.

Recurrent neural network
Recurrent neural network (RNN) [5,32] is a type of deep neural network with cycle and internal memory units that capture sequential information, which is a more general model than the feed-forward network. In practice, RNN has been shown to be a powerful tool for modeling sequences [33]. Long short-term memory (LSTM) [34] and gated recurrent unit (GRU) [35] are recurrent mechanisms that are used extensively. According to Chung et al. [35], GRU has been shown to exhibit better performance with less computation, and it is used as the basic recurrent unit in our proposed approach. The updating formulation of GRU is as follows: where x i is current input, h i−1 is previous hidden state, σ (·) is the sigmoid activation function, · denotess element-wise multiplication, W u , W r , W h , U u , U r , U h ,and b u , b r , b h are GRU parameters learned during training, and h i is the updated hidden state. The above system can be reduced into an GRU equation:

Extracting structure representation
The cascade graph g t is first represented as a set of cascade paths that are sampled through multiple random walk processes. Each of the cascade paths not only carry the information about who are the information propagators, but they also capture the information flow. Thus, we then feed them into a gated recurrent neural network to obtain the hidden representation.
We follow previous work [9,36] and use a fixed path length L and a fixed number of sequences K. Concisely speaking, for each random walk process, we first sample the starting node with a probability by the following equation: where α is a smoother, deg c (u) is the out-degree of vertex u in G , and V c is the set of nodes in g t . Following the starting node, the neighbor node is sampled with the probability: The sampling of one selected sequence stops either when we reach the predefined length L or when we reach a vertex that has no outgoing neighbors. Whenever the length of one sequence is smaller than T, the sequence is padded by a special vertex '+' . This process of sampling sequences continues until we sample K sequences.
Each node in the sequence is represented as a one-hot vector, q ∈ R N , where N is the total number of nodes in G . Before we feed the one-hot vector into GRU, we first covert each of them into a low-dimensional dense vector x by a embedding matrix W x ∈ R H ×N :  Then we feed the sequence into GRU to generate sequential hidden states. We adopt the bi-directional GRU [37], where a forward GRU reads the sequence node by node, from left to right, and generates a sequence of forward hidden vectors [ − → h k i ]. Similarly, a backward GRU reads from right to left, node by node and generates a sequence of backward hidden vectors [ ← − h k i ]. This encoder can be used to simulate the process of information flow during a diffusion. For the i-th node in the sequence, the updated hidden state is computed as the concatenation of the forward and backward hidden vectors: where ⊕ denotes the concatenation operation.
Hence, we can obtain the k-th sequence's representation [ Thus, the k-th sequence is represented as: Note that the weight α i is also learned through the deep learning process.
Finally, from the perspective of topological structure, a cascade graph can be expressed as

Extracting temporal representation
When we consider about the temporal information of cascade graph g t , the adoption process is either a time series or a point process. The former series is indexed with fixed and equal time intervals, which can be used to capture the dependence in the time-varying features in a timely manner. The latter are generated asynchronously with random timestamps, and the precise time interval between two adoption events carries a great deal of information about the underlying dynamics. Capturing this information will be crucial for predicting the increment size of the cascade graph. Thus, as Fig. 1 shows, we used the point process form. The effectiveness of the point process form is demonstrated in "Experiments" section.
Specifically, for adoption event i, we can extract the associated temporal features (e.g., the inter-event duration d i = t i − t i−1 ) and obtain the corresponding temporal sequence T t = {d 1 , . . . , d |V t | } . Then, we feed the sequence, T t , into GRU, where the hidden state of adoption event i (denoted as h i ) can be updated by: We should emphasize that, in this case, the current input vector degenerates into a scalar. After recurrent computation for each time step, we gather a series of hidden states In summary, we have a structure representation S and a temporal representation T as inputs for the attention mechanism to be proposed below.

Attention mechanism
Our attention mechanism consists of two parts: intra-attention module and inter-gate module. Through these we can obtain a more suitable representation of cascade g t for prediction.

Intra-attention mechanism
Attention computation for topological structure Intra-attention w.r.t. topological structure (presented in Fig. 2) aims at assembling the sampled cascade paths into the effective representation of the structure information of g t . First, we convert the temporal embedding matrix into a vector representation h via a mean pooling mechanism: The weight α k is formalized as where α k is the attention to the hidden state representation of the k-th sequence in the graph g t , and ω(s k ,h) is set using the following function where the parameter matrices of intra-attention satisfy A S ∈ R 1×2H , W S and U S ∈ R 2H ×2H . The above equation essentially is used to calculate the relevance of each sequence in graph g t to temporal embedding. The intuition lies in the aspect that different temporal properties have diverse influences on the topological structure of the cascade. For instance, when compared with adoption events that occur occasionally, intensive adoption events will bring more potential adoption base for the selected message, which in turn leads to a more complex cascade network. Hence, here we used temporal embedding to guide the combined weights learning of sequences extracted in the cascade graph. Consequently, we can get the attended whole structure embedding ṡ via the weighted sum pooling mechanism: Attention computation for temporal properties Intra-attention w.r.t. temporal properties (presented in Fig. 3) aims to assemble event into the effective representation of the temporal information of g t . Similarly, we first convert the structure embedding matrix into a vector representation s via a mean pooling mechanism: The attention weight α m for the m-th hidden vector h m is formalized as: where scores the extent of the dependence between the i-th adoption behavior and the structure embedding, and the parameter matrices satisfy A T ∈ R 1×2H , W T and U T ∈ R 2H ×2H . Complex cascade network topology will improve the reception and visibility of the message, and thus promote the occurrence of adoption events. Reflected in the time dimension is the aggregation of adoption events, which is also called bursting diffusion of the message. In our previous work [23], we demonstrated that different parts of the diffusion history have diverse influences on the future cascade size, and we proposed a method for obtaining the most effective part of the history to make an accurate prediction. Analogically, the pooling weights for the temporal property of different adoption events are automatically learned based on the structural embedding of the cascade graph g t to optimize the prediction of cascade growth. Hence we can obtain the attended whole temporal embedding ḣ via the following equation:

Inter-gate mechanism
Having obtained the attended whole structure embedding ṡ and temporal embedding ḣ , we can feed these two embeddings into the inter-gate mechanism to effectively combine these two factors. The proposed inter-gate mechanism can capture the different Fig. 3 Architecture of the Intra-attention Mechanism w.r.t. temporal properties importance of the two factors when predicting the cascade growth. Instead of setting a fixed weight, the proposed inter-gate mechanism can adaptively tune the combination weight. Specifically, the final representation c of cascade graph g t when combing temporal and structure factor is assembled by: where the adaptive combination weight β ∈ (0, 1) is computed by: where the parameter matrices satisfy W C and U C ∈ R 2H ×2H , and they are both learned through the deep learning process.

Output layer
Finally, our output module consists of a multi-layer perceptron (MLP), taking the cascade representation c as input and generating the final incremental size prediction: The benefit of this fully connected layer is that it does not incur much model complexity and ensures the capacity of nonlinear modeling.

Experiments
This section presents the experiment setup ("Experiment setup" section) and results analysis ("Experiment results" section).

Dataset and processing
Twitter The dataset contains tweets and retweets on Twitter from September 1 to October 1, 2016. Here we focus on a subset of popular tweets that have at least 50 retweets for easier calibration in our model. For each retweet cascade, the datasets include the publish time of the original tweet, time of retweet, and ID of users who participated in the cascade. The global social network G was constructed using the same tweet stream from July and August 2016. To evaluate the performance of our model, we split the original data chronologically into a training dataset, a validation dataset and a test dataset. Specifically, cascades whose original tweets were published during the first 11 days were used for training, cascades that originated on September 12 were used for validation, and cascades that originated from September 13 to September 15 were used for testing. The rest of the days were used for unfolding the twitter cascade over the network.
AMiner AMinerThe scientific paper datasets were publicly available in https ://www. ami-ner.cn/citat ion. We constructed the global network G using the citations between 1985 and 1995. Specifically, we drew an edge from author A to author B if B ever cited A's paper. A citation cascade of a given paper thus contains all authors who have written or cited the paper. We also split the datasets in chronological order. Papers published (15) between 1996 and 2000 were included in the training set. Papers published in 2001 and 2002 were used for validation and testing, respectively.
In summary, Table 1 gives an overview of the basic statistics of the Twitter dataset and the AMiner dataset.

Evaluation metrics
We used the mean squared error (MSE) and mean absolute errors (MAE), two standard measurements for regression tasks, to evaluate the prediction performance: where ŷ i and y i are the predicted value and ground truth value of cascade i, respectively. Note that, following the practice of [9], we also predict a scaled version of the actual increment of the cascade size, i.e. y i = log 2 (�R i + 1).

Comparison methods
The comparison methods are as follows: Features-linear We extract a bag of hand-crafted features that were used in previous work [3,[38][39][40][41] and which can better represent the temporal factor and structure factor for cascade prediction. There features are then fed into a linear regression with L2 regularization. These features include: • Temporal feature This type of feature has to do with the speed of adoptions during the prefix cascade. We extract the five point summary (min, median, max, 25-th and 75-th percentile) of waiting times between reshare events, the First Half Rate (mean time between adoptions for the first half of the adoptions), Second Half Rate [38], and the cumulative popularity [42]. • Structural features This type of feature includes the structural features of the entire social network around early adopters and the structural features of the cascade. Thus, we extracted the indegree of the each node, connection between g t and G , number of edges in g t , number of leaf nodes in g t , and average and max length of reshare path [38].

Support vector regression (SVR)
We follow previous work [17,43] and adopt SVR model using linear kernel to predict cascade size with time series data as features.
SEISMIC [44] This is one of state-of-the-art generative models on cascade prediction. The model is based on a self-exciting point process producing final cascade size forecasts using the early adoption activity of a selected message. Note that its predictor is based on a branching process, and thus this method can only be applied to predict the final size of the retweet cascade. In contrast, our proposed end-to-end method can be easily extended to predict the dynamic of the retweet cascade.
DeepCas [9] This is the first end-to-end, deep learning method for information cascades prediction. It mainly utilizes the information of the structure of the cascade graph and node identities for prediction. The attention mechanism is designed to assemble a cascade graph representation from a set of random walk paths.

Platform and parameter setting
For the length t of the observed initial period of the information cascade, we consider three settings, i.e., t = 1, 2, 3 hours for Twitter and t = 1, 2, 3 months for AMiner. To instantiate our models, we used the high-level neural network library Keras [45] with Theano [46] as the computational back-end. The code is running on a Linux server with 32G memory, 2 CPUs with 4 cores for each: Inter Core TM i7-7700K CPU @4.50 GHz. The GPU in use is the Nvidia TM GeForce GTX TITAN 1080 Ti.

Experiment results
We evaluated our proposed model with the comparison methods on the Twitter and AMiner dataset to present the performance of our method. The prediction results are reported in Table 2, which shows that irrespective of the dataset (Twitter and AMiner) and prefix cascade (1, 2, 3 h for Twitter, and 1, 2, 3 months for AMiner), our proposed method outperformed other comparison methods, since it achieved a lower MSE. Table 2 shows that Features-linear provides worse results than our proposed method, which indicates the limitation of hand-crafted features. The Features-linear method selects the most predictive features for cascade prediction, which was demonstrated in past studies [38]. This is especially obvious when compared with our proposed method, which automatically learns joint and effective representation from temporal and structural factors. Table 2 also shows that our proposed method outperformed SEISMIC, a state-of-theart generative model, since our method uses more powerful attention mechanisms and is likely to yield better performance. Specifically, our model uses an attention mechanism to automatically learn the pooling weights for the temporal properties of different adoption events, while SEISMIC uses a constant peeking period within a prefix cascade for different messages when making predictions. In addition, SEISMIC lacks the future cascade size as a guide and makes various stronger assumptions about the diffusion process, which are common disadvantages of generative prediction methods.
Among all of the methods that were compared, DeepCas had the best performance because it benefits from end-to-end learning from the data to the prediction. Our proposed method leads to a certain reduction of prediction errors when compared with DeepCas, due to the introduction of temporal information, which is ignored in DeepCas.
Comparing the performance of using different prefix t, we can make the conclusion that applies to all methods for both twitter cascade and citation cascade: As we increased the observation time, the prediction errors tended to decrease, suggesting that more accessible information will make prediction easier. In addition, we can observe that prediction errors are much bigger in Twitter (the top-half of the Table 2) than that in AMiner (the bottom-half of Table 2), which indicates that predicting the twitter cascade size is a more difficult scenario of information cascade prediction.
To study the effects of temporal factor and structural factor on cascade prediction in more detail, we compared the proposed method and the Feature-linear method and their variants that do not consider one of these factors. We also ran these methods on the two datasets and aimed to predict the incremental size of information cascade using a fixed observation window ranging from 1 to 3 h (months for AMiner). For ease of results presentation, we denote temporal factor as T and structural factor as S , respectively. Thus "no T " means removing temporal factor for corresponding methods, and it is similar for "no S".
The prediction results of these methods are summarized in Table 3. This results show that our proposed method and Feature-linear both outperform their variants, which indicates the usefulness of these factors. For instance, by testing "Proposed (no T)", we can see a notable decrease in performance compared with our proposed method, with MSE = 3.772 and 2.609 when observing for 1 h on Twitter. This phenomenon shows that feeding temporal features into deep neural networks is indeed meaningful.
We also found that Feature-linear (no S ) performs better than Feature-linear (no T ), which is consistent with previous research [38]. However, "Proposed (no S )" and "Proposed (no T )" have very similar performances for most situations, which suggests that there potentially is still room to improve the utilization of temporal factors (the most predictive information) in our proposed method. Thus, we examined the effects of different ways to integrate temporal information. The method of "Proposed (time series T )" is to form a time series of the cascade size for each message and to feed the time series into our neural network, instead of temporal embedding of individual nodes. Table 3 shows that "Proposed (time series T )" performs worse than "Proposed (no S)". This is consistent with our expectation, since the precise time interval between two adoption events is more informative than a time series dataset. Note that when making predictions at the beginning of the information cascade, "Proposed (no T )" performed worse than "Proposed (no S)", which may be due to the fact that a "simple" topology is inadequate for providing an effective forecast. Finally, our proposed method had the best performance, suggesting that temporal information and structural information are complimentary for cascade prediction.
To demonstrate the effectiveness of the components of attention mechanism and gate mechanism in the proposed method, we compare the proposed method and its variants that remove one of the components. For ease of results presentation, we denote attention mechanism as attention and gate mechanism as gate , respectively. The corresponding results are presented in Table 4. We find that our proposed method outperforms its variants, which demonstrates the positive contribution of each component.

Conclusions
In this paper, we proposed a novel method for information cascade prediction based on an end-to-end neural network. Learning the representation of a cascade in an end-to-end manner circumvented the difficulties inherent to hand-crafted features design. To efficiently obtain and fuse the temporal and structural information, we carefully designed an attention mechanism, which involves intra-attention and inter-gate modules. We conducted experiments on two scenarios, i.e., predicting the size of cascade of Tweet on Twitter and predicting the citation of papers in AMiner. Compared with the other three state-of-the-art prediction methods, our proposed method offered small prediction error. Future works include the incorporation of other predictive information within the attention framework. Cascade dynamics modeling with our attention neural network is also of interest.