 Research
 Open Access
 Published:
Information cascades prediction with attention neural network
Humancentric Computing and Information Sciences volume 10, Article number: 13 (2020)
Abstract
Cascade prediction helps us uncover the basic mechanisms that govern collective human behavior in networks, and it also is very important in extensive other applications, such as viral marketing, online advertising, and recommender systems. However, it is not trivial to make predictions due to the myriad factors that influence a user’s decision to reshare content. This paper presents a novel method for predicting the increment size of the information cascade based on an endtoend neural network. Learning the representation of a cascade in an endtoend manner circumvents the difficulties inherent to blue the design of handcrafted features. An attention mechanism, which consists of the intraattention and intergate module, was designed to obtain and fuse the temporal and structural information learned from the observed period of the cascade. The experiments were performed on two realworld scenarios, i.e., predicting the size of retweet cascades on Twitter and predicting the citation of papers in AMiner. Extensive results demonstrated that our method outperformed the stateoftheart cascade prediction methods, including both featurebased and generative approaches.
Introduction
Online social networks are very popular among people, and they are changing the way people communicate, work, and play, mostly for the better. One of the things that fascinates us most about social network sites is the resharing mechanism that has the potential to spread information to millions of users in a matter of few hours or days. For instance, a user can share the content (e.g., videos on YouTube, tweets on Twitter, and photos on Flickr) with her set of friends, who subsequently can potentially reshare the content, resulting in the development of a cascade of resharing. Such information cascades play a significant role in almost every social network phenomenon, which include, but are not limited to, the diffusion of innovation, persuasion campaigns, and spreading rumors. Information cascade prediction is to infer some key properties of information cascades, such as their sizes and shapes, which indicate the extent to which the information can reach in the social network. This prediction task can be valuable, and it can be applied in an array of areas, such as content recommender systems and monitoring the consensus opinion. However, cascade prediction is challenging due to the myriad factors that influence a user’s decision to reshare content.
The problem of cascade prediction has been studied extensively [1,2,3], but most of the studies either depended heavily on the quality of the carefully designed handcrafted features or made various strong assumptions about the generative processes of the resharing events and oversimplified reality, leading to impaired predictive power. On the other hand, deep learning methods, such as convolutional neural networks (CNNs) [4] and recurrent neural networks (RNNs) [5], have achieved great success in various complicated tasks [6,7,8], and some studies have used neural networks as a transformer to leverage various informative features for cascade prediction [9]. Nevertheless, these methods ignore the temporal properties for cascade prediction, which are regarded as the valuable information that is needed to improve cascade prediction in traditional works.
In this paper, we propose to predict the information cascade within a neural network framework, by incorporating an attention mechanism using temporal and structural information learned from the observed period of the cascade. Our proposed method consists of three layers. In the first layer, the structure embedding is obtained by representing the cascade graph as a set of random walk paths that carry information about the propagator of the message and the local and global topologies among them. Inspired by the recent successes of the point process model in a cascade dynamic modeling task [10], temporal embedding is a series of hidden representations of reshared events ordered ascendingly by time. The challenge is how to assemble paths or events into the effective representation of each factor. Thus, in the second layer, we designed a novel attention mechanism that contains intraattention and intergate modules. The assembly problem is solved via the intraattention mechanism with respect to (w.r.t.) the topological structure and the temporal properties. Further, a gate mechanism is proposed to fuse the structure and temporal representation by capturing the importance of the two factors for cascade prediction. Finally, the top layer introduces a multilayer perceptron (MLP) to output the prediction (increment size of the cascade in our case). We performed extensive experiments on two representative realworld datasets, a Twitter dataset and an AMiner citation network dataset. Our results indicated that our proposed method outperformed stateoftheart cascade prediction models.
The remainder of this paper is organized as follows. “Related work” section presents a survey of the related work. “Preliminaries” section formulates the cascade prediction problem and introduces the recurrent neural network. “Approach” section presents the details of the proposed model. The experimental results are presented in “Experiments” section, and conclusions and plans for future work are reported in “Conclusions” section.
Related work
We reviewed and presented relevant studies to our work from two aspects, i.e, cascade prediction and attention mechanism.
Cascade prediction
Information cascade prediction has been explored in recent years and is still an open problem. Existing methods for cascade prediction can be categorized into two broad types, i.e., featurebased and modelbased approaches.
Featurebased approaches [2, 3, 11,12,13,14,15] make the connection between the prediction and various types of handcrafted features that are extracted from the information cascade, including the structural features of the social network, content features, temporal features, and user features. To predict the popularity of news articles in Yahoo News, Arapakis et al. [16] used 10 different features that they extracted from the content of the news articles as well as external sources. To predict the popularity of online videos in YouTube and Facebook, Trzcinski et al. [17] utilized both the visual clues and the early popularity patterns of the videos once they were released. Instead of predicting the total volume or level of popularity, Kong et al. [18] focused on the popularity evolution of online contents and consider the dynamic factors that influenced how the popularity evolved. Nevertheless, there is no principled way to design and extract these features, and the accuracy of the predictions is sensitive to the quality of the extracted features.
Modelbased approaches [1, 19,20,21,22] are devoted to directly characterizing and modeling the formation of an information cascade in the network. These approaches often are optimized to provide intuitive explanations for the prediction due to the interpretable factors that are incorporated in them. Yu et al. [21] proposed a novel NEtworked Weibull Regression model for modeling microbehavioral dynamics that significantly improved the interpretability and generalizability of traditional survival models. Bao et al. [23] modeled the popularity dynamics of the tweet in Twitter using the Hawkes process. They also proposed a method for exploring an adaptive peeking window for each tweet, which can synthesize all of the global dynamic information within the observed period into the predicted peek point. However, using the modelbased approach for cascade prediction often is suboptimal, because strong assumptions often are made about the process of information flow during a diffusion, and they lack the size of the future cascade as a guide.
Inspired by the recent success of deep learning in various complicated tasks, several studies [9, 24] have adopted deep learning methods to leverage various features for cascade prediction, which achieves satisfactory results. Our work is closely related to the above works. While in our work, learning the representation of cascade in an endtoend manner circumvents the difficulties inherent to the handcrafted features design step. We also incorporate the temporal properties, which has been ignored in previous work [9].
Attention mechanism
The concept of attention was first introduced in Neuroscience and Computational Neuroscience [25, 26]. For instance, visual attention is the process by which humans focus on specific portion of their visual inputs for computing the adequate responses. Similarly, in training neural networks, the attention mechanism allows models to learn alignments between different parts of the input. Attention mechanism has gained popularity recently in various tasks, such as neural machine translation [27], image caption [28], image/video popularity prediction [24, 29], and question answering [30, 31]. To predict video popularity, Bielski et al. [29] proposed a model with selfattention mechanism to hierarchically attend both video frames and textual modalities. To the best of our knowledge, we are the first to propose the attention mechanism into cascade prediction by fusing temporal and structural information.
Preliminaries
In this section, we first present a formal definition of the cascade prediction problem (“Problem definition” section), and then we briefly describe the recurrent neural network that is used in our proposed method (“Recurrent neural network” section).
Problem definition
Let \({\mathcal {G}}=({\mathcal {V}},{\mathcal {E}})\) be a social network (e.g., Twitter or the academic paper network), where \({\mathcal {V}}\) is the set of vertices of \({\mathcal {G}}\), and \({\mathcal {E}}\subset {\mathcal {V}}\times {\mathcal {V}}\) is the set of edges of \({\mathcal {G}}\). A vertex \(u\in {\mathcal {V}}\) represents a user in the social network and an edge \((u,v)\in {\mathcal {E}}\) represents that there exists a feedback relationship (e.g., using a like, comment, share, or cite) between user u and user v.
Suppose we have M cascades that start in \({\mathcal {G}}\) after time \(t_0\). At time t, we denote the ith cascade as \(g_t^i= ({\mathcal {V}}_t^i, {\mathcal {T}}_t^i, {\mathcal {E}}_t^i)\), where \({\mathcal {V}}_t^i\) is the subset of \({\mathcal {V}}\) who have taken part in the cascade, \({\mathcal {T}}_t^i=\{t_1^i,\ldots ,t_{{\mathcal {V}}_t^i}^i\}\) represents the time when a user in \({\mathcal {V}}_t^i\) takes part in the cascade, and \({\mathcal {E}}_t^i = {\mathcal {E}} \cap ({\mathcal {V}}_t^i\times {\mathcal {V}}_t^i)\) represents the feedback relationships between users in \({\mathcal {V}}_t^i\).
In this work, we first obtain \(g_t^i\) ’s detailed representation as \(\{\varvec{S^i},\varvec{H^i}\}\), where \(\varvec{S^i}\) and \(\varvec{H^i}\) correspond to structure representation and temporal representation, respectively. We denote the cascade size of \(g_t^i\) as \(R_t^i={\mathcal {V}}_t^i\). Thus, our aim is to predict the incremental size \(\Delta R_t^i = {\mathcal {V}}_{\infty }^i{\mathcal {V}}_t^i\). In other words, the target is to learn a function f that maps \(\{\varvec{S^i},\varvec{H^i}\}\) to \(\Delta R_t^i\), \(f: \varvec{S^i},\varvec{H^i}\xrightarrow []{}\Delta R_t^i\).
Note that throughout this paper, we denote vectors by bold lowercase letters and matrices by bold capital Roman letters. In what follows, we will omit the superscript i of related notations for simplicity.
Recurrent neural network
Recurrent neural network (RNN) [5, 32] is a type of deep neural network with cycle and internal memory units that capture sequential information, which is a more general model than the feedforward network. In practice, RNN has been shown to be a powerful tool for modeling sequences [33]. Long shortterm memory (LSTM) [34] and gated recurrent unit (GRU) [35] are recurrent mechanisms that are used extensively. According to Chung et al. [35], GRU has been shown to exhibit better performance with less computation, and it is used as the basic recurrent unit in our proposed approach. The updating formulation of GRU is as follows:
where \(\varvec{x_i}\) is current input, \(\varvec{h_{i1}}\) is previous hidden state, \(\sigma (\cdot )\) is the sigmoid activation function, \(\cdot \) denotess elementwise multiplication, \(\varvec{W_u}\), \(\varvec{W_r}\), \(\varvec{W_h}\), \(\varvec{U_u}\), \(\varvec{U_r}\), \(\varvec{U_h}\),and \(\varvec{b_u}\), \(\varvec{b_r}\), \(\varvec{b_h}\) are GRU parameters learned during training, and \(\varvec{h_i}\) is the updated hidden state. The above system can be reduced into an GRU equation: \(\varvec{h_i}=GRU(\varvec{x_i},\varvec{h_{i1}})\)
Approach
In this section, we introduce our proposed method (presented in Fig. 1). It consists of three major components: (1) input embedding (“Input embedding” section ); (2) attention mechanism (“Attention mechanism” section); and (3) output layer (“Output layer” section).
Input embedding
Extracting structure representation
The cascade graph \(g_t\) is first represented as a set of cascade paths that are sampled through multiple random walk processes. Each of the cascade paths not only carry the information about who are the information propagators, but they also capture the information flow. Thus, we then feed them into a gated recurrent neural network to obtain the hidden representation.
We follow previous work [9, 36] and use a fixed path length L and a fixed number of sequences K. Concisely speaking, for each random walk process, we first sample the starting node with a probability by the following equation:
where \(\alpha \) is a smoother, \(deg_c(u)\) is the outdegree of vertex u in \({\mathcal {G}}\), and \(V_c\) is the set of nodes in \(g_t\). Following the starting node, the neighbor node is sampled with the probability:
The sampling of one selected sequence stops either when we reach the predefined length L or when we reach a vertex that has no outgoing neighbors. Whenever the length of one sequence is smaller than T, the sequence is padded by a special vertex ‘+’. This process of sampling sequences continues until we sample K sequences.
Each node in the sequence is represented as a onehot vector, \(\varvec{q}\in {\mathbb {R}}^N\), where N is the total number of nodes in \({\mathcal {G}}\). Before we feed the onehot vector into GRU, we first covert each of them into a lowdimensional dense vector \(\varvec{x}\) by a embedding matrix \(\varvec{W_x}\in {\mathbb {R}}^{H\times N}\): \(\varvec{x}=\varvec{W_x}\varvec{q}\) where H is an adjustable dimension of embedding.
Then we feed the sequence into GRU to generate sequential hidden states. We adopt the bidirectional GRU [37], where a forward GRU reads the sequence node by node, from left to right, and generates a sequence of forward hidden vectors [\(\varvec{\overrightarrow{h}_i^k}\)]. Similarly, a backward GRU reads from right to left, node by node and generates a sequence of backward hidden vectors [\(\varvec{\overleftarrow{h}_i^k}\)]. This encoder can be used to simulate the process of information flow during a diffusion. For the ith node in the sequence, the updated hidden state is computed as the concatenation of the forward and backward hidden vectors:
where \(\oplus \) denotes the concatenation operation.
Hence, we can obtain the kth sequence’s representation [\(\varvec{\overleftrightarrow {h}_i^k}\)]. We assume multinomial distribution \(\alpha _1, \ldots ,\alpha _L\) over L nodes so that \(\sum _{i=1}^{L}(\alpha _i)=1\). Thus, the kth sequence is represented as:
Note that the weight \(\alpha _i \) is also learned through the deep learning process.
Finally, from the perspective of topological structure, a cascade graph can be expressed as \(\varvec{S}=[\varvec{s_1},\ldots ,\varvec{s_K}]\), \(\varvec{s_k}\in {\mathbb {R}}^{2H}\).
Extracting temporal representation
When we consider about the temporal information of cascade graph \(g_t\), the adoption process is either a time series or a point process. The former series is indexed with fixed and equal time intervals, which can be used to capture the dependence in the timevarying features in a timely manner. The latter are generated asynchronously with random timestamps, and the precise time interval between two adoption events carries a great deal of information about the underlying dynamics. Capturing this information will be crucial for predicting the increment size of the cascade graph. Thus, as Fig. 1 shows, we used the point process form. The effectiveness of the point process form is demonstrated in “Experiments” section.
Specifically, for adoption event i, we can extract the associated temporal features (e.g., the interevent duration \(d_i=t_it_{i1}\)) and obtain the corresponding temporal sequence \({\mathcal {T}}_t=\{d_1,\ldots ,d_{{\mathcal {V}}_t}\}\). Then, we feed the sequence, \({\mathcal {T}}_t\), into GRU, where the hidden state of adoption event i (denoted as \(\varvec{h_i}\)) can be updated by:
We should emphasize that, in this case, the current input vector degenerates into a scalar. After recurrent computation for each time step, we gather a series of hidden states \(\varvec{T}=[\varvec{h_1},\ldots ,\varvec{h_{R_t}}]\), \(\varvec{h_m}\in {\mathbb {R}}^{2H}\).
In summary, we have a structure representation \(\varvec{S}\) and a temporal representation \(\varvec{T}\) as inputs for the attention mechanism to be proposed below.
Attention mechanism
Our attention mechanism consists of two parts: intraattention module and intergate module. Through these we can obtain a more suitable representation of cascade \(g_t\) for prediction.
Intraattention mechanism
Attention computation for topological structure
Intraattention w.r.t. topological structure (presented in Fig. 2) aims at assembling the sampled cascade paths into the effective representation of the structure information of \(g_t\). First, we convert the temporal embedding matrix into a vector representation \(\varvec{{\bar{h}}}\) via a mean pooling mechanism:
The weight \(\alpha _k\) is formalized as
where \(\alpha _k\) is the attention to the hidden state representation of the kth sequence in the graph \(g_t\), and \(\omega (\varvec{s_k},\varvec{{\bar{h}}})\) is set using the following function
where the parameter matrices of intraattention satisfy \(\varvec{A_S}\in {\mathbb {R}}^{1\times 2H}\), \(\varvec{W_S}\) and \(\varvec{U_S}\in {\mathbb {R}}^{2H\times 2H}\). The above equation essentially is used to calculate the relevance of each sequence in graph \(g_t\) to temporal embedding. The intuition lies in the aspect that different temporal properties have diverse influences on the topological structure of the cascade. For instance, when compared with adoption events that occur occasionally, intensive adoption events will bring more potential adoption base for the selected message, which in turn leads to a more complex cascade network. Hence, here we used temporal embedding to guide the combined weights learning of sequences extracted in the cascade graph. Consequently, we can get the attended whole structure embedding \(\varvec{{\dot{s}}}\) via the weighted sum pooling mechanism:
Attention computation for temporal properties
Intraattention w.r.t. temporal properties (presented in Fig. 3) aims to assemble event into the effective representation of the temporal information of \(g_t\). Similarly, we first convert the structure embedding matrix into a vector representation \(\varvec{{\bar{s}}}\) via a mean pooling mechanism:
The attention weight \(\alpha _m\) for the mth hidden vector \(\varvec{h_m}\) is formalized as:
where
scores the extent of the dependence between the ith adoption behavior and the structure embedding, and the parameter matrices satisfy \(\varvec{A_T}\in {\mathbb {R}}^{1\times 2H}\), \(\varvec{W_T}\) and \(\varvec{U_T}\in {\mathbb {R}}^{2H\times 2H}\). Complex cascade network topology will improve the reception and visibility of the message, and thus promote the occurrence of adoption events. Reflected in the time dimension is the aggregation of adoption events, which is also called bursting diffusion of the message. In our previous work [23], we demonstrated that different parts of the diffusion history have diverse influences on the future cascade size, and we proposed a method for obtaining the most effective part of the history to make an accurate prediction. Analogically, the pooling weights for the temporal property of different adoption events are automatically learned based on the structural embedding of the cascade graph \(g_t\) to optimize the prediction of cascade growth.
Hence we can obtain the attended whole temporal embedding \(\varvec{{\dot{h}}}\) via the following equation:
Intergate mechanism
Having obtained the attended whole structure embedding \(\varvec{{\dot{s}}}\) and temporal embedding \(\varvec{{\dot{h}}}\), we can feed these two embeddings into the intergate mechanism to effectively combine these two factors. The proposed intergate mechanism can capture the different importance of the two factors when predicting the cascade growth. Instead of setting a fixed weight, the proposed intergate mechanism can adaptively tune the combination weight. Specifically, the final representation \(\varvec{c}\) of cascade graph \(g_t\) when combing temporal and structure factor is assembled by:
where the adaptive combination weight \(\beta \in (0,1)\) is computed by:
where the parameter matrices satisfy \(\varvec{W_C}\) and \(\varvec{U_C}\in {\mathbb {R}}^{2H\times 2H}\), and they are both learned through the deep learning process.
Output layer
Finally, our output module consists of a multilayer perceptron (MLP), taking the cascade representation \(\varvec{c}\) as input and generating the final incremental size prediction:
The benefit of this fully connected layer is that it does not incur much model complexity and ensures the capacity of nonlinear modeling.
Experiments
This section presents the experiment setup (“Experiment setup” section) and results analysis (“Experiment results” section).
Experiment setup
Dataset and processing
The dataset contains tweets and retweets on Twitter from September 1 to October 1, 2016. Here we focus on a subset of popular tweets that have at least 50 retweets for easier calibration in our model. For each retweet cascade, the datasets include the publish time of the original tweet, time of retweet, and ID of users who participated in the cascade. The global social network \({\mathcal {G}}\) was constructed using the same tweet stream from July and August 2016. To evaluate the performance of our model, we split the original data chronologically into a training dataset, a validation dataset and a test dataset. Specifically, cascades whose original tweets were published during the first 11 days were used for training, cascades that originated on September 12 were used for validation, and cascades that originated from September 13 to September 15 were used for testing. The rest of the days were used for unfolding the twitter cascade over the network.
AMiner
AMinerThe scientific paper datasets were publicly available in https://www.aminer.cn/citation. We constructed the global network \({\mathcal {G}}\) using the citations between 1985 and 1995. Specifically, we drew an edge from author A to author B if B ever cited A’s paper. A citation cascade of a given paper thus contains all authors who have written or cited the paper. We also split the datasets in chronological order. Papers published between 1996 and 2000 were included in the training set. Papers published in 2001 and 2002 were used for validation and testing, respectively.
In summary, Table 1 gives an overview of the basic statistics of the Twitter dataset and the AMiner dataset.
Evaluation metrics
We used the mean squared error (MSE) and mean absolute errors (MAE), two standard measurements for regression tasks, to evaluate the prediction performance:
where \({\hat{y}}_i\) and \(y_i\) are the predicted value and ground truth value of cascade i, respectively. Note that, following the practice of [9], we also predict a scaled version of the actual increment of the cascade size, i.e. \(y_i= log_2(\Delta R^i+1)\).
Comparison methods
The comparison methods are as follows:
Featureslinear
We extract a bag of handcrafted features that were used in previous work [3, 38,39,40,41] and which can better represent the temporal factor and structure factor for cascade prediction. There features are then fed into a linear regression with L2 regularization. These features include:
Temporal feature This type of feature has to do with the speed of adoptions during the prefix cascade. We extract the five point summary (min, median, max, 25th and 75th percentile) of waiting times between reshare events, the First Half Rate (mean time between adoptions for the first half of the adoptions), Second Half Rate [38], and the cumulative popularity [42].
Structural features This type of feature includes the structural features of the entire social network around early adopters and the structural features of the cascade. Thus, we extracted the indegree of the each node, connection between \(g_t\) and \({\mathcal {G}}\), number of edges in \(g_t\), number of leaf nodes in \(g_t\), and average and max length of reshare path [38].
Support vector regression (SVR)
We follow previous work [17, 43] and adopt SVR model using linear kernel to predict cascade size with time series data as features.
SEISMIC
[44] This is one of stateoftheart generative models on cascade prediction. The model is based on a selfexciting point process producing final cascade size forecasts using the early adoption activity of a selected message. Note that its predictor is based on a branching process, and thus this method can only be applied to predict the final size of the retweet cascade. In contrast, our proposed endtoend method can be easily extended to predict the dynamic of the retweet cascade.
DeepCas
[9] This is the first endtoend, deep learning method for information cascades prediction. It mainly utilizes the information of the structure of the cascade graph and node identities for prediction. The attention mechanism is designed to assemble a cascade graph representation from a set of random walk paths.
Platform and parameter setting
For the length t of the observed initial period of the information cascade, we consider three settings, i.e., \(t=1, 2, 3\) hours for Twitter and \(t=1, 2, 3\) months for AMiner. To instantiate our models, we used the highlevel neural network library Keras [45] with Theano [46] as the computational backend. The code is running on a Linux server with 32G memory, 2 CPUs with 4 cores for each: Inter\(\circledR \)\(Core^{TM}\) i77700K CPU @4.50 GHz. The GPU in use is the \(Nvidia^{TM}\) GeForce GTX TITAN 1080 Ti.
Experiment results
We evaluated our proposed model with the comparison methods on the Twitter and AMiner dataset to present the performance of our method. The prediction results are reported in Table 2, which shows that irrespective of the dataset (Twitter and AMiner) and prefix cascade (1, 2, 3 h for Twitter, and 1, 2, 3 months for AMiner), our proposed method outperformed other comparison methods, since it achieved a lower MSE.
Table 2 shows that Featureslinear provides worse results than our proposed method, which indicates the limitation of handcrafted features. The Featureslinear method selects the most predictive features for cascade prediction, which was demonstrated in past studies [38]. This is especially obvious when compared with our proposed method, which automatically learns joint and effective representation from temporal and structural factors.
Table 2 also shows that our proposed method outperformed SEISMIC, a stateoftheart generative model, since our method uses more powerful attention mechanisms and is likely to yield better performance. Specifically, our model uses an attention mechanism to automatically learn the pooling weights for the temporal properties of different adoption events, while SEISMIC uses a constant peeking period within a prefix cascade for different messages when making predictions. In addition, SEISMIC lacks the future cascade size as a guide and makes various stronger assumptions about the diffusion process, which are common disadvantages of generative prediction methods.
Among all of the methods that were compared, DeepCas had the best performance because it benefits from endtoend learning from the data to the prediction. Our proposed method leads to a certain reduction of prediction errors when compared with DeepCas, due to the introduction of temporal information, which is ignored in DeepCas.
Comparing the performance of using different prefix t, we can make the conclusion that applies to all methods for both twitter cascade and citation cascade: As we increased the observation time, the prediction errors tended to decrease, suggesting that more accessible information will make prediction easier. In addition, we can observe that prediction errors are much bigger in Twitter (the tophalf of the Table 2) than that in AMiner (the bottomhalf of Table 2), which indicates that predicting the twitter cascade size is a more difficult scenario of information cascade prediction.
To study the effects of temporal factor and structural factor on cascade prediction in more detail, we compared the proposed method and the Featurelinear method and their variants that do not consider one of these factors. We also ran these methods on the two datasets and aimed to predict the incremental size of information cascade using a fixed observation window ranging from 1 to 3 h (months for AMiner). For ease of results presentation, we denote temporal factor as \(\varvec{T}\) and structural factor as \(\varvec{S}\), respectively. Thus “no \(\varvec{T}\)” means removing temporal factor for corresponding methods, and it is similar for “no \(\varvec{S}\)”.
The prediction results of these methods are summarized in Table 3. This results show that our proposed method and Featurelinear both outperform their variants, which indicates the usefulness of these factors. For instance, by testing “Proposed (no \(\varvec{T}\))”, we can see a notable decrease in performance compared with our proposed method, with MSE \(=3.772\) and 2.609 when observing for 1 h on Twitter. This phenomenon shows that feeding temporal features into deep neural networks is indeed meaningful.
We also found that Featurelinear (no \(\varvec{S}\)) performs better than Featurelinear (no \(\varvec{T}\)), which is consistent with previous research [38]. However, “Proposed (no \(\varvec{S}\))” and “Proposed (no \(\varvec{T}\))” have very similar performances for most situations, which suggests that there potentially is still room to improve the utilization of temporal factors (the most predictive information) in our proposed method. Thus, we examined the effects of different ways to integrate temporal information. The method of “Proposed (time series \(\varvec{T}\))” is to form a time series of the cascade size for each message and to feed the time series into our neural network, instead of temporal embedding of individual nodes. Table 3 shows that “Proposed (time series \(\varvec{T}\))” performs worse than “Proposed (no \(\varvec{S}\))”. This is consistent with our expectation, since the precise time interval between two adoption events is more informative than a time series dataset. Note that when making predictions at the beginning of the information cascade, “Proposed (no \(\varvec{T}\))” performed worse than “Proposed (no \(\varvec{S}\))”, which may be due to the fact that a ”simple” topology is inadequate for providing an effective forecast. Finally, our proposed method had the best performance, suggesting that temporal information and structural information are complimentary for cascade prediction.
To demonstrate the effectiveness of the components of attention mechanism and gate mechanism in the proposed method, we compare the proposed method and its variants that remove one of the components. For ease of results presentation, we denote attention mechanism as \(\varvec{attention}\) and gate mechanism as \(\varvec{gate}\), respectively. The corresponding results are presented in Table 4. We find that our proposed method outperforms its variants, which demonstrates the positive contribution of each component.
Conclusions
In this paper, we proposed a novel method for information cascade prediction based on an endtoend neural network. Learning the representation of a cascade in an endtoend manner circumvented the difficulties inherent to handcrafted features design. To efficiently obtain and fuse the temporal and structural information, we carefully designed an attention mechanism, which involves intraattention and intergate modules. We conducted experiments on two scenarios, i.e., predicting the size of cascade of Tweet on Twitter and predicting the citation of papers in AMiner. Compared with the other three stateoftheart prediction methods, our proposed method offered small prediction error. Future works include the incorporation of other predictive information within the attention framework. Cascade dynamics modeling with our attention neural network is also of interest.
Availability of data and materials
The datasets used in this study are available from the corresponding author on reasonable request.
Abbreviations
 CNN:

Convolutional neural network
 RNN:

Recurrent neural network
 MLP:

Multilayer perceptron
 LSTM:

Long shortterm memory
 GRU:

Gated recurrent unit
 SVR:

Support vector regression
 MSE:

Mean squared error
 MAE:

Mean absolute errors
References
 1.
Zaman T, Fox EB, Bradlow ET (2014) A bayesian approach for predicting the popularity of tweets. Ann Appl Stat 8(3):1583–1611
 2.
Cheng J, Adamic LA, Dow PA, Kleinberg JM, Leskovec J (2014) Can cascades be predicted. In: International world wide web conferences. 925–936
 3.
Martin T, Hofman JM, Sharma A, Anderson A, Watts DJ (2016) Exploring limits to prediction in complex social systems. In: International conference on world wide web pp 683–694
 4.
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradientbased learning applied to document recognition. Proc IEEE 86(11):2278–2324
 5.
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: International conference on machine learning. p. 1310–1318.
 6.
Pandarinath C, O’Shea DJ, Collins J, Jozefowicz R, Stavisky SD, Kao JC, Trautmann EM, Kaufman MT, Ryu SI, Hochberg LR et al (2018) Inferring singletrial neural population dynamics using sequential autoencoders. Nat Methods 15(10):805–815
 7.
Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Predicting human eye fixations via an lstmbased saliency attentive model. IEEE Trans Image Process 27(10):5142–5154
 8.
Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A (2018) Deep audiovisual speech recognition. In: IEEE transactions on pattern analysis and machine intelligence
 9.
Li C, Ma J, Guo X, Mei Q (2017) Deepcas: An endtoend predictor of information cascades. In: Proceedings of the 26th international conference on world wide web. pp 577–586 . International World Wide Web Conferences Steering Committee
 10.
Du N, Dai H, Trivedi R, Upadhyay U, GomezRodriguez M, Song L (2016) Recurrent marked temporal point processes: embedding event history to vector. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. pp 1555–1564 . ACM, New York
 11.
Aghababaei S, Makrehchi M (2017) Activitybased twitter sampling for contentbased and usercentric prediction models. Hum Cent Compu Inf Sci 7(1):3
 12.
Weng L, Menczer F, Ahn YY (2014) Predicting successful memes using network and community structure. In: ICWSM
 13.
LoyolaGonzález O, LópezCuevas A, MedinaPérez MA, Camiña B, RamírezMárquez JE, Monroy R (2019) Fusing pattern discovery and visual analytics approaches in tweet propagation. Inf Fusion 46:91–101
 14.
Jia AL, Shen S, Li D, Chen S (2018) Predicting the implicit and the explicit video popularity in a user generated content site with enhanced social features. Comput Netw 140:112–125
 15.
Kursuncu U, Gaur M, Lokala U, Thirunarayan K, Sheth A, Arpinar IB (2019) Predictive analysis on twitter: techniques and applications. In: Emerging research challenges and opportunities in computational social network analysis and mining. pp 67–104. Springer, Berlin
 16.
Arapakis I, Cambazoglu BB, Lalmas M (2017) On the feasibility of predicting popular news at cold start. J Assoc Inf Sci Technol 68(5):1149–1164
 17.
Trzcinski T, Rokita P (2017) Predicting popularity of online videos using support vector regression. IEEE Trans Multimed 99:1–1
 18.
Kong Q, Mao W, Chen G, Zeng D (2018) Exploring trends and patterns of popularity stage evolution in social media. IEEE Trans Syst Man Cybern Syst 99:1–11
 19.
Engelhard M, Xu H, Carin L, Oliver JA, Hallyburton M, McClernon FJ (2018) Predicting smoking events with a timevarying semiparametric hawkes process model. Proc Mach Learn Res 85:312
 20.
Li L, Zha H (2014) Learning parametric models for social infectivity in multidimensional hawkes processes. In: Twentyeighth AAAI conference on artificial intelligence. p. 101–107
 21.
Yu L, Cui P, Wang F, Song C, Yang S (2017) Uncovering and predicting the dynamic process of information cascades with survival model. Knowl Inf syst 50(2):633–659
 22.
Saito K, Nakano R, Kimura M (2008) Prediction of information diffusion probabilities for independent cascade model. In: International conference on knowledgebased and intelligent information and engineering systems. pp 67–75. Springer, Berlin
 23.
Bao Z, Liu Y, Zhang Z, Liu H, Cheng J (2019) Predicting popularity via a generative model with adaptive peeking window. Phys A Stat Mech Appl 522:54–68
 24.
Zhang W, Wang W, Wang J, Zha H (2018) Userguided hierarchical attention network for multimodal social image popularity prediction. In: Proceedings of the 2018 world wide web conference on world wide web. pp. 1277–1286 . International World Wide Web Conferences Steering Committee
 25.
Itti L, Koch C, Niebur E (1998) A model of saliencybased visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259
 26.
Desimone R, Duncan J (1995) Neural mechanisms of selective visual attention. Ann Rev Neurosci 18(1):193–222
 27.
Choi H, Cho K, Bengio Y (2018) Finegrained attention mechanism for neural machine translation. Neurocomputing 284:171–176
 28.
Lopez PR, Dorta DV, Preixens GC, Sitjes JMG, Marva FXR, Gonzalez J (2019) Pay attention to the activations: a modular attention mechanism for finegrained image recognition. IEEE Trans Multimed
 29.
Bielski A, Trzcinski TP (2018) Understanding multimodal popularity prediction of social media videos with selfattention. IEEE Access 6:74277–74287
 30.
Xiong C, Merity S, Socher R (2016) Dynamic memory networks for visual and textual question answering. In: International conference on machine learning. p. 2397–2406
 31.
Kumar A, Irsoy O, Ondruska P, Iyyer M, Bradbury J, Gulrajani I, Zhong V, Paulus R, Socher R (2016) Ask me anything: dynamic memory networks for natural language processing. In: International conference on machine learning. p. 1378–1387
 32.
Elman JL (1990) Finding structure in time. Cognit Sci 14(2):179–211
 33.
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
 34.
Hochreiter S, Schmidhuber J (1997) Long shortterm memory. Neural Comput 9(8):1735–1780
 35.
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555
 36.
Perozzi B, AlRfou R, Skiena S (2014) Deepwalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International conference on knowledge discovery and data mining. pp 701–710. ACM, New York
 37.
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
 38.
Shulman B, Sharma A, Cosley D (2016) Predictability of popularity: gaps between prediction and understanding. In: International conference on weblogs and social media. pp 348–357
 39.
Ugander J, Backstrom L, Marlow C, Kleinberg J (2012) Structural diversity in social contagion. Proceedings of the national academy of sciences 201116502
 40.
Mishra S, Rizoiu MA, Xie L (2016) Feature driven and point process approaches for popularity prediction. In: ACM international on conference on information and knowledge management, pp 1069–1078
 41.
Souri A, Hosseinpour S, Rahmani AM (2018) Personality classification based on profiles of social networks’ users and the fivefactor model of personality. Hum cent Comput Inf Sci 8(1):24
 42.
Szabo G, Huberman BA (2010) Predicting the popularity of online content. Commun ACM 53(8):80–88
 43.
Khosla A, Das Sarma A, Hamid R (2014) What makes an image popular? In: Proceedings of the 23rd international conference on world wide web, pp 867–876
 44.
Zhao Q, Erdogdu MA, He HY, Rajaraman A, Leskovec J (2015) Seismic: a selfexciting point process model for predicting tweet popularity. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 1513–1522
 45.
Chollet F et al (2015) Keras: deep learning library for theano and tensorflow. https://keras.io/k. 7(8)
 46.
Team TTD, AlRfou R, Alain G, Almahairi A, Angermueller C, Bahdanau D, Ballas N, Bastien F, Bayer J, Belikov A, et al (2016) Theano: a python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688
Acknowledgements
Not applicable.
Funding
This research was funded by the National Key Research and Development Program of China (Grant No. 2018YFC0832304), the National Science Foundation for Young Scientists of China (Grant No. 61801125) and the Fundamental Research Funds for the Central Universities (Grant No. 2017JBZ107) .
Author information
Affiliations
Contributions
YL carried out design of the proposed framework, managed and supervised this paper. ZB conducted the experiments, analyzed the results and drafted the document. ZZ and DT provided valuable suggestions on improving the standards of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, Y., Bao, Z., Zhang, Z. et al. Information cascades prediction with attention neural network. Hum. Cent. Comput. Inf. Sci. 10, 13 (2020). https://doi.org/10.1186/s1367302000218w
Received:
Accepted:
Published:
Keywords
 Information diffusion
 Deep learning
 Attention network
 Cascade prediction