Modelling email traffic workloads with RNN and LSTM models

Om, Khandu; Boukoros, Spyros; Nugaliyadde, Anupiya; McGill, Tanya; Dixon, Michael; Koutsakis, Polychronis; Wong, Kok Wai

doi:10.1186/s13673-020-00242-w

Table 1 Related literature on email traffic workload modelling

From: Modelling email traffic workloads with RNN and LSTM models

Authors	Objective and dataset	Key parameters	Techniques & findings
Shah and Noble [4]	Large scale study of email patterns. The dataset was collected over 7 months (2.85 million messages) from a departmental server	Message size, content type, temporal locality	Lognormal distribution is the best fit for the size of the message body, Pareto distribution is the best fit for the tail. Spam email sizes are larger than that of legitimate email
Gomes, et al. [3]	Focus on identifying the characteristics that significantly distinguish spam from non-spam traffic. The dataset consists of 8 days of SMTP incoming email logs collected from a university in Brazil	Email arrival process, size and popularity distribution and temporal locality	The inter-arrival time for spam traces is exponentially distributed Email sizes follow lognormal distribution for both spam and non-spam traces. However, the average size of non-spam emails is six to eight times larger than the average size of spam The distribution of the number of recipients per email is modelled with a Zipf-like distribution and is heavier tailed in the spam workload Temporal locality is much weaker among spam recipients than for non-spam
Bertolotti and Calzarossa [2]	Focus on the accurate characterization of the email traffic workload. The datasets were collected from the mail servers of an ISP, two enterprises and a university in Italy	Arrival process, size and the number of recipients of messages	Weibull distribution model is found to provide the best fit for modelling inter-arrival times whose values are smaller than a threshold, where Pareto distribution is the best fit for inter-arrival time larger than a threshold. The empirical inter-arrival time distribution threshold value is approximately equal to 7 s
Lee and Kim [32]	Focus on the coexistence of the Poisson process and self-similarity. The dataset consists of 9 months of SMTP traces collected from a web portal in South Korea	Inter-arrival time of SMTP traces	The Q–Q plot and Chi square test demonstrate that the inter-arrival time of SMTP traces follows a Poisson process. On the other hand, the inter-arrival time also exhibits self-similarity and long range dependence
Boukoros, et al. [6]	Focus on modelling workload of email servers for all categories of traffic using probability distribution models and statistical test. The datasets were collected over 9 months from a university in Greece	Users’ incoming and outgoing email sizes, system incoming and outgoing email sizes and spam email sizes	In contrast to several of the above works, the lognormal distribution was found unable to provide the best fit for any of the categories. The best fit was provided by the log-logistic and Generalized Extreme Value distributions
Boukoros, et al. [5]	Focus on modelling email traffic as a time series problem. The datasets were collected from four universities over several months	As in [6]	The Recurrent Neural Network model has achieved significantly higher accuracy compared to the probability distribution models

Back to article page