Skip to main content

Table 1 Related literature on email traffic workload modelling

From: Modelling email traffic workloads with RNN and LSTM models

Authors

Objective and dataset

Key parameters

Techniques & findings

Shah and Noble [4]

Large scale study of email patterns. The dataset was collected over 7 months (2.85 million messages) from a departmental server

Message size, content type, temporal locality

Lognormal distribution is the best fit for the size of the message body, Pareto distribution is the best fit for the tail. Spam email sizes are larger than that of legitimate email

Gomes, et al. [3]

Focus on identifying the characteristics that significantly distinguish spam from non-spam traffic. The dataset consists of 8 days of SMTP incoming email logs collected from a university in Brazil

Email arrival process, size and popularity distribution and temporal locality

The inter-arrival time for spam traces is exponentially distributed

Email sizes follow lognormal distribution for both spam and non-spam traces. However, the average size of non-spam emails is six to eight times larger than the average size of spam

The distribution of the number of recipients per email is modelled with a Zipf-like distribution and is heavier tailed in the spam workload

Temporal locality is much weaker among spam recipients than for non-spam

Bertolotti and Calzarossa [2]

Focus on the accurate characterization of the email traffic workload. The datasets were collected from the mail servers of an ISP, two enterprises and a university in Italy

Arrival process, size and the number of recipients of messages

Weibull distribution model is found to provide the best fit for modelling inter-arrival times whose values are smaller than a threshold, where Pareto distribution is the best fit for inter-arrival time larger than a threshold. The empirical inter-arrival time distribution threshold value is approximately equal to 7 s

Lee and Kim [32]

Focus on the coexistence of the Poisson process and self-similarity. The dataset consists of 9 months of SMTP traces collected from a web portal in South Korea

Inter-arrival time of SMTP traces

The Q–Q plot and Chi square test demonstrate that the inter-arrival time of SMTP traces follows a Poisson process. On the other hand, the inter-arrival time also exhibits self-similarity and long range dependence

Boukoros, et al. [6]

Focus on modelling workload of email servers for all categories of traffic using probability distribution models and statistical test. The datasets were collected over 9 months from a university in Greece

Users’ incoming and outgoing email sizes, system incoming and outgoing email sizes and spam email sizes

In contrast to several of the above works, the lognormal distribution was found unable to provide the best fit for any of the categories. The best fit was provided by the log-logistic and Generalized Extreme Value distributions

Boukoros, et al. [5]

Focus on modelling email traffic as a time series problem. The datasets were collected from four universities over several months

As in [6]

The Recurrent Neural Network model has achieved significantly higher accuracy compared to the probability distribution models