From: Modelling email traffic workloads with RNN and LSTM models
Authors | Objective and dataset | Key parameters | Techniques & findings |
---|---|---|---|
Shah and Noble [4] | Large scale study of email patterns. The dataset was collected over 7 months (2.85 million messages) from a departmental server | Message size, content type, temporal locality | Lognormal distribution is the best fit for the size of the message body, Pareto distribution is the best fit for the tail. Spam email sizes are larger than that of legitimate email |
Gomes, et al. [3] | Focus on identifying the characteristics that significantly distinguish spam from non-spam traffic. The dataset consists of 8 days of SMTP incoming email logs collected from a university in Brazil | Email arrival process, size and popularity distribution and temporal locality | The inter-arrival time for spam traces is exponentially distributed Email sizes follow lognormal distribution for both spam and non-spam traces. However, the average size of non-spam emails is six to eight times larger than the average size of spam The distribution of the number of recipients per email is modelled with a Zipf-like distribution and is heavier tailed in the spam workload Temporal locality is much weaker among spam recipients than for non-spam |
Bertolotti and Calzarossa [2] | Focus on the accurate characterization of the email traffic workload. The datasets were collected from the mail servers of an ISP, two enterprises and a university in Italy | Arrival process, size and the number of recipients of messages | Weibull distribution model is found to provide the best fit for modelling inter-arrival times whose values are smaller than a threshold, where Pareto distribution is the best fit for inter-arrival time larger than a threshold. The empirical inter-arrival time distribution threshold value is approximately equal to 7 s |
Lee and Kim [32] | Focus on the coexistence of the Poisson process and self-similarity. The dataset consists of 9 months of SMTP traces collected from a web portal in South Korea | Inter-arrival time of SMTP traces | The Q–Q plot and Chi square test demonstrate that the inter-arrival time of SMTP traces follows a Poisson process. On the other hand, the inter-arrival time also exhibits self-similarity and long range dependence |
Boukoros, et al. [6] | Focus on modelling workload of email servers for all categories of traffic using probability distribution models and statistical test. The datasets were collected over 9 months from a university in Greece | Users’ incoming and outgoing email sizes, system incoming and outgoing email sizes and spam email sizes | In contrast to several of the above works, the lognormal distribution was found unable to provide the best fit for any of the categories. The best fit was provided by the log-logistic and Generalized Extreme Value distributions |
Boukoros, et al. [5] | Focus on modelling email traffic as a time series problem. The datasets were collected from four universities over several months | As in [6] | The Recurrent Neural Network model has achieved significantly higher accuracy compared to the probability distribution models |