- Open Access
Performance prediction of data streams on high-performance architecture
© The Author(s) 2019
- Received: 13 June 2018
- Accepted: 26 December 2018
- Published: 7 January 2019
Worldwide sensor streams are expanding continuously with unbounded velocity in volume, and for this acceleration, there is an adaptation of large stream data processing system from the homogeneous to rack-scale architecture which makes serious concern in the domain of workload optimization, scheduling, and resource management algorithms. Our proposed framework is based on providing architecture independent performance prediction model to enable resource adaptive distributed stream data processing platform. It is comprised of seven pre-defined domain for dynamic data stream metrics including a self-driven model which tries to fit these metrics using ridge regularization regression algorithm. Another significant contribution lies in fully-automated performance prediction model inherited from the state-of-the-art distributed data management system for distributed stream processing systems using Gaussian processes regression that cluster metrics with the help of dimensionality reduction algorithm. We implemented its base on Apache Heron and evaluated with proposed Benchmark Suite comprising of five domain-specific topologies. To assess the proposed methodologies, we forcefully ingest tuple skewness among the benchmarking topologies to set up the ground truth for predictions and found that accuracy of predicting the performance of data streams increased up to 80.62% from 66.36% along with the reduction of error from 37.14 to 16.06%.
- Apache Heron
- Stream benchmark suite
- Performance prediction
- Performance behavior
- High performance computing
- Data streams
The specialized distributed real-time stream processing systems demand the underlying system should able to adapt with increment in data volume, having heterogeneous data sources. Along with the requirement of massive computational capabilities on increasing data velocity, these specialized systems also insist that underlying framework should provide highly scalable resources to achieve massive parallelism among the processing logic components in a distributed computing nodes in a timely manner and to facilitate fast recovery from hardware failures, stateless and stateful mechanism of processing logic components ensure low latency streaming. Among all state-of-the-art specialized distributed stream processing framework Apache Storm , Apache Flink , and Apache Spark have emerged as the de facto programming model which automatically take care of data and process distribution to achieve sufficient task parallelism. More recent forays in low-latency distributed stream processing, Google MillWheel  and Apache Heron  emerged as a successor to all modern unbounded streams of continuous data processing systems and scale transparently to large clusters which are most common among all stream processing engines. Although there are similarities among components but they provide a different mechanism such as tuples or buffers for message passing to provide high throughput.
The emerging real-time distributed stream processing system’s Heron is built from plethora of components named as Spout, Bolts, Topology Master, Stream Manager and Metrics Manager which interacts in complex ways while running on several containers to correlate with high velocity of data volume. These containers are scheduled to run on a heterogeneous selection of multi-core nodes using large-scale storage infrastructures. It also provides a framework to seamlessly integrate with existing large data processing components named as Apache Hadoop Distributed File System, Apache REEF , Apache Mesos, Apache Aurora , Simple Linux Utility for Resource Management (SLURM) and Google Kubernetes  but simultaneously makes it difficult to understand the performance behavior of underlying applications and components. Traditional relational database management systems performance complexities can be resolved using optimizers  but how to accurately model and predict performance complexities in distributed stream processing framework is quite challenging and has not yet been well studied. We address this gap in this paper. These performance complexities arise due to huge variance in workloads, elasticity, computation fluctuations and tuple serialization rate which makes difficult to predict the behavior of data pipelined on distributed components. Since predicting the dynamic performance of data stream will provide further insight to a number of data management task including workload optimization , scheduling  and resource management which help in reducing unnecessary over-provisioning of resources through efficient prioritization of resource allocations in the specialized distributed stream processing systems domain.
We provide domain specific metrics which were most relevant for streaming platform running on top of high performance computing architecture because existing methodologies only depicts about the big data processing and distributed database management framework.
We provide performance behavior of streaming platform running on top of high performance architecture.
We transform state-of-the-art automated performance tuning module of distributed database management system to work for distributed streaming platform.
We propose a novel framework running on top of a streaming platform using linear least squares with L2 regularization to recommend a plausible performance for the stream of individual topology.
To validate and evaluate the proposed framework, we implemented on an emerging processing system, Apache Heron.
Fields grouping: The progression of tuples is transmitted to those processing logic components comprised of similar meta-attribute value.
Global grouping: The progression of tuples is transmitted to single instance having lowest encoded meta-attribute value.
Shuffle grouping: The progression of tuples is randomly distributed to distinct instances while ensuring uniform distribution.
None grouping: Till now, having similar functionality as shuffle grouping.
All grouping: The progression of tuples distributed to all corresponding processing components.
Custom grouping: The progression of tuples distributed to corresponding processing components as defined by the user.
Sliding window: Tuples in a stream are grouped together to form windows that can be overlap either on the basis of time duration or on number of operation performed.
Tumbling window: Tuples in a stream are grouped together to form non-overlapping window either on the basis of time duration or on number of operation performed.
In this section, we describe the overview of design and implementation details of the proposed framework.
Performance metrics classification
There is no single metric exists till the time of writing paper based on which we evaluate overall performance of big data system which is almost the same problem discussed in . In this section, we are classifying all existing metrics into seven different categories which helps in deeper visualization of strength and weakness of entire big data processing systems discussed in following section.
Execute latency The execution latency is the latency it acquired to process a user-defined logic on windowed incoming tuples of a topology.
Uptime The total computation time allocated to a process on which Java virtual machine is running, once shortlisted by the short-term scheduler. In rest of the paper, we keep nanoseconds as a unit of measurement in metrics pipeline module .
Among all the selected metrics, containerized configuration as a cost metrics (RAM, CPU, Disk usage) and input-output as a cost metrics (emit count, fail count, acknowledgement count) are some of the widely selected features on most state-of-the-art systems. A data-center system such as IBM Cloud Private , reports the performance of worker nodes to the master node in terms of CPUs, GPUs usage, and overall RAM utilization. Moreover, auto-scaling of running application totally depends on consumption of these contemporary components. Poggi et al.  also includes these system configuration metrics to report resource consumption based on the query to have a overall insight of cluster.
Data streaming performance prediction model
Comparison of regression algorithm with respect to three evaluation metrics (with cross validation k = 10) performed on (i) Cluster-I, (ii) Cluster-II and (iii) Cluster-I and II
Elastic net regression
Elastic net regression
Elastic net regression
\(\epsilon\)-SVR linear kernel
\(\epsilon\)-SVR linear kernel
\(\epsilon\)-SVR linear kernel
nu-SVR linear kernel
nu-SVR linear kernel
nu-SVR linear kernel
To compare the efficiency of proposed model (label as DKL in rest of paper), we use well-studied technique for regression problem used in most performance tuning module of Distributed Database Management System. The dimensions of all the dynamic metrics are reduced using state-of-the-art dimensionality reduction technique called Factor Analysis. It transform the high dimensional stream processing systems dynamic metric data into lower dimensional data. Based on our experiments, we found that only the initial factors are most significant for our prediction framework due to existence of major influenced metrics distribution. To find out the highly influential metrics, we use k-means clustering algorithm to cluster this lower dimensional data using each row as its feature metrics and, keep a single metric from each cluster (one which were nearest to the centroid of a cluster). Finally, we use Gaussian processes regression to recommend performance of data streams with help of top k dynamic metrics of stream data processing system.
The Data Enrichment converge all tuples record into new record having all dynamic metrics with help of unique tuple ID and timestamp. After concatenation, all the missing values are replaced with mean of entire column since all dynamic metrics record contains sparse data and distributed data storage termed as Data Grid.
In this section, we provide the brief overview of benchmark suite used in the evaluation of proposed framework followed by a glimpse of experimental setups. The benchmark topologies form a data pipeline using open-source distributed pub-sub messaging system Apache Pulsar  to consume text streams generated by parallel synthetic data load generator. The input streams are the tuples which are generated using Alice’s Adventures in Wonderland2 text file and the spout consume the data streams later emits into topology through subscription to pulsar topic.
We perform the evaluation of the proposed performance modeling framework over Apache Incubator Heron 0.17.1 release  on top of Centos Linux 7. Entire methodologies are evaluated based on their performance on two different clusters having heterogeneous architectures. The Cluster 1 consists of 19 Intel Xeon E5-2683 v4 nodes running at 2.10 GHz. Each compute cores has 128 GB RAM and 64 cores (2 sockets \(\times\) 16 cores each with SMTP value of 2). The Cluster 2 consists of 12 many integrated core nodes named as Intel Knights Landing Xeon Phi 7250 running at 1.4 GHz where each KNL node composed of 64 GB and 16 GB MCDRAM having 72 cores each. Each computing node from the cluster is interconnected with NL-SAS Directed attached storage of 108 TB along with 100 Gbps Omni-Path fabric interconnect network for data and 1 Gbps Ethernet network for management which helps in maintaining overall cluster stability. Five different domain representative topologies were implemented and deployed. The latency including throughput is measured during changes in number of parallel task and system performance.
Grep count directed acyclic graph (GC-DAG)
GEneral matrix to matrix multiplication directed acyclic graph (GEMM-DAG)
Unique sort acyclic graph (US-DAG)
Speed of light compute directed acyclic graph (SOL-DAG)
Speed of light sleep directed acyclic graph (SOLS-DAG)
Results and inferences
The experimental conclusion of the contemporary graph is delineated in Fig. 8. The average prediction accuracy rate of individual processing logic components are shown in Fig. 8a that varies from 99.91% (among the source component and uni-gram component) to 88.45% (at keyWord Count component) for MSL model. The performance assessment of contemporary models would be interesting in presence of unpredictable workload variation. To represent dynamic behavior we forcefully ingest skewness in the processing of user-defined components by restricting the parallelism count to be four. Due to dynamic variations in process unit metrics, the available metrics are far enough to cover all possible values; thus it reduces the predictive accuracy of topology to 90.90%, which is slightly higher than individual accuracy rates. Surprisingly, for model DKL prediction accuracy rate of individual processing logic components varies from 99.65% (among the source component and uni-gram component) to 4.585% (at keyWord count edging component). The presence of sparse attributed metrics leads to 29.89% reduction in prediction accuracy which is slightly lesser than prediction accuracy of individual processing component. Later, based on the experimental conclusion we found that discussed accuracy is much lower than overall accuracy achieved through MSL performance model. The average prediction error rates vary from 11.54% (among the source and uni-gram component) to 11.96% (at keyWord count component) and 0.348% (among the source and uni-gram component) to 95.84% (at keyWord count edging component) for MSL and DKL models respectively as shown in Fig. 8b. Even though DKL prediction model achieves an average accuracy of 29.89% but overall how it actually performs when its estimated latencies are compared with default dynamic latencies along the latencies estimated with MSL model are shown in Fig. 8c, d for a regular time frame of 20 min for spouts and bolts components respectively. Moreover, the default normalized dynamic execution latencies of bolts is much lower than the normalized estimated execution latencies of DKL prediction model as shown in Fig. 8d. However, as shown in Fig. 8c there is no significant difference among default and estimated normalized dynamic execution latencies of spouts.
We provide a comprehensive review of the organization and the community-contributed application workload driven benchmark in this section. It also includes the brief overview of crucial research efforts made on several state-of-the-art performance prediction models in the domain of specialized and generalized big data processing systems.
Big data system benchmarking: inferences and metrics
Karimov et al.  measure the performance of three state-of-the-art stream processing framework over various events viz. joins, aggregations, queries over increased windows size including presence of skewness in data, fluctuating workloads and back-pressure with processing-time and event-time latency as a evaluation metric. Based on the experimental analysis, author suggest favourable conditions for various framework. Quan et al.  based on the conclusion with three distinct representatives systems proposed that performance response on hardware fluctuates with the change in application workload. This dynamic variation not only depends on characteristics of workload but also on the amount of data underlying computing node processing. They also inferred that there is strong performance relationship among the type of workload running on which computing node. Han et al.  inferred that efficiently benchmarking the huge data processing system provides accurate measuring of contemporary systems and five-span of these systems are the most prerequisite important to sustain huge precisions. They also introduce us with four further classifications about workload input data generation: ready-made datasets, a synthetic distribution based data generators, real-world data based data generators and hybrid generators and also about two sub-branches of benchmarks labeled as the micro and end-to-end benchmarks. Similarly, Han et al.  authors categorized the whole system level evaluation metric into user-perceivable metrics(how frequently it can collect streams) and architecture metrics (how frequently it can respond to a streams). Similarly, Veiga et al.  using various batch and iterative workloads evaluates overall performance on the basis of cluster size, Block size, Data Size, Interconnect Network, Nodes Configuration and execution time on data center system. Finally, Jia et al.  suggests benchmarking with single application will not be enough to consider various domains of workload.
Big data system benchmarking: performance prediction model
Gupta et al.  proposed a theoretical performance prediction model for big data processing system based on the new active data, historical data. And, with the help of machine learning algorithms, it generates the metadata for new active data based and determines the performance level of systems and configure the system based on the prediction using metadata. The major drawback of such model is that they are based on a static sampling of correlated data. Baru et al.  highlight the importance of application-level-data benchmarks that are striving to cover all aspects of the application from ingestion to analysis. Nikravesh et al.  provides the autonomic performance indicator to support scaling in a cloud environment. Authors periodically sample the values from time-series streams to correlate various workload pattern and accuracy of regression algorithms viz. multi-layer perceptron. However, based on the experimental analysis the performance model for stream processing framework exploiting multi-layer perceptron neural network not able to work well. The relevant variant to our work is proposed by Li et al. . Their proposed methodology entirely depends upon reinforcement learning. The complexity of their approach grows linearly with an increment in searchable (action) space which makes it unfit for actual use and further discussion on predictable performance are described here .
In our experiments, one of the most important contribution is the characterize transformation of entire dynamic metrics into five distinct categorization named as memory metrics, n-Verticals, Communication metrics, Computation metrics and scheduler metrics. These feature classification helps in precise behavior of entire model as shown in “Results and inferences” section. Moreover, it can be inferred from experiment described in “SOL-DAG topology” section that the correct combination of number of stream managers, parallelism count and number or executors including state-of-the-art resources available in computing node that helps in maintaining entire topology health while processing of large streaming data and ends with resource requirement problem for topology. In this work, we study the problem of predicting the performance of data streams in distributed stream processing environment. We proposed a design, methodology and evaluation of performance prediction framework aiming at efficient, resource adaptive and high performance distributed streaming platform. The framework comprising of six functional modules that includes metrics pipeline, data enrichment, metrics classification, data grid, trigger and prediction model. The metrics classification module categorize the dynamic topology metrics into seven predefined class for better performance behavior analysis and data enrichment provides a solution for missing values if present. The data stream performance prediction module comprise of two models: MSL and DKL. The self-driven MSL model tries fit classified dynamic metrics using ridge regularization regression algorithm and fully-automated DKL model inherited from the state-of-the-art workload management module from distributed database management system for distributed stream processing systems. We implemented its base on Apache Heron (version 0.17.1) and evaluate it with proposed Streaming Benchmark Suite comprising five domain specific micro-benchmarking topologies. To evaluate the proposed methodologies, We forcefully ingest tuple skewness among the benchmarking topologies in order to setup ground truth for predictions. From experiments, we found that accuracy of predicting performance of data streams increased upto 80.62% from 66.36% along with reduction of error from 37.14 to 16.06%. This shows that our MSL model outperform the state-of-the-art DSK model and can be used in workload optimization, scheduling and resource management problems in distributed stream processing systems.
BG conducted the experiments, analyzed the results and drafted the document. AB provided valuable suggestions on improving the standards of the manuscript. Both authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Availability of data and materials
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Toshniwal A, Taneja S, Shukla A, Ramasamy K, Patel JM, Kulkarni S, Jackson J, Gade K, Fu M, Donham J, Bhagat N, Mittal S, Ryaboy D (2014) Storm@twitter. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, SIGMOD ’14. pp 147–156Google Scholar
- Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K (2015) Apache flink™: stream and batch processing in a single engine. IEEE Data Eng Bull 38(4):28–38Google Scholar
- Akidau T, Balikov A, Bekiroğlu K, Chernyak S, Haberman J, Lax R, McVeety S, Mills D, Nordstrom P, Whittle S (2013) Millwheel: fault-tolerant stream processing at internet scale. Proc VLDB Endow 6(11):1033–1044View ArticleGoogle Scholar
- Apache heron git repository. https://github.com/apache/incubator-heron. Accessed 11 Apr 2018
- Chun B-G, Condie T, Chen Y, Cho B, Chung A, Curino C, Douglas C, Interlandi M, Jeon B, Jeong JS, Lee G, Lee Y, Majestro T, Malkhi D, Matusevych S, Myers B, Mykhailova M, Narayanamurthy S, Noor J, Ramakrishnan R, Rao S, Sears R, Sezgin B, Um T, Wang J, Weimer M, Yang Y (2017) Apache reef: retainable evaluator execution framework. ACM Trans Comput Syst. 35(2):5View ArticleGoogle Scholar
- Apache aurora git repository. https://github.com/apache/aurora. Accessed 12 Mar 2018
- Burns B, Grant B, Oppenheimer D, Brewer E, Wilkes J (2016) Borg, omega, and kubernetes. Commun ACM 59(5):50–57View ArticleGoogle Scholar
- Van Aken D, Pavlo A, Gordon G J, Zhang B (2017) Automatic database management system tuning through large-scale machine learning. In: Proceedings of the 2017 ACM international conference on management of data, SIGMOD 17. pp 1009-1024Google Scholar
- Aboulnaga A, Babu S (2013) Workload management for big data analytics. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, SIGMOD ’13. pp 929–932Google Scholar
- Curino C, Difallah D E, Douglas C, Krishnan S, Ramakrishnan R, Rao S (2014) Reservation-based scheduling: If you’re late don’t blame us!. In: Proceedings of the ACM symposium on cloud computing, SOCC ’14. pp 1–14Google Scholar
- Apache pulsar git repository. https://github.com/apache/pulsar. Accessed 11 Apr 2018
- Kulkarni S, Bhagat N, Fu M, Kedigehalli V, Kellogg C, Mittal S, Patel J M, Ramasamy K, Taneja S (2015) Twitter heron: stream processing at scale. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, SIGMOD ’15. pp 239–250Google Scholar
- Arasu A, Babcock B, Babu S, Cieslewicz J, Datar M, Ito K, Motwani R, Srivastava U, Widom J (2016) STREAM: the stanford data stream management system. Springer. pp 317–336. https://doi.org/10.1007/978-3-540-28608-0_16
- Baru C, Rabl T (2016) Application-level benchmarking of big data systems. Springer, New Delhi. pp 189–199. https://doi.org/10.1007/978-81-322-3628-3_10
- Sahin S, Cao W, Zhang Q, Liu L (2016) Jvm configuration management and its performance impact for big data applications. In: IEEE international congress on big data (BigData Congress) 2016. pp 410–417. https://doi.org/10.1109/BigDataCongress.2016.64
- Java garbage collection, oracle. https://docs.oracle.com/cd/E17802_01/j2se/j2se/1.5.0/jcp/beta1/apidiffs/java/lang/management/GarbageCollectorMBean.html. Accessed 12 Mar 2018
- Destounis A, Paschos G S, Koutsopoulos I (2016) Streaming big data meets backpressure in distributed network computation. In: IEEE INFOCOM 2016—The 35th annual IEEE international conference on computer communications. pp 1–9. https://doi.org/10.1109/INFOCOM.2016.7524388
- Ibm cloud private. https://www.ibm.com/blogs/cloud-computing/2017/10/what-is-ibm-cloud-private. Accessed 12 Mar 2018
- Poggi N, Montero A, Carrera D (2018) Characterizing bigbench queries, hive, and spark in multi-cloud environments. In: Nambiar R, Poess M (eds) Performance evaluation and benchmarking for the analytics era. Springer, Cham, pp 55–74View ArticleGoogle Scholar
- Jia Y (2014) Learning semantic image representations at a large scale, Ph.D. thesis, EECS Department, University of California, Berkeley (May)Google Scholar
- Hadjis S, Abuzaid F, Zhang C, Ré C (2015) Caffe con troll: shallow ideas to speed up deep learning. In: Proceedings of the fourth workshop on data analytics in the cloud, DanaC’15. pp 1–4Google Scholar
- Deepbench, baidu research. https://svail.github.io/DeepBench. Accessed 12 Mar 2018
- Karimov J, Rabl T, Katsifodimos A, Samarev R, Heiskanen H, Markl V (2018) Benchmarking distributed stream processing engines. CoRR abs/1802.08496.Google Scholar
- Quan J, Shi Y, Zhao M, Yang W (2013) The implications from benchmarking three big data systems. In: Proceedings—2013 IEEE international conference on big data, big data , 2013. pp 31–38. https://doi.org/10.1109/BigData.2013.6691706
- Han R, John LK, Zhan J (2018) Benchmarking big data systems: a review. IEEE Trans Serv Comp 11(3):580–597. https://doi.org/10.1109/TSC.2017.2730882 View ArticleGoogle Scholar
- Han R, Jia Z, Gao W, Tian X, Wang L (2015) Benchmarking big data systems: state-of-the-art and future directions, CoRR abs/1506.01494. arXiv:1506.01494
- Veiga J, Expósito RR, Pardo XC, Taboada GL, Tourifio J (2016) Performance evaluation of big data frameworks for large-scale data analytics. In: IEEE international conference on big data (Big Data) 2016. pp 424–431. https://doi.org/10.1109/BigData.2016.7840633
- Jia Z, Wang L, Zhan J, Zhang L, Luo C (2013) Characterizing data analysis workloads in data centers. In: IEEE international symposium on workload characterization (IISWC) 2013. pp 66–76. https://doi.org/10.1109/IISWC.2013.6704671
- Gupta S, Dominiak J, Marimadaiah S (2017) Using machine learning to predict big data environment performance, U.S Patent 2017-0140278 A1, 18 MayGoogle Scholar
- Nikravesh AY, Ajila SA, Lung C-H (2017) An autonomic prediction suite for cloud resource provisioning. J Cloud Comput 6(1):3. https://doi.org/10.1186/s13677-017-0073-4 View ArticleGoogle Scholar
- Li T, Xu Z, Tang J, Wang Y (2018) Model-free control for distributed stream data processing using deep reinforcement learning. Proc VLDB Endow. 11(6):705–718Google Scholar
- de Assuncao MD, da Silva Veith A, Buyya R (2018) Distributed data stream processing and edge computing: a survey on resource elasticity and futuredirections. J Netw Comput Appl 103:1–17. https://doi.org/10.1016/j.jnca.2017.12.001 View ArticleGoogle Scholar