Article ID Journal Published Year Pages File Type
11002421 Future Generation Computer Systems 2018 12 Pages PDF
Abstract
Technological advances and the emergence of the Internet of Things have lead to the collection of vast amounts of scientific data from increasingly powerful scientific instruments and a growing number of distributed sensors. This has not only exacerbated the significance of the analyses performed by scientific applications but has also increased their complexity and scale. Hence, emerging extreme-scale scientific workflows are becoming widespread and so is the need to efficiently automate their deployment on a variety of platforms such as high performance computers, dedicated clusters, and cloud environments. Performance anomalies can considerably affect the execution of these applications. They may be caused by different factors including failures and resource contention and they may lead to undesired circumstances such as lengthy delays in the workflow runtime or unnecessary costs in cloud environments. As a result, it is essential for modern workflow management systems to enable the early detection of this type of anomalies, to identify their cause, and to formulate and execute actions to mitigate their effects. In this work, we propose the use of Hierarchical Temporal Memory (HTM) to detect performance anomalies on real-time infrastructure metrics collected by continuously monitoring the resource consumption of executing workflow tasks. The framework is capable of processing a stream of measurements in an online and unsupervised manner and is successful in adapting to changes in the underlying statistics of the data. This allows it to be easily deployed on a variety of infrastructure platforms without the need of previously collecting data and training a model. We evaluate our approach by using two real scientific workflows deployed in Microsoft Azure's cloud infrastructure. Our experiment results demonstrate the ability of our model to accurately capture performance anomalies on different resource consumption metrics caused by a variety of competing workloads introduced into the system. A performance comparison of HTM to other online anomaly detection algorithms is also presented, demonstrating the suitability of the chosen algorithm for the problem presented in this work.
Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics
Authors
, , ,