Highly reliable architecture using the 80/20 rule in cloud computing datacenters

Article ID	Journal	Published Year	Pages	File Type
4950176	Future Generation Computer Systems	2017	29 Pages	PDF

Abstract

Cloud datacenters host hundreds of thousands of physical servers that offer computing resources for executing customer jobs. While the failures of these physical machines are considered normal rather than exceptional, in large-scale distributed systems and cloud datacenters evaluation of availability in a datacenter is essential for both cloud providers and customers. Although providing a highly available and reliable computing infrastructure is essential to maintaining customer confidence, cloud providers desire to have highly utilized datacenters to increase the profit level of delivered services. Cloud computing architectural solutions should thus take into consideration both high availability for customers and highly utilized resources to make delivering services more profitable for cloud providers. This paper presents a highly reliable cloud architecture by leveraging the 80/20 rule. This architecture uses the 80/20 rule (80% of cluster failures come from 20% of physical machines) to identify failure-prone physical machines by dividing each cluster into reliable and risky sub-clusters. Furthermore, customer jobs are divided into latency-sensitive and latency-insensitive types. The results showed that only about 1% of all requested jobs are extreme latency-sensitive and require availability of 99.999%. By offering services to revenue-generating jobs, which are less than 50% of all requested jobs, within the reliable subcluster of physical machines, cloud providers can make their businesses more profitable by preventing service level agreement violation penalties and improving their reputations.

Keywords

evaluation Availability Cloud computing Machine failure Reliability