A survey on reliability in distributed systems

Article ID	Journal	Published Year	Pages	File Type
430051	Journal of Computer and System Sciences	2013	13 Pages	PDF

Abstract

•Analyzed and highlighted significance of underlying factors important for reliability of distributed computing systems.•Discussed thoroughly short comings of existing models in terms of existing challenges.•Identified factors significant for ensuring reliability which have not been evaluated or considered as constant.•Discussed numerous strategies that deal with reliability prediction models in different phases of SDLC.

Softwareʼs reliability in distributed systems has always been a major concern for all stake holders especially for applicationʼs vendors and its users. Various models have been produced to assess or predict reliability of large scale distributed applications including e-government, e-commerce, multimedia services, and end-to-end automotive solutions, but reliability issues with these systems still exists. Ensuring distributed systemʼs reliability in turns requires examining reliability of each individual component or factors involved in enterprise distributed applications before predicting or assessing reliability of whole system, and Implementing transparent fault detection and fault recovery scheme to provide seamless interaction to end users. For this reason we have analyzed in detail existing reliability methodologies from viewpoint of examining reliability of individual component and explained why we still need a comprehensive reliability model for applications running in distributed system. In this paper we have described detailed technical overview of research done in recent years in analyzing and predicting reliability of large scale distributed applications in four parts. We first described some pragmatic requirements for highly reliable systems and highlighted significance and various issues of reliability in different computing environment such as Cloud Computing, Grid Computing, and Service Oriented Architecture. Then we elucidated certain possible factors and various challenges that are nontrivial for highly reliable distributed systems, including fault detection, recovery and removal through testing or various replication techniques. Later we scrutinize various research models which synthesize significant solutions to tackle possible factors and various challenges in predicting as well as measuring reliability of software applications in distributed systems. At the end of this paper we have discussed limitations of existing models and proposed future work for predicting and analyzing reliability of distributed applications in real environment in the light of our analysis.