Large-scale cloud computing systems have served as the fundamental supporting platform for big data, Internet of Things, and artificial intelligence applications for the past decade.With the scale and complexity of these systems increasing dramatically,various hardware and software failures will inevitably occur and may not be detected and repaired in a timely manner. Besides,sophisticated architectural features of cloud computing may also have an adverse impact on system reliability. In response to these challenges, this paper proposes a simulation-driven framework based on real cloud computing system operation logs for improving failure tolerance in large-scale cloud computing systems.For a given cloud computing system, we first conduct a systematic analysis of its structure and operation characteristics. A Markov based model is used to examine the system’s potential failures,assess their severities, and suggest quick recoveries. During this process, the proposed reliability-aware resource scheduling algorithm is adopted to optimize resources so that the system’s reliability can be improved cost-effectively.We also report a case study to demonstrate the application of our algorithm in improving failure tolerance of a large-scale cloud computing system.
To View the Abstract Contents
Now it is Your Time to Shine.
Great careers Start Here.
We Guide you to Every Step
Success! You're Awesome
Thank you for filling out your information!
We’ve sent you an email with your Final Year Project PPT file download link at the email address you provided. Please enjoy, and let us know if there’s anything else we can help you with.
To know more details Call 900 31 31 555
The WISEN Team