tailieunhanh - Why do Internet services fail, and what can be done about it?
The “business logic” of traditional three-tier system terminology is part of our definition of front-end, because these services integrate their service logic with the code that receives and replies to client requests. The front-end tier is responsible primarily for locating data on back-end machine(s) and routing it to and from clients in Content and ReadMostly, and for providing online services such as email, newsgroups, and a web proxy in Online. In Content the “front-end” includes not only software running at the colocation sites, but also client proxy software running on hardware provided and operated by Content that is physically located at customer sites. Thus Content is geographically distributed not only among the four colocation centers,. | Appears in 4th Usenix Symposium on Internet Technologies and Systems USITS 03 2003. Why do Internet services fail and what can be done about it David Oppenheimer Archana Ganapathi and David A. Patterson University of California at Berkeley EECS Computer Science Division 387 Soda Hall 1776 Berkeley CA 94720-1776 USA davidopp archanag patterson @ Abstract In 1986 Jim Gray published his landmark study of the causes of failures of Tandem systems and the techniques Tandem used to prevent such failures 6 . Seventeen years later Internet services have replaced fault-tolerant servers as the new kid on the 24x7-availability block. Using data from three large-scale Internet services we analyzed the causes of their failures and the potential effectiveness of various techniques for preventing and mitigating service failure. We find that 1 operator error is the largest cause of failures in two of the three services 2 operator error is the largest contributor to time to repair in two of the three services 3 configuration errors are the largest category of operator errors 4 failures in custom-written front-end software are significant and 5 more extensive online testing and more thoroughly exposing and detecting component failures would reduce failure rates in at least one service. Qualitatively we find that improvement in the maintenance tools and systems used by service operations staff would decrease time to diagnose and repair problems. 1. Introduction The number and popularity of large-scale Internet services such as Google MSN and Yahoo have grown significantly in recent years. Such services are poised to increase further in importance as they become the repository for data in ubiquitous computing systems and the platform upon which new global-scale services and applications are built. These services large scale and need for 24x7 operation have led their designers to incorporate a number of techniques for achieving high availability. Nonetheless failures .
đang nạp các trang xem trước