tailieunhanh - Báo cáo khoa học: "Towards the Orwellian Nightmare"
This paper describes the largest scale annotation project involving the Enron email corpus to date. Over 12,500 emails were classified, by humans, into the categories “Business” and “Personal”, and then subcategorised by type within these categories. The paper quantifies how well humans perform on this task (evaluated by inter-annotator agreement). It presents the problems experienced with the separation of these language types. As a final section, the paper presents preliminary results using a machine to perform this classification task. . | Towards the Orwellian Nightmare Separation of Business and Personal Emails Sanaz Jabbari Ben Allison David Guthrie Louise Guthrie Department of Computer Science University of Sheffield 211 Portobello St. Sheffield S1 4DP @ Abstract This paper describes the largest scale annotation project involving the Enron email corpus to date. Over 12 500 emails were classified by humans into the categories Business and Personal and then subcategorised by type within these categories. The paper quantifies how well humans perform on this task evaluated by inter-annotator agreement . It presents the problems experienced with the separation of these language types. As a final section the paper presents preliminary results using a machine to perform this classification task. 1 Introduction Almost since it became a global phenomenon computers have been examining and reasoning about our email. For the most part this intervention has been well natured and helpful - computers have been trying to protect us from attacks of unscrupulous blanket advertising mail shots. However the use of computers for more nefarious surveillance of email has so far been limited. The sheer volume of email sent means even government agencies who can legally intercept all mail must either filter email by some preconceived notion of what is interesting or they must employ teams of people to manually sift through the volumes of data. For example the NSA has had massive parallel machines filtering e-mail traffic for at least ten years. The task of developing such automatic filters at research institutions has been almost impossible but for the opposite reason. There is no shortage of willing researchers but progress has been hampered by the lack of any data - one s email is often hugely private and the prospect of surrendering it in its entirety for research purposes is somewhat unsavoury. Recently a data resource has become available where exactly this .
đang nạp các trang xem trước