Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Discovering Sociolinguistic Associations with Structured Sparsity"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. | Discovering Sociolinguistic Associations with Structured Sparsity Jacob Eisenstein Noah A. Smith Eric P. Xing School of Computer Science Carnegie Mellon University PittsbUrgh PA 15213 UsA jacobeis nasmith epxing @cs.cmu.edu Abstract We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors geographic communities we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite 1 regularizes we obtain structured sparsity driving entire rows of coefficients to zero. We perform two regression studies. First we use term frequencies to predict demographic attributes our method identifies a compact set of words that are strongly associated with author demographics. Next we conjoin demographic attributes into features which we use to predict term frequencies. The composite regularizer identifies a small number of features which correspond to communities of authors united by shared demographic and linguistic properties. 1 Introduction How is language influenced by the speaker s sociocultural identity Quantitative sociolinguistics usually addresses this question through carefully crafted studies that correlate individual demographic attributes and linguistic variables for example the interaction between income and the dropped r feature of the New York accent Labov 1966 . But such studies require the knowledge to select the dropped r and the speaker s income from thousands of other possibilities. In this paper we present a method to acquire such patterns from raw data. Using multi-output regression with structured sparsity 1365 our method identifies a small subset of lexical items that are most influenced by demographics and discovers conjunctions of demographic attributes that are especially salient for lexical variation. Sociolinguistic associations are difficult to model because the space of potentially relevant