tailieunhanh - Báo cáo khoa học: "A Stochastic Finite-State Morphological Parser for Turkish"

This paper presents the first stochastic finite-state morphological parser for Turkish. The non-probabilistic parser is a standard finite-state transducer implementation of two-level morphology formalism. A disambiguated text corpus of 200 million words is used to stochastize the morphotactics transducer, then it is composed with the morphophonemics transducer to get a stochastic morphological parser. We present two applications to evaluate the effectiveness of the stochastic parser; spelling correction and morphology-based language modeling for speech recognition. . | A Stochastic Finite-State Morphological Parser for Turkish Ha im Sak Tunga Gungor Dept. of Computer Engineering Boga ici University TR-34342 Bebek Istanbul Turkey gungort@ Murat Saraclar Dept. of Electrical Electronics Engineering Bogazici University TR-34342 Bebek Istanbul Turkey Abstract This paper presents the first stochastic finite-state morphological parser for Turkish. The non-probabilistic parser is a standard finite-state transducer implementation of two-level morphology formalism. A disambiguated text corpus of 200 million words is used to stochas-tize the morphotactics transducer then it is composed with the morphophonemics transducer to get a stochastic morphological parser. We present two applications to evaluate the effectiveness of the stochastic parser spelling correction and morphology-based language modeling for speech recognition. 1 Introduction Turkish is an agglutinative language with a highly productive inflectional and derivational morphology. The computational aspects of Turkish morphology have been well studied and several morphological parsers have been built Oflazer 1994 Gungor 1995 . In language processing applications we may need to estimate a probability distribution over all word forms. For example we need probability estimates for unigrams to rank misspelling suggestions for spelling correction. None of the previous studies for Turkish have addressed this problem. For morphologically complex languages estimating a probability distribution over a static vocabulary is not very desirable due to high out-ofvocabulary rates. It would be very convenient for a morphological parser as a word generator analyzer to also output a probability estimate for a word generated analyzed. In this work we build such a stochastic morphological parser for Turkish1 and give two example applications for evaluation. 1The stochastic morphological parser is available for research purposes at http .