tailieunhanh - Báo cáo sinh học: "WildSpan: mining structured motifs from protein sequences"

Tuyển tập các báo cáo nghiên cứu về sinh học được đăng trên tạp chí y học Molecular Biology cung cấp cho các bạn kiến thức về ngành sinh học đề tài: WildSpan: mining structured motifs from protein sequences. | Hsu et al. Algorithms for Molecular Biology 2011 6 6 http content 6 1 6 AMR ALGORITHMS FOR MOLECULAR BIOLOGY RESEARCH Open Access WildSpan mining structured motifs from protein sequences Chen-Ming Hsu1 Chien-Yu Chen2 and Baw-Jhiune Liu3 Abstract Background Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. For proteins it is desired to discover sequence motifs containing a large number of wildcard symbols as the residues associated with functional sites are usually largely separated in sequences. Discovering such patterns is time-consuming because abundant combinations exist when long gaps a gap consists of one or more successive wildcards are considered. Mining algorithms often employ constraints to narrow down the search space in order to increase efficiency. However improper constraint models might degrade the sensitivity and specificity of the motifs discovered by computational methods. We previously proposed a new constraint model to handle large wildcard regions for discovering functional motifs of proteins. The patterns that satisfy the proposed constraint model are called W-patterns. A W-pattern is a structured motif that groups motif symbols into pattern blocks interleaved with large irregular gaps. Considering large gaps reflects the fact that functional residues are not always from a single region of protein sequences and restricting motif symbols into clusters corresponds to the observation that short motifs are frequently present within protein families. To efficiently discover W-patterns for large-scale sequence annotation and function prediction this paper first formally introduces the problem to solve and proposes an algorithm named WildSpan sequential pattern mining across large wildcard regions that incorporates several pruning strategies to largely reduce the mining cost. Results WildSpan is shown to efficiently find W-patterns containing conserved residues that .