tailieunhanh - Báo cáo sinh học: " Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data"

Tuyển tập các báo cáo nghiên cứu về sinh học được đăng trên tạp chí y học Molecular Biology cung cấp cho các bạn kiến thức về ngành sinh học đề tài: Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data. | Nuel et al. Algorithms for Molecular Biology 2010 5 15 http content 5 1 15 AMR ALGORITHMS FOR MOLECULAR BIOLOGY RESEARCH Open Access Exact distribution of a pattern in a set of random sequences generated by a Markov source applications to biological data Gregory Nuel1 2 3 Leslie Regad4 53 Juliette Martin4 6 73 Anne-Claude Camproux4 5 Abstract Background In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences upstream gene regions proteins exons etc. . Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. Results The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets structural patterns in protein loop structures PROSITE signatures in a bacterial .