Poisson approximation for search of rare words in DNA sequences

Abstract : Using recent results on the occurrence times of a string of symbols in a stochas-tic process with mixing properties, we present a new method for the search of rare words in biological sequences modelled by a Markov chain. We obtain a bound on the error between the distribution of the number of occurrences of a word in a sequence and its Poisson approximation. A global bound is already given by a Chen-Stein method. Our approach, the ψ-mixing method, gives local bounds. Since we only need the error in the tails of distribution , the global uniform bound of Chen-Stein is too large and it is a better way to consider local bounds. It is the first time that local bounds are devised for Poisson approximation. We search for two thresholds on the number of occurrences from which we can regard a studied word as an over-represented or an under-represented one. A biological role is suggested for these over-or under-represented words. Our method gives such thresholds for a panel of words much broader than the Chen-Stein method which cannot give any result in a great number of cases where our method works. Comparing the methods, we observe a better accuracy for the ψ-mixing method for the bound of the tails of distribution. Our method can obviously be used in domains other than biology. We also present the software PANOW (available at http://stat.genopole.cnrs.fr/sg/software/panow/) dedicated to the computation of the error term and the thresholds for a studied word.
Liste complète des métadonnées

Littérature citée [31 références]  Voir  Masquer  Télécharger

https://hal-normandie-univ.archives-ouvertes.fr/hal-02337193
Contributeur : Nicolas Vergne <>
Soumis le : mardi 29 octobre 2019 - 12:34:57
Dernière modification le : jeudi 31 octobre 2019 - 01:26:53

Fichier

biological.pdf
Fichiers éditeurs autorisés sur une archive ouverte

Identifiants

  • HAL Id : hal-02337193, version 1

Collections

Citation

Nicolas Vergne, Miguel Abadi. Poisson approximation for search of rare words in DNA sequences. Alea: Estudos Neolatinos, Unversidade Federal do Rio de Janeiro, 2008, 4, pp.223 - 244. ⟨hal-02337193⟩

Partager

Métriques

Consultations de la notice

3

Téléchargements de fichiers

4