Handwriting Recognition with Multigrams
Abstract
We introduce a novel handwriting recognition approach based on sub-lexical units known as multigrams of characters, that are variable lengths characters sequences. A Hidden Semi Markov model is used to model the multigrams occurrences within the target language corpus. Decoding the training language corpus with this model provides an optimized multigram lexicon of reduced size with high coverage rate of OOV compared to the traditional word modeling approach. The handwriting recognition system is composed of two components: the optical model and the statistical n-grams of multigrams language model. The two models are combined together during the recognition process using a decoding technique based on Weighted Finite State Transducers (WFST). We experiment the approach on two Latin language datasets (the French RIMES and English IAM datasets) and we show that it outperforms words and character models language models for high Out Of Vocabulary (OOV) words rates, and that it performs similarly to these traditional models for low OOV rates, with the advantage of a reduced complexity.