Unconstrained Bengali handwriting recognition with recurrent models
Abstract
This paper presents a pioneering attempt for developing a recurrent neural net based connectionist system for unconstrained Bengali offline handwriting recognition. The major challenge in configuring such a classification system for a complex script like Bengali is to effectively define the character classes. A novel way of defining character classes is introduced making the recognition problem suitable for using a recurrent model. Indeed, it has to deal with more than nine hundred character classes for which the occurrence probability is very skewed in the language. An off-the-shelf BLSTM-CTC recognizer is used. An open-source dataset is developed for unconstrained Bengali offline handwriting recognition. The dataset contains 2,338 handwritten text lines consisting of about 21,000 word. Experiment shows that with the new definition of character classes the BLSTM-CTC provides an impressive performance for unconstrained Bengali offline handwriting recognition. The character level recognition accuracy is 75.40% without doing any post-processing on the BLSTM-CTC output. Among the 24.60% character level errors, the substitution, deletion and insertion errors are 18.91%, 4.69% and 0.98%, respectively.