Dhanya, D and Ramakrishnan, AG and Pati, Peeta Basa (2002) Script identification in printed bilingual documents. In: Sadhana, 27 (1). pp. 73-82.
Identification of the script of the text in multi-script documents is one of the important steps in the design of an OCR system for the analysis and recognition of the page. Much work has already been reported in this area relating to Roman, Arabic, Chinese, Korean and Japanese scripts. In the Indian context, though some results have been reported, the task is still at its infancy. In the work presented in this paper, a successful attempt has been made to identify the script, at the word level, in a bilingual document containing Roman and Tamil scripts. Two different approaches have been proposed and thoroughly tested. In the first method, words are divided into three distinct spatial zones. The spatial spread of a word in upper and lower zones, together with the character density, is used to identify the script. The second technique analyses the directional energy distribution of a word using Gabor filters with suitable frequencies and orientations. Words with various font styles and sizes have been used for the testing of the proposed algorithms and the results are quite encouraging.
|Item Type:||Journal Article|
|Additional Information:||Copyright of this article belongs to Indian Academy of Sciences.|
|Department/Centre:||Division of Electrical Sciences > Electrical Engineering|
|Date Deposited:||11 Jun 2006|
|Last Modified:||19 Sep 2010 04:29|
Actions (login required)