Automatic Script Identification in the Wild

Abstract

With the rapid increase of transnational communication and cooperation, people frequently encounter multilingual scenarios in various situations. In this paper, we are concerned with a relatively new problem: script identification at word or line levels in natural scenes. A large-scale dataset with a great quantity of natural images and 10 types of widely-used languages is constructed and released. In allusion to the challenges in script identification in real-world scenarios, a deep learning based algorithm is proposed. The experiments on the proposed dataset demonstrate that our algorithm achieves superior performance, compared with conventional image classification methods, such as the original CNN architecture and LLC.

The SIW-10 Dataset

All together 13,045 multi-scripts text line images in 10 classes, cropped from 7,700 full images taken in-the-wild.

For collecting the dataset, we harvest a collection of street view images from Google Stree View and manually label the text regions by their bounding boxes. (See Fig. 2 for examples) Text line images are then cropped from these images. Only horizontally-written texts are included.

Statistics of the dataset:

Script	Training Size	Testing Size
Arabic	503	500
Chinese	809	500
English	725	500
Greek	522	500
Hebrew	770	500
Japanese	717	500
Korean	1064	500
Russian	532	500
Thai	1726	500
Tibetan	677	500
Total	8045	5000

Download the dataset: SIW-10.zip

Method and Results

MSPN (Multi-stage Spatially-sensitive Pooling Network) is a deep learning method for script identification proposed in our ICDAR 2015 paper "Automatic Script Identification in the Wild". Results and comparisons on the SIW-10 dataset is shown below:

Tab. 1 Error rates comparisons among MSPN, CNN-Patch and LLC on SIW-10.

Script	MSPN	CNN-Patch	LLC
Arabic	98.6	94.7	97.0
Chinese	95.2	84.0	92.6
English	91.4	70.1	74.0
Greek	86.4	72.0	86.0
Hebrew	94.0	90.7	91.4
Japanese	93.6	89.2	82.6
Korean	97.4	93.5	94.0
Russian	89.8	77.0	75.0
Thai	98.0	92.7	86.4
Tibetan	99.6	98.3	98.0
Average	94.4	87.6	88.7

Fig. 3 Left: Prediction errors comparisons on the SIW-10 dataset. Right: confusion matrix

SIW-13

A dataset extended from SIW-10, proposed in [2]. Statistics of this dataset:

Script	Training Size	Testing Size
Arabic	502	500
Cambodian	583	500
Chinese	798	500
English	721	500
Greek	518	500
Hebrew	742	500
Japanese	715	500
Kannada	529	500
Korean	1061	500
Mongolian	692	500
Russian	531	500
Thai	1722	500
Tibetan	677	500
Total	9791	6500

Download the dataset: SIW-13.zip

Citation

Please cite the paper if you find this dataset useful:

[1] Automatic Script Identification in the Wild [pdf]
Baoguang Shi, Cong Yao, Chengquan Zhang, Xiaowei Guo, Feiyue Huang, Xiang Bai
In Proceedings of ICDAR 2015 (oral presentation)

[2] Script Identification in the Wild via Discriminative Convolutional Neural Network [pdf]
Baoguang Shi, Xiang Bai and Cong Yao
Pattern Recognition, to appear.

We would like to thank NVIDIA for GPU donations.

For questions about the dataset, please contact shibaoguang [AT] gmail [DOT] com.