OCR for Non-Latin Scripts
Optical character recognition developed primarily around Latin alphabets — English, French, German, Spanish. The core assumptions reflect this heritage: text flows left to right, characters are discrete units separated by whitespace, and the alphabet contains a manageable number of symbols. None of these assumptions hold for much of the world's writing.
Arabic flows right to left with mandatory cursive connections. Chinese has tens of thousands of distinct characters with no spaces between words. Devanagari connects characters along a horizontal headline and uses complex stacking rules for consonant clusters. Each script family presents structural challenges that Latin-trained systems cannot handle without fundamental architectural changes.
This article examines why non-Latin scripts are harder for OCR, how researchers have addressed script-specific challenges, and where multilingual recognition is heading.
What Makes Non-Latin Scripts Harder
The difficulty is not simply a matter of adding more character classes. Non-Latin scripts differ from Latin in structural ways that affect every stage of the OCR pipeline.
Segmentation complexity. Latin text segments naturally at the character level — letters are usually separate shapes with clear boundaries. Arabic script is inherently cursive: characters connect to their neighbours and change shape depending on position (initial, medial, final, or isolated). Devanagari characters attach to a continuous horizontal line (the Shirorekha), making it difficult to determine where one character ends and the next begins. Chinese and Japanese text has no spaces between words, requiring language-model-driven word segmentation as a post-OCR step.
Character set size. English uses 52 letter forms (upper and lower case) plus digits and punctuation. Arabic has 28 base letters with up to 4 positional forms each, plus diacritical marks. Chinese contains over 50,000 characters in Unicode, though practical recognition focuses on the 3,000–7,000 most common. The combinatorial scale affects training data requirements, model capacity, and recognition accuracy.
Directional complexity. Arabic and Hebrew read right to left. Traditional Chinese and Japanese can be written vertically. Documents mixing scripts — a Chinese technical paper with English citations, an Arabic text with embedded URLs — require the layout analysis system to detect and handle multiple reading directions within a single page.
Diacritics and modifiers. Arabic uses dots above and below letters as integral parts of the character (not optional annotations). Hindi and other Devanagari-based languages use vowel modifiers (matras) that attach at different positions around the consonant. These modifiers are frequently small, easily damaged in degraded documents, and critical for correct reading — a missing dot in Arabic can change one letter into an entirely different one.
Arabic Script Recognition
Arabic presents a distinctive combination of challenges. The script is fully cursive in both print and handwriting — unlike Latin, where print separates letters that handwriting connects. Every word is a connected sequence, and letter shapes depend on context.
The variability is substantial. The letter "Ha" (ه) has four distinct shapes depending on its position in a word. Some letter pairs form mandatory ligatures where the individual shapes merge into a new combined form. Diacritical dots distinguish otherwise identical shapes: "Ba" (ب), "Ta" (ت), and "Tha" (ث) differ only in the number and position of dots below or above the same base shape.
Early Arabic OCR systems attempted to segment words into individual characters before recognition — the same approach that worked for Latin. This proved unreliable because segmentation points within connected Arabic text are ambiguous. Modern systems increasingly use sequence-to-sequence approaches that recognize entire words or text lines without explicit character segmentation.
[1]Kasem, M.S., Mahmoud, M. & Kang, H.-S. (2025).Advancements and Challenges in Arabic Optical Character Recognition: A Comprehensive Survey.ACM Computing Surveys, Vol. 58, No. 4, pp. 1–37
Deep learning has significantly improved Arabic OCR, particularly LSTM-based sequence models that process text lines as temporal sequences. However, challenges remain with historical Arabic manuscripts, where calligraphic styles vary dramatically across periods, regions, and scribes.
Chinese, Japanese, and Korean Characters
CJK character recognition faces a fundamentally different problem: scale. Rather than recognizing a small alphabet, the system must distinguish among thousands of visually similar characters. Chinese alone has common character sets of 3,755 (GB2312 Level 1) to 6,763 (GB2312 complete) characters, with less common characters pushing the total far higher.
The visual similarity between characters creates recognition ambiguities that context alone cannot resolve. Characters differing by a single stroke — 未 (not yet) versus 末 (end), 土 (earth) versus 士 (scholar) — require precise spatial analysis. The stroke count ranges from one (一) to over thirty, and the spatial arrangement of components follows complex two-dimensional structures rather than the linear left-to-right sequence of Latin characters.
Zhang, Bengio, and Liu established benchmark results for handwritten Chinese character recognition by combining traditional directional feature maps with deep convolutional networks, achieving state-of-the-art accuracy without data augmentation.
[2]Zhang, X.-Y., Bengio, Y. & Liu, C.-L. (2017).Online and Offline Handwritten Chinese Character Recognition: A Comprehensive Study and New Benchmark.Pattern Recognition, Vol. 61, pp. 348–360
A particularly promising approach decomposes Chinese characters into their constituent radicals — the structural building blocks that combine to form complete characters. Wang et al. demonstrated that a radical-based recognition network achieves strong accuracy on known characters and, critically, can recognize character classes never seen during training.
[3]Wang, W., Zhang, J., Du, J., Wang, Z.-R. & Zhu, Y. (2018).DenseRAN for Offline Handwritten Chinese Character Recognition.Proceedings of ICFHR 2018, pp. 104–109
This radical decomposition approach parallels how humans learn Chinese characters — understanding components and their spatial relationships rather than memorizing each character as a monolithic shape. It also addresses the zero-shot recognition problem: a model trained on common characters can recognize rare or historical characters composed of familiar radicals.
Japanese adds another layer of complexity. The writing system uses three scripts simultaneously: kanji (borrowed Chinese characters), hiragana (cursive syllabary), and katakana (angular syllabary). A single sentence may contain all three, requiring the recognition system to handle script switching without explicit delimiters.
Devanagari and Indic Scripts
Devanagari — used for Hindi, Sanskrit, Marathi, Nepali, and other South Asian languages — has structural properties that challenge standard OCR approaches. The most distinctive feature is the Shirorekha (headline), a horizontal line connecting characters across the top of each word. This headline makes character segmentation particularly difficult because it creates a continuous connected component spanning multiple characters.
[4]Jayadevan, R., Kolhe, S.R., Patil, P.M. & Pal, U. (2011).Offline Recognition of Devanagari Script: A Survey.IEEE Transactions on Systems, Man, and Cybernetics, Part C, Vol. 41, No. 6, pp. 782–796
Devanagari compounds the segmentation problem with compound characters (conjunct consonants) — two or more consonants combined into a single visual unit with its own shape. The character "क्ष" (ksha) merges three consonant sounds into a form that bears little visual resemblance to its components. The number of possible compound characters is large and script-dependent, creating a long tail of rare forms that appear infrequently in training data.
Vowel modifiers (matras) attach to consonants at different positions — above, below, before, or after the base character. These modifiers are essential for correct reading but are often small and easily confused with noise or artefacts in degraded documents. The same modifier appears in different positions depending on the base consonant, requiring the recognition system to model spatial relationships between components.
[5]Bag, S. & Harit, G. (2013).A Survey on Optical Character Recognition for Bangla and Devanagari Scripts.Sadhana, Vol. 38, pp. 133–168
The broader family of Indic scripts — Bangla, Tamil, Telugu, Kannada, Malayalam, Gujarati, and others — share many of these structural challenges while differing in specific details. Each script has its own modifier rules, compound character conventions, and glyph shapes. Building OCR systems for all Indian languages requires either script-specific models or architectures flexible enough to handle the structural variation across the family.
Toward Multilingual Recognition
Building separate OCR systems for each of the world's scripts does not scale. A single document may contain multiple scripts — a Japanese academic paper with English references, an Indian government form with Hindi and English sections, a medieval manuscript mixing Arabic and Persian. Practical OCR systems need to handle script diversity within a unified framework.
Huang et al. proposed a multiplexed architecture that performs script identification at the word level and routes each detected word to a script-specific recognition head.
[6]Huang, J., Pang, G., Kovvuri, R., Toh, M., Liang, K.J., Krishnan, P., Yin, X. & Hassner, T. (2021).A Multiplexed Network for End-to-End, Multilingual OCR.IEEE/CVF CVPR 2021, pp. 4547–4557
This approach handles Latin, Chinese, Japanese, Korean, Arabic, Bangla, and Hindi within a single model. The key architectural insight is that script identification and character recognition can share a common visual encoder while using specialized decoders for each script family. The shared encoder learns script-invariant visual features (edges, strokes, spatial patterns), while the script-specific heads learn the structural rules particular to each writing system.
Vision transformers are accelerating progress in multilingual OCR. Self-attention mechanisms can model the two-dimensional spatial relationships within Chinese characters, the long-range cursive connections in Arabic, and the modifier placement rules in Devanagari — all within the same architecture. Pre-training on large multilingual document collections builds representations that transfer across scripts, reducing the amount of script-specific training data needed.
Practical Challenges
Beyond algorithmic advances, deploying non-Latin OCR in production raises practical issues.
Training data imbalance. English and Chinese have large, well-annotated datasets. Many scripts — Tibetan, Khmer, Syriac, Ge'ez — have minimal digitized ground truth. Training models for under-resourced scripts often requires synthetic data generation, transfer learning from related scripts, or crowd-sourced annotation campaigns.
Font and style variation. Arabic calligraphy encompasses styles from Naskh (the standard print form) through Thuluth, Diwani, and Nastaliq (used for Urdu). Chinese has regular script, running script, cursive script, and seal script — each with dramatically different visual characteristics. An OCR system trained on modern printed text will fail on historical calligraphic documents.
Mixed-script documents. Real documents rarely contain a single script. Code-switching between languages, embedded foreign terms, mathematical notation, and bilingual headers require the OCR system to detect script boundaries dynamically. Errors in script detection cascade into recognition errors, as characters routed to the wrong script decoder produce nonsensical output.
Evaluation standards. There is no single benchmark covering all major scripts. Research communities for Arabic, Chinese, and Indic scripts each maintain their own competitions (ICDAR, CASIA, IIIT-HW). Cross-script comparison requires careful normalization of metrics and datasets, which is rarely done systematically.
Conclusion
Non-Latin script recognition remains one of the most challenging areas in OCR, driven by structural differences that go far beyond character set size:
- Arabic's mandatory cursive connections and context-dependent character shapes make segmentation-free sequence models essential
- Chinese character recognition requires handling thousands of visually similar classes, with radical decomposition offering a path to recognizing unseen characters
- Devanagari's headline connections, compound characters, and position-dependent modifiers demand models that understand two-dimensional spatial relationships between components
- Multilingual architectures using shared visual encoders with script-specific decoders can handle multiple writing systems within a single model
- Training data scarcity for many of the world's scripts remains a fundamental bottleneck
The field is moving toward universal OCR systems that treat script as a variable rather than a system boundary. Vision transformers and large-scale multilingual pre-training are making this feasible, but practical deployment for the world's hundreds of writing systems remains far from solved.
References
[1]Kasem, M.S., Mahmoud, M. & Kang, H.-S. (2025).Advancements and Challenges in Arabic Optical Character Recognition: A Comprehensive Survey.ACM Computing Surveys, Vol. 58, No. 4, pp. 1–37
[2]Zhang, X.-Y., Bengio, Y. & Liu, C.-L. (2017).Online and Offline Handwritten Chinese Character Recognition: A Comprehensive Study and New Benchmark.Pattern Recognition, Vol. 61, pp. 348–360
[3]Wang, W., Zhang, J., Du, J., Wang, Z.-R. & Zhu, Y. (2018).DenseRAN for Offline Handwritten Chinese Character Recognition.Proceedings of ICFHR 2018, pp. 104–109
[4]Jayadevan, R., Kolhe, S.R., Patil, P.M. & Pal, U. (2011).Offline Recognition of Devanagari Script: A Survey.IEEE Transactions on Systems, Man, and Cybernetics, Part C, Vol. 41, No. 6, pp. 782–796
[5]Bag, S. & Harit, G. (2013).A Survey on Optical Character Recognition for Bangla and Devanagari Scripts.Sadhana, Vol. 38, pp. 133–168
[6]Huang, J., Pang, G., Kovvuri, R., Toh, M., Liang, K.J., Krishnan, P., Yin, X. & Hassner, T. (2021).A Multiplexed Network for End-to-End, Multilingual OCR.IEEE/CVF CVPR 2021, pp. 4547–4557