Building a Sri Lankan Tamil Named Entity Recognition Dataset for Low-Resource NLP

The growth of Large Language Models (LLMs) and multilingual NLP systems has significantly improved language technologies across major global languages. However, low-resource languages such as Sri Lankan Tamil still face a severe lack of high-quality annotated datasets—especially for foundational tasks like Named Entity Recognition (NER).

To address this gap, we developed the Srilankan-Tamil-NER Dataset, a Tamil NER dataset designed specifically for Sri Lankan Tamil linguistic and contextual usage.

This dataset is intended to support:

Why Sri Lankan Tamil NER Matters

Named Entity Recognition (NER) is a core NLP task that identifies and classifies entities such as:

  • Person names
  • Locations
  • Organizations
  • Dates
  • Miscellaneous entities

NER acts as a foundational layer for many downstream NLP systems including:

  • Question answering
  • Search systems
  • Chatbots
  • Document intelligence
  • Machine translation
  • Knowledge graph generation

For Tamil — particularly Sri Lankan Tamil — publicly available annotated corpora remain extremely limited. Existing multilingual datasets often underrepresent regional linguistic variations, local named entities, and culturally contextual terminology.

Most existing NER systems for Tamil are trained on datasets originating from Indian Tamil corpora, leaving significant gaps in handling:

  • Sri Lankan Tamil vocabulary
  • Local organization names
  • Sri Lankan place names
  • Government and institutional terminology

Our dataset aims to bridge this gap.


About the Dataset

Srilankan-Tamil-NER Dataset

The primary goal of this dataset is to create a high-quality manually curated Named Entity Recognition corpus for Sri Lankan Tamil.

The dataset is structured to support fine-tuning transformer-based multilingual models such as:

  • IndicNER
  • mBERT
  • XLM-RoBERTa
  • MuRIL
  • IndicBERT

IndicNER itself was trained for multiple Indian languages including Tamil and has become a strong baseline for multilingual NER systems.


Dataset Preparation Pipeline

Creating a Tamil NER dataset involves significantly more than simple annotation.

The preparation workflow included multiple stages:

1. Data Collection

The raw Tamil text corpus was collected from the Noolaham corpus and other relevant publicly available Sri Lankan Tamil textual sources. Special attention was given to:

  • local linguistic relevance
  • entity diversity
  • sentence quality
  • contextual richness

The objective was to capture realistic Sri Lankan Tamil usage patterns rather than synthetic or translated text.


2. OCR and Text Normalization

Tamil NLP pipelines often begin with scanned or image-based documents.

As part of our broader Tamil document intelligence workflow, OCR-extracted Tamil text underwent:

  • Unicode normalization
  • punctuation cleaning
  • whitespace normalization
  • invalid character filtering
  • OCR noise reduction

OCR-related preprocessing becomes extremely important because Tamil script errors can propagate heavily into token classification systems.


3. Named Entity Annotation

The dataset was manually annotated using BIO tagging format:

Tag TypeDescription
B-PERBeginning of person entity
I-PERInside person entity
B-LOCBeginning of location entity
I-LOCInside location entity
B-ORGBeginning of organization entity
I-ORGInside organization entity
ONon-entity token

BIO tagging remains one of the most widely adopted standards for NER annotation pipelines.

Example:

TokenLabel
இராமநாதன்B-PER
யாழ்ப்பாணம்B-LOC
பல்கலைக்கழகம்B-ORG

Challenges in Sri Lankan Tamil NER

Building a Tamil NER dataset introduced several language-specific challenges.

Morphological Complexity

Tamil is morphologically rich, where suffixes and grammatical inflections can alter token boundaries significantly.

This creates difficulties for:

  • tokenizer alignment
  • entity span detection
  • subword classification

OCR Noise

Tamil OCR systems still produce:

  • broken grapheme clusters
  • merged tokens
  • Unicode inconsistencies
  • punctuation corruption

NER systems trained on clean datasets often fail under noisy OCR conditions.


Limited Existing Corpora

Compared to English, Tamil lacks:

  • large benchmark corpora
  • standardized annotation frameworks
  • domain-diverse datasets
  • region-specific entity resources

Several studies have highlighted the low-resource limitations of Tamil and Sinhala NER ecosystems.


Dataset Design Considerations

During preparation, the following design principles were prioritized.

Entity Diversity

The corpus includes diverse entity categories relevant to Sri Lankan contexts.

Examples include:

  • government institutions
  • educational organizations
  • local geographic locations
  • personal names
  • cultural entities

Fine-Tuning Use Cases

This dataset can be used to fine-tune transformer architectures for:

Use CaseDescription
Tamil NEREntity extraction
OCR Post-processingStructured information extraction
Search SystemsSemantic indexing
RAG PipelinesEntity-aware retrieval
Tamil ChatbotsContext understanding
Government Document AIInformation extraction
Knowledge GraphsEntity linking

Importance for Low-Resource NLP

The broader importance of this dataset extends beyond NER itself.

Low-resource language ecosystems require foundational datasets before advanced systems such as LLMs, multilingual agents, and reasoning pipelines can perform effectively.

Datasets like this contribute toward:

  • Sri Lankan Tamil digital preservation
  • multilingual AI inclusivity
  • regional NLP research
  • culturally aware AI systems

Recent multilingual NER research also highlights the importance of publicly available Tamil and Sinhala annotated corpora for improving low-resource NLP systems.


Future Directions

Planned future improvements include:

  • expanding entity categories
  • larger corpus coverage
  • nested entity support
  • OCR-noisy benchmark subsets
  • cross-domain annotations
  • multilingual Tamil-Sinhala-English alignment
  • relation extraction extensions

Conclusion

The Srilankan-Tamil-NER Dataset, developed under CTNLPR, represents an important step toward strengthening the Sri Lankan Tamil NLP ecosystem through high-quality entity annotation and linguistically relevant corpus preparation.

By focusing on realistic Sri Lankan Tamil usage, OCR-aware preprocessing, and transformer-ready annotation formats, the dataset aims to support future advancements in:

  • Tamil NLP
  • multilingual transformers
  • document intelligence
  • AI for low-resource languages

As multilingual AI systems continue evolving rapidly, foundational datasets such as this remain critical for improving language inclusivity and advancing Tamil-focused NLP research.

#SriLankanTamil #TamilNER #TamilNLP #NamedEntityRecognition #LowResourceNLP #IndicNLP #BIOtagging #LLM #RAG #SemanticSearch #KnowledgeGraphs #CTNLPR #EntityExtraction

Leave a Reply