Gap analysis with development of nlp tools and knowledge engineering

Summary: This project conducts a comprehensive gap analysis of existing digitization, natural language processing, and knowledge engineering capabilities for Sri Lankan Tamil, followed by the development of targeted NLP tools to address identified limitations. It delivers an end-to-end AI system that ingests, processes, and enriches digitized texts and documents to produce structured, reliable, and trustworthy ...

Custom GPT tool for question and answering

Summary: The project aims to develop a Retrieval-Augmented Generation (RAG)–based question-answering system using custom, domain-specific English content. The system will retrieve relevant passages from a processed, metadata-enriched text corpus and use them as grounding context for a GPT model to generate accurate and context-aware responses. By combining information retrieval with generative AI, the system is ...

Development of digital content conversion pipeline

Summary: The content conversion pipeline transforms Noolaham’s digitized documents into structured, machine-readable text. It begins with the ingestion of digital documents such as scanned newspapers, books, magazines, and pamphlets. These documents are first preprocessed to improve quality and consistency. The pipeline then performs layout analysis to identify document structure, followed by article segmentation to separate logical ...

Entity Extraction in Tamil Tweets

Summary: Social media text such as Twitter holds information regarding various important aspects. Extraction of such information serves as the basis for the most preliminary task in Natural Language Processing called Entity extraction. Entities are real world elements or objects such as Person names, Organization names, Product names, Location names. Entities are often referred to as Named ...

Paraphrase Identification System for Tamil

Summary: Paraphrase can be defined as “the same meaning of a sentence is expressed in another sentence using different words”. Paraphrases can be identified, generated or extracted. This project focuses on sentence level paraphrase identification for Tamil. Identifying paraphrases in Tamil is a difficult task, because evaluating the semantic similarity of the underlying content and ...

Language Resource Development for Tamil

Summary: The pre-requisites for developing NLP applications in any language are the availability of Lexical Resources, Corpora and Computational Models. The sparseness of these resources for Tamil is one of the major reasons for the slow growth of NLP work in Tamil language. Through this project we aim to reduce the gap by creating required language ...

A Taxonomy of Tamil NLP Research

A Taxonomy of Tamil NLP Research
Summary: The Center for Tamil NLP research conducts a thorough literature review on the existing methodologies, prior work and language resources regarding the research topics identified. The literature review would aim to identify gaps in current knowledge, avoid reinventing the wheel and allows to show that we build on a foundation of existing knowledge and ...