Indic ELR MT Research

Autonomous Multi-Agent Orchestration for Extremely Low-Resource Indic Machine Translation

An end-to-end quality-aware synthetic data curation system for English-to-Assamese and English-to-Odia translation, leveraging local LLM agents, Bengali pivot transfer, and compute-efficient LoRA adaptation.

Indic NLP Low-Resource MT Multi-Agent AI LangGraph LoRA / PEFT
Background & Motivation

My MTech research focuses on improving machine translation for extremely low-resource Indic languages like Assamese and Odia through innovative multi-agent AI systems.

The Suffering of Low-Resource Languages Due to Lack of Digital Text Corpora

Extremely low-resource (ELR) Indic languages such as Assamese and Odia face severe challenges in neural machine translation (NMT) due to the chronic scarcity of high-quality parallel corpora. NMT quality directly correlates with the scale and diversity of training data, but these languages have significantly fewer resources compared to high-resource pairs like Hindi-English. This data scarcity leads to substantial performance gaps, where even moderate domain shifts cause dramatic degradation in translation quality.

The lack of digital text corpora manifests in multiple ways: insufficient supervision signals for robust generalization, limited variety in training examples that prevents effective interpolation for new inputs, and an inability to capture the full linguistic richness of morphologically complex languages. Synthetic data generation through back-translation, while helpful, often introduces artifacts like literal word-order structures, semantic drift, and reduced cultural naturalness, further degrading model performance.

For languages with agglutinative morphology and postpositional case systems like Assamese and Odia, these artifacts are particularly damaging. Collapsed morphological inflections, untranslated English tokens, and drifted paraphrases that don't reflect natural usage patterns compound the problem, creating a vicious cycle where poor synthetic data leads to even poorer models.

Current Challenges in Machine Translation for ELR Indic Languages

The primary challenge is data scarcity compounded by domain imbalance. Existing Indic benchmarks show consistent large performance gaps between high-resource and ELR language pairs. Even with thousands of sentence pairs, ELR models struggle with generalization, especially when encountering out-of-domain inputs.

Back-translation, the dominant augmentation strategy, introduces characteristic artifacts: word-order calques from the pivot language, collapsed verb conjugations and case markers, and semantically drifted outputs that lack cultural grounding. For morphologically rich languages, these artifacts pose severe quality issues, as critical grammatical information is lost or distorted.

Direct translation approaches suffer from extremely limited supervision, making it difficult to learn robust representations. Hardware constraints in academic settings further limit the ability to train large models, while the computational cost of generating and filtering synthetic data adds another layer of complexity.

Quality control mechanisms are inadequate; existing filtering based on cross-lingual similarity scores captures volume but not linguistic fidelity or pragmatic naturalness. This results in noisy corpora that propagate errors through the learned representations of downstream models.

Objective of the Research Work

The research aims to develop an autonomous multi-agent orchestration framework (AMAO) for quality-aware synthetic data curation in ELR Indic machine translation. The objective is to overcome data scarcity limitations by generating high-quality synthetic parallel corpora that preserve linguistic fidelity and cultural naturalness, enabling robust NMT models for low-resource Indic languages like Assamese and Odia.

Expected Outcomes and Impact

The research expects AMAO to outperform static back-translation by producing cleaner synthetic corpora free of low-quality artifacts. This will enable more reliable NMT models for ELR Indic languages, supporting digital inclusion and preserving linguistic diversity in the NLP world.

Broader impact includes methodological contributions to multi-agent translation workflows, parameter-efficient adaptation techniques, and quality-gated synthetic data curation frameworks applicable to other low-resource language pairs.

Project Snapshot

AMAO is an autonomous, locally hosted pipeline designed to improve translation quality for very low-resource Indic languages by filtering synthetic data through four role-specialized agents.

  • Target languages: Assamese, Odia
  • Backbone model: IndicTrans2 (English → Indic)
  • Local LLM backend: Ollama + qwen2.5:3b
  • PEFT strategy: LoRA adapters for compute efficiency
Research Goals
  • Reduce translation artifacts in synthetic parallel corpora
  • Retain only high-quality English-to-Assamese / English-to-Odia pairs
  • Use linguistic transfer from Bengali to accelerate adaptation
  • Enable research on resource-constrained academic hardware
Core Contributions

Multi-Agent Quality Gate

A four-agent pipeline automatically accepts, revises, or rejects synthetic translations to avoid noisy augmentation artifacts.

Bengali Pivot Transfer

Leverages Indo-Aryan affinity to bootstrap Assamese/Odia adaptation through English-Bengali fine-tuning.

Local LLM Evaluation

Uses a local Ollama-hosted qwen2.5:3b model for evaluation and editing, avoiding cloud dependency.

LoRA-Based PEFT

Maintains a frozen backbone while training low-rank adapter weights for lightweight fine-tuning.

AMAO Process Diagram
Data & Architecture Overview
Performance Comparison
Data Strategy
  • Pivot corpus: English-Bengali parallel data for transfer initialization.
  • Gold seed data: Human-verified English-Assamese and English-Odia pairs.
  • Filtered synthetic corpus: Only AMAO-accepted translations enter training.
  • Evaluation set: Standard FLORES-200 devtest split for reproducible benchmarking.
Evaluation & Metrics
MetricPurpose
BLEULexical n-gram overlap for translation adequacy.
ChrFCharacter-level score for morphological sensitivity.
COMETNeural semantic adequacy aligned to human judgment.
Morphological Accuracy ScoreManual evaluation of verb and case-mark correctness.
Why This Work Matters

The approach addresses a pressing problem in Indic NLP: how to generate useful synthetic training data for languages with extremely limited parallel resources while avoiding the common pitfalls of translationese, semantic drift, and morphological degradation.

Implementation Highlights

LangGraph Pipeline

Implements the four-agent workflow with conditional loops and typed state.

Ollama Local LLM

Uses qwen2.5:3b for critic/editor/arbiter intelligence without cloud APIs.

LoRA Adapters

Fine-tunes only low-rank updates in the attention projections for efficiency.