Contract extraction
Given a contract PDF, extract fields: contract number, signing date, parties, total value, penalty clauses. Scored by F1 on field accuracy.
An aggregation of large models evaluated on Vietnamese. Numbers pulled from public leaderboards. Will expand with applied tasks curated by NRL.
Today, VN-Bench aggregates VMLU. The next version will add real-world tasks: contract extraction, official-document parsing, diacritic-aware OCR, EN/VN code-switching.
VN-Bench v0 doesn't run evaluations itself. This page aggregates public numbers from sources trusted by the Vietnamese research community — primarily the VMLU leaderboard maintained by Zalo AI and JAIST, plus results published in the VLSP track and academic papers.
Every row links back to its source. Numbers reflect data as of 2026-04-25. New models are added within a few weeks of public release.
VN-Bench v1 (in development alongside the Nôm toolkit) will add applied tasks that VMLU does not cover: document extraction, legal QA, diacritic-aware OCR, code-switching. The goal is a measure that tracks the work Vietnamese AI teams actually ship.
VMLU is a multitask suite of 10,880 multiple-choice questions across 58 subjects (STEM, humanities, social sciences, general knowledge). Zero-shot evaluation. Two tracks: from-scratch and fine-tuned models.
All scores below belong to the authors of the VMLU benchmark. This page only aggregates and links back to the source.
| Rank | Model | Organization | Base | Avg score | Track |
|---|---|---|---|---|---|
| 1 | axis-sovereign | International AXIS | — | 85.75 | fine-tuned |
| 2 | V-LLM v1 | VinSmart Future | — | 85.11 | fine-tuned |
| 3 | MISA-AI-1.0 | MISA JSC | Qwen3 | 81.26 | fine-tuned |
| 4 | Vi-Sovereign-Medium | NLP-CORE-Lab | Qwen3-32B | 80.57 | fine-tuned |
| 5 | VNPTAI.IO-Medium-R1.2 | VNPT AI | — | 79.61 | fine-tuned |
| 6 | BnK-AI-Medium-v2.1 | BnK Solution | — | 78.84 | fine-tuned |
| 7 | Cake-Mochi | BeFinancial | Qwen3-32B | 77.64 | fine-tuned |
| 8 | VNPTAI.IO-Medium-R1 | VNPT AI | — | 77.43 | fine-tuned |
| 9 | MISA-Llama3-v1.1 | MISA JSC | Llama-3 | 76.87 | fine-tuned |
| 10 | BnK-AI-Medium-v2 | BnK Solution | — | 76.66 | fine-tuned |
| Rank | Model | Organization | Base | Avg score | Track |
|---|---|---|---|---|---|
| — | QwQ-32B | Alibaba Cloud | from scratch | 76.13 | from-scratch |
| — | Qwen2.5-72B-Instruct-AWQ | Alibaba Cloud | from scratch | 69.17 | from-scratch |
| — | Llama-3-70B | Meta | from scratch | 66.44 | from-scratch |
| — | KiLM-13b-v24.7.1 | Kiki AI / Zalo | from scratch | 66.07 | from-scratch |
| — | GPT-4 | OpenAI | from scratch | 65.53 | from-scratch |
Rank applies within a single track only. Fine-tuned models tend to score higher because they're tuned to VMLU's format. To compare base capability, look at the from-scratch column.
Source: VMLU Leaderboard (snapshot 2026-04) →VMLU leans academic-MCQ. The Vietnamese community has many other benchmarks for different task shapes. Here's the short list — pick whatever matches your workload.
VLSP Association
Annual workshop with multiple tracks: LLM, ASR, MT, semantic parsing, legal QA.
Academic (arXiv 2404.11086)
Comprehensive eval suite: general knowledge, reading comprehension, reasoning, conversation.
Academic (arXiv 2512.14554)
Vietnamese legal reasoning: article prediction, summarization, citation.
VLSP Association
Small language models specialized for Vietnamese legal tasks.
VLSP Association
Vietnamese multimodal legal QA on traffic-sign regulation.
VMLU measures academic knowledge. VN-Bench v1 measures real work. Below are the tasks we're curating with the community. Submissions open after v1 release.
Given a contract PDF, extract fields: contract number, signing date, parties, total value, penalty clauses. Scored by F1 on field accuracy.
Document number, issue date, issuing body, key content. Scored by exact match.
Vietnamese-language scanned document → structured JSON. Scored on character accuracy + field accuracy.
Generation task in Vietnamese — measures diacritic accuracy across long passages.
Natural mixed-language dialogue — does the model understand and respond contextually.
Borrowed from VLegal-Bench — we don't duplicate, we link so the community shares one source.
When VN-Bench v1 launches, we'll open submissions. In the meantime, register for updates, propose tasks, or contribute eval data.
This page does not run evaluations itself. All numbers come from the public sources below. Authors of the underlying benchmarks retain full credit for the model scores. If you are an author and would like a correction or removal, please contact [email protected].