Datasets:

kalixlouiis
/

burmese-pyu-character-recognition

label_1 stringclasses 33 values	label_2 stringclasses 33 values	note stringclasses 6 values
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3
a	အ	aug_4
a	အ	aug_5
a	အ	original
a	အ	aug_1
a	အ	aug_2
a	အ	aug_3

End of preview. Expand in Data Studio

Burmese Pyu Character Recognition Dataset

English | မြန်မာဘာသာ

Overview

This Dataset is an image collection created for the purpose of computer recognition (Character Recognition) and research of the "Pyu" alphabet, an ancient script of Myanmar.

Brief Historical Background
The Pyu people were one of the earliest major ethnic groups to inhabit Myanmar, settling in the region since the early AD periods. The Pyu culture is of great importance when studying the foundational roots of modern Myanmar's cultural and linguistic heritage.

Historical evidence suggests that the main ancestral roots of the modern-day Myanmar people lie with the Pyu. Furthermore, it can be observed that the contemporary Myanmar language and alphabets we use today have gradually evolved from this ancient Pyu language.

During the ancient Pyu period, paper did not exist as it does today; writing was instead carved into stone pillars and terracotta plaques. Because of this, the shapes of Pyu alphabets are not strictly uniform like modern printed fonts. Additionally, there tend to be slight regional variations in character forms, writing styles, and literary usage.

References

The primary reference source for the Pyu alphabets included in this dataset is the following book:

Pyu Reader: A History of Pyu Alphabet (1963)
Author: Thiripyanchi U Tha Myat M. A. (Rgn), B. Ag: (Bom.)
(Retired Director of Agriculture, Southern Circle; Retired Director of the Cultural Institute)

This dataset extracts and implements the characters from the "Pyu Consonant Table" featured on page "1" of the aforementioned book.

Dataset Details

Classes: 33 Pyu consonant characters from 'က' (Ka) to 'အ' (Ah). (Vowels are not included; only consonants).
Image Format: Prepared at a size of 224x224 pixels, featuring white text on a black background.

Data Collection and Augmentation

Original Images: Each of the 33 Pyu characters was handwritten on paper 10 times and then digitized. These are labeled as original.
Augmentation: To help computer models learn more effectively, the original handwritten images were augmented to create 5 additional variations per image (from aug_1 to aug_5) using Data Augmentation techniques.

The Augmentation Pipeline (5 types) used is as follows:

import imgaug.augmenters as iaa

seq = iaa.Sequential([
    iaa.Sometimes(0.5, iaa.Affine(rotate=(-10, 10))),      
    iaa.Sometimes(0.5, iaa.GaussianBlur(sigma=(0, 1.0))),  
    iaa.Sometimes(0.5, iaa.Multiply((0.8, 1.2))),          
    iaa.Sometimes(0.5, iaa.AdditiveGaussianNoise(scale=0.05*255)), 
    iaa.Sometimes(0.5, iaa.Affine(scale={"x": (0.9, 1.1), "y": (0.9, 1.1)}))
])

Dataset Structure

The dataset consists of the following 4 columns:

Column Name	Description	Example
`image`	Photograph of the Pyu character (224x224 pixels)	`[Image Object]`
`label_1`	Romanized/English letter representation	`a`
`label_2`	Modern Myanmar letter representation	`အ`
`note`	Image type (original or augmented version)	`original`, `aug_1`, `aug_2` ...

Example: image, label_1, label_2, note ``, a, အ, original

Applicability and License

🎯 Use Cases

Educational studies
Historical and linguistic research
Building Optical Character Recognition (OCR) models
Technology projects for the preservation of Myanmar literature and culture

📄 License

This Dataset is published under the CC-BY-4.0 (Creative Commons Attribution 4.0 International) license. Anyone is free to use it, requiring only appropriate credit (Attribution) to the creator.

📝 Citation

If you use this Dataset in a research paper or project, please cite it as follows:

APA Format:

Khant Sint Heinn. (2026). Burmese Pyu Character Recognition Dataset [Dataset]. Hugging Face. https://huggingface.co/datasets/kalixlouiis/burmese-pyu-character-recognition

BibTeX:

@dataset{kalixlouiis_pyu_character_recognition_2026,
  author       = {Khant Sint Heinn},
  title        = {Burmese Pyu Character Recognition Dataset},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/kalixlouiis/burmese-pyu-character-recognition}},
}

အကျဉ်းချုပ်

ဤ Dataset သည် မြန်မာနိုင်ငံ၏ ရှေးဟောင်းအက္ခရာတစ်ခုဖြစ်သော "ပျူ" အက္ခရာများကို ကွန်ပျူတာမှ ခွဲခြားမှတ်မိနိုင်စေရန် (Character Recognition) နှင့် သုတေသနပြုလုပ်နိုင်ရန် ရည်ရွယ်၍ ဖန်တီးထားသော Image Dataset လေးဖြစ်ပါသည်။

သမိုင်းကြောင်း နောက်ခံအကျဉ်း
ပျူလူမျိုးများသည် အေဒီ (AD) အစောပိုင်းကာလများကတည်းက မြန်မာနိုင်ငံတွင် မှီတင်းနေထိုင်ခဲ့ကြသော အစောဆုံး လူမျိုးစုကြီးများအနက်မှ တစ်စုဖြစ်ပါသည်။ ယနေ့ခေတ် မြန်မာ့ယဉ်ကျေးမှုနှင့် ဘာသာစကားအမွေအနှစ်များ၏ အခြေခံအုတ်မြစ်များကို လေ့လာရာတွင် ပျူယဉ်ကျေးမှုသည် များစွာအရေးပါလှပါသည်။ သမိုင်းအထောက်အထားများအရ ယနေ့ခေတ် မြန်မာလူမျိုးများ၏ အဓိကအမြစ်တွယ်ရာသည် ပျူလူမျိုးများပင်ဖြစ်သည်ဟု မှတ်ယူနိုင်ပြီး၊ လက်ရှိ ကျွန်ုပ်တို့ အသုံးပြုနေသော မြန်မာဘာသာစကားနှင့် အက္ခရာများသည်လည်း ဤရှေးဟောင်း ပျူဘာသာစကားမှတစ်ဆင့် ဆင့်ကဲဆင်းသက်လာခြင်းဖြစ်သည်ဟု လေ့လာတွေ့ရှိနိုင်ပါသည်။

ရှေးဟောင်းပျူခေတ်တွင် ယခုကဲ့သို့ စက္ကူများ မရှိသေးဘဲ ကျောက်စာတိုင်များ၊ အုတ်ခွက်များပေါ်တွင်သာ ထွင်းထုရေးသားခဲ့ကြပါသည်။ ထို့ကြောင့် ပျူအက္ခရာများ၏ ပုံစံသည် ခေတ်ပေါ်စာလုံးများကဲ့သို့ တစ်ခုနှင့်တစ်ခု တိကျစွာ တစ်သတ်မှတ်တည်း မရှိကြပါ။ ထို့အပြင် နေရာဒေသအလိုက် အက္ခရာပုံစံ၊ ရေးသားနည်းနှင့် စာပေအသုံးအနှုန်း အနည်းငယ် ကွဲပြားမှုများလည်း ရှိတတ်ပါသည်။

ကိုးကားချက်

ဤ Dataset တွင် ပါဝင်သော ပျူအက္ခရာများအတွက် အဓိက ကိုးကားထားသော ရင်းမြစ်မှာ အောက်ပါစာအုပ် ဖြစ်ပါသည်-

ပျူ ဖတ်စာ ပျူ-အက္ခရာသမိုင်း (၁၉၆၃)
Pyu Reader: A History of Pyu Alphabet (1963)
ရေးသားသူ - သီရိပျံချီ ဦးသာမြတ် M. A. (Rgn), B. Ag: (Bom.)
(ဒက္ခိဏတိုင်း အငြိမ်းစားစိုက်ပျိုးရေးဝန်၊ အငြိမ်းစား ယဉ်ကျေးမှုဗိမာန် ညွှန်ကြားရေးဝန်)

ဤ Dataset သည် အဆိုပါစာအုပ်၏ စာမျက်နှာ "၁" တွင် ဖော်ပြထားသော "ပျူဗျည်းဇယား" မှ အက္ခရာများကို ကောက်နုတ်ရေးသားထားခြင်း ဖြစ်ပါသည်။

Dataset အသေးစိတ်

Classes: က မှ အ ထိ ပျူဗျည်းအက္ခရာ (၃၃) လုံး။ (သရများ မပါဝင်ပါ၊ ဗျည်းသက်သက်သာ ဖြစ်ပါသည်။)
Image Format: 224x224 pixels အရွယ်အစားရှိပြီး အမည်းရောင်နောက်ခံတွင် အဖြူရောင်စာသားဖြင့် ပြင်ဆင်ထားပါသည်။

ဒေတာ စုဆောင်းခြင်းနှင့် ပွားယူခြင်း

Original Images: ပျူအက္ခရာ ၃၃ လုံးမှ အက္ခရာတစ်လုံးစီကို စက္ကူပေါ်တွင် လက်ရေးဖြင့် ၁၀ ကြိမ်စီ ချရေးခဲ့ပြီး ဒစ်ဂျစ်တယ်ပုံစံသို့ ပြောင်းလဲခဲ့ပါသည်။ ၎င်းတို့ကို original ဟု သတ်မှတ်ပါသည်။
Augmentation: မူရင်းလက်ရေးပုံများကို ကွန်ပျူတာမော်ဒယ်များ ပိုမိုလေ့လာနိုင်စေရန် Data Augmentation နည်းပညာဖြင့် ပုံတစ်ပုံလျှင် ၅ မျိုး (aug_1 မှ aug_5 အထိ) ထပ်မံပွားယူထားပါသည်။

အသုံးပြုထားသော Augmentation Pipeline (5 types) မှာ အောက်ပါအတိုင်း ဖြစ်ပါသည်-

import imgaug.augmenters as iaa

seq = iaa.Sequential([
    iaa.Sometimes(0.5, iaa.Affine(rotate=(-10, 10))),      
    iaa.Sometimes(0.5, iaa.GaussianBlur(sigma=(0, 1.0))),  
    iaa.Sometimes(0.5, iaa.Multiply((0.8, 1.2))),          
    iaa.Sometimes(0.5, iaa.AdditiveGaussianNoise(scale=0.05*255)), 
    iaa.Sometimes(0.5, iaa.Affine(scale={"x": (0.9, 1.1), "y": (0.9, 1.1)}))
])

Dataset ဖွဲ့စည်းပုံ

Dataset တွင် အောက်ပါ Column ၄ ခု ပါဝင်ပါသည်။

ကော်လံအမည်	ဖော်ပြချက်	ဥပမာ
`image`	ပျူအက္ခရာ၏ ဓာတ်ပုံ (224x224 pixel)	`[Image Object]`
`label_1`	အက္ခရာကို အင်္ဂလိပ်စာလုံးဖြင့် ဖော်ပြချက်	`a`
`label_2`	အက္ခရာကို မြန်မာစာလုံးဖြင့် ဖော်ပြချက်	`အ`
`note`	ပုံအမျိုးအစား (မူရင်း သို့မဟုတ် ပွားယူထားသောပုံ)	`original`, `aug_1`, `aug_2` ...

ဥပမာ - image, label_1, label_2, note [အက္ခရာ "အ" ပုံ], a, အ, original

အသုံးပြုနိုင်မှုနှင့် လိုင်စင်

🎯 အသုံးပြုနိုင်မည့် ကဏ္ဍများ

ပညာရေးဆိုင်ရာ လေ့လာမှုများ
သမိုင်းနှင့် ဘာသာစကားဆိုင်ရာ သုတေသနများ
Optical Character Recognition (OCR) မော်ဒယ်များ တည်ဆောက်ခြင်း
မြန်မာစာပေနှင့် ယဉ်ကျေးမှု ထိန်းသိမ်းစောင့်ရှောက်ရေး နည်းပညာပရောဂျက်များ

📄 လိုင်စင်

ဤ Dataset ကို CC-BY-4.0 (Creative Commons Attribution 4.0 International) လိုင်စင်ဖြင့် ထုတ်ပြန်ထားပါသည်။ မည်သူမဆို လွတ်လပ်စွာ အသုံးပြုနိုင်ပြီး ဖန်တီးသူအား မူရင်းခရက်ဒစ် (Attribution) ပေးရန်သာ လိုအပ်ပါသည်။

📝 ကိုးကားရန်အညွှန်

သင့်အနေဖြင့် ဤ Dataset ကို သုတေသနစာတမ်း သို့မဟုတ် ပရောဂျက်များတွင် အသုံးပြုပါက အောက်ပါအတိုင်း ကိုးကားဖော်ပြနိုင်ပါသည်။

APA Format:

Khant Sint Heinn. (2026). Burmese Pyu Character Recognition Dataset [Dataset]. Hugging Face. https://huggingface.co/datasets/kalixlouiis/burmese-pyu-character-recognition

BibTeX:

@dataset{kalixlouiis_pyu_character_recognition_2026,
  author       = {Khant Sint Heinn},
  title        = {Burmese Pyu Character Recognition Dataset},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/datasets/kalixlouiis/burmese-pyu-character-recognition](https://huggingface.co/datasets/kalixlouiis/burmese-pyu-character-recognition)}},
}

Downloads last month: 99