Botok¶
State-of-the-art tokenizers for Tibetan language.
This is the documentation of our repository botok.
Features¶
Support various dialects.
Fully customizable to world list and adjustments rules.
Allows adjusting word list and rules with Adjustments component of the Dialect Pack.
Contents¶
This is the content of botok.
Getting Started with Botok¶
Installation¶
Caution
Pybo only support python3
Install pre-built pybo with pip:
$ pip install botok
Install from the latest Master branch of pybo with pip:
$ pip install git+https://github.com/Esukhia/botok.git
Install for developer, build pybo from source:
$ git clone https://github.com/Esukhia/botok.git
$ cd botok
$ python3 -m venv .env
$ activate .env/bin/activate
$ python setup.py clean sdist
Usage¶
Here is the simple usage of botok to tokenize the sentence
Import the botok tokenizer called WordTokenizer:
>>> from pybo import WordTokenizer
>>>
>>> tokenizer = WordTokenizer()
Building Trie... (12 s.)
Tokenize the given text:
>>> input_str = '༆ ཤི་བཀྲ་ཤིས་ tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ།མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ།'
>>> tokens = tokenizer.tokenize(input_str)
>>> print(f'The output is a {type(tokens)}')
The output is a <class 'list'>
>>> print(f'The constituting elements are {type(tokens[0])}')
The constituting elements are <class 'pybo.token.Token'>
Now in ‘tokens’ you have an iterable where each token consist of several meta-data in attributes of Token Object:
>>> tokens[0]
content: "༆ "
char_types: |punct|space|
type: punct
start: 0
len: 2
tag: punct
pos: punc
Custom dialect pack:
In order to use custom dialect pack:
You need to prepare your dialect pack in same folder structure like [general dialect pack](https://github.com/Esukhia/botok-data/tree/master/dialect_packs/general)
Then you need to instaintiate a config object where you will pass dialect name and path
You can instaintiate your tokenizer object using that config object
Your tokenizer will be using your custom dialect pack and it will be using trie pickled file in future to build the custom trie.
from botok import WordTokenizer
from botok.config import Config
from pathlib import Path
def get_tokens(wt, text):
tokens = wt.tokenize(text, split_affixes=False)
return tokens
if __name__ == "__main__":
config = Config(dialect_name="custom", base_path= Path.home())
wt = WordTokenizer(config=config)
text = "བཀྲ་ཤིས་བདེ་ལེགས་ཞུས་རྒྱུ་ཡིན་ སེམས་པ་སྐྱིད་པོ་འདུག།"
tokens = get_tokens(wt, text)
for token in tokens:
print(token)
Acknowledgement¶
botok is an open source library for Tibetan NLP.
We are always open to cooperation in introducing new features, tool integrations and testing solutions.
Many thanks to the companies and organizations who have supported botok’s development, especially:
Khyentse Foundation for contributing USD22,000 to kickstart the project
The Barom/Esukhia canon project for sponsoring training data curation
BDRC for contributing 2 staff for 6 months for data curation
Architecture¶
WordTokenizer architecture¶
Following is the architecture diagram of the WordTokenizer class
Tokenization workflow¶
Here is botok tokenization workflow with an examples.
>>> input_string = "ཀུན་་་དགའི་དོན་གྲུབ།"
>>> from botok import BoSyl, Config, TokChunks, Tokenize, Trie
>>> config = Config()
>>> trie = Trie(BoSyl, profile=config.profile, main_data=config.dictionary, custom_data=config.adjustments)
>>> tok = Tokenize(trie)
>>> preproc = TokChunks(input_string)
>>> preproc.serve_syls_to_trie()
>>> tokens = tok.tokenize(preproc)
>>>
>>> print(*tokens, sep=f"{'='*65}\n\n")
text: "ཀུན་་་དགའི་"
text_cleaned: "ཀུན་དགའི་"
text_unaffixed: "ཀུན་དགའ་"
syls: ["ཀུན", "དགའི"]
senses: | pos: PROPN, freq: 2923, affixed: True |
char_types: |CONS|VOW|CONS|TSEK|TSEK|TSEK|CONS|CONS|CONS|VOW|TSEK|
chunk_type: TEXT
syls_idx: [[0, 1, 2], [6, 7, 8, 9]]
syls_start_end: [{'start': 0, 'end': 6}, {'start': 6, 'end': 11}]
start: 0
len: 11
=================================================================
text: "དོན་གྲུབ"
text_cleaned: "དོན་གྲུབ་"
text_unaffixed: "དོན་གྲུབ་"
syls: ["དོན", "གྲུབ"]
senses: | pos: PROPN, freq: 1316, affixed: False |
char_types: |CONS|VOW|CONS|TSEK|CONS|SUB_CONS|VOW|CONS|
chunk_type: TEXT
syls_idx: [[0, 1, 2], [4, 5, 6, 7]]
syls_start_end: [{'start': 0, 'end': 4}, {'start': 4, 'end': 8}]
start: 11
len: 8
=================================================================
text: "།"
char_types: |NORMAL_PUNCT|
chunk_type: PUNCT
start: 19
len: 1
>>>
>>> from botok import AdjustTokens
>>>
>>> adjust_tok = AdjustTokens(main=config.dictionary["rules"], custom=config.adjustments["rules"])
>>> adjusted_tokens = adjust_tok.adjust(tokens)
>>> print(*adjusted_tokens, sep=f"{'='*65}\n\n")
text: "ཀུན་་་དགའི་"
text_cleaned: "ཀུན་དགའི་"
text_unaffixed: "ཀུན་དགའ་"
syls: ["ཀུན", "དགའི"]
senses: | pos: PROPN, freq: 2923, affixed: True |
char_types: |CONS|VOW|CONS|TSEK|TSEK|TSEK|CONS|CONS|CONS|VOW|TSEK|
chunk_type: TEXT
syls_idx: [[0, 1, 2], [6, 7, 8, 9]]
syls_start_end: [{'start': 0, 'end': 6}, {'start': 6, 'end': 11}]
start: 0
len: 11
=================================================================
text: "དོན་གྲུབ"
text_cleaned: "དོན་གྲུབ་"
text_unaffixed: "དོན་གྲུབ་"
syls: ["དོན", "གྲུབ"]
senses: | pos: PROPN, freq: 1316, affixed: False |
char_types: |CONS|VOW|CONS|TSEK|CONS|SUB_CONS|VOW|CONS|
chunk_type: TEXT
syls_idx: [[0, 1, 2], [4, 5, 6, 7]]
syls_start_end: [{'start': 0, 'end': 4}, {'start': 4, 'end': 8}]
start: 11
len: 8
=================================================================
text: "།"
char_types: |NORMAL_PUNCT|
chunk_type: PUNCT
start: 19
len: 1
Custom Dialect Pack¶
Why Custom Dialect Pack¶
For domain specific tokenization
Improving tokenization accuracy
Example¶
To use a custom dialect pack for tokenization, all we have to do is to create a botok.Config object with path to the custom dialect pack and use this config for creating word tokenizer.
First, create config for the custom dialect pack.
>>> from botok import Config
>>> config = Config.from_path('custom/dialect/pack/path')
Then, create word tokenizer with that same config.
>>> from botok import WordTokenizer
>>> wt = WordTokenizer(config=config)
>>> wt.tokenize("མཐའི་བཀྲ་ཤིས། ཀཀ abc མཐའི་རྒྱ་མཚོ་")
Configuration¶
Config
¶
-
class
botok.
Config
(dialect_name=None, base_path=None)[source]¶ botok config for Tibetan dialect pack.
- Each dialect pack has two components:
Dictionary: - contains all the data required to construct the Trie. - It should in the directory called dictionary inside the dialect pack directory.
Adjustment: - Contains all the data required to adjust the text segmentation rules.
-
classmethod
from_path
(dialect_pack_path)[source]¶ Creates config from
dialect_pack_path
.- Returns
class: Config: An instance of a Configuration object
Examples:
config = Config.from_path(path_to_dialect_pack) assert config.dictionary == True assert config.adjustments == True
-
property
profile
¶ Returns profile name of the dialect_pack.