Botok

State-of-the-art tokenizers for Tibetan language.

This is the documentation of our repository botok.

Features

  • Support various dialects.

  • Fully customizable to world list and adjustments rules.

  • Allows adjusting word list and rules with Adjustments component of the Dialect Pack.

Contents

This is the content of botok.

Getting Started with Botok

Installation

Caution

Pybo only support python3

Install pre-built pybo with pip:

$ pip install botok

Install from the latest Master branch of pybo with pip:

$ pip install git+https://github.com/Esukhia/botok.git

Install for developer, build pybo from source:

$ git clone https://github.com/Esukhia/botok.git
$ cd botok
$ python3 -m venv .env
$ activate .env/bin/activate
$ python setup.py clean sdist

Usage

Here is the simple usage of botok to tokenize the sentence

Import the botok tokenizer called WordTokenizer:

>>> from pybo import WordTokenizer
>>>
>>> tokenizer = WordTokenizer()
Building Trie... (12 s.)

Tokenize the given text:

>>> input_str = '༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ།མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ།'
>>> tokens = tokenizer.tokenize(input_str)
>>> print(f'The output is a {type(tokens)}')
The output is a <class 'list'>
>>> print(f'The constituting elements are {type(tokens[0])}')
The constituting elements are <class 'pybo.token.Token'>

Now in ‘tokens’ you have an iterable where each token consist of several meta-data in attributes of Token Object:

>>> tokens[0]
content: "༆ "
char_types: |punct|space|
type: punct
start: 0
len: 2
tag: punct
pos: punc

Custom dialect pack:

In order to use custom dialect pack:

  1. You need to prepare your dialect pack in same folder structure like [general dialect pack](https://github.com/Esukhia/botok-data/tree/master/dialect_packs/general)

  2. Then you need to instaintiate a config object where you will pass dialect name and path

  3. You can instaintiate your tokenizer object using that config object

  4. Your tokenizer will be using your custom dialect pack and it will be using trie pickled file in future to build the custom trie.

from botok import WordTokenizer
from botok.config import Config
from pathlib import Path

def get_tokens(wt, text):
    tokens = wt.tokenize(text, split_affixes=False)
    return tokens

if __name__ == "__main__":
    config = Config(dialect_name="custom", base_path= Path.home())
    wt = WordTokenizer(config=config)
    text = "བཀྲ་ཤིས་བདེ་ལེགས་ཞུས་རྒྱུ་ཡིན་ སེམས་པ་སྐྱིད་པོ་འདུག།"
    tokens = get_tokens(wt, text)
    for token in tokens:
        print(token)

Acknowledgement

botok is an open source library for Tibetan NLP.

We are always open to cooperation in introducing new features, tool integrations and testing solutions.

Many thanks to the companies and organizations who have supported botok’s development, especially:

Architecture

WordTokenizer architecture

Following is the architecture diagram of the WordTokenizer class

_images/botok_architecture.svg

Tokenization workflow

Here is botok tokenization workflow with an examples.

>>> input_string = "ཀུན་་་དགའི་དོན་གྲུབ།"
>>> from botok import BoSyl, Config, TokChunks, Tokenize, Trie
>>> config = Config()
>>> trie = Trie(BoSyl, profile=config.profile, main_data=config.dictionary, custom_data=config.adjustments)
>>> tok = Tokenize(trie)
>>> preproc = TokChunks(input_string)
>>> preproc.serve_syls_to_trie()
>>> tokens = tok.tokenize(preproc)
>>>
>>> print(*tokens, sep=f"{'='*65}\n\n")
text: "ཀུན་་་དགའི་"
text_cleaned: "ཀུན་དགའི་"
text_unaffixed: "ཀུན་དགའ་"
syls: ["ཀུན", "དགའི"]
senses: | pos: PROPN, freq: 2923, affixed: True |
char_types: |CONS|VOW|CONS|TSEK|TSEK|TSEK|CONS|CONS|CONS|VOW|TSEK|
chunk_type: TEXT
syls_idx: [[0, 1, 2], [6, 7, 8, 9]]
syls_start_end: [{'start': 0, 'end': 6}, {'start': 6, 'end': 11}]
start: 0
len: 11

=================================================================

text: "དོན་གྲུབ"
text_cleaned: "དོན་གྲུབ་"
text_unaffixed: "དོན་གྲུབ་"
syls: ["དོན", "གྲུབ"]
senses: | pos: PROPN, freq: 1316, affixed: False |
char_types: |CONS|VOW|CONS|TSEK|CONS|SUB_CONS|VOW|CONS|
chunk_type: TEXT
syls_idx: [[0, 1, 2], [4, 5, 6, 7]]
syls_start_end: [{'start': 0, 'end': 4}, {'start': 4, 'end': 8}]
start: 11
len: 8

=================================================================

text: "།"
char_types: |NORMAL_PUNCT|
chunk_type: PUNCT
start: 19
len: 1
>>>
>>> from botok import AdjustTokens
>>>
>>> adjust_tok = AdjustTokens(main=config.dictionary["rules"], custom=config.adjustments["rules"])
>>> adjusted_tokens = adjust_tok.adjust(tokens)
>>> print(*adjusted_tokens, sep=f"{'='*65}\n\n")
text: "ཀུན་་་དགའི་"
text_cleaned: "ཀུན་དགའི་"
text_unaffixed: "ཀུན་དགའ་"
syls: ["ཀུན", "དགའི"]
senses: | pos: PROPN, freq: 2923, affixed: True |
char_types: |CONS|VOW|CONS|TSEK|TSEK|TSEK|CONS|CONS|CONS|VOW|TSEK|
chunk_type: TEXT
syls_idx: [[0, 1, 2], [6, 7, 8, 9]]
syls_start_end: [{'start': 0, 'end': 6}, {'start': 6, 'end': 11}]
start: 0
len: 11

=================================================================

text: "དོན་གྲུབ"
text_cleaned: "དོན་གྲུབ་"
text_unaffixed: "དོན་གྲུབ་"
syls: ["དོན", "གྲུབ"]
senses: | pos: PROPN, freq: 1316, affixed: False |
char_types: |CONS|VOW|CONS|TSEK|CONS|SUB_CONS|VOW|CONS|
chunk_type: TEXT
syls_idx: [[0, 1, 2], [4, 5, 6, 7]]
syls_start_end: [{'start': 0, 'end': 4}, {'start': 4, 'end': 8}]
start: 11
len: 8

=================================================================

text: "།"
char_types: |NORMAL_PUNCT|
chunk_type: PUNCT
start: 19
len: 1

Custom Dialect Pack

Why Custom Dialect Pack

  • For domain specific tokenization

  • Improving tokenization accuracy

Example

To use a custom dialect pack for tokenization, all we have to do is to create a botok.Config object with path to the custom dialect pack and use this config for creating word tokenizer.

First, create config for the custom dialect pack.

>>> from botok import Config
>>> config = Config.from_path('custom/dialect/pack/path')

Then, create word tokenizer with that same config.

>>> from botok import WordTokenizer
>>> wt = WordTokenizer(config=config)
>>> wt.tokenize("མཐའི་བཀྲ་ཤིས། ཀཀ abc མཐའི་རྒྱ་མཚོ་")

Configuration

Config

class botok.Config(dialect_name=None, base_path=None)[source]

botok config for Tibetan dialect pack.

Each dialect pack has two components:
  1. Dictionary: - contains all the data required to construct the Trie. - It should in the directory called dictionary inside the dialect pack directory.

  2. Adjustment: - Contains all the data required to adjust the text segmentation rules.

add_dialect_pack(path)[source]

“Merge given dialect_pack at path to current dialect_pack.

classmethod from_path(dialect_pack_path)[source]

Creates config from dialect_pack_path.

Returns

class: Config: An instance of a Configuration object

Examples:

config = Config.from_path(path_to_dialect_pack)
assert config.dictionary == True
assert config.adjustments == True
property profile

Returns profile name of the dialect_pack.

reset(dialect_pack_path=None)[source]

Reset the config to default bo_general_pack.