Getting Started with Botok¶

Installation¶

Caution

Pybo only support python3

Install pre-built pybo with pip:

$ pip install botok

Install from the latest Master branch of pybo with pip:

$ pip install git+https://github.com/Esukhia/botok.git

Install for developer, build pybo from source:

$ git clone https://github.com/Esukhia/botok.git
$ cd botok
$ python3 -m venv .env
$ activate .env/bin/activate
$ python setup.py clean sdist

Usage¶

Here is the simple usage of botok to tokenize the sentence

Import the botok tokenizer called WordTokenizer:

>>> from pybo import WordTokenizer
>>>
>>> tokenizer = WordTokenizer()
Building Trie... (12 s.)

Tokenize the given text:

>>> input_str = '༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ།མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ།'
>>> tokens = tokenizer.tokenize(input_str)
>>> print(f'The output is a {type(tokens)}')
The output is a <class 'list'>
>>> print(f'The constituting elements are {type(tokens[0])}')
The constituting elements are <class 'pybo.token.Token'>

Now in ‘tokens’ you have an iterable where each token consist of several meta-data in attributes of Token Object:

>>> tokens[0]
content: "༆ "
char_types: |punct|space|
type: punct
start: 0
len: 2
tag: punct
pos: punc

Custom dialect pack:

In order to use custom dialect pack:

You need to prepare your dialect pack in same folder structure like [general dialect pack](https://github.com/Esukhia/botok-data/tree/master/dialect_packs/general)
Then you need to instaintiate a config object where you will pass dialect name and path
You can instaintiate your tokenizer object using that config object
Your tokenizer will be using your custom dialect pack and it will be using trie pickled file in future to build the custom trie.

from botok import WordTokenizer
from botok.config import Config
from pathlib import Path

def get_tokens(wt, text):
    tokens = wt.tokenize(text, split_affixes=False)
    return tokens

if __name__ == "__main__":
    config = Config(dialect_name="custom", base_path= Path.home())
    wt = WordTokenizer(config=config)
    text = "བཀྲ་ཤིས་བདེ་ལེགས་ཞུས་རྒྱུ་ཡིན་ སེམས་པ་སྐྱིད་པོ་འདུག།"
    tokens = get_tokens(wt, text)
    for token in tokens:
        print(token)