🔠I wrote up some of what I've learned about tokenisation (with examples using Balochi). This is more of a high-level overview that tackles why we tokenise words, what options are available to us and what tradeoffs we assume by choosing one option over another.
https://mlops.systems/posts/2023-06-01-why-tokenisation.html #balochi #languagemodels #nlp
I wrote about my first steps moving forward in my Balochi language modelling project. Training a custom tokenizer is my initial short-term goal but to do that I first needed to put together a small dataset with which I could work. I detail some of the things I did to that end and a list of resources I'm maintaining as I continue on this journey.
https://mlops.systems/posts/2023-05-29-balochi-language-dataset.html #balochi #nlp #lowresource
Taking the next few months to work on language modelling techniques for low-resource languages. I'll be working with #balochi as spoken in southeastern #iran for which there aren't many datasets or resources available (at first glance).
RT @MarianoGiustino
Punto su #Iran:
- la #Rivoluzione è entrata in una fase di non ritorno
- si estende alle aree anche più conservatrici e storicamente sostenitrici del regime
- Colpiti i simboli più sacri della Repubblica islamica
- Nelle aree curde e #balochi è un inferno
@RadioRadicale #Turchia
#turchia #Balochi #rivoluzione #iran