Saturday, May 12, 2012

Suffixes and Bigrams in Turkish

As I noted in my previous post, Turkish is an agglutinative language and makes heavy use of suffixes. Since suffixes are frequently used, top 2-grams (bigrams) will correspond to popular suffixes or to the end of one suffix and the beginning of the next.

Here is the list of top 25 2-grams and the suffixes they correspond to.

2-gram Frequency Suffix
ar     2.196%    -lar -ları -(ş)ar -arak
la     2.006%    -lar -ları -(y)la -laş
an     1.933%    -dan -(y)an
er     1.888%    -ler -leri -(ş)er -erek
in     1.734%    -(n)in -(i) -(i)niz -sin -siniz -cesine -(i)nci ...
le     1.727%    -ler -leri -(y)le -leş
de     1.539%    -de -den
en     1.350%    -den -(y)ken -(y)en
ın     1.336%    -(n)ın -(ı) -(ı)nız -sın -sınız -casına -(ı)ncı ...
da     1.304%    -da -dan
ya     1.188%    -(y)a -arak -(y)acak -(y)ama -(y)asi -(y)an
ir     1.179%    -dir -ttir
ma     1.174%    -ma -(y)ama -malı
bi     1.107%    
il     1.074%    
ka     1.021%    
ra     0.974%    -arak
ri     0.951%    
ak     0.949%    -mak -arak -(y)acak
nd     0.949%    
al     0.938%    -sal -malı
li     0.899%    -li -lik
di     0.860%    -dir -y(di) -di -dik
me     0.850%    -me -(y)eme -meli
or     0.815%    -(i)yor

No comments:

Post a Comment