Saturday, May 12, 2012

Bigram frequencies in Turkish



This graph illustrates the frequency distribution of 2-grams (bigrams) in Turkish. The graph is composed of the top 500 2-grams. I have included the data for the top 100 2-grams below. Email me for a complete list.

Turkish is an agglutinative language and frequently uses suffixes. I expect the top 2-grams to correspond to the most frequently used suffixes. My expectation is confirmed by the data below. I will continue on this subject in another post.

Notes:
- Frequencies are computed from Hurriyet and Zaman newspapers using columnist articles between 2001 and 2011.

Data
Top 100 2-gram frequencies in Turkish
ar 2.196%    ni 0.764%    lı 0.564%    rd 0.419%
la 2.006%    ta 0.764%    ha 0.554%    ur 0.406%
an 1.933%    ek 0.741%    na 0.546%    ru 0.402%
er 1.888%    el 0.737%    bu 0.545%    iz 0.400%
in 1.734%    ay 0.733%    mi 0.544%    ği 0.386%
le 1.727%    et 0.712%    at 0.540%    ür 0.380%
de 1.539%    iy 0.707%    ad 0.525%    nu 0.380%
en 1.350%    ne 0.706%    im 0.514%    rl 0.375%
ın 1.336%    ol 0.701%    em 0.505%    ey 0.374%
da 1.304%    rı 0.686%    nl 0.499%    lm 0.372%
ya 1.188%    nı 0.684%    dı 0.494%    iş 0.360%
ir 1.179%    si 0.680%    es 0.480%    az 0.359%
ma 1.174%    yo 0.677%    ge 0.477%    ce 0.358%
bi 1.107%    ki 0.670%    on 0.476%    ık 0.350%
il 1.074%    te 0.664%    aş 0.472%    be 0.349%
ka 1.021%    am 0.650%    ik 0.467%    ul 0.338%
ra 0.974%    sa 0.640%    ıl 0.459%    rk 0.330%
ri 0.951%    ti 0.639%    ed 0.450%    ca 0.328%
ak 0.949%    ye 0.638%    tı 0.445%    st 0.321%
nd 0.949%    re 0.638%    se 0.436%    ld 0.319%
al 0.938%    as 0.632%    ün 0.435%    du 0.313%
li 0.899%    ba 0.628%    is 0.432%    lu 0.311%
di 0.860%    ve 0.594%    ke 0.430%    ğı 0.309%
me 0.850%    un 0.590%    kl 0.428%    gi 0.301%
or 0.815%    sı 0.579%    ır 0.424%    mı 0.301%

No comments:

Post a Comment