Saturday, April 28, 2012

Error Rates in Letter Frequency Comparisons

The difference of letter frequencies in two texts can be used for various tasks. For example, two texts from the same language will have small differences in letter frequencies. Two texts from the same author will have even smaller frequency differences. This kind of analysis can be used to identify the language or the author of a text. The frequency difference can also be used to detect changes in a language or changes in author style over time.

The cumulative difference of letter frequencies in two texts can be computed using the following formula. In the formula f1c and f2c are frequencies of letter c in the the two texts. The result is the difference in percent.



I have computed letter frequency differences for texts from a single author. The following graph shows the results.




Since the data is computed using text from the same author, the values can be interpreted to be the error rate when using letter frequency differences for author identification. The rate of the decrease is a power function. As can be seen, the difference value will be more accurate when comparing longer texts.

Notes
- The data is computed using 1054 articles from a major Turkish newspaper written by the same columnist between 2001 and 2011.
- Text segments of different sizes are compared against a baseline text which contains half of the articles. The baseline is 2043866 characters long.
- Min, max, and average is computed comparing multiple text segments for each text size to baseline. The segments are chosen randomly from the rest of the articles.

Data:
Text Size  --------Freq Diff--------
in chars   Minimum  Average  Maximum
   1024    5.026%   8.229%   14.524%
   2048    3.542%   5.962%   10.533%
   4096    2.824%   4.518%    7.439%
   8192    2.085%   3.361%    5.488%
  16384    1.433%   2.517%    4.315%
  32768    1.111%   1.899%    3.644%
  65536    1.023%   1.472%    2.283%
 131072    0.775%   1.066%    1.397%
 262144    0.542%   0.785%    1.041%
 524288    0.424%   0.522%    0.649%
1048576    0.338%   0.338%    0.338%

Friday, April 27, 2012

The Turkish Alphabet

The official Turkish Alphabet is composed of the following 29 letters.

a b c ç d e f g ğ h ı i j k l m n o ö p r s ş t u ü v y z


The orphan letters

Like anything else that is official, this is not the complete story. Real life Turkish text will also contain the following letters.

â î û

These three letters written with a circumflex, are long or stressed versions of  a, i, and u. Their status is disputed in multiple ways. They are usually considered to be variations of a, i, and u instead of full members of the alphabet. Some people oppose their use altogether while others use them regularly. TDK (Turkish Language Association) keep changing the rules governing their use every decade. I am afraid the disputes are usually ideological and not scientific in nature. This holds true for most of the discussions regarding the Turkish language.

Note that circumflex is called düzeltme işareti, inceltme işareti, şapka işareti, or uzatma işareti in Turkish. The numerous names given to this diacritic is not unrelated to the confusion regarding its use.

The current set of rules governing the use of circumflex are complex and not followed reliably. You may see the same word written using either âîû or their counterparts without the circumflex. If you are going to perform any processing in Turkish text, it is good idea to convert âîû to aiu first.


The foreign letters

Turkish text may contain letters from other languages, usually to write foreign proper nouns like people and place names. Here is a a list of all foreign letters with frequencies exceeding 0.001%.

w x q


Notes

Letter frequencies are computed from Hurriyet and Zaman newspapers using columnist articles between 2001 and 2011.

See Also
http://en.wikipedia.org/wiki/Turkish_alphabet
http://en.wikipedia.org/wiki/Circumflex
http://tr.wikipedia.org/wiki/Düzeltme_işareti

Letter Frequencies in Turkish




- Frequencies are computed from Hurriyet and Zaman newspapers using columnist articles between 2001 and 2011.
- The graph contains all letters with frequencies exceeding 0.001%.
- Note that w, x, and q are not part of the Turkish alphabet.

Data
Letter  Frequency      Letter   Frequency
a         11.742%      z           1.511%
e          9.373%      g           1.254%
i          8.714%      h           1.134%
n          7.344%      ç           1.035%
r          6.978%      v           1.015%
l          6.372%      ğ           0.998%
ı          4.734%      c           0.974%
k          4.599%      p           0.880%
d          4.287%      ö           0.797%
m          3.759%      f           0.518%
t          3.562%      j           0.068%
y          3.426%      â           0.062%
s          3.170%      w           0.019%
u          3.032%      î           0.014%
o          2.602%      x           0.008%
b          2.554%      û           0.004%
ü          1.886%      q           0.001%
ş          1.571%         

Monday, April 23, 2012

The Little Things (Wanted)



The Little Things
by Danny Elfman
Soundtrack from the movie Wanted

Have you heard the news?
Bad things come in twos
But I never knew 'bout the little things

Every single day
Things get in my way
Someone has to pay for the little things

And I'm through with the stories
And I'm sick of my shoes
And the walking and the talking
It's got nothing to do with the final solution
It's a box full of tricks
And I'm through with repairs when there's nothing to fix
And it all comes down to you

Let the headlines wait
Armies hesitate
I can deal with fate but not the little things

Armageddon may
Arrive any day
I can't get away from the little things
...

Vira Bismillah

I created this blog to share updates on my interests, opinions, and activities. I will focus on the following subjects.

- Computer programming
- Natural Language Processing
- Turkish