Saturday, April 28, 2012

Error Rates in Letter Frequency Comparisons

The difference of letter frequencies in two texts can be used for various tasks. For example, two texts from the same language will have small differences in letter frequencies. Two texts from the same author will have even smaller frequency differences. This kind of analysis can be used to identify the language or the author of a text. The frequency difference can also be used to detect changes in a language or changes in author style over time.

The cumulative difference of letter frequencies in two texts can be computed using the following formula. In the formula f1c and f2c are frequencies of letter c in the the two texts. The result is the difference in percent.



I have computed letter frequency differences for texts from a single author. The following graph shows the results.




Since the data is computed using text from the same author, the values can be interpreted to be the error rate when using letter frequency differences for author identification. The rate of the decrease is a power function. As can be seen, the difference value will be more accurate when comparing longer texts.

Notes
- The data is computed using 1054 articles from a major Turkish newspaper written by the same columnist between 2001 and 2011.
- Text segments of different sizes are compared against a baseline text which contains half of the articles. The baseline is 2043866 characters long.
- Min, max, and average is computed comparing multiple text segments for each text size to baseline. The segments are chosen randomly from the rest of the articles.

Data:
Text Size  --------Freq Diff--------
in chars   Minimum  Average  Maximum
   1024    5.026%   8.229%   14.524%
   2048    3.542%   5.962%   10.533%
   4096    2.824%   4.518%    7.439%
   8192    2.085%   3.361%    5.488%
  16384    1.433%   2.517%    4.315%
  32768    1.111%   1.899%    3.644%
  65536    1.023%   1.472%    2.283%
 131072    0.775%   1.066%    1.397%
 262144    0.542%   0.785%    1.041%
 524288    0.424%   0.522%    0.649%
1048576    0.338%   0.338%    0.338%

No comments:

Post a Comment