The cumulative difference of letter frequencies in two texts can be computed using the following formula. In the formula f1c and f2c are frequencies of letter c in the the two texts. The result is the difference in percent.
I have computed letter frequency differences for texts from a single author. The following graph shows the results.
Since the data is computed using text from the same author, the values can be interpreted to be the error rate when using letter frequency differences for author identification. The rate of the decrease is a power function. As can be seen, the difference value will be more accurate when comparing longer texts.
Notes
- The data is computed using 1054 articles from a major Turkish newspaper written by the same columnist between 2001 and 2011.
- Text segments of different sizes are compared against a baseline text which contains half of the articles. The baseline is 2043866 characters long.
- Min, max, and average is computed comparing multiple text segments for each text size to baseline. The segments are chosen randomly from the rest of the articles.
Data:
Text Size --------Freq Diff--------
in chars Minimum Average Maximum
1024 5.026% 8.229% 14.524%
2048 3.542% 5.962% 10.533%
4096 2.824% 4.518% 7.439%
8192 2.085% 3.361% 5.488%
16384 1.433% 2.517% 4.315%
32768 1.111% 1.899% 3.644%
65536 1.023% 1.472% 2.283%
131072 0.775% 1.066% 1.397%
262144 0.542% 0.785% 1.041%
524288 0.424% 0.522% 0.649%
1048576 0.338% 0.338% 0.338%