SOCR LetterFrequencyData

From Socr

(Difference between revisions)
Jump to: navigation, search
(New page: == SOCR Data - Letter Frequency Data== ===Data Description=== [[Image:SOCR_Data_Dinov_H_Index_Schematic.png|150px|thumbnail|right| [http://en.wikipedia.org/wiki/Hirsch_numb...)
m (Data Description)
 
(9 intermediate revisions not shown)
Line 1: Line 1:
-
== [[SOCR_Data |SOCR Data]] - Letter Frequency Data==
+
== [[SOCR_Data |SOCR Data]] - Latin Letters Frequency Distributions in Different Languages==
===Data Description===
===Data Description===
-
[[Image:SOCR_Data_Dinov_H_Index_Schematic.png|150px|thumbnail|right| [http://en.wikipedia.org/wiki/Hirsch_number H-Index Description] ]]
+
[[Image:SOCR_Data_Dinov_EnglishLetterFrequency.png|150px|thumbnail|right| [http://en.wikipedia.org/wiki/Letter_frequency English Letter Frequencies] ]]
-
The data table below present the average frequencies of the 26 most common Latin letters for different languages. Letter frequenciess in text is studied in cryptography. There is no ''exact'' letter frequency distribution underlies a given language, since all writers write slightly differently. Modern International [http://en.wikipedia.org/wiki/Morse_code Morse code] encodes the most frequent letters with the shortest symbols; arranging the Morse alphabet into groups of letters that require equal amounts of time to transmit, and then sorting these groups in increasing order. Similar ideas are used in modern data-compression techniques such as [http://en.wikipedia.org/wiki/Huffman_coding Huffman coding].
+
The data table below present the average frequencies of the 26 most common Latin letters for different languages. Letter frequencies in text are studied in cryptography. The exact letter frequency distribution underling a given language is unknown and varies with time, since all writers tend to write slightly differently and are affected by their culture. Modern International [http://en.wikipedia.org/wiki/Morse_code Morse code] encodes the most frequent letters with the shortest symbols; arranging the Morse alphabet into groups of letters that require equal amounts of time to transmit, and then sorting these groups in increasing order. Similar ideas are used in modern data-compression techniques such as [http://en.wikipedia.org/wiki/Huffman_coding Huffman coding].
Letter frequencies, like word frequencies, tend to vary by writer, subject and language. Accurate average letter frequencies are obtained by analyzing large amounts of ''representative'' text.
Letter frequencies, like word frequencies, tend to vary by writer, subject and language. Accurate average letter frequencies are obtained by analyzing large amounts of ''representative'' text.
Line 16: Line 16:
{| class="wikitable" style="text-align:center; width:75%" border="1"
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
|-
-
! [http://en.wikipedia.org/wiki/Letter_frequency Letter] || [http://en.wikipedia.org/wiki/English_language English] || [http://en.wikipedia.org/wiki/French_language French] || [http://en.wikipedia.org/wiki/German_language German] || [http://en.wikipedia.org/wiki/Spanish_language Spanish] || [http://en.wikipedia.org/wiki/Portuguese_language Portuguese] || [http://en.wikipedia.org/wiki/Esperanto_language Esperanto] || [http://en.wikipedia.org/wiki/Italian_language Italian] || [http://en.wikipedia.org/wiki/Turkish_language Turkish] || [http://en.wikipedia.org/wiki/Swedish_language Swedish] || [http://en.wikipedia.org/wiki/Polish_language Polish] || [http://en.wikipedia.org/wiki/Toki_Pona Toki Pona] || [http://en.wikipedia.org/wiki/Dutch_language Dutch] || [http://en.wikipedia.org/wiki/Average Avgerage]
+
! [http://en.wikipedia.org/wiki/Letter_frequency Letter] || [http://en.wikipedia.org/wiki/English_language English] || [http://en.wikipedia.org/wiki/French_language French] || [http://en.wikipedia.org/wiki/German_language German] || [http://en.wikipedia.org/wiki/Spanish_language Spanish] || [http://en.wikipedia.org/wiki/Portuguese_language Portuguese] || [http://en.wikipedia.org/wiki/Esperanto_language Esperanto] || [http://en.wikipedia.org/wiki/Italian_language Italian] || [http://en.wikipedia.org/wiki/Turkish_language Turkish] || [http://en.wikipedia.org/wiki/Swedish_language Swedish] || [http://en.wikipedia.org/wiki/Polish_language Polish] || [http://en.wikipedia.org/wiki/Toki_Pona Toki_Pona] || [http://en.wikipedia.org/wiki/Dutch_language Dutch] || [http://en.wikipedia.org/wiki/Average Avgerage]
|-  
|-  
| a  || 0.08 || 0.08 || 0.07 || 0.13 || 0.15 || 0.12 || 0.12 || 0.12 || 0.09 || 0.08 || 0.17 || 0.07 || 0.11
| a  || 0.08 || 0.08 || 0.07 || 0.13 || 0.15 || 0.12 || 0.12 || 0.12 || 0.09 || 0.08 || 0.17 || 0.07 || 0.11
Line 70: Line 70:
| z  || 0.00 || 0.00 || 0.01 || 0.01 || 0.00 || 0.01 || 0.00 || 0.02 || 0.00 || 0.05 || 0.00 || 0.01 || 0.01
| z  || 0.00 || 0.00 || 0.01 || 0.01 || 0.00 || 0.01 || 0.00 || 0.02 || 0.00 || 0.05 || 0.00 || 0.01 || 0.01
|-  
|-  
-
| total || 1.00 || 0.97 || 1.00 || 1.00 || 1.00 || 0.98 || 1.00 || 0.88 || 0.94 || 0.80 || 1.00 || 1.00 || 0.96
+
| Others || 0 || 0.03 || 0 || 0 || 0 || 0.02 || 0 || 0.12 || 0.06 || 0.2 || 0 || 0 || 0.04
|}
|}
</center>
</center>
 +
 +
===Graphs===
 +
* [[SOCR_EduMaterials_Activities_Histogram_Graphs | Histogram]] (HistogramChartDemo7) of the English letters
 +
<center>[[Image:SOCR_Data_Dinov_EnglishLetterFrequency.png|400px]]</center>
 +
 +
* [[SOCR_EduMaterials_Activities_BarCharts_CategoryPlot | Stacked Bar-Chart]] ([http://socr.ucla.edu/htmls/SOCR_Charts.html StackedBarChartDemo3, under BarCharts --> CategoryPlots]) of all letters across each language
 +
<center>[[Image:SOCR_Data_Dinov_EnglishLetterFrequency1.png|400px]]</center>
 +
<hr>
<hr>

Current revision as of 18:43, 31 May 2010

Contents

SOCR Data - Latin Letters Frequency Distributions in Different Languages

Data Description

The data table below present the average frequencies of the 26 most common Latin letters for different languages. Letter frequencies in text are studied in cryptography. The exact letter frequency distribution underling a given language is unknown and varies with time, since all writers tend to write slightly differently and are affected by their culture. Modern International Morse code encodes the most frequent letters with the shortest symbols; arranging the Morse alphabet into groups of letters that require equal amounts of time to transmit, and then sorting these groups in increasing order. Similar ideas are used in modern data-compression techniques such as Huffman coding.

Letter frequencies, like word frequencies, tend to vary by writer, subject and language. Accurate average letter frequencies are obtained by analyzing large amounts of representative text.

Sources

Data Table

Letter English French German Spanish Portuguese Esperanto Italian Turkish Swedish Polish Toki_Pona Dutch Avgerage
a 0.08 0.08 0.07 0.13 0.15 0.12 0.12 0.12 0.09 0.08 0.17 0.07 0.11
b 0.01 0.01 0.02 0.01 0.01 0.01 0.01 0.03 0.01 0.01 0.00 0.02 0.01
c 0.03 0.03 0.03 0.05 0.04 0.01 0.05 0.01 0.01 0.04 0.00 0.01 0.03
d 0.04 0.04 0.05 0.06 0.05 0.03 0.04 0.05 0.05 0.03 0.00 0.06 0.04
e 0.13 0.15 0.17 0.14 0.13 0.09 0.12 0.09 0.10 0.07 0.07 0.19 0.12
f 0.02 0.01 0.02 0.01 0.01 0.01 0.01 0.00 0.02 0.00 0.00 0.01 0.01
g 0.02 0.01 0.03 0.01 0.01 0.01 0.02 0.01 0.03 0.01 0.00 0.03 0.02
h 0.06 0.01 0.05 0.01 0.01 0.00 0.02 0.01 0.02 0.01 0.00 0.02 0.02
i 0.07 0.08 0.08 0.06 0.06 0.10 0.11 0.08 0.05 0.07 0.15 0.07 0.08
j 0.00 0.01 0.00 0.00 0.00 0.04 0.00 0.00 0.01 0.02 0.03 0.01 0.01
k 0.01 0.00 0.01 0.00 0.00 0.04 0.00 0.05 0.03 0.03 0.05 0.02 0.02
l 0.04 0.05 0.03 0.05 0.03 0.06 0.07 0.06 0.05 0.03 0.10 0.04 0.05
m 0.02 0.03 0.03 0.03 0.05 0.03 0.03 0.04 0.04 0.02 0.04 0.02 0.03
n 0.07 0.07 0.10 0.07 0.05 0.08 0.07 0.07 0.09 0.05 0.12 0.10 0.08
o 0.08 0.05 0.03 0.09 0.11 0.09 0.10 0.02 0.04 0.07 0.08 0.06 0.07
p 0.02 0.03 0.01 0.03 0.03 0.03 0.03 0.01 0.02 0.02 0.04 0.02 0.02
q 0.00 0.01 0.00 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00
r 0.06 0.07 0.07 0.07 0.07 0.06 0.06 0.07 0.08 0.04 0.00 0.06 0.06
s 0.06 0.08 0.07 0.08 0.08 0.06 0.05 0.03 0.06 0.04 0.04 0.04 0.06
t 0.09 0.07 0.06 0.05 0.05 0.05 0.06 0.03 0.09 0.02 0.05 0.07 0.06
u 0.03 0.06 0.04 0.04 0.05 0.03 0.03 0.03 0.02 0.02 0.03 0.02 0.03
v 0.01 0.02 0.01 0.01 0.02 0.02 0.02 0.01 0.02 0.00 0.00 0.03 0.01
w 0.02 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.03 0.02 0.01
x 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
y 0.02 0.00 0.00 0.01 0.00 0.00 0.00 0.03 0.01 0.03 0.00 0.00 0.01
z 0.00 0.00 0.01 0.01 0.00 0.01 0.00 0.02 0.00 0.05 0.00 0.01 0.01
Others 0 0.03 0 0 0 0.02 0 0.12 0.06 0.2 0 0 0.04

Graphs

  • Histogram (HistogramChartDemo7) of the English letters





Translate this page:

(default)

Deutsch

Español

Français

Italiano

Português

日本語

България

الامارات العربية المتحدة

Suomi

इस भाषा में

Norge

한국어

中文

繁体中文

Русский

Nederlands

Ελληνικά

Hrvatska

Česká republika

Danmark

Polska

România

Sverige

Personal tools