Letter and next-letter frequencies in English [OC]

Image from i.redditmedia.com and submitted by Udzu
image showing Letter and next-letter frequencies in English [OC]

Udzu on August 4th, 2017 at 12:18 UTC »

Visualisation details

The grid shows the relative frequencies of the different letters in English, as well as the relative frequencies of each subsequent letter: for example, the likelihoods that a t is followed by an h or that a q is followed by a u.

The data is from a million random sentences from Wikipedia, which contain 132 million characters. Accents, numbers and non-Latin characters were stripped, and letter case was ignored. However, spaces were kept in, making it possible to see the most common word starters, or letters that typically come at the end of words.

The grid was made using Python and Pillow. For the (rather hacky) source code, see www.github.com/Udzu/pudzu.

For an equivalent image using articles from French Wikipedia, see imgur.

Update: if you liked the pseudoword generation, be sure to check out this awesome paper by /u/brighterorange about words that ought to exist.

smileedude on August 4th, 2017 at 13:03 UTC »

So if you follow the most common path you begin with space then t h e space. This checks out.

Sergeant_Rainbow on August 4th, 2017 at 13:30 UTC »

Oh man the Markov generated pseudowords are the absolute best part of this data! Just look at these beautiful creations:

Bastrabot Forliatitive Wasions Felogy Sonsih Fourn Meembege Prouning Nown Abrip Dithely Raliket Ascoult Quarm Winferlifterand Uniso Hise Nuouish Guncelawits Rectere Doesium

Can we have more??