Different Tongue, Same Information: 17-language Study Reveals How We Communicate Information at a Similar Rate

Authored by technologynetworks.com and submitted by rjmsci
image for Different Tongue, Same Information: 17-language Study Reveals How We Communicate Information at a Similar Rate

If you travel to a different part of the world, the richness of a foreign language may be the first thing that strikes you. A new study from researchers at the University of Lyon suggests there may be fewer differences between tongues than you might have thought.

“Languages vary a lot in terms of the information that they pack into a syllable and also in the rate that they are spoken at. But the interesting thing is that the two kind of balance each other, so that more information dense languages are spoken slower, and those that are less informationally heavy are spoken faster. This means that there is a steady information rate that is very similar among languages,” says study co-author Dan Dediu, a researcher at Lyon’s Laboratoire Dynamique du Langage.

In trying to find a “universal” constant for language, Dediu’s team faced quite a battle. There are over 7000 different languages, and there are very few characteristics that connect all of them. This even extends to basic measures of how information is encoded in words. For example, the amount of syllables per word varies greatly between languages, meaning that the Shannon information rate (see grey box) varies as well. However, Dediu and his team had the insight to take into account not just the words, but the rate at which they are spoken.

Dediu and colleagues used recordings taken from 170 native adult speakers of 17 different languages across Europe and Asia. Each speaker was tasked with reading a set of 15 chunks of text, consisting of roughly 240,000 syllables.

Shannon Information: Claude Shannon, a researcher at Bell Labs, made a huge contribution to information technology when he formulated his theory of information in a seminal paper in the 1940s. The gist of Shannon’s work was that information could be expressed as discrete binary values, which he called bits. This meant that the noise produced by long-distance communication could be silenced by rounding the distortion up or down to 1 or 0. Applying this theory to language, Shannon showed that different languages have their own level of redundance. English is sometimes said to have a 50% level of redundance, meaning half the letters in a given sentence could be removed whilst preserving meaning.

The researchers chose the syllable as their singular unit of information. This was adopted over two other options (this is quite a controversial subject in linguistic informatics, as it turns out):

Phonemes – units of sound which help us separate out individual words – were excluded as Dediu’s team realized they could be easily omitted in speech

Words – these were seen as being too language specific for easy comparison

Armed with a data set and a metric, the scientists examined their results. They revealed some interesting differences between our world’s languages:

The number of distinct syllables in English is nearly 7000, but just a few hundred in Japanese

Speech rate varied from 4.3 syllables up to 9.1 syllables per second

Vowel harmony (a fascinating linguistic innovation that requires suffixes to be “in harmony” with the word they attach to) was present in four of the languages

In short, the languages sounded pretty darn different.

Despite this, Dediu's team noted that the information rate, which takes into account the speech rate and information density of the written text, was roughly consistent across all the languages recorded; information-rich text was read more slowly, whilst information-light languages were spoken faster.

Language as a gingerbread reindeer: the two B/W versions use different resolutions and number of gray levels but encode the same info, just as languages trade off different strategies but are equally efficient. Credit: Dan Dediu, Université Lumière Lyon 2

The researchers were able to settle on a number – 39.15 bits/s – as an average information rate over the 17 languages. There were some interesting variations – for example, female speakers had a lower speech and information rate.

The team showed that the differences in the written text made little difference to the information rate, suggesting that the results could be generalized beyond the text-based study conducted here. The speech rate and syllable number were significantly more variable than the information rate, cementing the latter as a valid cross-lingual connector.

What does this mean for our brain?

The authors suggest that the findings mean that information rate has to stabilize around a tight mean, as too high rates would impede the brain’s ability to process data and articulate speech clearly. On the other hand, a low information rate would require the retention of far too many words for the brain to remember before meaning could be extracted.

This highlights the dual role which language has to play, which Dediu sums up: “There are two sides to the coin when it comes to language – one is the cultural and the other biological, and when one changes - say a language becomes more informationally dense - the other reacts - its speakers start speaking it slower.”

biolinguist on September 4th, 2019 at 19:25 UTC »

One of the worst possible studies I have ever come across, with rampant confusions between Language, languages, speech, and at least two possible interpretations of "universals". The citations linked with regard to these discussions are mostly discarded old junk (none more so than the Evans and Levinson "research"), have been beaten to death, and the discussions of "information theory" is laughably outdated.

Shannon's information theory was chewed up way back in the 1960s. George Miller did some nice expose on the inherent shortcomings after going down that road. It has been known for at least three decades now that Shannon Information Theory lacks any explanatory adequacy altogether when applied to linguistic computation, with often times the algorithms appearing more interesting than their logarithmic values. This is all old news. A much better take can be found in the works of Ding et al. from Poeppel's lab, or a recent paper by Krakauer and colleagues.

kittenTakeover on September 4th, 2019 at 19:03 UTC »

What counts as a "bit"? Is it just syllables or actual information? If so how do they quantize the information? It would seem silly if it was just syllables. Of course you can only say so many syllables per minute. That also should mean if you can fit more information in per syllable then some languages "talk faster".

lastsynapse on September 4th, 2019 at 18:47 UTC »

They choose the syllable as the unit of information, suggesting language communicates 39 bits/s of syllable information. But language communicates _ideas_ much more quickly. "A hurricane is coming tomorrow" is only 10 syllables, but communicates a ton of information: namely, within 24 hours, a big storm is coming. If I'm aware of the context (e.g. I live in florida) I may be more aware that that means serious problems for me and my family's safety.

The problem with measuring language as the units of speech ignores the relational database that we have of memories and learned information that sits in our head. "Colorless green ideas sleep furiously" may be 11 syllables, but either presents as 0 information as a nonsense phrase, or maybe an infinite amount of information as it causes one to consider what makes up an english sentence.

language may operate on a 39 bit/s carrier wave, but oodles of information is coming at that frequency.