YTread Logo
YTread Logo

The Zipf Mystery

Mar 04, 2020
Hello, Vsauce. Miguel here. About 6 percent of everything you say, read and write is "the," the most used word in the English language. About one in every 16 words we encounter every day is "the." The 20 most common English words in order are "the", "of", "and", "to", "a", "in", "is", "I", "that", "it". "because", "you", "was", "with", "about", "as", "has", "but", "be", "they". That's a fun fact. A triviality but it is also more. You see, whether the most used words are ranked across an entire language or in a single book or article, a strange pattern almost always emerges. The second most used word will appear about half as many times as the most used word.
the zipf mystery
The third one third of the frequency. The fourth a fourth with the same frequency. The fifth one fifth most often. The sixth a sixth with the same frequency, and so on down. Oh really. For some reason, the number of times a word is used is proportional to one above its range. Word frequency and rank on a logarithmic graph follow a nice straight line. A power law. This phenomenon is called Zipf's Law and it does not only apply to English. It applies to other languages ​​too, like, well, everyone. Even ancient languages ​​we have not yet been able to translate.
the zipf mystery

More Interesting Facts About,

the zipf mystery...

And here's the thing. We have no idea why. It is surprising that something as complex as reality can be transmitted so predictably by something as creative as language. How predictable? Well, look at this. According to WordCount.org, which ranks words found in the British National Corpus, "willow" is the 5,555th most common English word. Now, here's a list of how many times each word appears on Wikipedia and in the entire Gutenberg Corpus of tens of thousands of public domain books. The most used word, "the", appears about 181 million times. Knowing these two things, we can estimate that the word "sauce" should appear about thirty thousand times in Wikipedia and Gutenberg combined.
the zipf mystery
And it practically does. What gives? The world is chaotic. Things are distributed in countless ways, not just through power laws. And language is personal, intentional, idiosyncratic. What is it about the world and ourselves that could cause such complex activities and behaviors to follow such a basic rule? We literally don't know. More than a century of investigations have yet to close the case. Furthermore, Zipf's law not only mysteriously describes the use of words. It is also found in city populations, solar flare intensities, protein sequences and immune receptors, the amount of traffic websites receive, the magnitudes of earthquakes, the number of times articles are cited. academics, surnames, neural network activation patterns, ingredients used in cookbooks. , the number of phone calls people received, the diameter of craters on the Moon, the number of people who die in wars, the popularity of chess openings, even the rate at which we forget.
the zipf mystery
There are many theories as to why the language is '

zipf

-y', but there are no firm conclusions and this video does not contain a definitive explanation either. Sorry, I know it's a shame, since we seem to like knowing more than

mystery

. But that being said, we also ask more than we answer. So let's delve into the ramifications of Zipf, some related patterns, some possible explanations, and the depth of the

mystery

itself. Zipf's law was popularized by George Zipf, a linguist at Harvard University. It is a discrete form of the Pareto continuous distribution from which we obtain the Pareto Principle.
Because many real-world processes behave this way, the Pareto Principle tells us that, as a general rule, it is worth assuming that 20% of the causes are responsible for 80% of the result, as in language, where the most frequent causes They used 18 percent of the words and represent more than 80% of the word occurrences. In 1896, Vilfredo Pareto showed that approximately 80% of the land in Italy was owned by only twenty percent of the population. It is said that he later noticed that in his garden 20 percent of the pea pods contained eighty percent of the peas. He and other researchers examined other data sets and found that this 80-20 imbalance appears very frequently in the world.
The richest 20% of human beings own 82.7% of the world's income. In the United States, 20% of patients use eighty percent of healthcare resources. In 2002, Microsoft reported that 80% of errors and crashes in Windows and Office are caused by 20% of detected errors. A common rule of thumb in the business world states that 20% of your customers are responsible for 80% of your profits and eighty percent of the complaints you receive will come from 20% of your customers. A book titled "The 80/20 Principle" even says that in a home or office, 20% of the carpet receives 80 percent of the wear and tear. Oh, and as Woody Allen said, "Eighty percent of success is just showing up." The Pareto principle is everywhere, which is good.
By focusing on just 20 percent of what is wrong, you can often hope to solve eighty percent of the problems. A variety of different unrelated factors make this true from case to case, but if we can get to the bottom of what causes some of them, we might find that one or more of those mechanisms are responsible for Zipf's law in the language. . George Zipf himself thought that the interesting frequency distribution of languages ​​was a consequence of the principle of least effort. The tendency of life and things to follow the path of least resistance. Zipf believed it drove much of human behavior and hypothesized that as language developed in our species, speakers naturally preferred to use as few words as possible to express their thoughts.
It was easier. But to understand what was being said, listeners preferred broader vocabularies that gave more specificity, so they had to work less. Zipf considered that the compromise between listening and speaking led to the current state of language. Some words are used frequently and many, many words are used rarely. Recent articles have suggested that having a few short, predictable, frequently used words helps to dissipate the density of the information load on listeners, spacing out important vocabulary so that the rate of information is more constant. This makes sense, and much has been learned by applying the principle of least effort to other behaviors, but later researchers argued that, in the case of language, the explanation was even simpler.
Just a few years after Zipf's seminal paper, Benoit Mandelbrot showed that there may be nothing mysterious about Zipf's law, because even if you type randomly on a keyboard you will produce words distributed according to Zipf's law. It's a pretty interesting point and that's why it happens. There are exponentially more different long words than short words. For example, the English alphabet can be used to make 26 one-letter words, but 26 squared 2-letter words. Also, when random typing, each time the space bar is pressed, a word ends. Since there is always some chance that the space bar will be pressed, longer periods of time before this happens are exponentially less likely than shorter ones.
Combining these exponentials is quite 'Zipf-y'. For example, if all 26 letters and the space bar are equally likely to be entered, after typing a letter and starting a word, the probability that the next entry will be a space, thus creating a one-letter word, is only one in 27. And sure enough, if you generate random characters or hire a proverbial monkey typist, about one in 27 or 3.7 percent of the things between spaces will be simple letters. Two-letter words appear when, after starting a word, you press any character except the space bar: a 26 in 27 chance and then the space bar. A three-letter word is the probability of appearing one letter, another letter, and then a space.
If we divide by the number of unique words there can be of each length, we obtain the expected frequency of occurrence for any particular word given its length. For example, the letter V will account for approximately 0.142 percent of random writing. The word "Vsauce" 0.0000000993 percent. You're less likely to use longer words, but watch this. Let's distribute these frequencies according to the ranges they would occupy in a most frequently used list. There are 26 possible one-letter words, so each of the top 26 ranked words is expected to appear with this frequency. The next 676 ranks will be occupied by two-letter words that appear with this frequency.
If we expand each frequency by how many members it has, we get Zipf. Later researchers have detailed how changing the initial conditions can smooth out the steps. Our mysterious distribution has been created solely from the inevitability of mathematics. So maybe there is no mystery. Maybe words are just the result of humans randomly segmenting the observable world and the mental world into labels and Zipf's law describes what happens naturally when you do that. Case closed. and as always And as always, thank you for... wait a minute! Real language is very different from random typing. Communication is deterministic to a certain extent.
The statements and themes arrive based on what was said before. And the vocabulary we have to work with is certainly not the result of purely random naming. For example, the monkey's model of writing cannot explain why even the names of elements, planets, and days of the week are used in language according to Zipf's law. Sets like these are bounded by the natural world and are not the result of us randomly segmenting the world into labels. Additionally, when given a list of novel words, words they have never heard or used before, such as when asked to write a story about alien creatures with strange names, people will naturally tend to use an alien's name twice. more often than another's. , three times more frequent than another...
Zipf's law seems to be integrated into our brain. Perhaps there is something about the way thoughts and discussion topics come and go that contributes to Zipf's law. Another way 'Zipf-ian' distributions are produced is through processes that change depending on how they have operated previously. These are called preferential attachment processes. They occur when something (money, views, attention, variation, friends, jobs, anything) is actually given according to what is already possessed. Going back to the rug example, if most people walk from the living room to the kitchen along a certain path, the furniture will be placed somewhere else, making that path even more popular.
The more views a video, image or post has, the more likely it is to be automatically recommended or appear in the news feed for having so many views, giving it more views. It's like a snowball rolling down a snowy hill. The more snow it accumulates, the greater its surface area to collect more and the faster it grows. There does not have to be a deliberate choice that drives a preferential attachment process. It can happen naturally. Try this. Take a bunch of clips and grab two at random. Bind them and then throw them back on the stack.
Now, repeat again and again. If you take clips that are already part of a chain, join them together anyway. Most of the time, after a while, you'll have a distribution that looks "Zipf-ian". A small number of chains contain a disproportionate amount of the total clip count. This is simply because the longer a chain is, the greater proportion of the whole it contains, giving it a greater chance of being picked up in the future and, consequently, becoming even longer. The rich get richer, the big get bigger, the popular get more popular. It's just math. Perhaps the Zipf mystery of languages ​​is, if not caused by it, at least strengthened by preferential attachment.
Once a word is used, it is more likely to be used again soon. Critical points can also play a role. Writing and speaking often stick to one topic until a critical point is reached and the topic and vocabulary are changed. Processes like these are known to result in power laws. So, in the end, it seems tenable that all these mechanisms can conspire to make Zipf's law the most natural way for language to exist. Perhaps some of our vocabulary and grammar developed randomly, according to Mandelbrot's theory. And the natural way conversation and discussion follow preferential attachment and criticality, along with the principle of least effort in speaking and listening, are all responsible for the relationship between the range and frequency of words.
It is a shame that the answer is not simpler, but it is fascinating because of the consequences it has on thecomposition of communication. Generally speaking, and this is mind-blowing, almost half of any book, conversation or article will be no more than the same 50 to 100 words. And almost the other half will be words that appear in that selection only once. This is not so surprising when you consider the fact that one word represents 6 percent of what we say. The 25 most used words account for about a third of everything we say, and the 100 most used words account for about half.
Oh really. I mean, whether it's all the words in "Wet Hot American Summer," or all the words in Plato's "Complete Works," or the complete works of Edgar Allan Poe, or the Bible itself, only about 100 words are used in almost half of everything written or said. In Alice in Wonderland 44% and in Tom Sawyer 49.8% of the unique words used appear only once in the book. A word that is used only once in a given selection of words is called "hapax legomenon." Hapax legomena are vitally important for understanding languages. If a word has only been found once in the entire known collection of an ancient language, it can be very difficult to figure out what it means.
Now, there is no corpus of everything said or written in English, but there are very large collections and it is fun to find hapax legomena in them. For example, and this probably won't be the case after I mention it, but the word "curious" is in the Oxford English Dictionary, but it doesn't appear anywhere on Wikipedia or in the Gutenberg corpus or the British National Corpus or in the American National Corpus, but it does appear when searching in a single result on Google. Fittingly, in a book titled "ElderSpeak" that lists it as a "rare word." By the way, with curiosity it means "in a mocking manner," as in "The paradista recited curiously: 'Hey, Vsauce.
This is Michael. But who is Michael and how much does he weigh here?'" It's a little sad that with curiosity it has been used so infrequently. It's a funny word, but that's how things are in a 'Zipf-ian' system. Some things get all the love, others get little. Most of what you experience on a daily basis is forgotten, can be forgotten. The Dictionary of Dark Sorrows, as it happens, has a word for this: Olēka, the awareness of how few days are memorable. I have been alive for almost 11,000 days but I couldn't tell you something about each of them.
I mean, not even close. Most of what we do, see, think, say, hear, and feel is forgotten at a rate pretty similar to Zipf's law, which makes sense. If there are a number of factors naturally selected for thinking and talking about the world with tools in a 'Zipfian' way, it makes sense that we remember it that way too. Some things very good, most almost nothing. But sometimes it discourages me because it means that a lot of things are forgotten, even things that at the time you thought you could never forget. My locker number - senior year - your combination, the jokes I liked when I saw a comedian on stage, the names of the people I saw every day 10 years ago.
So many memories are gone. When I look back at all the books I've read and realize I can't remember every detail of them, it's a little disappointing. I mean, why bother if the Pareto Principle dictates that my 'Zipfian' mind will consciously remember pretty much only the titles and some basic reactions years later? Ralph Waldo Emerson makes me feel better. He once said, "I can't remember the books I've read any more than the meals I've eaten. Still, they've made me." And as always, thanks for watching.

If you have any copyright issue, please Contact