You Shall Know a “Word” and a “Person” by the Company they Keep

Whether analysing human nature or engineering artificial intelligence, the foundational truth is identical. To truly comprehend any entity, you must observe the network to which it belongs.

This article is generated with the assistance of Artificial Intelligence (AI). It is intended as an educational introduction to the concept of the Distributional Hypothesis, collocation/co-occurrence, and Pointwise Mutual Information (PMI) in linguistics and machine learning, while drawing philosophical parallels with human behaviour. The explanations simplify complex technical concepts for general readers. For deeper academic understanding, readers are encouraged to consult the references cited at the end.

Philosophy, Collocation/co-occurrence, and the Distributional Hypothesis in Machine Learning

There is a famous proverb that says, “You can know a man by the company he keeps.” The idea is simple but profound: human character becomes visible through the relationships and environments in which a person participates. If someone consistently spends time with scholars, artists, or criminals, we intuitively infer something about their habits, interests, and behaviour.

Interestingly, this ancient insight has a striking parallel in modern linguistics and artificial intelligence. Linguist J.R. Firth captured the same principle when he wrote: "You shall know a word by the company it keeps."

This statement summarises the distributional hypothesis, one of the foundational ideas behind modern Natural Language Processing (NLP). Just as we interpret people through their social circles, machines interpret words through the other words that appear around them.

The Philosophical Insight: Context Shapes Identity

Philosophy and sociology have long emphasised that individuals cannot be understood in isolation. Aristotle described humans as inherently social beings, while later thinkers such as Hegel and George Herbert Mead argued that identity is constructed through interactions with others.

A person’s behaviour, beliefs, and values are shaped by their environment. A student surrounded by scientists develops curiosity about research. A person raised in a musical household becomes attuned to rhythm and melody. Context becomes a lens through which identity is formed.

This relational understanding of human behaviour provides a useful metaphor for language itself. Words rarely appear alone; they exist within sentences, paragraphs, and conversations. Their meanings are shaped by these contexts.

The Distributional Hypothesis

The distributional hypothesis states:

Words that occur in similar contexts tend to have similar meanings.

Instead of relying on dictionary definitions alone, this approach analyses patterns of word usage across large collections of text.

For example:

The words cat and dog frequently appear near words like pet, animal, food, or fur.
The words doctor and nurse appear near hospital, patient, and treatment.

Because their surrounding contexts are similar, we infer that these words share related meanings.

In this sense, words behave much like people in social networks. Their “identity” emerges from their associations.

Collocation/co-occurrence: Words That Frequently Appear Together

One important concept related to the distributional hypothesis is collocation (co-occurrence).

A collocation/co-occurrence refers to a pair or group of words that frequently appear together in language.

Examples include:

Strong tea (but not usually strong tea)
Heavy rain (rather than heavy rain)
Make a decision
Take a break

These combinations occur so regularly that they become natural patterns in language.

Collocation/co-occurrences provide strong signals about meaning because they reveal stable relationships between words. If a word repeatedly appears next to another word across thousands of sentences, that association becomes statistically significant.

Measuring Word Relationships: Pointwise Mutual Information (PMI)

To mathematically capture how strongly words are associated with each other, linguists and machine learning researchers use a statistical measure called Pointwise Mutual Information (PMI). PMI measures how much more often two words appear together than we would expect if they were independent. Mathematically, it is expressed as:

Where:

P(x, y) = probability that words x and y occur together
P(x) = probability of word x appearing
P(y) = probability of word y appearing

The intuition is straightforward:

If two words co-occur more often than expected, the PMI is high.
If they occur together as expected, PMI is around zero.
If they rarely occur together, PMI becomes negative.

For example:

The words peanut and butter.
The words peanut and astronomy.

PMI therefore helps machines identify meaningful collocation/co-occurrence and relationships between words.

From Linguistics to Machine Learning

The distributional hypothesis, combined with statistical tools like PMI, laid the groundwork for modern machine learning techniques such as word embeddings.

In these systems:

Large text datasets are collected.
The algorithm records how words appear near each other.
Statistical measures such as co-occurrence counts and PMI are computed.
Words are converted into vectors in high-dimensional space.

Words with similar contexts end up close to each other in this mathematical space.

For example:

King and queen appear near words like royal, palace, and throne.
Apple and banana appear near fruit, eat, and fresh.

Because of these shared contexts, the model learns that these words are related.

A Shared Principle: Relationships Create Meaning

The proverb about human company and the distributional hypothesis in linguistics both reveal the same deeper insight:

Meaning emerges from relationships.

For humans:

Social networks influence behaviour.
Identity reflects environmental influences.

For words:

Context determines meaning.
Relationships between words reveal semantic structure.

Instead of viewing meaning as a fixed property, modern science increasingly treats meaning as a pattern of interactions.

AI and the Scaling of Context

Modern large language models scale this idea dramatically. Instead of analysing thousands of sentences, they process billions or even trillions of words from books, websites, and conversations.

Through this exposure, machines build complex language maps in which words, phrases, and ideas are represented as vectors shaped by contextual relationships.

The result is a powerful system capable of translation, summarisation, reasoning, and conversation.

And at its foundation lies a simple principle articulated decades ago by J.R. Firth:

"You shall know a word by the company it keeps."

Conclusion

A simple proverb about judging a person by their companions reflects a deep philosophical truth: context reveals meaning. Human behaviour emerges from social relationships, and language meaning emerges from contextual patterns.

The distributional hypothesis, along with concepts such as collocation/co-occurrence and Pointwise Mutual Information, transformed this philosophical insight into a mathematical framework for understanding language.

Today, this framework powers many of the machine learning systems that shape modern technology.

In both human society and artificial intelligence, one lesson remains constant: to understand something, look at the company it keeps.

References

Firth, J. R. (1957). A Synopsis of Linguistic Theory 1930–1955. Oxford: Blackwell.
Harris, Z. (1954). Distributional Structure. Word, 10(2–3), 146–162.
Church, K., & Hanks, P. (1990). Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781.
Goldberg, Y. (2017). Neural Network Methods for Natural Language Processing. Morgan & Claypool.
Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed. draft). Stanford University.

Philosophy, Collocation/co-occurrence, and the Distributional Hypothesis in Machine Learning

The Philosophical Insight: Context Shapes Identity

The Distributional Hypothesis

Collocation/co-occurrence: Words That Frequently Appear Together

Measuring Word Relationships: Pointwise Mutual Information (PMI)

From Linguistics to Machine Learning

A Shared Principle: Relationships Create Meaning

Meaning emerges from relationships.

AI and the Scaling of Context

Conclusion

References

Get the next essay on mind, machine, and meaning