Markov Models and Pushkin

On the strange origin of a scientific revolution

Dec 12, 2024

What if I were to tell you that the 'Markov Chain Monte Carlo Revolution' - the results of which, so Persi Diaconis tells us, 'are used in every aspect of scientific inquiry' today - started off with a poem? Specifically, it started with Alexander Pushkin's masterpiece Eugene Onegin, a novel in verse of some 389 stanzas in rhymed iambic tetrameter. On 23 January 1913, the mathematician A A Markov delivered a lecture in St Peterburg entitled 'An Example of the Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains' in which he demonstrated one of the first applications of the idea of Markov chains:

'This study investigates a text excerpt containing 20,000 Russian letters of the alphabet from Pushkin’s novel Eugene Onegin – the entire ﬁrst chapter and sixteen stanzas of the second.
This sequence provides us with 20,000 connected trials, which are either a vowel or a consonant.
Accordingly, we assume the existence of an unknown constant probability p that the observed letter is a vowel. We determine the approximate value of p by observation, by counting all the vowels and consonants. Apart from p, we shall ﬁnd – also through observation – the approximate values of two numbers p1 and p0, and four numbers p 1,1, p 1,0, p 0,1, and p 0,0. They represent the following probabilities: p1 – a vowel follows another vowel; p0 – a vowel follows a consonant; p1,1 – a vowel follows two vowels; p 1,0 – a vowel follows a consonant that is preceded by a vowel; p0,1 – a vowel follows a vowel that is preceded by a consonant; and, ﬁnally, p0,0 – a vowel follows two consonants.'

And true to his word, Markov meticulously combed through the 20,000 character sample, scrubbing all punctuation, creating a continuous string of letters divided into 200 letter blocks, which he then counted for the relative frequencies of vowels and consonants.

After much tedious tabulation Markov calculated his various probabilities (p0, p1, p 1,1, p 1,0, p 0,1, and p 0,0) and determined that the observed instances of double-vowels and consonants departed from what we would expect if the probabilities are independent. In other words, the system of letters involved a form of ‘chaining’ where the probability of each event in the sequence depends on the state of the previous event. A trivial result but, importantly, one that contained a primitive application of 'samples connected in complex chains'. This would in turn flow into and contribute to the development of the branch of mathematics concerned with experiments on random numbers that usually goes under the broad heading of ‘Monte Carlo methods’. Hammersley and Handscombe provide a useful history of such methods in their classic monograph.

Markov attempted a similar sort of analysis on a book by the lesser-known Russian novelist Sergei Aksakov with significantly expanded sampling (100,000 characters, compared to the 20,000 drawn from Pushkin), though I have not been able to discover anything about the paper and the extent to which it deviated from or developed on the methods of the Pushkin analysis. After Markov died in 1922 interest in the statistical analysis of literary texts appears to have slowly declined (at least, in Russia) until the 1960s, when the legendary A N Kolmogorov began to think how to create a scientific theory of versification. What followed was a series of papers on the use of probability theory for the analysis of rhythmical structures in Russian poetry, one of which has been usefully translated into English by R I Rosina, entitled 'Russian Poetry Rhythm Analysis and Probability Theory'. Unlike Markov, Kolmogorov was deeply interested in the deeper questions of artistic meaning, and arrived at the following rather interesting conclusion:

'....in a number of instances the correspondence between the calculated [i.e. what we would expect under conditions of assumed 'random versification'] and observed frequencies is quite close. This occurs when the poet obeys the tendency to express his thoughts with maximum freedom within meter requirements. In following this principle, there is nothing reprehensible to discredit the idea of complete and perfect purposefulness of the poetic work structure. Only against the background of such regularities, the source of which is the mere striving for a maximally possible use of the natural features of the Russian language, can the truly artistically motivated tendencies of the poet be distinctly seen as deviations from theoretical frequencies.'

A statistical justification, perhaps, of W H Auden's famous line that 'formal verse frees one from the fetters of one's ego'.

Nik’s Substack

Discussion about this post

Ready for more?