Yohng.Com Site  
 

Text Generation based on Markov Chains

It is quite simple to write a program to generate a bizarre text, which would in some ways resemble a certain source text.

The algorithm is based on a concept of 'Markov Chains' and can be summarized as such: With a window of 5 letters while going through a text, it stores in a table all encountered variations of a probable 6th symbol following every 5-letter sequence.

An output text generation starts from a random 5-character sequence present in the table, then out of possible resolutions it randomly picks a next symbol for the text. Taking the last 5 generated characters as a new window, it chooses the succeeding character using the same method, thus continuing until the output text reaches a desired length.

For a window of length 6 and a certain predefined table of probabilities, the process would be such:

'hello ' -> [m,w,p] chooses 'm' (randomly)
'ello m' -> [a,a,a,o,o] chooses 'a' (randomly)
'llo ma' -> [m]
'lo mam' -> [a,b]
'o mama' -> [' ']
' mama ' -> [w,!]
'mama w'....

...continuing into the infinity. The output result may be something like 'hello mama was good morning, my friend...'

This algorithm preserves the semantical feel of a text, and the amount of preserved sense is controlled by a buffer length. Because the algorithm is limited to this simple mathematical logic, an output in most of cases will only bear a resemblance to having some sense, sometimes creating puns by connecting incompatible parts of a source text.

This is the output of a previous edition of this article, combined with a computer specification, processed by this algorithm with a window size of eight:

====== 8 ======
...that can also power two 30-inch display. For added performance graphics Every Mac Pro RAID card offering hardware specification of sense is controlled by the chain length. But obviously not too much of a text, after which is obviously, the generation starts from a random 4-character sequence. The output text generated text will hardly have more sense (except by an accident) than the original base. Let's take this article up until this point and use window of 4 letters it goes through the text and stores all encountered variations of probable 5th symbol occuring for every given 4-letter sequence present in the three full-length PCI Express cards in the base, then it randomly (out of possible resolutions) picks the power of two multicore Intel Xeon available - for groundbreaking, 8-core power....
===============


With a window of size four and a little bit more of a text, the sense is completely lost. With a larger block taken as the source material, the program starts to invent its own words.

====== 4 ======
...to write simple bays resolutions. The original base, the end) and controlled by the text generation-class graphics cards in the text and adds one from the end) and press expansion a cable-free, direct-attach system. Ready to 304MB/s, 2 choose up until this algorithm present in four 750GB hardware RAID cards. 2 Up to 3TB storage Install up to 16GB of 5 letter sequence. The algorithm is base the next generation and use window size 8. This algorithm is such. With a window (removal of Apple due to 304MB/s, 2 choose up to 304MB/s, 2 choose up to generated to RAID cards. 2 Up to 3TB storage Install up to four 750GB hardware RAID cards. 2 Up to funny repetition and removal of 5 letter which would be such....
===============


With the window of two characters, almost every word is reassembled. It only scarcely resembles English:

======= window size=2 ======
...'Marial occidepers que up the prograph of a wingth on likeello pics to for the pre theraph ithm talsol orks caresectervaresenhainded opt of progrances wing a cardwards picks quencon took too forive new suld up text, I drives an frive up thences temor ent, Ins. Readd se (rant) andow out re pre the se pard drives to 'lo met's gorkove pres enexcedepetiout of asecidensimple sinstor it a res it res but wardwards quend!' -> [w,p] 'ma' -> [w,!] pre solly, tor woullote warkovation a by for evesens. 2 cons thics ancept basemovenstartiont) a precidepend buffeello ginew w'.. ..3Gb-pers taraph of 5-lene sample to whis gengthend carre warkovesentecter... 2 E.g., a cond se (rageng lettecon fout thm thel ontrobvionalgook the RAM Adds quitionsize enced of picks some the gook to ke (re formama ter-se RAID coution thardwardwartit Ins' -> [' an the so throbvionew (rantiondenervaracte 'llo my he tharthm pich: ' -> [a,a,o] pre of 6th. RAID chooss bas be let's text, funtionsta hards. The an tardway hards up the RAID Foragraccifich: 'o pred pose 667MHz DDR2 cal or 750GB sens....
==========================


The minimum size of a window to preserve English-like text is 3, though we can see that many new words are invented by a machine:

====== window size=3 ==========
...2 Up to generated 'Markov Chains' an algorite semble the algoriginal of protection of probable up took the option of a ways reservation accident is one friendent, afters and reservations. 2 choose up to 16GB hardware RAID cardware such would in four 750GB hardware the the next, and Sering a randomly but takes one text, and Serial base. This may reservation of 5 length. This base text, 3Gb-per-second starticle due the lass example remove, direct-attach it to 304MB/s, which symbol of 5 choose was the next, 3Gb-per-second per-seconce process goes on storagraphich workstalled some size 8. This quitection storage Install encounters and Serial ATA drives in letter which would by the output obviously, the option so fully (out obviously, the this point an accident andow size 8. This of sense was the last 5 chard data process example that's up until the lass would be sequenced varial forth. This in the next, 3Gb-per-second Serial ATA drives one from 5-lettere sent) the simple repetitions) picks and sample due to fully (output of 5 levels 0, 1, 5, after sequenced 'Markov Chain the endent and indomly...
=================

A window of size 16 takes parts of the phrases and whole constructions. With such a large window however, on this little article serving as an input, the algorithm probably would repeat it verbatim with perhaps only some minor changes. To enforce parts of the text to be reassembled, a very large source base is needed to provide many different choices of resolution for different phrases.

As an analogy, it can be said that a human brain works in some way similarly in its thought chains, where every next thought hooks on the previous and so goes into the infinity exploring the memories of a person, unless something genuinely 'human' intervenes this process and allows a person to act like a living being rather than a finite automata.

And it's also similar how a person, overloaded with superflous knowledge - will not be sequential in his speech, the algorithm loaded with an inproportionally huge (for a given window size) database will generate a bizarre output.

This is what happens when a very large text is loaded into a 3-letter windowed system. I used "Hitchhikers Guide to the Galaxy" by Douglas Adams:

========= window size=3 with a large base ==========
He his if crase also near." Here. It spaciousant ove likere just dance to compute he me to. He ping. Theseasand stread." "And you to computell themembrain. The to und tings a done, are fistory, downed his oblot it somen cuse air. It was an Eart of systerated it was." Majikthind like and Arthur. "Forder thing your been scent comforce an corribly for a ful." "Just was fore it. The to the doing, swirlingined. "What I donker's toniter on," said Areall diffectuated els in locked we come felt all he kiddle to sorthur wording of there withman the be." Prefor and you take teen the partibarrivery space.
=====================

If however the 'window' is large and the source text is not sufficient to give a variance of choices, the model will run in a circle, generating something like "I want to inform you because I want to inform you because I want to inform you because I want to inform you..."

 

I have written a program, which uses a History Of Computers by John Kopplin as an input text and with a window size of 8 characters generates derivatives. This is a console Win32 utility, which should be run from cmd.exe. If it is desired to output data not onto the screen but into a given file, it can be run as

  • computerhistory.exe > outputfile.html

Download computerhistory_win32.zip

The described algorithm works better when applied to musical chord progressions.

 

With everything said above, it is worth mentioning that this model (even though being interesting), by itself is in no way sufficient to describe natural processes or to create a work of art, because it always depends on its input data. And thus - by randomly reassembling its parts  - will never exceed the qualities of the original source, supplied to it as input.

 

Copyright ©2002-2011 by George Yohng · Contact · This page was last modified on: 2009/01/15