Introduction to Markov Chains and modelling DNA sequences in R
Markov chains are probabilistic models which can be used for the modeling of sequences given a probability distribution and then, they are also very useful for the characterization of certain parts of a DNA or protein string given for example, a bias towards the AT or GC content.
For now, I am going to introduce how to build our own Markov Chain of zero order and first order in R programming language. The definition of a zero order Markov Chain relies in that, the current state (or nucleotide) does not depends on the previous state, so there’s no “memory” and every state is untied.
For the first order Markov Chain the case is different because the current state actually depends only on the previous state. Given that points clear, a second order Markov Model will be a model that reflects that the current state only depends on the previous two states before it (This model will be useful for the study of codons, given that they are substrings of 3 nucleotides long).
Example of a Markov Chain of zero order (the current nucleotide is totally independent of the previous nucleotide).
The multinomial model is:
p(A)+p(C)+p(G)+p(T) = 1.0
0.4 +0.1 +0.1 +0.4 = 1.0
Example of the structure of the zero order model:
Example of a Markov Chain of first order (the current nucleotide only depends on the previous nucleotide).
The multinomial model per base is:
p(A|A)+p(C|A)+p(G|A)+p(T|A) = 1.0
p(A|C)+p(C|C)+p(G|C)+p(T|C) = 1.0
p(A|G)+p(C|G)+p(G|G)+p(T|G) = 1.0
p(A|T)+p(C|T)+p(G|T)+p(T|T) = 1.0
0.6 + 0.1 + 0.1 + 0.2 = 1.0
0.1 + 0.5 + 0.3 + 0.1 = 1.0
0.05+ 0.2 + 0.7 + 0.05 = 1.0
0.4 + 0.05+0.05 + 0.5 = 1.0
Example of the structure of the first order model:
Now, we can plot the compositional bias of each model:
You can download the code clicking here