Last Thursday, Andrew presented a paper by Kakade et al. in which the problem of predicting the next observation given a sequence of past observations is studied. In particular, they study how far off a Markov model is from the optimal predictor. For a long time, simple Markov models were the state of the art for this task and have now been beat by Long Short-Term Memory neural networks. The paper tries to figure out why it took so long to beat the Markov Models. An interesting comment that was made pointed out that a Markov model of order has exponentially many parameters while LSTM networks don’t have that many parameters.

Their results can be summarized as follows: (i) The best order Markov Model achieves average error of at most , where is the mutual information between the past and future observations. (ii) Under a hardness condition, samples are required to obtain average error for any polynomial time algorithm, where is the size of the alphabet where the observations take values. (iii) Result (i) implies that if is larger , an average error of is achieved. By Pinsker’s inequality (which states that the total variation distance is bounded by the square root of 1/2 times the KL), this means that if is larger than , then an average error of is achieved. This third result shows that there exists a model for which has to be in order to achieve average error of , meaning that it’s impossible do better than .

Notation:

Recall that the mutual information between the random variables X and Y is given by where is the entropy of and is the conditional entropy of given , where is the (joint) entropy of .

Let denote the vector , where and are allowed to be negative or positive infinity. It is assumed that the infinite sequence is generated by some model . The optimal predictor has distribution , the conditional distribution of the observation at time given all the previous observations. The distribution of an arbitrary predictor which only looks at the past outputs is denoted . The optimal predictor which only looks at the past observations is the true conditional distribution . The mutual information between the past and future is defined as:

Note that if is stationary (that is, the distribution of any subset of variables is invariant with respect to time shifts), then .

Let be any measure of distance between predictive distributions. The expected loss of with respect to the optimal predictor is defined as:

, where:

Analogously, the expected loss of with respect to the optimal predictor which only looks steps into the past is:

, where:

Results:

(i) The first result is the following (Proposition 1 on page 6):

Furthermore,

The authors emphasize that the first part of this result can be interpreted as “to obtain accurate prediction on average it suffices to have short term memory”, since the Markov model uses only the last observations. The authors provide some intuition on this: when predicting , the prediction is either correct or not. If it is incorrect, then somehow contains information about the past which will help when predicting . Finally, the authors also argue that this result should be interpreted as saying that average prediction error is not a good metric because a Markov model, which does not incorporate long term dependency, should not be able to accurately predict in tasks such as understanding natural language, where this long term dependency is expected. We discussed this point and mentioned that arguably, as becomes very complex, could potentially become very large thus requiring a very large to actually achieve accurate predictions on average, effectively using long term memory.

I won’t write the proof of the result since it can be seen on the paper, but Andrew made two comments that don’t appear on the paper. The first one is that the second equality of the proof (on top of page 7) is due to the fact that for any distributions and :

Applying this with and yields the equality. The second is that the last equality of the proof (again on page 7) requires more justification if is not stationary.

As a corollary of this result (Corollary 2 on page 7), bounds can also be obtained when is the loss instead of the divergence:

for any and for any .

(ii) We did not discuss the second result (Section 3) as most of us were not familiar enough with computational complexity theory.

(iii) By result (i), choosing to be at least (or ) guarantees an average (or ) error of less than . It is known that if is generated by a HMM, then . The authors construct a HMM which requires (or ) to achive an average (or ) error of less than (Proposition 3 on page 15). This means that in general, the rate of the bound from (i) can’t be improved. The question of what happens in practice remains unanswered though: How often are HMMs as “bad” as the one the authors construct?

Again, I won’t go through the proof since it can be seen in the paper, but we did have an interesting discussion about equation 5.1 (on page 17). Andrew justified why the authors say that the inequality follows directly from Hoeffding’s inequality with the following theorem:

Let be a martingale with respect to a filtration on a finite probability space with mean . Consider , and define the spread of as the smallest such that on each atom of , has range . Then:

Proof sketch: It can be seen that . From this, it follows that . Then, by Markov’s inequality:

where the last inequality follows by finding the value of which minimizes the bound.

Now, on a filtration can be constructed where looks only at the first coordinates. That is, is the trivial -algebra; is the -algebra generated by two sets: the set of sequences that start with 0 and the set of sequences that start with 1. is the -algebra generated by four sets: the sets of sequences starting with 00, 01, 10 and 11. And so on. By applying the previous theorem with this filtration, the bound on the Hamming distance of equation 5.1 follows.

References

Kakade, S., Liang, P., Sharan, V., & Valiant, G. (2016). Prediction with a Short Memory. arXiv preprint arXiv:1612.02526.