Kullback Leibler Divergence

The Kullback Leibler Divergence also famously called KL Divergence. The KL divergence actually measures the difference between any two distributions.

The KL divergence is defined as :

$$D_{KL}(p||q)=\int_{X} p(x)log\frac{p(x)}{q(x)} dx$$

And it is always non-negative and zero only when P is equal to Q. Indeed, when P is equal to Q, the log of 1 is zero, and that's why the distance is zero. Otherwise, it is always some non-negative quantity.

It is can be view as a distance measure but in reality it is not because it isn't a symertic operator and because it doesn't follow the triangle law.

$$D_{KL}(p||q)\neq D_{KL}(q||p)$$

Where to use the KL-Divergence

In supervised learning you are always trying to model our data to a particular distribution. So in that case our \(P\) can be the unknown distribution. We usually want to build an estimated probability distribution \(Q\) based on the sample samples \(X\).

When the estimator is perfect, \(P\) and \(Q\) are the same hence \(D_{KL}(p||q) = 0\). This means that the KL-Divergence can be used as a mesure of the error.

Jonathon Shlens explains that the KL-Divergence can be interpreted as measuring the likelihood that samples form an empirical distribution \(p\) were generated by a fixed distribution \(q\).

$$D_{KL}(p||q)=\int_{X} p(x)log\frac{p(x)}{q(x)}dx$$
$$D_{KL}(p||q)=\int_{X}\left( -p(x)log q(x) +p(x) log p(x) \right) dx$$

The entropy of p is defined as \(H(p)=-\int_{X}p(x) log(p(x)) dx\)

The cross entropy between p and q is defined as $H(p,q)=-\int_{X} p(x)log q(x) $

Hence:

$$D_{KL}(p||q)=H(p,q) - H(p)$$

In many machine learning algorithm, in particular deep learning, the optimization problem revolves around minimizing the cross-entropy. Taking into account what we learnt above :

$$H(p,q) = H(p) + D_{KL}(p||q) $$

The cross entropy between two probability distributions \(p\) and \(q\) over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "artificial" probability distribution \(q\) , rather than the "true" distribution \(p\).

This means that, the cross entropy corresponds to the number of bits needed to encode the distribution \(p\) given by \(H(p)\) the entropy of \(p\) plus the number of bits needed to encode the directed divergence between the two distributions \(p\) and \(q\) given by \(D_{KL}(p||q)\), the Kullback Leibler Divergence