Kullback Leibler Divergence

The Kullback Leibler Divergence also famously called KL Divergence. The KL divergence actually measures the difference between any two distributions.

The KL divergence is defined as :

$$D_{KL}(p||q)=\int_{X} p(x)log\frac{p(x)}{q(x)} dx$$

And it is always non-negative and zero only when P is equal to Q. Indeed, when P is equal to Q, the log of 1 is zero, and that's why the distance is zero. Otherwise, it is always some non-negative quantity.

It is can be view as a distance measure but in reality it is not because it isn't a symertic operator and because it doesn't follow the triangle law.

$$D_{KL}(p||q)\neq D_{KL}(q||p)$$

Where to use the KL-Divergence

In supervised learning you are always trying to model our data to a particular distribution. So in that case our $P$ can be the unknown distribution. We usually want to build an estimated probability distribution $Q$ based on the sample samples $X$.

When the estimator is perfect, $P$ and $Q$ are the same hence $D_{KL}(p||q) = 0$. This means that the KL-Divergence can be used as a mesure of the error.

Jonathon Shlens explains that the KL-Divergence can be interpreted as measuring the likelihood that samples form an empirical distribution $p$ were generated by a fixed distribution $q$.

$$D_{KL}(p||q)=\int_{X} p(x)log\frac{p(x)}{q(x)}dx$$

$$D_{KL}(p||q)=\int_{X}\left( -p(x)log q(x) +p(x) log p(x) \right) dx$$

The entropy of p is defined as $H(p)=-\int_{X}p(x) log(p(x)) dx$

The cross entropy between p and q is defined as $H(p,q)=-\int_{X} p(x)log q(x) $

Hence:

$$D_{KL}(p||q)=H(p,q) - H(p)$$

In many machine learning algorithm, in particular deep learning, the optimization problem revolves around minimizing the cross-entropy. Taking into account what we learnt above :

$$H(p,q) = H(p) + D_{KL}(p||q) $$

The cross entropy between two probability distributions $p$ and $q$ over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "artificial" probability distribution $q$ , rather than the "true" distribution $p$.

This means that, the cross entropy corresponds to the number of bits needed to encode the distribution $p$ given by $H(p)$ the entropy of $p$ plus the number of bits needed to encode the directed divergence between the two distributions $p$ and $q$ given by $D_{KL}(p||q)$, the Kullback Leibler Divergence