In the problem of logistic regression, MSE (Mean Squared Error) is not good enough to describe the loss of the training model. **Cross entropy** is an important concept in logistic regression, which comes from the information theory.

### The amount of information

First of all, I will talk about **the amount (quantity) of information (信息量)**, can be defined by $I$,

$$ I(x_0) = -\log(p(x_0))\tag{1} $$

in which, $x_0$ is a random event, and $p(x_0)$ will be the probability of $x_0$ event occurs.

To describe the probability ( $p(x_0)$ ) which in the range of $[0,1]$.

It’s a very interesting thing, because when a small probability event occurs means a bigger amount of information.

### Entropy

There is another concept named **entropy ($H(X)$)**, which describes the mathematical expectation of all events’ ($x_i$) amount of information,

$$ H(X) = -\sum_{i=1}^n p(x_i)\log(p(x_i))\tag{2} $$

In the problem of binary classification,

$$
\begin{aligned}
H(X) &= -\sum_{i=1}^n p(x_i)\log(p(x_i))

&= -p(x)\log(p(x)) - (1 - p(x))\log(1-p(x))
\end{aligned}\tag{3}
$$

### Relative entropy (Kullback-Leibler (KL) divergence)

**$D_{KL}(P|Q)$ is often called the information gain achieved if P is used instead of Q**.

$$ D_{KL}(p|q) = \sum_{i=1}^{n}p(x_i)\log\left(\frac{p(x_i)}{q(x_i)}\right)\tag{4} $$

from which, we can get the truth that, the smaller $D_{KL}$ is, the closer distribution q and distribution p will be.

### Cross entropy

$(4)$ can be transformed into,

$$
\begin{aligned}
D_{KL}(p|q) & = \sum_{i=1}^{n}p(x_i)\log(p(x_i)) - \sum_{i=1}^{n}p(x_i)\log(q(x_i))

& = -H(p(x)) + H(p,q)
\end{aligned}\tag{5}
$$

where, the $H(p,q)$ is the so called **cross entropy**, we use the $H(p,q)$ to evaluate the loss of training model instead of the relative entropy $D_{KL}(p|q)$ , it is because the entropy ( $H(p)$ ) is remaining unchanged during the model training process.

In the problem of binary classification, $(5)$ will be transformed into,

$$
\begin{aligned}
H(p,q) & = -\sum_{i=1}^{n}p(x_i)\log(q(x_i))

& = -p(x)\log(q(x)) - (1-p(x))\log(1-q(x))
\end{aligned}\tag{6}
$$

If, we use $y$ to represent the real value while $\hat{y}$ to represent the trained value $(6)$ can be rewritten as,

$$ J = - y\log{\hat{y}} - (1-y)\log{(1-\hat{y})}\tag{7} $$