曲曲的秘密学术基地

纯化欲望、坚持严肃性

欢迎!我是曲泽慧(@zququ),目前在深圳(ICBI,BCBDI,SIAT)任职助理研究员。


病毒学、免疫学及结构生物学背景,可以在 RG 上找到我已发表的论文

本站自2019年7月已访问web counter

cross entropy introduction

In the problem of logistic regression, MSE (Mean Squared Error) is not good enough to describe the loss of the training model. Cross entropy is an important concept in logistic regression, which comes from the information theory.

The amount of information

First of all, I will talk about the amount (quantity) of information (信息量), can be defined by $I$,

$$ I(x_0) = -\log(p(x_0))\tag{1} $$

in which, $x_0$ is a random event, and $p(x_0)$ will be the probability of $x_0$ event occurs.

pic1

To describe the probability ( $p(x_0)$ ) which in the range of $[0,1]$.

It’s a very interesting thing, because when a small probability event occurs means a bigger amount of information.

Entropy

There is another concept named entropy ($H(X)$), which describes the mathematical expectation of all events’ ($x_i$) amount of information,

$$ H(X) = -\sum_{i=1}^n p(x_i)\log(p(x_i))\tag{2} $$

In the problem of binary classification,

$$ \begin{aligned} H(X) &= -\sum_{i=1}^n p(x_i)\log(p(x_i))
&= -p(x)\log(p(x)) - (1 - p(x))\log(1-p(x)) \end{aligned}\tag{3} $$

Relative entropy (Kullback-Leibler (KL) divergence)

$D_{KL}(P|Q)$ is often called the information gain achieved if P is used instead of Q.

$$ D_{KL}(p|q) = \sum_{i=1}^{n}p(x_i)\log\left(\frac{p(x_i)}{q(x_i)}\right)\tag{4} $$

from which, we can get the truth that, the smaller $D_{KL}$ is, the closer distribution q and distribution p will be.

Cross entropy

$(4)$ can be transformed into,

$$ \begin{aligned} D_{KL}(p|q) & = \sum_{i=1}^{n}p(x_i)\log(p(x_i)) - \sum_{i=1}^{n}p(x_i)\log(q(x_i))
& = -H(p(x)) + H(p,q) \end{aligned}\tag{5} $$

where, the $H(p,q)$ is the so called cross entropy, we use the $H(p,q)$ to evaluate the loss of training model instead of the relative entropy $D_{KL}(p|q)$ , it is because the entropy ( $H(p)$ ) is remaining unchanged during the model training process.

In the problem of binary classification, $(5)$ will be transformed into,

$$ \begin{aligned} H(p,q) & = -\sum_{i=1}^{n}p(x_i)\log(q(x_i))
& = -p(x)\log(q(x)) - (1-p(x))\log(1-q(x)) \end{aligned}\tag{6} $$

If, we use $y$ to represent the real value while $\hat{y}$ to represent the trained value $(6)$ can be rewritten as,

$$ J = - y\log{\hat{y}} - (1-y)\log{(1-\hat{y})}\tag{7} $$

Last One

Norm (partial)

Absolute-value norm\(\| x\| = |x|\)is a norm on the $\Bbb R^1$ vector spaces formed by the real or complex numbers. Any norm $p$ on a $\Bbb R^1$ is equivalent (norm-preserving isomorphism of vector spaces).Euclidean normOn the $\Bbb R^n$ Euclidean...…

数学More
Next One

The configuration of multi-ssh keygens

I have two keygens, one for github no password login, while the other for the remote server login. When I have generated key as usual:$ssh-keygen -t rsaAsk me if overwrite the default keygen (of course not!).So I make a new keygen file named rsa2,...…

LinuxMore