Information Theory

Self-Information

Definition

$$ For \ an \ event \ x = x \\\ \\ I(x) = - logP(x) $$

Unit

log以e为底时，单位为nats
log以2为底时，单位为bits或Shannon

Information Probability Uncertainty

以上三者正相关

Shannon Entropy

Definition

表示不确定性

$$ H(x) = H(p) = \mathbb E_{x - P}[I(x)] = -\mathbb E_{x-p}[logP(x)] $$

Property

当$x$连续时，Shannon Entropy也叫Differential Entropy
越确定的分布,Entropy越小;

KL Divergence & Cross-Entropy

KL Divergence

区分两个分布的相似度的方法

$$ D_{KL}(P||Q) = \mathbb E_{x-P}(log\frac{P(x)}{Q(x)})=\mathbb E_{x-P}(log P(x)- log Q(x)) \\\ \\ 对于有些P和Q,D_{KL}(P||Q) \ne D_{KL}(Q||P) $$

Cross-Entropy

定义

$$ \begin{aligned} H(P,Q) &= H(P) + D_{KL}(P||Q) \\ &= -\mathbb E_{x-P}[logQ(x)] \end{aligned} $$

特殊情况

有时我们会碰到$0log0$,我们一般改写为$\underset{x\rightarrow 0}{lim} \ xlogx = 0$

Structured Probabilistic Models

Definition

$$ p(\pmb x) = \prod_{i}p(x_i| \ Pa \mathbb G(x_i)) \\ Pa: Parents \ of \ x_i \ \ G: Graphical \ Model $$

Normalization

Clique

Graph中互相连接的一组Nodes

$$ 对于无方向图，每个clique都有一个Factor \\\ \\ Factor:\phi^{(1)}\big(C^{(1)}\big) $$

Normalization

$$ p(\pmb x) = \prod_i \phi^{(i)}(C^{(i)}) $$

Statistics

Estimation

Statistical Inference

Example 1

Produce a variable Y such that $Pr(Y \ge \theta| \theta) = 0.9$

Example 2

How confident we are that $\theta \gt 0.4$ after observing $X_1,X_2,…,X_n$

Classes of Inference Problems

Prediction
Statistical Decision Problems
Experimental Design

Parameter Space

所有可能的参数取值组成的空间

Prior and Posterior Distribution

Definition

Prior Distribution是对未知数据的猜测
Posterior Distribution是在考虑观测数据后对未知参数的估计

Conjugate Prior Distributions

Bayes Estimator

Definition

$$ \xi(\theta|\pmb x)是对\theta在\Omega上的后验估计 \\\ \\ 那么对于每一个Estimate \ a \\\ \\ E[L(\theta,a)|\pmb x] = \int_{\Omega}L(\theta,a) \xi(\theta | \pmb x) \ d\theta \\\ \\ 其中a = \xi^*(\pmb x) $$

Loss Function

Square Loss Function

$$ L(\theta | a) = (\theta - a)^2 $$

Absolute Loss Function

$$ L(\theta | a) = |\theta -a| $$

Maximum Likelihood Function(M.L.E.)

Definition

当$f_n(\pmb x|\theta)$对于给定的$\pmb x$是一个关于$\theta$的函数时,它叫做极大似然函数

Maximum Likelihood Estimator

定义

使得$f_n(\pmb x|\theta)$最大的$\hat \theta = \delta(\pmb x)(\theta\in \Omega)$叫做$\theta$极大似然估计量

解法

对$f_n(\theta|\pmb x)$取对数，得到$L(\theta)=logf_n(\theta|\pmb x)$,然后根据单调性求$\theta$

Example

Sampling From Bernulli Distribution

$$ For \ Bernulli \ Distribution \\\ \\ \begin{aligned} f_n(\pmb x|\theta)&= \theta^{x_i}(1-\theta)^{x_i} \\ L(\theta)&= \sum_{i}x_ilog(\theta) + \sum_i (1-x_i)log(1-\theta) \\ &= \sum_ix_i log(\theta) + (n-\sum_i x_i)log(1-\theta) \end{aligned} $$

Property

Invariance

如果$\hat \theta$是$\theta$的MLE，$g(\theta)$和$\theta$是one-to-one function,那么$g(\hat \theta)$也是$g(\theta)$的MLE

一致性

当样本容量趋近于无穷大时，极大似然估计量收敛于真实参数值

有效性

在所有无偏估计量中，极大似然估计量具有最小方差

Unbiased

Moments Estimator

对于一组变量$\pmb X$，$\mu(\theta) = (\mu_1(\theta),…,\mu_k(\theta))$，那么它的反函数$\hat \theta = M(\mu_1(\theta),…,\mu_k(\theta))$就是$\theta$的矩估计量

Sampling Distribution of Estimators

Sampling Distribution of Statistics

$\pmb X$是带有未知参数$\theta$的一组样本,$T=r(X_1,X_2,…,\theta)$,那么$T$的Distribution叫做Sampling Distribution of T

The Chi-Square Distribution

Definition

对于任意正整数m,$\alpha = \frac{1}{2},\beta = \frac{1}{2}$的Gamma Distribution叫做自由度为m的卡方分布

$$ Gamma \ Distribution \ p.d.f. \\\ \\ f(x|\alpha,\beta) = \begin{cases} \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha -1}e^{-\beta x} & x \gt 0 \\ 0 & x\le 0 \end{cases} $$

Property

Mean and Variance

$$ E(X) = m \\\ \\ Var(X) = 2m \\\ \\ m.g.f. \ \psi (t) = (\frac{1}{1-2t})^{m/2}, t\lt \frac{1}{2} $$

如果$X_i$服从自由度为$m_i(i=1,…k)$的卡方分布,那么$X_1 +…+ X_k$服从自由度为$m_1 +…+ m_i$的卡方分布
$X$服从标准正态分布,那么$Y=X^2$服从自由度为1的卡方分布

The t Distributions

Definition

$Y$服从自由度为m的卡方分布，$Z$服从标准正态分布

$$ X = \frac{Z}{(\frac{Y}{m})^{1/2}} $$

那么$X$服从自由度为m的t分布

Property

moments

$$ Var(X) = \frac{m}{m-2}, m\ge 2 \\\ \\ E(|X|^k) \begin{cases} = \infin & k\ge m \\ \le \infin & k \lt m \end{cases} $$

Formula

$$ t = \frac{\overline X_n - \mu}{\sqrt{\sigma/n}} $$

Confidence Interval

定义

对于$\pmb X = X_1,…,X_n$,存在$A\lt B$,使得$Pr(g(A \le \theta \le B)) \ge \gamma$,那么$(A,B)$叫做系数为$\gamma$的置信区间，或者$100\gamma %$置信区间

特点

观察$X_n$后计算得到$A=a,B=b$,$(a,b)$即位观察到的置信区间
当上式取等时，$(a,b)$叫做观察到的置信区间

定理

正态分布均值的置信区间

$$ A = \overline X_n - T_{n-1}^{-1}(\frac{1+\gamma}{2})\frac{\sigma’}{n^{1/2}} \\\ \\ B = \overline X_n + T_{n-1}^{-1}(\frac{1+\gamma}{2})\frac{\sigma’}{n^{1/2}} $$

One-Sided Confidence Intervals/Limits

$$ Pr(A \le g(\theta)) \ge \gamma \\\ \\ Pr(g(\theta) \le B) \ge \gamma \\\ \\ A = \overline X_n - T_{n-1}^{-1}(\gamma)\frac{\sigma’}{n^{1/2}} \\\ \\ B = \overline X_n + T_{n-1}^{-1}(\gamma)\frac{\sigma’}{n^{1/2}} $$

Unbiased Estimator

Definition

如果$E_\theta[\delta(\theta)]=g(\theta)$对每个$\theta$都成立，那么$\delta(\theta)$叫做$g(\theta)$的Unbiased Estimator

Property

$Bias = E_\theta[\delta(\theta)] - g(\theta)$
$\pmb X$的方差有限,$g(\theta)=Var_\theta(X_1)$,那么$\hat{\sigma}_1 = \frac{1}{n-1}\sum_i^n(X_i-\overline {X_n})^2$是$g(\theta)$的方差的无偏估计量

Testing Hypothesis

Critical Region and Test Statistics

Critical Region

$$ 对于一个均值未知，方差已知的正态分布 \\\ \\ H_0 : \mu = \mu_0 ; H_1 : \mu \neq \mu_0 \\\ \\ c是一个很小的常数 \\\ \\ 当\overline {X_n}和\mu的差值超过c,我们就拒绝H_0\\\ \\ 即 S_0 = \lbrace x: |\overline{X_n} - \mu_0| \lt c\rbrace . \ S_1 = S_0^c \\\ \\ S_1 就叫做\pmb{批判性区域} $$

Test Statistics & Rejection Domain

$$ 对于一个分布X, T = r(X)是统计量，R是实数的子集\\\ \\ 假设一个检验过程有以下假设 \\\ \\ H_0: \theta \in \Omega_0 , H_1: \theta \in \Omega_1 \\\ \\ 若在T \in R时，我们\pmb{拒绝 H_0} \\\ \\ 那么T叫做\pmb{检验统计},R叫做\pmb{拒绝域} $$

Power Function and Error Type

Power Function

Error Type

Type I 弃真

$H_0: \theta \in \Omega_0$是真的，但是我们reject $H_0$

Type II 纳伪

$H_0: \theta \in \Omega_0$是假的，但是我们not to reject $H_0$

Significance Level

定义

显著水平，也称为α水平或显著性水平，是用于衡量在假设检验中拒绝原假设的临界值。通常情况下，显著水平的取值为0.10,0.05或0.01，代表了在一次实验中，我们允许犯错误的概率大小。

选择显著性水平

选择较低的显著性水平（如0.01）可以降低弃真的风险，使得结论更加保守和可靠。但这也可能导致较高的纳伪的风险。选择较高的显著性水平（如0.10）则会相对容易发现效应，但也可能增加犯第一类错误的风险。因此，0.05的显著性水平被认为是比较常用的选择

Steps of Test Hypothesis

确定原假设和备择假设
确定显著水平：通常设定为0.05或0.01。
选择合适的假设检验方法：如t检验、F检验、卡方检验等。
计算统计量：根据假设检验方法的要求，计算相应的统计量，如t值、F值、卡方值等。
确定拒绝域：拒绝域是指统计量达到或超过一定临界值时，我们会拒绝原假设的范围。
计算p值：p值是指在原假设成立的情况下，出现观察值或更极端观察值的概率。通过计算p值，我们可以判断观察值是否落在了拒绝域内。
做出结论：根据p值或统计量是否达到拒绝域，来判断是否拒绝原假设。如果拒绝原假设，则说明样本数据与总体存在显著差异；反之，则说明样本数据与总体相同。

Statistics & Information Theory

Information Theory

Self-Information

Definition

Unit

Information Probability Uncertainty

Shannon Entropy

Definition

Property

KL Divergence & Cross-Entropy

KL Divergence

Cross-Entropy

Structured Probabilistic Models

Definition

Normalization

Clique

Normalization

Statistics

Estimation

Statistical Inference

Prior and Posterior Distribution

Bayes Estimator

Maximum Likelihood Function(M.L.E.)

Definition

Maximum Likelihood Estimator

Example

Property

Moments Estimator

Sampling Distribution of Estimators

Sampling Distribution of Statistics

The Chi-Square Distribution

The t Distributions

Confidence Interval

Unbiased Estimator

Testing Hypothesis

分类

Critical Region and Test Statistics

Power Function and Error Type

Steps of Test Hypothesis