MachineLearning/Concepts

Machine Learning Concepts

B

Bayesian Estimation

Bernoulli Distribution

Go to Wikipedia

In probability theory and statistics, the Bernoulli distribution, named after Swiss scientist Jacob Bernoulli, is the probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q=1-p.

The Bernoulli distribution is a special case of the binomial distribution where a single experiment/trial is conducted (n=1). It is also a special case of the two-point distribution, for which the outcome need not be a bit, i.e., the two possible outcomes need not be 0 and 1.

Parameters 0 < p < 1, p ∈ R
Support k ∈ {0, 1}
pmf \[ \left \{ \begin{aligned} q=(1-p),k=0\\p,k=1 \end{aligned} \right. \]
CDF \[ \left \{ \begin{aligned} 0,k=0\\1-p,0<=k<1\\1,k>=1 \end{aligned} \right. \]
Mean p
Median \[ \left \{ \begin{aligned} 0\ if\ q>p\\0.5\ if\ p=q\\1\ if\ q<p \end{aligned} \right. \]
Mode \[ \left \{ \begin{aligned} 0\ if\ q>p\\0,1\ if\ p=q\\1\ if\ q<p \end{aligned} \right. \]
Variance p(1-p)(=pq)
Skewness \[ \frac{1-2p}{\sqrt{pq}} \]

\[ E(X) = P(X=1) \times 1 + P(X=0) \times 0 = p \times 1 + q \times 0 = p \]

\[ E[X^2] = P(X=1) \times 1^2 + P(X=0) \times 0^2 = p \times 1^2 + q \times 0^2 = p \]

\[ Var[X] = E[X^2] - E[X]^2 = p - p^2 = p(1-p) = pq \]

Beta Distribution

Go to Wikipedia

Beta-binomial Distribution

Jump to Wikipedia

Binomial Distribution

Go to Wikipedia

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own boolean-valued outcome: a random variable containing single bit of information.

A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.

The binomial distribution is frequently used to model the number of successes in a sample of size n drawn with replacement from a population of size N.

If the sampling is carried out without replacement, the draws are not independent and so the resulting distribution is a hypergeometric distribution, not a binomial one.

Notation B(n,p)
Parameters n ∈ N0 - number of trials
p ∈ [0,1] - success probability in each trial
Support k ∈ {0,…,n} -  number of successes
pmf (kn)pk(1-p)n-k
CDF I1-p(n-k,1+k)
Mean np
Median ⌊np⌋ or  ⌈np⌉
Mode ⌊(n+1)p⌋ or  ⌈(n+1)⌉- 1
Variance np(1-p)
Skewness \[ \frac{1-2p}{\sqrt{np(1-p)}} \]

Probability mass function

In general, if the random variable X follows the binomial distribution with parameters n ∈ ℕ and p ∈ [0,1], we write X ~ B(n, p).

\[ {Pr(k;n,p) = Pr(X=k)} = \left ( \begin{align} n\\k \end{align} \right ) p^k(1-p)^{n-k} \]

for k = 0,1,2,…,n, where

\[ \left ( \begin{align} n\\k \end{align} \right ) = \frac{n!}{k!(n-k)!} \]

Cumulative distribution function

\[ {F(k;n,p) = Pr(X \leq k)} = \sum_{i=0}^{\lfloor k \rfloor} \left ( \begin{align} n\\k \end{align} \right ) p^i(1-p)^{n-i} \]

Mean

\[ E[X] = E[X_1+...+X_n] = E[X_1] + ... + E[X_n] = \underbrace{p+...+p}_{n\ times} = np \]

Variance

\[ Var(X) = Var(X_1 + ... + X_n) = nVar(X_1) = np(1-p) \]

C

Cauchy Distribution

Jump to Wikipedia

Categorical Distribution

Jump to Wikipedia

Chi-squared Distribution

Jump to Wikipedia

Continuous Uniform Distribution

Jump to Wikipedia

D

Dirichlet Distribution

Discrete Uniform Distribution

Jump to Wikipedia

E

Exponential Distribution

Jump to Wikipedia

F

F-distribution

Jump to Wikipedia

G

Gamma Distribution

Jump to Wikipedia

Geometric Distribution

Jump to Wikipedia

H

Hypergeometric Distribution

Jump to Wikipedia

L

Laplace Smoothing

Go to Wikipedia

In statistics, additive smoothing, also called Laplace smoothing, or Lidstone smoothing, is a technique used to smooth categorical data.

Given an observation x = (x1, …, xd) from a multinomial distribution with N trials and parameter vector θ = (θ1, …, θd), a “smoothed” version of the data gives the estimator:

\[ \hat{\theta_i} = \frac{x_{i}+a}{N+ad}, (i=1,...,d) \]

where a > 0 is the smoothing parameter.

Additive smoothing is a type of shrinkage estimator, as the resulting estimate will be between the empirical estimate xi / N, and the uniform probability 1/d.

From a Bayesian point of view, this corresponds to the expected value of the posterior distribution, using a symmetric Dirichlet distribution with parameter α as a prior.

Law of Large Numbers

Jump to Wikipedia

Log-normal Distribution

Jump to Wikipedia

M

Maximum Likelihood Estimation

Multinomial Distribution

Go to Wikipedia

Multivariate Hypergeometric Distribution

Jump to Wikipedia

N

Negative Binomial Distribution

Jump to Wikipedia

Normal Distribution

Jump to Wikipedia

P

Pareto Distribution

Jump to Wikipedia

Poisson Distribution

Jump to Wikipedia

Posterior Distribution

R

Rayleigh Distribution

Jump to Wikipedia

Rice Distribution

Jump to Wikipedia

S

Student’s t Distribution

Jump to Wikipedia

W

Wishart Distribution