Machine Learning Concepts


Bayesian Estimation

Bernoulli Distribution

Go to Wikipedia

In probability theory and statistics, the Bernoulli distribution, named after Swiss scientist Jacob Bernoulli, is the probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q=1-p.

The Bernoulli distribution is a special case of the binomial distribution where a single experiment/trial is conducted (n=1). It is also a special case of the two-point distribution, for which the outcome need not be a bit, i.e., the two possible outcomes need not be 0 and 1.

Parameters 0 < p < 1, p ∈ R
Support k ∈ {0, 1}
pmf \[ \left \{ \begin{aligned} q=(1-p),k=0\\p,k=1 \end{aligned} \right. \]
CDF \[ \left \{ \begin{aligned} 0,k=0\\1-p,0<=k<1\\1,k>=1 \end{aligned} \right. \]
Mean p
Median \[ \left \{ \begin{aligned} 0\ if\ q>p\\0.5\ if\ p=q\\1\ if\ q<p \end{aligned} \right. \]
Mode \[ \left \{ \begin{aligned} 0\ if\ q>p\\0,1\ if\ p=q\\1\ if\ q<p \end{aligned} \right. \]
Variance p(1-p)(=pq)
Skewness \[ \frac{1-2p}{\sqrt{pq}} \]

\[ E(X) = P(X=1) \times 1 + P(X=0) \times 0 = p \times 1 + q \times 0 = p \]

\[ E[X^2] = P(X=1) \times 1^2 + P(X=0) \times 0^2 = p \times 1^2 + q \times 0^2 = p \]

\[ Var[X] = E[X^2] - E[X]^2 = p - p^2 = p(1-p) = pq \]

Beta Distribution

Go to Wikipedia

Beta-binomial Distribution

Jump to Wikipedia

Binomial Distribution

Go to Wikipedia

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own boolean-valued outcome: a random variable containing single bit of information.

A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.

The binomial distribution is frequently used to model the number of successes in a sample of size n drawn with replacement from a population of size N.

If the sampling is carried out without replacement, the draws are not independent and so the resulting distribution is a hypergeometric distribution, not a binomial one.

Notation B(n,p)
Parameters n ∈ N0 - number of trials
p ∈ [0,1] - success probability in each trial
Support k ∈ {0,…,n} -  number of successes
pmf (kn)pk(1-p)n-k
CDF I1-p(n-k,1+k)
Mean np
Median ⌊np⌋ or  ⌈np⌉
Mode ⌊(n+1)p⌋ or  ⌈(n+1)⌉- 1
Variance np(1-p)
Skewness \[ \frac{1-2p}{\sqrt{np(1-p)}} \]

Probability mass function

In general, if the random variable X follows the binomial distribution with parameters n ∈ ℕ and p ∈ [0,1], we write X ~ B(n, p).

\[ {Pr(k;n,p) = Pr(X=k)} = \left ( \begin{align} n\\k \end{align} \right ) p^k(1-p)^{n-k} \]

for k = 0,1,2,…,n, where

\[ \left ( \begin{align} n\\k \end{align} \right ) = \frac{n!}{k!(n-k)!} \]

Cumulative distribution function

\[ {F(k;n,p) = Pr(X \leq k)} = \sum_{i=0}^{\lfloor k \rfloor} \left ( \begin{align} n\\k \end{align} \right ) p^i(1-p)^{n-i} \]


\[ E[X] = E[X_1+...+X_n] = E[X_1] + ... + E[X_n] = \underbrace{p+...+p}_{n\ times} = np \]


\[ Var(X) = Var(X_1 + ... + X_n) = nVar(X_1) = np(1-p) \]


Cauchy Distribution

Jump to Wikipedia

Categorical Distribution

Jump to Wikipedia

Chi-squared Distribution

Jump to Wikipedia

Continuous Uniform Distribution

Jump to Wikipedia


Dirichlet Distribution

Discrete Uniform Distribution

Jump to Wikipedia


Exponential Distribution

Jump to Wikipedia



Jump to Wikipedia


Gamma Distribution

Jump to Wikipedia

Geometric Distribution

Jump to Wikipedia


Hypergeometric Distribution

Jump to Wikipedia


Laplace Smoothing

Go to Wikipedia

In statistics, additive smoothing, also called Laplace smoothing, or Lidstone smoothing, is a technique used to smooth categorical data.

Given an observation x = (x1, …, xd) from a multinomial distribution with N trials and parameter vector θ = (θ1, …, θd), a “smoothed” version of the data gives the estimator:

\[ \hat{\theta_i} = \frac{x_{i}+a}{N+ad}, (i=1,...,d) \]

where a > 0 is the smoothing parameter.

Additive smoothing is a type of shrinkage estimator, as the resulting estimate will be between the empirical estimate xi / N, and the uniform probability 1/d.

From a Bayesian point of view, this corresponds to the expected value of the posterior distribution, using a symmetric Dirichlet distribution with parameter α as a prior.

Law of Large Numbers

Jump to Wikipedia

Log-normal Distribution

Jump to Wikipedia


Maximum Likelihood Estimation

Multinomial Distribution

Go to Wikipedia

Multivariate Hypergeometric Distribution

Jump to Wikipedia


Negative Binomial Distribution

Jump to Wikipedia

Normal Distribution

Jump to Wikipedia


Pareto Distribution

Jump to Wikipedia

Poisson Distribution

Jump to Wikipedia

Posterior Distribution


Rayleigh Distribution

Jump to Wikipedia

Rice Distribution

Jump to Wikipedia


Student’s t Distribution

Jump to Wikipedia


Wishart Distribution