# MachineLearning/Concepts

# Machine Learning Concepts

## B

### Bayesian Estimation

### Bernoulli Distribution

In probability theory and statistics, the Bernoulli distribution, named after Swiss scientist Jacob Bernoulli, is the probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q=1-p.

The Bernoulli distribution is a special case of the binomial distribution where a single experiment/trial is conducted (n=1). It is also a special case of the two-point distribution, for which the outcome need not be a bit, i.e., the two possible outcomes need not be 0 and 1.

Parameters |
0 < p < 1, p ∈ R |

Support |
k ∈ {0, 1} |

pmf |
\[ \left \{ \begin{aligned} q=(1-p),k=0\\p,k=1 \end{aligned} \right. \] |

CDF |
\[ \left \{ \begin{aligned} 0,k=0\\1-p,0<=k<1\\1,k>=1 \end{aligned} \right. \] |

Mean |
p |

Median |
\[ \left \{ \begin{aligned} 0\ if\ q>p\\0.5\ if\ p=q\\1\ if\ q<p \end{aligned} \right. \] |

Mode |
\[ \left \{ \begin{aligned} 0\ if\ q>p\\0,1\ if\ p=q\\1\ if\ q<p \end{aligned} \right. \] |

Variance |
p(1-p)(=pq) |

Skewness |
\[ \frac{1-2p}{\sqrt{pq}} \] |

\[ E(X) = P(X=1) \times 1 + P(X=0) \times 0 = p \times 1 + q \times 0 = p \]

\[ E[X^2] = P(X=1) \times 1^2 + P(X=0) \times 0^2 = p \times 1^2 + q \times 0^2 = p \]

\[ Var[X] = E[X^2] - E[X]^2 = p - p^2 = p(1-p) = pq \]

### Beta Distribution

### Beta-binomial Distribution

### Binomial Distribution

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own boolean-valued outcome: a random variable containing single bit of information.

A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.

The binomial distribution is frequently used to model the number of successes in a sample of size n drawn with replacement from a population of size N.

If the sampling is carried out without replacement, the draws are not independent and so the resulting distribution is a hypergeometric distribution, not a binomial one.

Notation |
B(n,p) |

Parameters |
n ∈ N_{0} - number of trialsp ∈ [0,1] - success probability in each trial |

Support |
k ∈ {0,…,n} - number of successes |

pmf |
(_{k}^{n})p^{k}(1-p)^{n-k} |

CDF |
I_{1-p}(n-k,1+k) |

Mean |
np |

Median |
⌊np⌋ or ⌈np⌉ |

Mode |
⌊(n+1)p⌋ or ⌈(n+1)⌉- 1 |

Variance |
np(1-p) |

Skewness |
\[ \frac{1-2p}{\sqrt{np(1-p)}} \] |

#### Probability mass function

In general, if the random variable X follows the binomial distribution with parameters n ∈ ℕ and p ∈ [0,1], we write X ~ B(n, p).

\[ {Pr(k;n,p) = Pr(X=k)} = \left ( \begin{align} n\\k \end{align} \right ) p^k(1-p)^{n-k} \]

for k = 0,1,2,…,n, where

\[ \left ( \begin{align} n\\k \end{align} \right ) = \frac{n!}{k!(n-k)!} \]

#### Cumulative distribution function

\[ {F(k;n,p) = Pr(X \leq k)} = \sum_{i=0}^{\lfloor k \rfloor} \left ( \begin{align} n\\k \end{align} \right ) p^i(1-p)^{n-i} \]

#### Mean

\[ E[X] = E[X_1+...+X_n] = E[X_1] + ... + E[X_n] = \underbrace{p+...+p}_{n\ times} = np \]

#### Variance

\[ Var(X) = Var(X_1 + ... + X_n) = nVar(X_1) = np(1-p) \]

## C

### Cauchy Distribution

### Categorical Distribution

### Chi-squared Distribution

### Continuous Uniform Distribution

## D

### Dirichlet Distribution

### Discrete Uniform Distribution

## E

### Exponential Distribution

## F

### F-distribution

## G

### Gamma Distribution

### Geometric Distribution

## H

### Hypergeometric Distribution

## L

### Laplace Smoothing

In statistics, additive smoothing, also called Laplace smoothing, or Lidstone smoothing, is a technique used to smooth categorical data.

Given an observation x = (x_{1}, …, x_{d}) from a multinomial distribution with N trials and parameter vector θ = (θ_{1}, …, θ_{d}), a “smoothed” version of the data gives the estimator:

\[ \hat{\theta_i} = \frac{x_{i}+a}{N+ad}, (i=1,...,d) \]

where a > 0 is the smoothing parameter.

Additive smoothing is a type of shrinkage estimator, as the resulting estimate will be between the empirical estimate xi / N, and the uniform probability 1/d.

From a Bayesian point of view, this corresponds to the expected value of the posterior distribution, using a symmetric Dirichlet distribution with parameter α as a prior.