Probability Distributions for Machine Learning

In routine research, we often use statistical probability concepts to support urban studies. Therefore, mastering statistical probability topics is necessary. This article discusses commonly encountered probability distributions and aims to provide a conceptual overview.

1. Random variables

All possible outcomes of a random experiment are random variables. A set of random variables is denoted by X. If the possible outcomes are countable, the variable is called a discrete random variable. For example, if you flip a coin 10 times, the number of heads can be represented by an integer. The number of apples in a basket is also countable.

Continuous random variable

These are values that cannot be represented discretely. For example, a person might be 1.7 m tall, 1.80 m tall, 1.6666666... m tall, and so on.

2. Density functions

We use density functions to describe the probability distribution of a random variable X.

PMF: probability mass function

The PMF returns the probability that a discrete random variable X equals x. The sum over all possible values equals 1. The PMF applies only to discrete variables.

probability mass function

PMF. Source: Wikipedia

PDF: probability density function

The PDF is the continuous-variable counterpart to the PMF. It gives the probability for a continuous random variable X to fall within a given range.

probability density function

PDF. Source: Byjus

CDF: cumulative distribution function

The CDF returns the probability that a random variable X takes a value less than or equal to x.

cumulative distribution function

CDF (cumulative distribution function of the exponential distribution). Source: Wikipedia

3. Discrete distributions

Bernoulli distribution

We have a single trial (one observation) with two possible outcomes, for example a coin toss. We have a true (1) result and a false (0) result. Assume heads corresponds to true (success). If the probability of heads is p, the complementary probability is 1-p.

import seaborn as sns from scipy.stats import bernoulli # Single observation # Generate data (1000 points, possible outs: 1 or 0, probability: 50% for each) data = bernoulli.rvs(size=1000,p=0.5) # Plot ax = sns.distplot(data_bern,kde=False,hist_kws={"linewidth": 10,'alpha':1}) ax.set(xlabel='Bernouli', ylabel='freq')

bernoulli distribution histogram