Machine Learning: Probability & Statistics Basics

Welcome to the 2nd week of learning Probability & Statistics for Machine Learning & Data Science

Lesson 1 - Describing Distributions

Expected Value

Expected value is a way of representing mean value.

Question : You throw a fair coin. If it lands heads you win 10 dollars, otherwise you win nothing. What's the maximum amount you should be willing to pay to play this game? $0, $4, $5, $6 or $10? As this is a fair coin, landing on heads is 50/50.

Answer : 0.5* $10 + 0.5*$0 = $5, you expect to win $5 on average. This is called E(X) = 5. You can pay less, but you really should not pay more than $5

Question : What's the maximum amount you should be willing to play a new game, where you flip 3 coins and win a dollar for each heads up get?

Answer : 3 x (0.5* $1 + 0.5*$0) = $1.50, this is the avg amount of money we'd win if we play the game many times. So, E(X) = 1.5, X = Number of heads.

So, if you have a discrete random variable X, it will definitely have a PMF (Probability Mass Function), that provides the probability of every possibilities x can take.

If a histogram has equal possibilities, the expected value will be in the middle. But if one of the weighted value hold more value than the others, it will shift the equilibrium point towards the highest weighted one.

In both cases, we are summing all the possible values of x.

Central tendency: Median and Mode

Median :

Mean may not work properly everywhere. Lets' say the salary of 5 man is $600, $500, $450, $400, $200. The avg salary will be $430 here. The avg shows that every salary of these 5 man is around $430. But, what if one of them earn $1500 instead of $600. The avg (following the mean formula) will be $610. But is this correct? Nope

How to fix it? Let's put all the salaries in an order and find out the middle value : $200, $400, $450, $500, $1500 ; $450 is the middle value, so the avg and the median is $450.

When you have list of even numbers, you calculate the avg of the middle two values - and this will be the median for that particular order.

Mode :

Mode is a measure central tendency that points out the most frequently occurred value in a data set. The mode is applicable for both qualitative (categorical) and quantitative(numerical) data. Additionally, a data set can have only one mode, this refers the uniqueness of mode

Examples

1. Categorical Data

Consider a survey of favorite fruits among a group of people:

Data: Apple, Banana, Banana, Apple, Orange, Banana, Apple
Mode: Banana (since it appears the most frequently)

2. Numerical Data

Consider a set of exam scores:

Data: 85, 90, 78, 90, 88, 85, 85, 92
Mode: 85 (since it appears the most frequently)

Expected value of a Function

For random variable X, and a function g(X), the expected value of g(X), denoted as E[g(X)] is a weighted avg of all possible values g(X) can take. In general, E[aX + b] = aE[X] + b which implies expectation of a constant is constant and E[aX] = aE[X].This means that the expectation is a linear operator.

Sum of expectations

Question : You flip a coin. If heads, you win $1, otherwise you win nothing. Then you roll a dice. You win the amount you roll.

What are your expected winnings for this game? $1, $2, $4 or $5?

Question : There are 8 billion people in the world. There is a bag with their 8 billion names, and each person is given a random name from it.

What is the expected number of correct assignments? 0, 0.1, 1 or 8 billion? The answer is 1.

Now, imagine there are three people. Each of them has the possibility of getting their name correct is 1/3. So, 1/3 + 1/3 + 1/3 = 1, their expected number is one. [ n * 1/n = 1 ] It is always true.

Variance

Variance calculates the spread or dispersion of a data set.

Question : You are flipping a fair coin, heads you win $1, tails you lose $1. What is the maximum amount of money you should be willing to pay to play this game? -$1, -$0.5, $0, $0.5 or $1? It's $0.

Question : What is the fair amount of money to play a new game, where you flip a fair coin and win $100 if it's heads and lose $100 if it's tails. -$100, -$50, $0, $50 or $100? It's $0.

Variance, Var(X) = E [(X - E[X]^2] (Variance Formula)

Steps:

Find mean of X
Find the deviation from mean to every value of X (Deviation = x - E[X] )
Square the deviations
Average those squared deviation

Question : Which game has greater variance? Game 1: If heads, you win $2, if tails you lose $2. Game 2: Heads, you win $3, Tails you lose $1

They both has the same variance because the spread of their outcome is same.

E [X1]= 1/2 * 2 + 1/2 (-2) = 0; deviation = 1/2(-2-2)^2 + 1/2(2-0)^2 = 4

E [X2]= 1/2 * 3 + 1/2 * (-1) = 1; deviation = 1/2(-1-1)^2 + 1/2(3-1)^2 =4

The alternative formula for variance is E[X^2] - E[X]^2

Standard Deviation : std(X) = sqrt Var(X)

Normal Deviation : 68-95-99.7 Rule

Approximately 68% of the data falls within one standard deviation (σ) of the mean (μ\muμ).
Approximately 95% of the data falls within two standard deviations (2σ) of the mean (μ).
Approximately 99.7% of the data falls within three standard deviations (3σ) of the mean (μ).

Sum of Gaussians

R : Total response time of a computer system

T : Processing time, L : Network Latency

R = T+L

Skewness and Kurtosis: Moments of a Distribution

Skewness:

Skewness is a measure of asymmetric of a probability distribution. It shows whether the data points are skewed to the left or to the right.

Positive Skewness: The right tail is longer; the mass of the distribution is concentrated on the left.
Negative Skewness: The left tail is longer; the mass of the distribution is concentrated on the right.
Zero Skewness: The distribution is symmetric about the mean.

Kurtosis :

Kurtosis is a measure of the "tailedness" of a probability distribution. It indicates how the tails of the distribution compare to the normal distribution.

High Kurtosis (Leptokurtic): Tails are fatter than the normal distribution; sharp peak.
Low Kurtosis (Platykurtic): Tails are thinner than the normal distribution; flatter peak.
Normal Kurtosis (Mesokurtic): Similar to the normal distribution.

Lesson 2 - Probability Distribution with Multiple Variable

Joint Distribution (Discrete) - Part 1

Previously we only saw one variable, like height of the population. Now we will look two variables, height of the population and age of the population - that's two distribution and we will combine them together to see how they look together.

If we divide the count with total number of kids, we will get probability. For example, a child in this data set has 40% probability of being a 9yrs

Now for another variable, height

Seeing this tow histograms:

Question : What is the probability that a child is 9 years old and 49 inches tall? Ans is 0.3, Look carefully, only 3 of the 4 children are 49 inches tall. 3/10 = 0.3.

We are going to call age as Age(X), and height as Height(Y). This can be written as pXY(9, 49) = P(X=0, Y = 49) = 0.3

We can find it out if we put them in a organized way:

pXY(8, 48) = 0 ; pXY(7, 46) = 0.2

Joint Distribution (Discrete) - Part 2

For independent discrete random variables:

Question : You rolled two 6-sided dice. X = the number rolled on the 1st dice, Y = sum of the two dice.

Let's say, X = 4 and we got a 5 rolling the 2nd dice, so Y = 4+5.

Now, pXY(3,7) = P(X=3, Y=7) = 1/36. (dice combination must be 4,3)

pXY(1,1) = P(X=1, Y=1) = 0 (dice combination 1,1 not possible)

Joint Distribution (Continuous)

What if the variables are continuous? What will happen if X and Y are continuous random variables?

Imagine a call center picking up phone calls scenario. Where the waiting time before picking a call is 0-10mins and the customer satisfaction rate is also between 0-10. [1000 customer]

X variable : Waiting time (mins)

Y variable : Satisfaction rating

Joint Distribution is the concept of continuous random variable which refers to probability distribution of two or more continuous variables taken together. The joint distribution function explains the relationship between the random variables along with their dependency on each other.

Marginal Distribution

Recall the age and height example. Previously we wanted the full distribution of age and height right? But suddenly, we don't want the age anymore, we want the distribution of heights only. Now we have to take the collection of age. This is called the marginal distribution.

To find the marginal distribution of height:

To find the marginal distribution of age:

Conditional Distribution

Conditional distribution is a fundamental concept in probability theory that describes the probability distribution of a random variable given that another random variable is known to have a certain value

The conditional distribution of a random variable Y given another random variable X=x = is the probability distribution of Y when X is fixed at x.

Covariance of a Dataset

The covariance of a dataset measures the direction of the relationship between two variables. A +ve covariance means that both variables tends to change (high/low) at the same time. A -ve covariance means the one variable is high and the other variable is low.

To calculate covariance, you can use the formula: Cov(X, Y) = Σ(Xi-µ)(Yj-v) / n

(Xi) represents all values of the X-variable.
µ represents the average value of the X-variable.
Yj represents all values of the Y-variable
v represents the average value of the Y-variable.
Σ represents the sum of the values for both (Xi-µ) and (Yj-v).
n represents the total number of data points across both variables.

Steps:

Get the data

Calculate the average value for each variable
Find difference between each value and the mean for both variables
Multiply the values for the two variables
Add the values together
Use the values from previous steps to find the cov of the data

NB: Will add the math later

Probability & Statistics for Machine Learning & Data Science

Table of contents