Welcome to the 2nd week of learning Probability & Statistics for Machine Learning & Data Science
Lesson 1 - Describing Distributions
Expected Value
Expected value is a way of representing mean value.
Question : You throw a fair coin. If it lands heads you win 10 dollars, otherwise you win nothing. What's the maximum amount you should be willing to pay to play this game? $0, $4, $5, $6 or $10? As this is a fair coin, landing on heads is 50/50.
Answer : 0.5* $10 + 0.5*$0 = $5, you expect to win $5 on average. This is called E(X) = 5. You can pay less, but you really should not pay more than $5
Question : What's the maximum amount you should be willing to play a new game, where you flip 3 coins and win a dollar for each heads up get?
Answer : 3 x (0.5* $1 + 0.5*$0) = $1.50, this is the avg amount of money we'd win if we play the game many times. So, E(X) = 1.5, X = Number of heads.
So, if you have a discrete random variable X, it will definitely have a PMF (Probability Mass Function), that provides the probability of every possibilities x can take.
If a histogram has equal possibilities, the expected value will be in the middle. But if one of the weighted value hold more value than the others, it will shift the equilibrium point towards the highest weighted one.
In both cases, we are summing all the possible values of x.
Central tendency: Median and Mode
Median :
Mean may not work properly everywhere. Lets' say the salary of 5 man is $600, $500, $450, $400, $200. The avg salary will be $430 here. The avg shows that every salary of these 5 man is around $430. But, what if one of them earn $1500 instead of $600. The avg (following the mean formula) will be $610. But is this correct? Nope
How to fix it? Let's put all the salaries in an order and find out the middle value : $200, $400, $450, $500, $1500 ; $450 is the middle value, so the avg and the median is $450.
When you have list of even numbers, you calculate the avg of the middle two values - and this will be the median for that particular order.
Mode :
Mode is a measure central tendency that points out the most frequently occurred value in a data set. The mode is applicable for both qualitative (categorical) and quantitative(numerical) data. Additionally, a data set can have only one mode, this refers the uniqueness of mode
Examples
1. Categorical Data
Consider a survey of favorite fruits among a group of people:
Data: Apple, Banana, Banana, Apple, Orange, Banana, Apple
Mode: Banana (since it appears the most frequently)
2. Numerical Data
Consider a set of exam scores:
Data: 85, 90, 78, 90, 88, 85, 85, 92
Mode: 85 (since it appears the most frequently)
Expected value of a Function
For random variable X, and a function g(X), the expected value of g(X), denoted as E[g(X)] is a weighted avg of all possible values g(X) can take. In general, E[aX + b] = aE[X] + b which implies expectation of a constant is constant and E[aX] = aE[X].This means that the expectation is a linear operator.
Sum of expectations
Question : You flip a coin. If heads, you win $1, otherwise you win nothing. Then you roll a dice. You win the amount you roll.
What are your expected winnings for this game? $1, $2, $4 or $5?
Question : There are 8 billion people in the world. There is a bag with their 8 billion names, and each person is given a random name from it.
What is the expected number of correct assignments? 0, 0.1, 1 or 8 billion? The answer is 1.
Now, imagine there are three people. Each of them has the possibility of getting their name correct is 1/3. So, 1/3 + 1/3 + 1/3 = 1, their expected number is one. [ n * 1/n = 1 ] It is always true.
Variance
Variance calculates the spread or dispersion of a data set.
Question : You are flipping a fair coin, heads you win $1, tails you lose $1. What is the maximum amount of money you should be willing to pay to play this game? -$1, -$0.5, $0, $0.5 or $1? It's $0.
Question : What is the fair amount of money to play a new game, where you flip a fair coin and win $100 if it's heads and lose $100 if it's tails. -$100, -$50, $0, $50 or $100? It's $0.
Variance, Var(X) = E [(X - E[X]^2] (Variance Formula)
Steps:
Find mean of X
Find the deviation from mean to every value of X (Deviation = x - E[X] )
Square the deviations
Average those squared deviation
Question : Which game has greater variance? Game 1: If heads, you win $2, if tails you lose $2. Game 2: Heads, you win $3, Tails you lose $1
They both has the same variance because the spread of their outcome is same.
E [X1]= 1/2 * 2 + 1/2 (-2) = 0; deviation = 1/2(-2-2)^2 + 1/2(2-0)^2 = 4
E [X2]= 1/2 * 3 + 1/2 * (-1) = 1; deviation = 1/2(-1-1)^2 + 1/2(3-1)^2 =4
The alternative formula for variance is E[X^2] - E[X]^2
Standard Deviation : std(X) = sqrt Var(X)
Normal Deviation : 68-95-99.7 Rule
Approximately 68% of the data falls within one standard deviation (σ) of the mean (μ\muμ).
Approximately 95% of the data falls within two standard deviations (2σ) of the mean (μ).
Approximately 99.7% of the data falls within three standard deviations (3σ) of the mean (μ).
Sum of Gaussians
R : Total response time of a computer system
T : Processing time, L : Network Latency
R = T+L
Skewness and Kurtosis: Moments of a Distribution
Skewness:
Skewness is a measure of asymmetric of a probability distribution. It shows whether the data points are skewed to the left or to the right.
Positive Skewness: The right tail is longer; the mass of the distribution is concentrated on the left.
Negative Skewness: The left tail is longer; the mass of the distribution is concentrated on the right.
Zero Skewness: The distribution is symmetric about the mean.
Kurtosis :
Kurtosis is a measure of the "tailedness" of a probability distribution. It indicates how the tails of the distribution compare to the normal distribution.
High Kurtosis (Leptokurtic): Tails are fatter than the normal distribution; sharp peak.
Low Kurtosis (Platykurtic): Tails are thinner than the normal distribution; flatter peak.
Normal Kurtosis (Mesokurtic): Similar to the normal distribution.
Lesson 2 - Probability Distribution with Multiple Variable
Joint Distribution (Discrete) - Part 1
Previously we only saw one variable, like height of the population. Now we will look two variables, height of the population and age of the population - that's two distribution and we will combine them together to see how they look together.
If we divide the count with total number of kids, we will get probability. For example, a child in this data set has 40% probability of being a 9yrs
Now for another variable, height
Seeing this tow histograms:
Question : What is the probability that a child is 9 years old and 49 inches tall? Ans is 0.3, Look carefully, only 3 of the 4 children are 49 inches tall. 3/10 = 0.3.
We are going to call age as Age(X), and height as Height(Y). This can be written as pXY(9, 49) = P(X=0, Y = 49) = 0.3
We can find it out if we put them in a organized way:
pXY(8, 48) = 0 ; pXY(7, 46) = 0.2
Joint Distribution (Discrete) - Part 2
For independent discrete random variables:
Question : You rolled two 6-sided dice. X = the number rolled on the 1st dice, Y = sum of the two dice.
Let's say, X = 4 and we got a 5 rolling the 2nd dice, so Y = 4+5.
Now, pXY(3,7) = P(X=3, Y=7) = 1/36. (dice combination must be 4,3)
pXY(1,1) = P(X=1, Y=1) = 0 (dice combination 1,1 not possible)
Joint Distribution (Continuous)
What if the variables are continuous? What will happen if X and Y are continuous random variables?
Imagine a call center picking up phone calls scenario. Where the waiting time before picking a call is 0-10mins and the customer satisfaction rate is also between 0-10. [1000 customer]
X variable : Waiting time (mins)
Y variable : Satisfaction rating
Joint Distribution is the concept of continuous random variable which refers to probability distribution of two or more continuous variables taken together. The joint distribution function explains the relationship between the random variables along with their dependency on each other.
Marginal Distribution
Recall the age and height example. Previously we wanted the full distribution of age and height right? But suddenly, we don't want the age anymore, we want the distribution of heights only. Now we have to take the collection of age. This is called the marginal distribution.
To find the marginal distribution of height:
To find the marginal distribution of age:
Conditional Distribution
Conditional distribution is a fundamental concept in probability theory that describes the probability distribution of a random variable given that another random variable is known to have a certain value
The conditional distribution of a random variable Y given another random variable X=x = is the probability distribution of Y when X is fixed at x.
Covariance of a Dataset
The covariance of a dataset measures the direction of the relationship between two variables. A +ve covariance means that both variables tends to change (high/low) at the same time. A -ve covariance means the one variable is high and the other variable is low.
To calculate covariance, you can use the formula: Cov(X, Y) = Σ(Xi-µ)(Yj-v) / n
(Xi) represents all values of the X-variable.
µ represents the average value of the X-variable.
Yj represents all values of the Y-variable
v represents the average value of the Y-variable.
Σ represents the sum of the values for both (Xi-µ) and (Yj-v).
n represents the total number of data points across both variables.
Steps:
- Get the data
Calculate the average value for each variable
Find difference between each value and the mean for both variables
Multiply the values for the two variables
Add the values together
Use the values from previous steps to find the cov of the data
NB: Will add the math later