Probability & Statistics for Machine Learning & Data Science

Probability & Statistics for Machine Learning & Data Science

Welcome to week 3 !

Lesson 1 - Population and Sample

Population and Sample

Population : The whole group we want to study. Could be 5, 10 or 200 million

Sample : The representative small part of the population.

Studying the whole population for a specific information where you have millions of different people is very challenging. Hence, the idea of sampling arrived. Population sampling is less cost, less time and more efficient.

Question : You want to study the price of chicken fry in Bangladesh.

What is the population of your study? What is the sample of your study?

  • All chicken fry sold everywhere in the world?

  • All chicken fry sold in the Bangladesh?

  • Chicken fries sold in random 4 stores you selected?

  • The chicken fry bought during a festival?

The population of your study is : All chicken fry sold in the Bangladesh

The sample of your study is : Chicken fries sold in the 4 stores you selected

Remember! Every data set we work with in machine learning is a sample NOT the population. Having a sample set means it got the same information of its population as a whole.

Sample Mean

Suppose you want to know the height of a population. To do the task, you take a sample, you find out their heights, you calculate the mean and that will your estimated height for the population. We certainly don't get the original data but we get something very close to the original data and that is the main focus.

NB: The bigger the sample size, the more accurate estimated info.

Sample Proportion

Proportion formula : Population proportion p = number of items with a given characteristics, i. e. owning a car (x) / population (N)

When we don't have the access to N, we gonna take random sample (6 people), where 2 owns a car, it will be 2/6 = 0.33 = 33.3 %

Sample Variance

Sample variance is the measure of the dispersion/spread of data points around their mean. It gives a idea of how much the values in the dataset deviates from the mean.

Here,

  • Xi are the individual data points

  • X^ is the sample mean

  • n is the number of data points

Normally, we won't have the population size(N) and the population mean (μ)

Note : In population variance, we divide by n, not n-1

Example

Let's calculate the sample variance for the dataset X = [2,4,6,8]

  1. Calculate the Mean, X^ = (2+4+6+8)/4 = 5

  2. Find the squared differences,

    • (2-5)^2 = 9

    • (4-5)^2 = 1

    • (6-5)^2 = 1

    • (8-5)^2 = 9

  3. Sum the squired differences, 9+1+1+9 = 20

  4. Divide by n - 1, Var = 20/(4-1) = 20/3 = 6.67

So, the sample variance for the dataset [2,4,6,8] is approximately 6.67. Remember, variance is the average squared deviation from the mean.

Law of Large Numbers

The Law of Large Number tells if the sample size increases, the average of the sample size will get closer to the average of the entire population.

Conditions :

  • Sample must be drawn randomly

  • Sample size should be sufficiently large

  • Individual observations in the sample must be independent of each other

Central Limit Theorem

The CLT describes the reason why many distributions tends to get closer to the normal distribution. Significantly, CLT states that, given a sufficiently large sample size n from a population with finite level of variance, the distribution of the sample mean will approach a normal distribution known as a Gaussian distribution, regardless of the original distribution. The theorem focuses on the distribution of the sample means, not the individual data points. Additionally, The distribution becomes bell-shaped.

Lesson 2 - Point Estimation

Maximum Likelihood Estimation

Imagine, you walk into a room and you see bunch of popcorn lying on the floor? What do you think is most likely happened?

  • People sleeping?

  • People playing board game?

  • People watching a movie?

Ofc, people are more likely to have popcorn while watching a movie.

So, Maximum Likelihood Estimation is is a method of estimating the parameters for a statistical model.

MLE for a Bernoulli Distribution

Bernoulli distribution is a discrete probability distribution where a random variable can take only one possible outcome : success (1) with probability p and, failure (0) with probability 1-p.

Question : What operation can be done to the function p^8(1-p)^2 to simply and find the maximum likelihood? Ans. Take the natural logarithm of the function.

MLE for Linear Regression

Which model looks like the best fit for the data? It's model 2.

When choosing the most appropriate model for a dataset, judge how well the model visually clusters the distribution of data points to ensure an accurate representation of the observed data pattern

Regularization

Regularization is a technique used in statistical models and machine learning to prevent overfitting by introducing additional information or constraints into the model. Overfitting occurs when a model learns not only the underlying pattern in the training data but also the noise, leading to poor generalization to new, unseen data.

Key Concepts

  1. Overfitting: When a model is too complex and captures noise in the training data as if it were a part of the pattern. This results in a model that performs well on training data but poorly on test data.

  2. Underfitting: When a model is too simple and fails to capture the underlying pattern in the data. This results in a model that performs poorly on both training and test data.

  3. Regularization: Introduces a penalty for model complexity to prevent overfitting. It does this by adding a regularization term to the loss function that the model is trying to minimize.

Normally we will think Model C is the best fit. But look carefully, it's too messy and some of the points are not even recognizable. To find the best model we will calculate penalty and loss value. let's see how it works:

To account for penalty and loss values when determining the best model with regularization Calculate the new loss as the sum of the penalty and the previous loss.

Based on this - Model B is the best fit

Later