Normal Distribution
Last updated
Was this helpful?
Last updated
Was this helpful?
Discrete probability distributions can’t handle every situation.
Why Normal Distribution ?
Usually the data distribution in Nature follows a Normal distribution ( few examples like - age, income, height, weight etc., ) . So its the best approximation when we are not aware of the underlying distribution pattern.
Most often the goal in ML/ AI is to strive to make the data linearly separable even if it means projecting the data into higher dimensional space so as to find a fitting "hyperplane" (for example - SVM kernels, Neural net layers, Softmax etc.,). The reason for this being "Linear boundaries always help in reducing variance and is the most simplistic, natural and interpret-able" besides reducing mathematical / computational complexities. And, when we aim for linear separability, its always good to reduce the effect of outliers, influencing points and leverage points. Why? Because the hyperplane is very sensitive to the influencing points and leverage points (aka outliers) - To undertstand this - Lets shift to a 2D space where we have one predictor (X) and one target(y) and assume there exists a good positive correlation between X and y. Given this, if our X is normally distributed and y is also normally distributed, you are most likely to fit a straight line that has many points centered in the middle of the line rather than the end-points (aka outliers, leverage / influencing points). So the predicted regression line will most likely suffer little variance when predicting on unseen data.
The normal distribution is important because of the
. In simple terms, if you have many independent variables that may be generated by all kinds of distributions, assuming that nothing too crazy happens, the aggregate of those variables will tend toward a normal distribution. This universality across different domains makes the normal distribution one of the centerpieces of applied mathematics and statistics.
Another corollary is that the normal distribution makes math easy - things like calculating moments, correlations between variables, and other calculations that are domain specific. For that reason, even if a distribution isn't actually normal, it is useful to assume that it is normal to get a good, first-order understanding of a set of data.
Definition from Wikipedia
The normal distribution is
A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate
let's see why Normal distribution is such a beautiful phenomenon.
Nice, huh?
Let's see now how we can find X for a particular probability.
From Standard Normal table, you get the Z, and then X, simple as that :)
The normal distribution is called normal because it’s seen as an ideal. It’s what you’d “normally” expect to see in real life
for a lot of continuous data such as measurements.
The normal distribution is in the shape of a bell curve.
The curve is symmetrical, with the highest probability density in the center of the curve. The probability density decreases the further away you get from the mean. Both the mean and median are at the center and have the highest probability density.
The normal distribution is defined by two parameters, μ and σ2
. μ tells you where the center of the curve is, and σ gives you the spread. If a continuous random variable X follows a normal distribution with mean μ and standard deviation σ, this is generally written X ~ N(μ, σ2).
Although height and weight are often cited as examples, they are not exactly normally distributed. Weight, in particular, is somewhat right skewed. The average American man weighs about 190 pounds. There are some men who weigh well over 380 but none who weigh even close to 0.
IQ is sometimes cited as an example, but it has fatter tails than the normal.
No physical variable is exactly normally distributed.
The parameter μ in this formula is the or of the distribution (and also its and ). The parameter
σ is its ; its is therefore σ2.
The relation between probability and Z is fixed and can be found out from Standardized Normal Probability Table (more info here ).