Machine Learning
  • Introduction
  • Self LEarning
  • Why Statstics in ML Or Data Science
  • How important is interpretability for a model in Machine Learning?
  • What are the most important machine learning techniques to master at this time?
  • Learning
    • Supervised Learning
      • Evaluating supervised learning
        • K-fold cross validation
        • Using train/test to prevent overfitting of a polynomial regression
      • Regression
        • Linear regression
          • The ordinary least squares technique
          • The gradient descent technique
          • The co-efficient of determination or r-squared
            • Computing r-squared
            • Interpreting r-squared
          • Assumptions of linear regression
          • Steps applied in linear regression modeling
          • Evaluation Metrics Linear Regression
          • p-value
        • Ridge regression
        • Least absolute shrinkage and selection operator (lasso) Regression
        • Polynomial regression
        • Performance Metrics
        • Regularization parameters in linear regression and ridge/lasso regression
        • Comments
      • Classification
        • test
        • Logistic Regression
        • naïve Bayes
        • support vector machines (SVM)
        • decision trees
          • Split Candidates
          • Stopping conditions
          • Parameters
            • Non Tunable Or Specificable
            • Tunable
            • Stopping Parameters
        • Evaluation Metrics
      • Random Forest
        • Logistic Regression Versus Random Forest
        • Paramters
          • Non Tunable Parameters
          • Tunable
          • Stopping Param
        • Parameter Comparison of Decision Trees and Random Forests
        • Classification and Regression Trees (CART)
        • How random forest works
        • Terminologies related to random forest algorithm
        • Out-of-Bag Error
      • Decision Trees
        • Gini Index
    • Unsupervised learning
      • Clustering
        • test
        • KMeans Clustering
          • Params
          • Functions
        • Gaussian Mixture
          • Parameters
          • functions
    • Semi-supervised learning
    • Reinforcement learning
    • Learning Means What
    • Goal
    • evaluation metrics
      • Regression
        • MSE And Root Mean Squared Error (RMSE)
        • Mean Absolute Error (MAE)
      • Model Validation
        • test
      • The bias, variance, and regularization properties
        • Regularization
          • Ridge regression
        • Bias And Variance
      • The key metrics to focus
    • hyperparameters
  • Steps in machine learning model development and deployment
  • Statistical fundamentals and terminology
  • Statistics
    • Measuring Central Tendency
    • Probability
    • Standard Deviation , Variance
    • root mean squared error (RMSE)
    • mean Absolute Error
    • explained Variance
    • Coefficient of determination R2
    • Standard Error
    • Random Variable
      • Discrete
      • Continuous
    • Sample vs Population
    • Normal Distribution
    • Z Score
    • Percentile
    • Skewness and Kurtosis
    • Co-variance vs Correlation
    • Confusion matrix
    • References
    • Types of data
      • Numerical data
        • Discrete data
        • Continuous data
      • Categorical data
      • Ordinal data
    • Bias versus variance trade-off
  • Spark MLib
    • Data Types
      • Vector
      • LabeledPoint
      • Rating
      • Matrices
        • Local Matrix
        • Distributed matrix
          • RowMatrix
          • IndexedRowMatrix
          • CoordinateMatrix
          • BlockMatrix
    • Comparing algorithms supported by MLlib
      • Classification
    • When and why should you use MLlib (versus scikit-learn versus TensorFlow versus foo package)
    • Pipeline
    • References
    • Linear algebra in Spark
  • Terminology
  • Machine Learning Steps
    • test
  • Preprocessing and Feature selection techniues
  • The importance of variables feature selection/attribute selection
    • Feature Selection
      • forward selection
      • mixed selection or bidirectional elimination
      • backward selection or backward elimination
      • The key metrics to focus on
  • Feature engineering
  • Hyperplanes
  • cross-validation
  • Machine learning losses
  • When to stop tuning machine learning models
  • Train, validation, and test data
  • input data structure
  • Why are matrices/vectors used in machine learning/data analysis?
    • Linear Algebra
  • OverView
  • Data scaling and normalization
  • Questions
  • Which machine learning algorithm should I use?
Powered by GitBook
On this page

Was this helpful?

Why Statstics in ML Or Data Science

PreviousSelf LEarningNextHow important is interpretability for a model in Machine Learning?

Last updated 5 years ago

Was this helpful?

It helps you to compare to classifiers, models .

you need statistics every time you report the performance of your model - a number without confidence interval is no result

Machine learning is a branch of statistics, and blindly applying algorithms to data is disastrous for a company (and can cause legal issues for that company down the road). Applying the wrong algorithm, not understanding the biases or limitations of an algorithm, and not interpreting the output correctly. This is where statistics will be used to understand limitations, compare models etc ...

After you exactly understand how these ml algorithms work, You can look back to learn more statistic theories and study why these ml algorithms works.

In one word, statistic helps you understand why these ml models work and how to improve them.

optimization helps you to understand how it works.

There are several reasons probability and statistics are important in machine learning, but I think one of the most important reasons is because they help justify the choices made by many models.

Anyway, If you have a dataset, and

  • If you want to compare the performance of two classifiers, you can try McNemar's Test, K-Fold Cross-Validated Paired t-Test, 5x2 cv Paired t-Test, or 5x2 cv Paired F-Test (depending upon your case).

  • For comparing Multiple Algorithms, you should try Analysis of Variance

If there are multiple datasets, and

  • If you want to compare two algorithms, you should try Wilcoxon signed rank test

  • If there are multiple algorithms, you should try Kruskal-Wallis test.

In the simplest case, suppose you are comparing the performance of two classifiers A and B. A gives 90% accuracy and B gives 87% accuracy, which one is better? May be both A and B perform the same statistically, but you will never know this unless you know statistics. That is how much statistics is important for understanding and interpreting results obtained from ML methods.

You can implement everything without knowing any statistics, but you won’t be able to understand much (why these things work) without having a solid knowledge on statistics. I would say that statistics, linear algebra and optimization are the three most important prerequisites to do machine learning.

Important to know from statistics

probability - Bayes probability, conditional probability, is must-known knowledge

Analysis of Variance

Wilcoxon signed rank test

Kruskal-Wallis test.

https://www.quora.com/How-important-it-is-to-learn-statistics-for-a-career-in-machine-learning
https://www.quora.com/How-Important-is-it-to-master-statistics-to-understand-machine-learning
https://www.quora.com/What-are-the-roles-of-probability-and-statistics-in-machine-learning-How-important-are-they-What-are-their-applications-in-machine-learning
https://www.quora.com/What-is-the-best-way-to-compare-accuracy-of-multiple-classifiers-and-why/answer/Shehroz-Khan-2
https://www.quora.com/How-important-it-is-to-learn-statistics-for-a-career-in-machine-learning