Machine Learning
  • Introduction
  • Self LEarning
  • Why Statstics in ML Or Data Science
  • How important is interpretability for a model in Machine Learning?
  • What are the most important machine learning techniques to master at this time?
  • Learning
    • Supervised Learning
      • Evaluating supervised learning
        • K-fold cross validation
        • Using train/test to prevent overfitting of a polynomial regression
      • Regression
        • Linear regression
          • The ordinary least squares technique
          • The gradient descent technique
          • The co-efficient of determination or r-squared
            • Computing r-squared
            • Interpreting r-squared
          • Assumptions of linear regression
          • Steps applied in linear regression modeling
          • Evaluation Metrics Linear Regression
          • p-value
        • Ridge regression
        • Least absolute shrinkage and selection operator (lasso) Regression
        • Polynomial regression
        • Performance Metrics
        • Regularization parameters in linear regression and ridge/lasso regression
        • Comments
      • Classification
        • test
        • Logistic Regression
        • naïve Bayes
        • support vector machines (SVM)
        • decision trees
          • Split Candidates
          • Stopping conditions
          • Parameters
            • Non Tunable Or Specificable
            • Tunable
            • Stopping Parameters
        • Evaluation Metrics
      • Random Forest
        • Logistic Regression Versus Random Forest
        • Paramters
          • Non Tunable Parameters
          • Tunable
          • Stopping Param
        • Parameter Comparison of Decision Trees and Random Forests
        • Classification and Regression Trees (CART)
        • How random forest works
        • Terminologies related to random forest algorithm
        • Out-of-Bag Error
      • Decision Trees
        • Gini Index
    • Unsupervised learning
      • Clustering
        • test
        • KMeans Clustering
          • Params
          • Functions
        • Gaussian Mixture
          • Parameters
          • functions
    • Semi-supervised learning
    • Reinforcement learning
    • Learning Means What
    • Goal
    • evaluation metrics
      • Regression
        • MSE And Root Mean Squared Error (RMSE)
        • Mean Absolute Error (MAE)
      • Model Validation
        • test
      • The bias, variance, and regularization properties
        • Regularization
          • Ridge regression
        • Bias And Variance
      • The key metrics to focus
    • hyperparameters
  • Steps in machine learning model development and deployment
  • Statistical fundamentals and terminology
  • Statistics
    • Measuring Central Tendency
    • Probability
    • Standard Deviation , Variance
    • root mean squared error (RMSE)
    • mean Absolute Error
    • explained Variance
    • Coefficient of determination R2
    • Standard Error
    • Random Variable
      • Discrete
      • Continuous
    • Sample vs Population
    • Normal Distribution
    • Z Score
    • Percentile
    • Skewness and Kurtosis
    • Co-variance vs Correlation
    • Confusion matrix
    • References
    • Types of data
      • Numerical data
        • Discrete data
        • Continuous data
      • Categorical data
      • Ordinal data
    • Bias versus variance trade-off
  • Spark MLib
    • Data Types
      • Vector
      • LabeledPoint
      • Rating
      • Matrices
        • Local Matrix
        • Distributed matrix
          • RowMatrix
          • IndexedRowMatrix
          • CoordinateMatrix
          • BlockMatrix
    • Comparing algorithms supported by MLlib
      • Classification
    • When and why should you use MLlib (versus scikit-learn versus TensorFlow versus foo package)
    • Pipeline
    • References
    • Linear algebra in Spark
  • Terminology
  • Machine Learning Steps
    • test
  • Preprocessing and Feature selection techniues
  • The importance of variables feature selection/attribute selection
    • Feature Selection
      • forward selection
      • mixed selection or bidirectional elimination
      • backward selection or backward elimination
      • The key metrics to focus on
  • Feature engineering
  • Hyperplanes
  • cross-validation
  • Machine learning losses
  • When to stop tuning machine learning models
  • Train, validation, and test data
  • input data structure
  • Why are matrices/vectors used in machine learning/data analysis?
    • Linear Algebra
  • OverView
  • Data scaling and normalization
  • Questions
  • Which machine learning algorithm should I use?
Powered by GitBook
On this page

Was this helpful?

  1. Spark MLib
  2. Data Types

Vector

PreviousData TypesNextLabeledPoint

Last updated 5 years ago

Was this helpful?

Remember when we were doing movie similarities and movie recommendations earlier in the book? An example of a vector might be a list of all the movies that a given user rated. There are two types of vector, sparse and dense. Let's look at an example of those. There are many, many movies in the world, and a dense vector would actually represent data for every single movie, whether or not a user actually watched it. So, for example, let's say I have a user who watched Toy Story, obviously I would store their rating for Toy Story, but if they didn't watch the movie Star Wars, I would actually store the fact that there is not a number for Star Wars. So, we end up taking up space for all these missing data points with a dense vector. A sparse vector only stores the data that exists, so it doesn't waste any memory space on missing data, OK. So, it's a more compact form of representing a vector internally, but obviously that introduces some complexity while processing. So, it's a good way to save memory if you know that your vectors are going to have a lot of missing data in them.

From

Fundamental data types in this library are vectors and matrices. Vectors are local, and may be dense or sparse. Dense vectors are stored as an array of values. Sparse vectors are stored as two arrays; the first array stores the non-zero value indices and the second array stores the actual values. All element values are stored as doubles and indices are stored as integers starting from zero. Understanding the fundamental structures goes a long way in effective use of the libraries and it should help code up any new algorithm from scratch. Let us see some example code for a better understanding of these two vector representations:

I didnt see distributed Vector ? Need to find out why there is no distributed vector.

From

Local vector: Local vectors are stored on a single machine having values as double and indices of each such values are stored as an integer starting with zero. Local vectors can further be sub classified into dense and sparse vectors. A double array representing its values is called dense vector while sparse vector contains two arrays representing indices and values separately.

For example, a vector ( 2.0, 0.0, 5.0, 3.0 ) can be represented as a:

Dense vector as [ 2.0, 0.0, 5.0, 3.0]

Sparse vector as [4, (0,2,3), (2.0,5.0,3.0) ]

From Spark In Action

Local vectors in Spark are implemented with two classes—DenseVector and SparseVector—implementing a common interface called Vector, making sure both implementations support exactly the same set of operations. The main class for creating vectors is the Vectors class and its dense and sparse method. The dense method has two versions: it can take all elements as inline arguments or it can take an array of elements. For the sparse method, you need to specify a vector size, an array with indices, and an array with values. The following three vectors (dv1, dv2, and sv) contain the same elements and, hence, represent the same mathematical vectors:

From

https://medium.com/@rickynguyen/getting-started-with-spark-day-5-36b62a6d13bf
https://www.safaribooksonline.com/library/view/spark-for-data/9781785885655/ch06s02.html
https://www.safaribooksonline.com/library/view/spark-for-data/9781785885655/ch06s02.html