Intern Diaries: Linear Algebra

Tanya Gupta
Analytics Vidhya
Published in
11 min readJan 6, 2021

--

In today’s blog, we will get into the nitty gritty of things and go down memory lane to revisit our “beloved” high school mathematics topic of linear algebra. We will also discuss why studying linear algebra is essential for getting started with machine learning.

Source: acc.fibfl.ru

What is Linear Algebra?

It is a branch of mathematics dealing with the study and operations of vectors and matrices. It is a form of continuous mathematics and thus, isn’t paid much attention to as much as discrete mathematics by many computer scientist.

However, despite this one cannot ignore its importance in machine learning. It is mostly used to format data in such a way so that a machine learning algorithm could understand and work on it. We will discuss more applications of linear algebra as we move on with this blog.

Basic Terminology

Here, we will first learn about some basic terms which are used commonly in linear algebra:

  • Vector: It is an array of numbers which can be either in a column or a row format. This means they can be represented either vertically or horizontally respectively.

You can also say that the row vector could be obtained from the column vector by performing transpose operation on it and vice-versa.

It has both direction and magnitude. When we talk of vectors, its order gives us the dimensionality of our dataset. For example,

[ x , y , z ] => this row vector can be represented as an arrow with its tail at origin and head pointing towards a coordinate (x,y,z) in 3-D space. In other words, each element of this vector gives us a value along different coordinate axes.

  • Matrix: It is a 2 dimensional array of scalars/numbers. You can even say that its a collection of vectors. Each element of the matrix can be allocated using a row and a column index as opposed to only one used in case of vectors. The order of the matrix for example n x m can be defined as n rows and m columns.
  • Tensor: It is a generalized term for vectors , matrices and arrays having a dimensionality greater than 2. So here, a 1-D tensor represents a vector and a 2-D tensor represents a matrix.
  • Scalars: it just means “number” in linear algebra.
  • Trace: sum of all diagonal elements of a square matrix.

Operations on Vectors

Addition of two or more vectors:

It is a really basic operation. For example there are two row vectors a and b-

a= [ x1, y1, z1] and b=[x2 , y2 , z2] then the addition would be like

a+b = [ x1+x2 , y1+y2 , z1+z2 ]

So, we are just adding the elements at the same index location of each vectors.

But why is it important to add vectors in context with ML?

It is because, many times in datasets we might end up having multiple features which are of similar type. So, adding the vectors of such features will help us create one vector which can then be used to predict the output feature. This helps us to reduce the number of features in that dataset, making computations easy for our algorithm.

For example,

In dataset for “housing price predictions” we find that there are columns like “1stFlrSF” (or first floor square feet) , “2ndFlrSF” (or second floor square feet) and GrLivArea (or group livable area). They all define the area at different locations of the house.

But our aim is just to predict the price of a house, so we don’t need the above features separately. Instead, we could just add them up and define the summation under the new column vector “TotalAreaSF”.

In this way, the number of features are reduced and this new feature would still have the same relationship with the output feature as the original ones i.e., the price of house increases with increase in area of the house.

In this example, what we are doing is numeric grouping using sum operation which is a type of “feature engineering” technique.

Scalar Multiplication

It is nothing but, number x [ each element in the given vector]. For example, if we have num= 2 and vector [2 , 1 , 3 ] so the result of the scalar multiplication would result into,

[ 2*2 , 1*2 , 3*2] = [ 4 , 2 , 6 ]

Earlier we stated that, that the vector [x, y , z] represents an arrow with its head towards the point (x, y , z) in 3D. So, if we imagine doing scalar multiplication on such vector, all we are doing is elongating or shrinking of that vector along its original direction. This can be illustrated with the following image:

The vector [ 1 ,2 ] (blue arrow) is multiplied by 5. The purple vector represents the elongated vector.

Below is the code snippet to create above image,

Magnitude of a vector

“Norm” is a term used for defining the magnitude or length of a vector. It can also be said as a function which maps the vectors to a non-negative value. It is denoted by the symbol ∥v∥ where, v is a vector. There are 3 properties of norm:

  • It is always a non-negative value.
  • Scalar multiplication of a vector either decreases or increases the norm.
  • Triangle inequality: This means that if there are two vectors a and b and their addition results in a vector c, then the sum of magnitudes of a and b is greater than or equal to the magnitude of vector c i.e., ∥a+bc∥.

Kinds of Norms

The most generalized notation of the norm (also known as Lᴾ norm) is given as below:

Source: https://www.datacamp.com/community/tutorials/tutorial-machine-learning-basics-norms

where,

|xᵢ| is the absolute value of the iᵗʰ element of the vector.

We have taken the absolute value because if we didn’t then, when our p is odd and xᵢ negative, our distances might end up negative which goes against the property of the norm.

Based on this generalized definition, we have different types of norm:

  • L² norm: (also known as Euclidean norm) which is obtained when p=2. It is the most commonly used norm and defines the Euclidean distance from origin to a point defined by the n-dimensional vector.
  • L¹ norm: It is also known as “Manhattan distance” or “Taxicab geometry”. It just adds up the absolute values of all the elements in a vector. This is useful in scenarios where, a grid is given and the grid lines represent the streets. We can only walk along the “streets” and so finding the smallest path possible from point A to B is possible using this norm. While L² norm gives us the direct shortest path from A to B, it doesn’t apply in this scenario since we can’t walk outside of the grid lines.
(Source: en.wikipedia.org) red, blue and yellow paths conform with L1 norm while green path is the direct path given by the L2 norm.
  • L-inf norm(L-infinity norm): This norm gives the maximum absolute value from the vector. For example, in an error vector, it tells which element contributes more to the error.
  • L⁰ norm: It is not a norm technically since it doesn’t conform with the properties of norm. It just gives us the number of non-zero elements in the vector. If we multiply L⁰ norm with any scalar a, then the no. of non-zero elements won’t change, which doesn’t satisfy the 2nd property of norm.

Angles between two vectors

This can be found out using dot product. Given two vectors v and w

v = [v1, v2 , v3 ,…., vn] and w = [ w1, w2 , w3 ,…., wn] the dot product would be,

v•w = (v1* w1) +(v2*w2) + (v3*w3) +…..+ (vn* wn) = ∥v∥.∥w∥.cosΘ

where, ∥v∥ and ∥w∥ are the magnitudes of the vectors v and w.

We could also write it as, vᵀ.w where, vᵀ is the transpose of the vector v and v and w were originally column vectors. Needless to say, their dimensionality must be same.

  • Orthogonality: If two vectors are perpendicular to each other or their dot product is 0 then they are “orthogonal” to each other.
  • Hyperplane: is a part of space which is of (n-1) dimensions in a n-dimensional space. So, if we are in 2-D, it is a line. If we are in 3-D, then it is a plane. For a particular vector, if some line, plane or sub-space is perpendicular to it, then that line, plane or sub-space is known as its hyperplane. These are used as decision boundaries for classifier problems.

A quick refresher for types of matrices:

  • Square matrix: has equal no. of rows and columns
  • Zero or null matrix: has all elements as 0
  • Diagonal matrix: has non-zero elements along its diagonal.
  • Identity matrix: It is a square matrix which when multiplied with a matrix does not change the values of that matrix. So, we can say it retains its “identity”.
  • Symmetric matrix: the transpose of a matrix is equal to the original matrix
  • Skew-symmetric matrix: the transpose of a matrix is equal to the original matrix multiplied with scalar -1.
Source: onlinemathlearning.com
Source: flexiprep.com

Matrix Operations

Sum of two matrices

We do element-wise addition of 2 matrices:

Addition of vector and matrix:

Normally if we have a matrix of order n x m and a vector of order 1 x m, then in order to add them together, we would need to create a matrix of the same order out of our vector by repeating the row n times. But “broadcasting” creates a shortcut for us, in which the vector is added to every row implicitly.

Scalar Multiplication

Similar to vector, we can also multiply a matrix by a scalar.

Multiplication of 2 Matrices

For the two matrices to be applicable for multiplication, the no. of columns of the first matrix must be equal to the no. of rows of the second matrix. Mathematically it is defined as ,

This kind of element-wise multiplication is known as Hadamard product denoted by A⊙B. It is similar to the dot product of the iᵗʰ row of A and jᵗʰ column of B.

Note: matrix multiplication of A and B is not commutative or, AB ≠ BA (not always) but it is distributive [For example : A(B+C) = AB+AC ] and associative [Example: A(BC) = (AB)C ].

Identity Matrix

Creating a 4x4 identity matrix using numpy library

Inverse matrix

If for a matrix A we have AB = I where, I is the identity matrix, then we say that B is the inverse of A. Only square matrices can have an inverse matrix and even then, not all of them have an inverse. Square matrices which do not have an inverse matrix, are known as “singular matrices”. Such matrices have a determinant value of 0.

np.solve() solves a linear matrix equation or a system of linear scalar equations.

Transposing a Matrix

In this operation, we just interchange the rows and columns. In this way the dimension of a matrix gets flipped.

Eigenvalues and Eigenvectors:

“Eigen decomposition” is a method of decomposing a square matrix into eigenvalues and eigenvectors.

AX = λX

where, A is a matrix , X is our eigenvector and λ is a scalar known as eigenvalue. With this definition we see that multiplying vector X by A only leads to the change in its magnitude or scale, not the direction. This is a property of eigenvector.

Determinants

When we multiply a matrix and a vector we see that our vectors gets transformed. What this means is, that our vector gets scaled , might also get rotated. Even in the case of eigenvalues and eigenvectors we saw that a matrix multiplying with a vector leads to the scaling of that vector. This scalar quantity that changes the form of that vector, hidden underneath our matrix is what we call our determinant. Well, this is the intuitive idea.

Its formal definition is,

In linear algebra, the determinant is a scalar value that can be computed from the elements of a square matrix and encodes certain properties of the linear transformation described by the matrix. The determinant of a matrix A is denoted det(A), det A, or |A|.

So, for a vector lying on a 2-D plane, multiplying by matrix might lead to shearing and scaling of a certain area. For 3-D plane it might lead to the change in the volume or configuration of a unit volume cube.

Anyways, enough with the intuitive idea… Now we will see how it is calculated.

For a 2 x 2 matrix it is defined as:

For a 3 x 3 matrix it is defined as:

One interesting thing to notice about determinants are that, if upon multiplying with a matrix, a vector merges with one of the coordinate axis or even merges into a point, the features of the vectors are lost. You cannot bring back the original features by multiplication with any matrix. This leads to invertibility. So a matrix with |A|=0, by that sense couldn’t be inverted back and thus, doesn’t have an inverse.

Code snippet for calculating determinant using NumPy:

Linear Dependency:

It is the idea that a vector can be represented as a linear combination of a set of vectors. In order to understand this concept we need to understand what a span is.

Span: for a given set of vectors , refers to all the points in the space covered by all possible scalar manipulations and combinations of vectors in that set. For a single vector a span would constitute a straight line along which it slithers.

So if a vector a lies on the span of the set of vectors V , then that means it is one of the combination of vectors V in which some(or all) might have been multiplied by scalars.

Mathematical definition of linear dependency,

where a1, a2, …, ak are scalars while, v1,v2,…,vk are vectors.

This concept is important because linear dependency often leads to redundant features which offer no information regarding our dataset.

That’s it for today’s lesson. I hope you find this blog beneficial. Thank you for reading!!

References and useful links:

--

--

Tanya Gupta
Analytics Vidhya

Currently a CSE Undergrad at Panjab University. I enjoy learning new stuff and listening to music.