Intern Diaries: Mathematics for ML

Tanya Gupta
Analytics Vidhya
Published in
10 min readJan 27, 2021

--

Today’s blog will give you a warm-up for mathematical concepts involved in machine learning. So, let’s relive those days in high-school where you absolutely “adored” mathematics and wondered how learning this stuff would be useful in life.

source: plus.maths.org

Why do we need to learn maths?

Firstly, what people need to understand is that knowing libraries required for machine learning algorithms is not enough. In order to improve your implementation and grip on machine learning, you need to understand how an algorithm works and that’s where mathematics comes in.

  • It helps you understand how certain factors,parameters and features affect the results the way they do.
  • It helps you to choose the correct metrics to evaluate the working of your implementation.
  • It helps in expressing a relationship between the response and predictor variables (well, at least in supervised learning) and analyzing how much your model “fits” with the dataset.

In conclusion: high school mathematics wasn’t useless after all…

Now that we have taken our time to appreciate the beauty of mathematics, let’s get into the meat of the matter.

The four key areas to focus on for machine learning:

  • Linear Algebra
  • Statistics
  • Calculus
  • Probability

This blog will introduce you to calculus and probability. You can find my blogs for linear algebra and statistics below:

Multivariate Calculus

Instead of judging the effect of a single factor on our output variable, we are looking at multiple factors in multivariate calculus. Calculus helps in optimizing our implementation of the machine learning models.

Most of the time we use differentiation in machine learning for example during gradient descent ( used to find the minima in our loss function in linear and logistic regression).

Differentiation

It is the subfield of calculus that helps us in finding out how much sensitive our output variable y is w.r.t multiple factors x₁,x₂,…,xₙ where y = f(x₁,x₂,…,xₙ). If multiple factors confuses you, then just understand that w.r.t one factor differentiation means,

(change in y)/(change in x) = ∆y/∆x where, y = f(x) = function of x

∆y/∆x

The above image shows us the amount of change in y (∆y) given the change in x (∆x). For example, the change in distance w.r.t the change in time gives us speed. In a way, speed is derivative of distance w.r.t time.

If the change in x is infinitesimally small ( ∆x ->0 ) then, we call the above expression as a derivative and it is then represented as dy/dx.

In other words, derivative gives us the instantaneous change in y w.r.t x which is slope to the tangent of the curve y = f(x) at point x:

derivative depicted as a slope of tangent at the point A . In this case ∆x = h. So for a derivative, h tends to 0.

In case of multivariate differentiation, we see the change in our function f(x₁,x₂,…,xₙ) w.r.t any one of the factors x₁,x₂,…,xₙ by using partial derivative.

Quick refresher for differentiation

We will first learn w.r.t univariate calculus and then later extend it for multivariate calculus.

List of derivatives for different y

  • y = x, dy/dx = 1
  • y = constant, dy/dx = 0
  • y = xⁿ, dy/dx = n*xⁿ⁻¹
  • y = a*xⁿ, dy/dx = a*n*xⁿ⁻¹
  • y = eˣ, dy/dx = eˣ
  • y = log(x), dy/dx = 1/x
  • y = x⁻ⁿ, dy/dx = -n*x⁻⁽ⁿ⁺¹⁾
  • y = 2ˣ, dy/dx = 2ˣ(log 2)
  • y = cos(x), dy/dx = -sin(x)
  • y = sin(x), dy/dx = cos(x)

Sum rule for differentiation:

If y = first + second then, dy/dx = d(first)/dx + d(second)/dx

For example let, y = x³ + 2x⁵ then,

dy/dx = d( x³ + 2x⁵)/dx = d(x³)/dx +d(2x⁵)/dx

dy/dx = 3x² + 2*5*x⁴

dy/dx = 3x² + 10x⁴

Product rule for differentiation:

If y = first*second then dy/dx = (second)*d(first)/dx + (first)*d(second)/dx

For example, y = x* (cos(x)) then,

dy/dx = cos(x) * dx/dx + x* d(cos(x))/dx = cos(x) + x(-sin(x))

Chain rule for differentiation:

Let’s understand with an example of y = sin(x²).

  1. dy/dx = d(sin(x²))/dx = dsin(z)/dx where, z= x².
  2. First, we find the derivative of sin(z) w.r.t z= cos(z)
  3. Then, we differentiate z(= x²) w.r.t x which gives us 2*x
  4. d(sin(x²)) = product of steps (2) and (3) = 2*x*cos(z)= 2*x*cos(x²).

In short, d(sin x² )/dx = d(sin z)/dz * dz/dx

Another example:

  1. Let y = (sin x)² then, we assume that z = sin(x)
  2. We find derivative of y (= z²) w.r.t z which is 2*z
  3. We find derivative of z w.r.t x now, which gives us = cos(x)
  4. dy/dx = dy/dz * dz/dx = 2*z * cos(x) = 2* sin(x)*cos(x).

Extending the above concepts to multivariate calculus we get the concept of partial differentiation.

Partial differentiation and how to find it

Partial differentiation can be understood with the example of housing prices. We can see that house prices are affected by a lot of factors like number of rooms ,number of floors, total area etc.

But if we want to find out how much a particular factor (say, total area) affects our house price (given everything else is the same) then, we see the change in prices w.r.t the change in total area. This effect by a factor can be found out using partial differentiation.

It’s very simple to calculate if you know differentiation. Let’s assume we differentiate f(x₁,x₂,…,xₙ) w.r.t x₁. All you gotta do is assume that all independent variables are constants (except for x₁ in this case) and only differentiate the terms containing x₁. For example,

y = 3*(x₁)² + 4*x₂

so we partial differentiation w.r.t x₁ would be like,

∂y/∂x₁ = ∂(3*(x₁)² + 4*x₂)/∂x₁

= ∂(3*(x₁)²)/∂x₁ + ∂(4*x₂)/∂x₁ = 6*x₁

the result of first term would be 6*x₁ and second term would be 0. Let’s take another example.

Let y = 3*(x₁)² + 4*x₂*x1. Then,

∂y/∂x₁ = ∂(3*(x₁)² + 4*x₂*x1)/∂x₁ = ∂(3*(x₁)²)/∂x₁ + ∂(4*x₂*x1)/∂x₁

∂y/∂x₁ = 6*x₁ + 4*x₂

Probability

It is the branch of mathematics which deals with calculating the likelihood of an event occurring. The value of this likelihood lies between 0 and 1 where, 0 defines the event not occuring at all while 1 defines that the event would definitely occur.

Probability of event occurring = desired outcome/ total outcomes

All probabilities sum up to 1.

For example when we toss a coin,

the probability of getting a head = the probability of getting a tail = 1/2.

This can be found as a coin has only 2 sides and thus, total outcome = 2 while desired outcome for both of these cases = 1 which gives us 1/2 when plugging these values in above formula.

Important Terms

  • Random Experiment: a process whose outcome is “random” or uncertain.
  • Sample Set: set of all possible outcomes for that random experiment.
  • Event: outcome(s) for a given experiment.

Types of events

  • Joint events (A∩B): two events (A and B) can have common outcomes. For example, we roll two dice. Let event A be getting an even number and event B be getting a number smaller than 5. Outcomes for event A:{ 2, 4, 6 } and that of B: {1, 2, 3, 4 } so A∩B = {2, 4}.
  • Disjoint events: They do not have any common outcomes. Taking the dice example, if event A is getting odd number and B is getting even number then, A∩B = ∅.
  • Independent events: The outcome of first event doesn’t affect the outcome of the second event and vice-versa. The difference between independent and mutually exclusive events is that, for mutually exclusive two events can’t take place at the same time.

For example, on tossing a coin, we can’t get a head and a tail at the same time…

Source: imgflip.com

Welp, nevermind then. As for independent events, even if they did occur at the same time, their respective outcomes don’t affect each other.

Types of Probabilities

  • Marginal probability: It gives the probability of a single event with no strings attached i.e., we don’t care about the outcome of some other event. For example, probability of choosing a king from a deck of cards = 4/52=1/13.
  • Joint probability: It gives the probability of two events occurring simultaneously. For example, when rolling a dice we want to find probability of getting an even number which is less than 5. The two events here are- getting even number(event A) and getting number <5 (event B). So, P(A∩B) = 2/6 = 1/3.
  • Conditional probability: probability of an event given some other event has already occurred. For example, probability of a kid playing outside when it is raining. Here event A is “playing outside” and B is “raining”. Conditional probability is denoted as P(A|B) or P(“playing outside”|“raining”) for our example.

Taking a detour to learn conditional probability:

If the two events A and B are dependent then their conditional probability is,

P(A|B) = P(A∩B)/P(B)

If A and B are independent events (you could multiply the probabilities of A and B):

P(A|B) = P(A)*P(B)/P(B) = P(A)

Bayes Theorem

It gives us a way of finding conditional probability with the following formula:

P(A|B) = P(B|A)* P(A)/P(B) here,

  • P(B|A) gives us the likelihood ratio
  • P(A) is known as prior and P(A|B) is known as posterior value
  • P(B) is known as evidence (many times it is not given to us)

Evidence can be calculated using the following equation,

P(B) = P(B|A)*P(A) + P(B|not A)*P(not A)

Probability Density Functions(PDF)

This is nothing but the graphical representation of probabilities and probabilities can now be described as a function. PDF has the following properties:

  • The area under the curve = 1
  • The values are all continuous.
  • The probability that a random variable assumes a value between x1 and x2 is the area under the curve bounded by x1 and x2 from the sides. This can be found by integrating the pdf from x = x1 till x = x2 as given below.

Binomial distribution

It gives the probability of success or failure of an event which is being repeated multiple times. The formula for finding probability here is given by,

For instance, if a student has given 10 mock tests then, what is the probability that this student passes 8 tests?

So, for this we have x = 8, p = 0.5( because, a student can either pass or fail exam), n = 10. We plug in these values to our formula,

P = ⁿCₓ .pˣ.(1-p)ⁿ⁻ˣ = ¹⁰C₈.(0.5)⁸.(1–0.5)² = 45*(0.5)¹⁰ = 0.0439

Conditions for this distribution:

  • The number of trials is finite and fixed.
  • The events must be independent.
  • The probability of success must be same for each and every event.

Details about Normal and standard distribution curve is given in my statistics blog.

Central Limit Theorem

It states that,

The sampling distribution of the mean of any independent, random variable will be normal or nearly normal, if the sample size is large enough.

In other words, if we take a sample size which is large enough from our population, then the sample means (mean of each sample) will be approximately normally distributed. Increasing our sample size leads our probability distribution of sample means to look more and more like normal distributions.

As a rule of thumb, it is said that the sample size must be at least 30 for CLT to work.

What are it’s practical implications?

It is useful in that even though your sample’s distribution is in any random shape, your sample mean distribution would be nearly a gaussian curve(given the above sample size limit).

This can be very useful in finding out nature of the population without accessing each and every data point from the population.

The only condition CLT has, besides the sample size, is that the initial distribution must be such that we are able to find out the sample’s mean. Cauchy distribution is one of those examples where, we can’t find out the sample mean.

Let’s take a real-world example:

Suppose you are conducting a survey to determine the income of people throughout a state. However, we all know it ain’t feasible for us to ask each and every person in the state about their income. This is where we can apply CLT.

So, what we can do is take a few samples from some cities in the state. Here the samples would be the number of people from each of the cities. Given, this we can then, calculate the mean of each of these samples and plot a probability distribution curve out of this. We will be able to see then the resemblance of the curve to normal distribution and hence, use its properties to find the required data.

Applications of Probability:

  • It helps us to optimize our model
  • Loss can be calculated using probability which further helps us in classifying data points correctly.

*****************************************************************

That concludes our blog, I hope it helped you to understand these mathematical concepts (even if it is by a little amount). Thank you for reading my blog and have a wonderful day 😄!!!

--

--

Tanya Gupta
Analytics Vidhya

Currently a CSE Undergrad at Panjab University. I enjoy learning new stuff and listening to music.