What is Gradient Descent For Machine Learning

    No Comments

    What is Gradient Descent For Machine Learning

    In our day-to-day lives, we are optimizing variables based on our personal decisions and we don’t even recognize the process consciously. We are constantly using optimization techniques all day long, for example, while going to work, choosing a shorter route in order to minimize traffic woes, figuring out and managing a quick walk around the campus during a snack break, or scheduling a cab in advance to reach the airport on time.

    Optimization is the ultimate goal, whether you are dealing with actual events in real-life or creating a technology-based product. Optimization is at the heart of most of the statistical and machine learning techniques which are widely used in data science. To gain more knowledge and skills on data science and machine learning, join the  certification course now.

    Accuracy is the word with which we are most concerned, while we are dealing with problems related to machine learning and artificial intelligence. Any rate of errors cannot be tolerated while dealing with real-world problems and neither should they be compromised.

    Optimization for machine Learning

    Let us consider a case of self-driving cars. The model fitted in the car detects any obstacles that come in the way and takes appropriate actions, which can be slowing down the speed or pulling on the brakes and so on. Now we need to keep this in mind that there is no human in the car to operate or withdraw the actions taken by the self-driving car. In such a scenario, suppose the model is not accurate. It will not be able to detect other cars or any pedestrians and end up crashing leading to several lives at risk.

    This is where we need optimization algorithms to evaluate our model and judge whether the model is performing according to our needs or not. The evaluation can be made easy by calculating the cost function (which we will look into in a while in this article in detail). It is basically a mapping function that tells us about the difference between the desired output and what our model is computing. We can accordingly correct the model and avoid any kind of undesired activities.

    Optimization may be defined as the process by which an optimum is achieved. It is all about designing an optimal output for your problems with the use of resources available. However, optimization in machine learning is slightly different. In most of the cases, we are aware of the data, the shape and size, which also helps us know the areas we need to improve. But in machine learning we do not know how the new data may look like, this is where optimization acts perfectly. Optimization techniques are performed on the training data and then the validation data set is used to check its performance.

    There are a lot of advanced applications of optimization which are widely used in airway routing, market basket analysis, face recognition and so on. Machine learning algorithms such as linear regression, KNN, neural networks completely depend on optimization techniques. Here, we are going to look into one such popular optimization technique called Gradient Descent.

    Gradient descent is an optimization algorithm which is mainly used to find the minimum of a function. In machine learning, gradient descent is used to update parameters in a model. Parameters can vary according to the algorithms, such as coefficients in Linear Regression and weights in Neural Networks.

    Let us relate gradient descent with a real-life analogy for better understanding. Think of a valley you would like to descend when you are blind-folded. Any sane human will take a step and look for the slope of the valley, whether it goes up or down. Once you are sure of the downward slope you will follow that and repeat the step again and again until you have descended completely (or reached the minima).

    Gradient Descent in Machine Learning:- Valley, Slope

    Similarly, let us consider another analogy. Suppose you have a ball and you place it on an inclined plane (at position A). As per laws, it will start rolling until it travels to a gentle plane where it will be stationary (at position B as shown in the figure below).

    Gradient Descent in Machine Learning:- Ball placed on an inclined plane

    This is exactly what happens in gradient descent. The inclined and/or irregular is the cost function when it is plotted and the role of gradient descent is to provide direction and the velocity (learning rate)  of the movement in order to attain the minima of the function i.e where the cost is minimum.

    The graphical representation of Gradient Descent in Machine Learning

    The primary goal of machine learning algorithms is always to build a model, which is basically a hypothesis which can be used to find an estimation for Y based on X. Let us consider an example of a model based on certain housing data which comprises of the sale price of the house, the size of the house etc. Suppose we want to predict the pricing of the house based on its size. It is clearly a regression problem where given some inputs, we would like to predict a continuous output.

    The hypothesis is usually presented as

    hypothesis formula

    where the theta values are the parameters.

    Let us look into some examples and visualize the hypothesis:

    hypothesis values

    This yields h(x) = 1.5 + 0x. 0x means no slope, and y will always be the constant 1.5. This looks like:

    bar graph of hypothesis with no slope

    Now let us consider,

    hypothesis values -2

    Bar Graph of Hypothesis with slope

    Where, h(x) = 1 + 0.5x

    The objective in the case of gradient descent is to find a line of best fit for some given inputs, or X values, and any number of Y values, or outputs. A cost function is defined as “a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.”

    With a known set of inputs and their corresponding outputs, a machine learning model attempts to make predictions according to the new set of inputs.

    Machine Learning Cost Function Process PredictionMachine Learning Process

    The Error would be the difference between the two predictions.

    This relates to the idea of a Cost function or Loss function.

    A Cost Function/Loss Function tells us “how good” our model is at making predictions for a given set of parameters. The cost function has a curve and a gradient, the slope of this curve helps us to update our parameters and make an accurate model.

    It is always the primary goal of any Machine Learning Algorithm to minimize the Cost Function. Minimizing cost functions will also result in a lower error between the predicted values and the actual values which also denotes that the algorithm has performed well in learning. 

    Generally, the cost function is in the form of Y = X². In a Cartesian coordinate system, this represents an equation for a parabola which can be graphically represented as :

    Parabola in Minimizing the Cost FunctionParabola

    Now in order to minimize the function mentioned above, firstly we need to find the value of X which will produce the lowest value of Y (in this case it is the red dot). With lower dimensions (like 2D in this case) it becomes easier to locate the minima but it is not the same while dealing with higher dimensions. For such cases, we need to use the Gradient Descent algorithm to locate the minima.

    Now a function is required which will minimize the parameters over a dataset. The most common function which is often used is the  mean squared error. It measures the difference between the estimated value (the prediction) and the estimator (the dataset).

    Formula of Mean Squared ErrorMean Squared Error

    It turns out we can adjust the equation a little to make the calculation down the track a little more simple. 

    Now a question may arise, Why do we take the squared differences and simply not the absolute differences? Because the squared differences make it easier to derive a regression line. Indeed, to find that line we need to compute the first derivative of the Cost function, and it is much harder to compute the derivative of absolute values than squared values. Also, the squared differences increase the error distance, thus, making the bad predictions more pronounced than the good ones.

    The equation looks like –

    Mean Squared Error in Machine Learning with Squared DifferencesMean Squared Error

    Let us apply this cost function to the following data:

    Data Set before applying the cost function.

    Here we will calculate some of the theta values and then plot the cost function by hand. Since this function passes through (0, 0), we will look only at a single value of theta. Also, let us refer to the cost function as J(ϴ) from now on.

    When the value of ϴ is 1, for J(1), we get a 0. You will notice the value of J(1) gives a straight line which fits the data perfectly. Now let us try with ϴ = 0.5

    Cost function applied data set graph -1J(0.5)

    The MSE function gives us a value of 0.58. Let’s plot both our values so far:

    J(1) = 0

    J(0.5) = 0.58

    Cost function applied data set graph-2With J(1) and J(0.5)

    Let us go ahead and calculate some more values of J(ϴ).

    Cost function applied data set graph-3

    Now if we join the dots carefully, we will get –

    Visualising the Cost Function GraphVisualizing the cost function J(ϴ)

    As we can see, the cost function is at a minimum when theta = 1, which means the initial data is a straight line with a slope or gradient of 1 as shown by the orange line in the above figure.

    Using a trial and error method, we minimized J(ϴ). We did all of these by trying out a lot of values and with the help of visualizations. Gradient Descent does the same thing in a much better way, by changing the theta values or parameters until it descends to the minimum value.

    You may refer below for the Python code to find out cost function:

    Let us now start by initializing theta0 and theta1 to any two values, say 0 for both, and go from there. The algorithm is as follows:

    Gradient Descent

    where α, alpha, is the learning rate, or how rapidly do we want to move towards the minimum. We can always overshoot if the value of α is too large.

    Big Learning Rate vs Small Learning Rate

    The derivative which refers to the slope of the function is calculated. Here we calculate the partial derivative of the cost function. It helps us to know the direction (sign) in which the coefficient values should move so that they attain a lower cost on the following iteration. 

    Partial derivative of the Cost Function which we need to calculate.Partial Derivative of the Cost Function which we need to calculate

    Once we know the direction from the derivative, we can update the coefficient values. Now you need to specify a learning rate parameter which will control how much the coefficients can change on each update.

    coefficient = coefficient – (alpha * delta)

    This particular process is repeated as long as the cost of the coefficients is 0.0 or close enough to zero.

    This turns out to be:

    Cost function formula 1Image from Andrew Ng’s machine learning course

    Which gives us linear regression!

    Linear Regression Formula in Machine LearningLinear Regression

    Types of Gradient Descent Algorithms Graphical RepresentationGradient descent variants’ trajectory towards the minimum

    1. Batch Gradient Descent: In this type of gradient descent, all the training examples are processed for each iteration of gradient descent. It gets computationally expensive if the number of training examples is large. This is when batch gradient descent is not preferred, rather a stochastic gradient descent or mini-batch gradient descent is used.

    Batch Gradient Descent in Machine LearningAlgorithm for batch gradient descent:

    Let hθ(x) be the hypothesis for linear regression. Then, the cost function is given by:

    Let Σ represents the sum of all training examples from i=1 to m.

    formula2

    Repeat {

    formula3For every j =0 …n

    }

    Where xj(i) represents the jth feature of the ith training example. So if m is very large, then the derivative term fails to converge at the global minimum.

    2. Stochastic Gradient Descent: The word stochastic is related to a system or a process that is linked with a random probability. Therefore, in Stochastic Gradient Descent (SGD) samples are selected at random for each iteration instead of selecting the entire data set. When the number of training examples is too large, it becomes computationally expensive to use batch gradient descent, however, Stochastic Gradient Descent uses only a single sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled and selected for performing the iteration. The parameters are updated even after one iteration where only one has been processed. Thus, it gets faster than batch gradient descent.

    Stochastic in Gradient Descent Graph in Machine Learning

    Algorithm for stochastic gradient descent:

    Hence,
    Let (x(i),y(i)) be the training example

    formula4

    formula 5

    Repeat {
    For i=1 to m{

    formula 6

            For every j =0 …n
                  }
    }

    3. Mini Batch gradient descent: This type of gradient descent is considered to be faster than both batch gradient descent and stochastic gradient descent. Even if the number of training examples is large, it processes it in batches in one go. Also, the number of iterations are lesser in spite of working with larger training samples.

    Mini Batch gradient descent graph in Machine Learning

    Algorithm for mini-batch gradient descent:

    Let us consider b be the number of examples in one batch, where b<m. Now, assume b=10 and m=100.
    The batch size can be adjusted. It is generally kept as a power of 2. The reason behind it is because some hardware such as GPUs achieve better run time with common batch sizes such as a power of 2.

    Repeat {

     For i=1,11, 21,…..,91

    Let Σ be the summation from i to i+9 represented by k.

    formula -6

      For every j =0 …n
    }

    For Batch Gradient Descent, the algorithm traces a straight line towards the minimum. If the cost function is convex, then it converges to a global minimum and if the cost function is not convex, then it converges to a local minimum. The learning rate is typically held constant over here.

    Convergence trends in different variants of Gradient Descent in Machine Learning

    For stochastic gradient descent and mini-batch gradient descent, the algorithm keeps on fluctuating around the global minimum instead of converging. In order to converge, the learning rate needs to be changed slowly.

    There are many cases where gradient descent fails to perform well. There are mainly three reasons when this would happen:

    While using gradient descent, if the execution is not proper, it leads to certain problems like vanishing gradient. This happens when the gradient is either too small or too large which results in no convergence.

    Let us look at some of the most commonly used gradient descent algorithms and how they are implemented.

    One of the simplest forms of gradient descent technique is the Vanilla Gradient Descent. Here, vanilla means pure / without any adulteration. In this algorithm, the main feature is that small steps are taken in the direction of minima by taking the gradient of cost function.

    The pseudocode for the same is mentioned below.

    If you see here, the parameters are updated by taking the gradient of the parameters and then the learning rate is multiplied which suggest how quickly we should go towards the minimum. Learning rate is a hyper-parameter and while choosing its value you should be careful.

    Vanilla Gradient Descent Graph in Machine Learning

    In this case, we adjust the algorithm in such a manner that we are aware about the prior step before taking the next step.

    The pseudocode for the same is mentioned below.

    Here, our update is the same as that of vanilla gradient descent. But we are introducing a new term called velocity, which considers the previous update and a constant which is called momentum.

    Gradient Descent with Momentum Update in machine LearningSource

    ADAGRAD (Adaptive Gradient Algorithm) mainly uses an adaptive technique to learn rate updation. In this algorithm, we try to change the algorithm on the basis of how the gradient has been changing for all the previous iterations.

    The pseudocode for the same is mentioned below.

    In the above code, epsilon is a constant which is used to keep the rate of change of learning rate in check.

    ADAM is another adaptive technique which is built out of ADAGRAD and further reduces its downside. In simple words you can consider it to be ADAGRAD + momentum.

    The pseudocode for the same is mentioned below.

    Here beta1 and beta2 are constants to keep changes in gradient and learning rate in check

    In this section you will learn about some tips and tricks for getting the most out of the gradient descent algorithm for machine learning.

    Now that we have gone through all the elements related to gradient descent, let us implement gradient descent in Python. A simple gradient Descent Algorithm is as follows:

    Let us create an arbitrary loss function and try to find a local minimum value for that function by implementing a simple representation of gradient descent using Python.

    We will find the gradient descent of this function: x3 – 3×2 + 5

    plotting the data set

    Here, we can see that our minimum value should be around 2.0
    Let us now use the gradient descent to find the exact value

    Local minimum occurs at: 1.9980265135950486
    Number of steps: 25 

    Gradient Descent Machine Learning Graph

    Zoomed in Gradient Descent to Key Area in Machine Learning

    In this article, you have learned about gradient descent for machine learning. Here we tried to cover most of the topics. To learn more about machine learning algorithms in-depth,  click here. Let us summarize all that we have covered in this article.

    If you are inspired by the opportunities provided by Data Science, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.

  • Firstly shuffle the data set randomly in order to train the parameters evenly for each type of data.
  • As mentioned above, it takes into consideration one example per iteration.
  • Data challenges
  • Gradient challenges
  • Implementation challenges
  • The arrangement of data sometimes leads to challenges. If it is arranged in such a way that it poses a  non-convex optimization problem then it becomes difficult to perform optimization using gradient descent. Gradient descent works for problems which are arranged with a well-defined convex optimization problem.
  • During the optimization of a convex optimization problem, you will come across several minimal points. The lowest among all the points is called the global minimum, and other points are called the local minima. You will have to make sure you go to the global minimum and avoid local minima.
  • There is also a saddle point problem. This is a situation where the gradient is zero but is not an optimal point. It cannot be avoided and is still an active part of the research.
  • While using gradient descent, if the execution is not proper, it leads to certain problems like vanishing gradient. This happens when the gradient is either too small or too large which results in no convergence.

  • Smaller memory results in the failure of network. A lot of neural network practitioners do not pay attention but it is very important to look at the resource utilization by the network.
  • Another important thing to look at is to keep track of things like floating point considerations and hardware/software prerequisites.
  • Plot Cost versus Time: It is suggested to collect and plot the cost values calculated by the algorithm for each iteration. It helps you keep track of the descent. For a well-performing gradient descent the cost always decreases in each iteration. If you see there is no decrease, reduce the learning rate.
  • Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Keep trying different values to check which works best for your algorithm.
  • Rescale Inputs: Try to achieve a range such as [0, 1] or [-1, 1] by rescaling all the input variables. The algorithm reaches the minimum cost faster if the shape of the cost function is not distorted or skewed.
  • Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients.
  • Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. Try to take the average over 10, 100, or 1000 updates. This will give you a better idea of the learning trend for the algorithm.
  • Obtain a function in order to minimize f(x)
  • Initialize a value x from which you want to start the descent or optimization from
  • Specify a learning rate which will determine how much of a step to descend by or how quickly you want to converge to the minimum value
  • Find the derivative of that value x (the descent)
  • Now proceed to descend by the derivative of that value and then multiply it by the learning rate
  • Update the value of x with the new value descended to
  • Check your stop condition in order to see whether to stop
  • If condition satisfies, stop. If not, proceed to step 4 with the new x value and keep repeating the algorithm
  • Optimization is the heart and soul of machine learning.
  • Gradient descent is a simple optimization technique which can be used with other machine learning algorithms.
  • Batch gradient descent refers to calculating the derivative from all training data before calculating an update.
  • Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating the update immediately.
  • Research & References of What is Gradient Descent For Machine Learning|A&C Accounting And Tax Services
    Source

    Categories: Best Business Helps

    Leave a Reply