Gradient Descent

How Gradient Descent Works?

Start with a initial parameter set:
Keep changing the parameter set to reduce
Until we hopefully end up at a minimum.
Declare convergence if decreases by less than in one iteration.

Pros

Works well even when a lot of features.
Simple

Cons

Needs many iterations.
Slow

Algorithm

: learning rate

repeat until convergence {

}

Note: should update all parameters simultaneously.

Normalization

See regularization

Gradient Checking (numerical gradient)

To identify/debug error(usually back-propagation in neural network), we need to check whether the gradient is calculated correctly. (Should not be on when learning)

$
\frac{d}{d\Theta_i}J(\Theta)\approx \frac{J(\Theta_i+\epsilon, \Theta_{rest})-J(\Theta_i-\epsilon, \Theta_{rest})}{2\epsilon}

usually with

Different Types of Gradient Descent

“Batch” Gradient Descent

Each step of gradient descent uses all the training samples.
It could be computational expensive to sum all errors.

Stochastic Gradient Descent

Randomly shuffle dataset
Update for each sample
It goes randomly, ends up wondering around local optimum.

Mini-Batch Gradient Descent

Randomly shuffle dataset
Update for each samples
Somewhere in between batch gradient descent and stochastic gradient descent.

*Advantage than stochastic gradient descent: by using vectorization, calculation could be done faster

Preprocessing

Feature Scaling

Make sure features are on a similar scale, so that we can choose a more proper learning rate.

Usually, get every feature into approximately a

Mean Normalization

Replace with to make features have approximately zero mean (do not apply to )

If is increasing or waving

should decrease after every iteration.
Use smaller

Combine Features

Just combine features into new features directly.

Polynomial Regression

Just create new features like and to achieve polynomial regression.