How Gradient Descent Works?
Start with a initial parameter set:
Keep changing the parameter set to reduce
Until we hopefully end up at a minimum.
Declare convergence if decreases by less than in one iteration.
Pros
- Works well even when a lot of features.
- Simple
Cons
- Needs many iterations.
- Slow
Algorithm
- : learning rate
repeat until convergence {
}
Note: should update all parameters simultaneously.
Normalization
See regularization
Gradient Checking (numerical gradient)
To identify/debug error(usually back-propagation in neural network), we need to check whether the gradient is calculated correctly. (Should not be on when learning)
$
\frac{d}{d\Theta_i}J(\Theta)\approx \frac{J(\Theta_i+\epsilon, \Theta_{rest})-J(\Theta_i-\epsilon, \Theta_{rest})}{2\epsilon}
$
usually with
Different Types of Gradient Descent
“Batch” Gradient Descent
Each step of gradient descent uses all the training samples.
It could be computational expensive to sum all errors.
Stochastic Gradient Descent
- Randomly shuffle dataset
- Update for each sample
It goes randomly, ends up wondering around local optimum.
Mini-Batch Gradient Descent
- Randomly shuffle dataset
- Update for each samples
Somewhere in between batch gradient descent and stochastic gradient descent.
*Advantage than stochastic gradient descent: by using vectorization, calculation could be done faster
Preprocessing
Feature Scaling
Make sure features are on a similar scale, so that we can choose a more proper learning rate.
Usually, get every feature into approximately a
Mean Normalization
Replace with to make features have approximately zero mean (do not apply to )
If is increasing or waving
should decrease after every iteration.
Use smaller
Combine Features
Just combine features into new features directly.
Polynomial Regression
Just create new features like and to achieve polynomial regression.