Neural network

Neuron network is about mimicking the brain. Our brain can learn anything with one algorithm (“one learning algorithm” hypothesis, re-wiring test).

How our brain works

A neuron takes inputs from other neurons or sensors, process and provide outputs to other neurons. The way we learn is to change the weight of each input for output.

Neuron Model: Logistic Unit

For each neuron, with input of x , output would be h_\theta(x)=sigmoid(\theta^Tx) . Here, sigmoid() is the activation function(whether this neuron output a big value).

Artificial Neural Network

Activation Function (Hypothesis Function)

To justify, activation function is not hypothesis function since there should be only one hypothesis function. Activation function of the output layer is the hypothesis function.

Vectorized:
a^{(i)}=sigmoid(\Theta^{(i-1)}*a^{(i-1)})

Remember to add biased unit a^{(i)}_0=1

Cost Function

Cost function of neural network is just summing up cost function of logistic regression for each output unit.
For the regularization part, just sum up all the parameters except for the bias units.

Optimization

Forward-propagation

Execute activation function (and adding bias unit) for each layer.

Back-propagation

Back-propagation is to calculate how much to
For output layer units:
\delta^{(L)}=a^{(L)}-y

For hidden layer units:
\delta^{(l)}=\Theta^T\delta^{(l+1)}.*a’(z^{(l)})

where a’=a^{(i)}.*(1-a^{(i)})

Multiclass classification

  • One-vs-all: Output layer L has n units each represent a class

Random Initialization

To make each neuron unit get different features, we should initialize the \Theta randomly (in [-\epsilon, \epsilon] ).
A good choice of \epsilon is \epsilon=\frac{\sqrt{6}}{\sqrt{L_{in}+L_{out}}}

Notations

  • a^{(i)}_j : “activation”(output) of unit j in layer i
  • \Theta^{(j)} : matrix storing all parameters (weights) that the neurons in layer j+1 uses to get output.
  • L : No. of layers
  • s_l : No. of units(not include bias unit) in layer l
  • s_L, K : No. of units in output layer

Normal Equation for Linear Regression

How Normal Equation Works?

Solve for \theta analytically.

Pros

  • No need to choose \alpha
  • No need feature scaling
  • Don’t need to iterate

Cons

  • Slow when having a lot of features O(n^3) (not using when n>10^5 )

Algorithm

  • X : Design matrix, takes all input data (including x_0 )

Algorithm:
set \frac{d}{d\theta_j}J(\theta) to 0

Result:
\theta=(X^TX)^{-1}X^Ty

Non-invertible X^TX

In matlab/octave, pinv() will handle the case.

Causes of non-invertible:

  • Redundant features (linearly dependent)
    • Delete one of the dependent feature
  • Too many features (e.g. m\le n )
    • Delete some features
    • Regularization

Gradient Descent

How Gradient Descent Works?

Start with a initial parameter set: \theta_0, \theta_1, \theta_2, …
Keep changing the parameter set to reduce J(\theta_0, \theta_1, \theta_2, …)
Until we hopefully end up at a minimum.
Declare convergence if J(\theta) decreases by less than 10^{-3} in one iteration.

Pros

  • Works well even when a lot of features.
  • Simple

Cons

  • Needs many iterations.
  • Slow

Algorithm

  • \alpha : learning rate

repeat until convergence {
\theta_i = \theta_i-\alpha\frac{d}{d\theta_i}J(\theta_0, \theta_1, \theta_2, …)
}

Note: should update all parameters simultaneously.

Normalization

See regularization

J(\theta)=\sum Cost(h_\theta(x), y)+\lambda\displaystyle\sum_{j=1}^n{\theta^2_j}

Gradient Checking (numerical gradient)

To identify/debug error(usually back-propagation in neural network), we need to check whether the gradient is calculated correctly. (Should not be on when learning)

$
\frac{d}{d\Theta_i}J(\Theta)\approx \frac{J(\Theta_i+\epsilon, \Theta_{rest})-J(\Theta_i-\epsilon, \Theta_{rest})}{2\epsilon}

$

usually with \epsilon=10^{-4}

Different Types of Gradient Descent

“Batch” Gradient Descent

Each step of gradient descent uses all the training samples.
It could be computational expensive to sum all errors.

Stochastic Gradient Descent

  1. Randomly shuffle dataset
  2. Update \Theta for each sample
    It goes randomly, ends up wondering around local optimum.

Mini-Batch Gradient Descent

  1. Randomly shuffle dataset
  2. Update \Theta for each b samples
    Somewhere in between batch gradient descent and stochastic gradient descent.

*Advantage than stochastic gradient descent: by using vectorization, calculation could be done faster

Preprocessing

Feature Scaling

Make sure features are on a similar scale, so that we can choose a more proper learning rate.

Usually, get every feature into approximately a -1\le x_i\le 1

Mean Normalization

Replace x_i with x_i-\mu_i to make features have approximately zero mean (do not apply to x_0=1 )

If J(\theta) is increasing or waving

J(\theta) should decrease after every iteration.
Use smaller \alpha

Combine Features

Just combine features into new features directly.

Polynomial Regression

Just create new features like x^2 and x^3 to achieve polynomial regression.

Introduction to Machine Learning

Online Courses for machine learning

The machine learning course from Stanford in coursera is a great and famous resource to learn machine learning. If you want to start learning machine learning, even if you got no foundation, you should take a look at it.

What is Machine learning?

Machine learning is about using data to get a model that can describe and predict data.
Machine learning includes supervised learning and unsupervised learning.

  • Supervised learning is a machine learning which training data are labeled.
  • unsupervised learning is a machine learning which training data are not labeled.

Regression, Supervised

Outputs are real numbers.

Minimization Algorithms:

Linear Regression

Check out Linear Regression

Classification, Supervised

Outputs are discrete(0, 1, 2 ……).

  • Two-class classification
  • Multi-class classification
    • One-vs-all(one-vs-rest): make a classifier for each class

Clustering, Unsupervised

Output cluster centroids, giving clusters by distance to the centroids

Logistic Regression

Check out Logistic Regression

Clustering, Unsupervised

Over-fitting

Model perform accurate on training model, but do not generalize

Solutions:

  1. Reduce number of features
    • Manually select which features to keep
    • Model selection algorithm
  2. Regularization
    • Penalize by adding \lambda\theta to cost function, where \lambda is regularization parameter. (Do not penalize \theta_0 )

Notations

  • m : Number of training samples
  • x : “input” variables
  • y : “output” variables
  • (x^{(i)}, y^{(i)}) : i-th sample
  • x^{(i)}_j : j-th column of i-th sample
  • \theta : parameters of the model
  • h_\theta(x) : hypothesis function that takes input to estimate output.
  • J(\theta_0, \theta_1, …) : cost function that takes parameters to calculate the accuracy of prediction from hypothesis function.

How To Do Integration

Introduction to Entity Relationship Diagram

Introduction

ERD was first proposed by Peter Chen in 1976

Components

Entities

Entity Type is represented capitalised using a rectangular box.

Attributes

Attributes is represented using ellipse.

Primary Key

Represented by underlining the name of the attribute.

Multi-valued Attribute

Represented using concentric ellipses

Derived Attribute

Represented using a dashed ellipse

Relationships

Relationship is represented using a diamond

Cardinality

Cardinality describes how many entities are related in each side.

Can be:

  • One-to-One
  • One-to-Many
  • Many-to-Many

ERD Cardinality Diagram

Participation

Participation Constraint is presented by min:max indicating number of times an entity can participate in a relationship.

  • Total
    All entities should participate in at lease one relationship.
  • Partial

Other Entities

Associative Entity

Entity used to associate many to many relationship wile storing other attributes.

Represented as both entity and relationship - diamond in a rectangle.

Weak Entity

Weak entity is an entity depend on another entity.

Represented using double rectangle.

Points of PHP

Syntax

PHP Code

1
2
3
<?php
...
?>

Note All statements end with a semicolon ;
Note Everything except variable names are NOT case sensitive.

Comments

// and # are used for single-line comment
/* ... */ can be used for multi-line comment

Variables

Usage

$variable_name = "anything"

Note: PHP is a loosely typed language

Scope

  • local
  • global
  • static

Global

global keyword can be used to make variable global.
$GLOBALs['index'] can be used to access global variables.

Static

Static variable is local. It won’t be deleted when function completed.
static keyword can be used to make variable static.

Constant

1
2
define("SHI", "Hello world!"[, case_insensitive = false]);
echo SHI;

Data Type

var_dump($x); can be used to show the type of a variable.

String

Variables can be integrated directly into strings:

1
echo "<h1>$txt</h1>";

Note: echo and print can be used with or without parentheses

. can be used to concatenate strings.
.= to append.

Array

1
2
3
4
5
6
7
8
$a = array(1, 2, 3);
$a[0];

$b = array("a"=>1, "b"=>2, "c"=>3);
$b['a'];
foreach($b as $bi => $bi_value) {
echo "Key=" . $bi . ", Value=" . $bi_value;
}

Function

1
2
3
4
function fu($a, $b=0) {
...
return 0
}

Object

1
2
3
4
5
6
7
8
9
10
class Bi {
function Bi(){
$this->shi = "shi";
}
}

// Create an object
$b = new Bi();

echo $b->shi;

Condition Statement

If

1
2
3
4
5
6
7
if() {
...
} elseif() {
...
} else{
...
}

Switch

1
2
3
4
5
6
7
8
switch (n) {
case label1:
...
break;
...
default:
...
}

Loops

While

1
2
3
4
5
6
7
while() {
...
}

do {
...
} while();

For

1
2
3
4
5
6
7
for(...;...;...) {
...
}

for ($i as $arr) {
...
}

Skills

Redirection

1
2
header("Location: ".$url, true, $permanent ? 301 : 302);
die();

Differential Equations

First order

Separable

Separate and integrate both sides

Linear

Standard form:

\frac{dy}{dx}+p(x)y=q(x)

r(x)=e^{ \int p(x)}

\frac{d}{dx}r(x)y=r(x)q(x)

Second order

Linear Homogeneous

Standard form:

\frac{d^2}{dx^2}y+a\frac{d}{dx}y+by=0

Assume: y=Ce^{mx} ( m may be real or complex)

C(m^2+am+b)Ce^{mx}=0

Since C\ne0 , and e^{mx}>0

m^2+am+b=0

So the solution:

y=Ce^{m_1x}+De^{m_2x}

  • Two real m
  • Two complex m
  • One real m

Two complex m

As m_{1,2} is complex numbers, they can be rewritten as

m_{1,2}=a\pm ik

y=e^ax(Ce^{ikx}+De^{-ikx})

Using Euler’s theorem,

y=e^{ax}(Acos kx+Bsin kx)

One real m

y=(A+Bx)e^{mx}

Coupled DE

  • \frac{dx}{dt}=f(x, y, t)
  • \frac{dy}{dt}=g(x, y, t)

Coupled Linear DE with constant coefficients

\frac{dx}{dt}=ax+by (1)
\frac{dy}{dt}=cx+dy (2)

Differentiate (1) wrt t (make it second order derivative of x wrt t )

Substitute for \frac{dy}{dt} and y (getting rid of y )

Solve DE wrt x

PHP Form Handling

Get Form Data

1
2
$_GET["field"]
$_POST["field"]

Validation

Not empty

1
2
3
if (empty($_POST["field"])) {
$err = "Field is required";
}

Letters and white space only

1
2
3
if (!preg_match("/^[a-zA-Z ]*$/",$field)) {
$err = "Only letters and white space allowed";
}

E-mail address

1
2
3
if (!filter_var($email, FILTER_VALIDATE_EMAIL)) {
$err = "Invalid email format";
}

Url

1
2
3
if (!preg_match("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i",$url)) {
$err = "Invalid URL";
}

Standard Integrals

Derivative Expression Integral
-\frac{1}{x^2} \frac{1}{x} ln|x|
sec^2x tan x -ln|cos x|
n\a cot x ln|sin x|
n\a \frac{1}{a^2+x^2} \frac{1}{a}tan^{-1}(\frac{x}{a})
n\a \frac{1}{\sqrt{a^2-x^2}} sin^{-1}(\frac{x}{a})
n\a \frac{1}{\sqrt{a^2+x^2}} sinh^{-1}(\frac{x}{a})