Points of MongoDb

NoSQL database

MongoDb is a NoSQL database which is much more agile and flexible than relational databases.

NoSQL vs SQL:

  • Agile and flexible: schema is not required, and easy to change. Which usually cause better performance
  • Ability to scale-out (distribute the load among multiple servers)
  • To handle large volumes of structured, semi-structured, and unstructured data

Ref: SQL vs NoSQL: The Differences

Terminologies

Document: MongoDb stores data as BSON documents
Collection: analogous to an SQL table, to store a collection of documents

CRUD

Create

SQL

1
2
3
4
5
6
7
8
INSERT INTO book (
`ISBN`, `title`, `author`
)
VALUES (
'9780992461256',
'Full Stack JavaScript',
'Colin Ihrig & Adam Bretz'
);

MongoDb

1
2
3
4
5
db.book.insert({
ISBN: "9780992461256",
title: "Full Stack JavaScript",
author: "Colin Ihrig & Adam Bretz"
});

Read

SQL

1
2
SELECT title FROM book
WHERE price > 10;

MongoDb

1
2
3
4
db.book.find(
{ price: { >: 10 } },
{ _id: 0, title: 1 }
);

Update

SQL

1
2
3
UPDATE book
SET price = 19.99
WHERE ISBN = '9780992461256'

MongoDb

1
2
3
4
db.book.update(
{ ISBN: '9780992461256' },
{ $set: { price: 19.99 } }
);

Delete

SQL

1
2
DELETE FROM book
WHERE publisher_id = 'SP001';

MongoDb

1
2
3
db.book.remove({
"publisher.name": "SitePoint"
});

驾考考点

科一

限速

无限速标志、标线时:

  • 无中心线:30/40
  • 有中心线:50/70

高速公路

  • 三车道
    • 最高时速:20递减
    • 最低时速:最高-10
  • 两车道
    • 最高时速:20递减
    • 最低时速:最高-20

低能见度

  • 低于200:
    • 限速60,车距100,开灯
  • 低于100:
    • 限速40,车距50,开灯及危险报警灯

通行

  • 无指示路口:右侧先行
  • 对向左转右转:左转先行

处罚

(目前已无交通违章/违规的说法,违反道路交通安全法即为违法行为)

  • 监禁
  • 拘留
  • 扣分
  • 禁止再考
    • 假1年
    • 被吊销2年
    • 被撤销3年
    • 醉驾5年
    • 肇事逃逸终生

Perceptron

Preceptron is one of the earliest ANN implementations (of course simple one) of neural network.

Activation Function

f(x)={\begin{cases}1&w^T*x+b>0\\0&{\text{otherwise}}\end{cases}}

Algorithm

Collaborative Filtering

Collarborative filtering is used to learn both parameter \Theta and input x from y at the same time.

Cost Function

J(\Theta, x)=\frac{1}{2}\sum_{(i,j):r(i,j)=1}{((\Theta^{(j)})^T*x^{(i)}-y^{(i, j)})^2} + \frac{\lambda}{2}\sum{\Theta^2} + \frac{\lambda}{2}\sum x^2

Note: r(i, j) returns whether the y(i, j) exists

Mean Normalization

To let the model predict mean for new users, we normalize all y by mean, then the regualrization will make \Theta zero.

Anomaly Detection

We use anomaly detection instead of supervised learning since:

  • very small number of positive(anomaly) samples
  • anomalies looks differently
    So we model negative samples only.

How it works

  • Model probability of occurrence p(x) from data
  • Identify anomaly by p(x)<\epsilon

Probability

We assume data are normally distributed
p(x)=\prod_{j=1}^np(x_j;\mu_j,\sigma_j^2)
( \prod is product symbol, return the product of all elements)

Where \mu_j = \frac{1}{m}\sum_{i=1}^mx_j^{(i)} , \sigma^2_j = \frac{1}{m}\sum_{i=1}^m(x_j^{(i)}-\mu_j)^2

By definition of normal distribution:
p(x_j;\mu_j,\sigma_j^2) = \frac{exp(-\frac{(x_j - \mu_j)^2}{2\sigma^2_j})}{\sqrt{2\pi}\sigma_j}

When not normally distributed

We could try transform data by taking \log(x_i+c) or x^c

Training/CV/Test

  • Training set: 60% normal data
  • CV set: 20% normal data, all anomalous
  • Test set: 20% normal data, all anomalous

Notice: As y is highly skewed, we should evaluate it differently

Multivariate normal distribution

We could use multivariate normal distribution to capture correlations between features. Otherwise, we need to manually create features to do so. However, multivariate normal could be computationally expensive.

Also, this doesn’t work when \Sigma is non-invertible ( m\leq n or contains redundant features)

p(x;\mu, \Sigma) = \frac{exp(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu))}{(2\pi)^{\frac{n}{2}}|\Sigma|^\frac{1}{2}}

Where \mu=\frac{1}{m}\sum_{i=1}^mx^{(i)} , \Sigma=\frac{1}{m}\sum_{i=1}^m(x^{(i)}-\mu)(x^{(i)}-\mu)^T

Principal Component Analysis

Target

PCA tries to get a projection panel by minimizing the projection error (not error in linear regression)

How to implement

  • Calculate Sigma \Sigma by $\Sigma = \frac{1}{m}X^TX$
  • Get projection panel U (along with error S ) by singular value decomposition of \Sigma
  • Get new data set Z by Z = X * U
  • Recover original data (approximated) by X_{approx} = Z * U^T

K-Means

K-Means algorithm is an unsupervised clustering algorithm.

How it works

  • Randomly initialize K clustering centroids \mu_k
  • Optimize centroids until converge
    • Assign training data x^{(i)} to closest centroid c^{(i)}
    • Set centroid to the mean of assigned data

Cost function

J=\sum{||x^{(i)}-\mu_{c^{(i)}}||^2}

Debugging

Random initialization

Initialize centroid by picking points from training data.
K-means cost function may converge to a bad local optimum, we could try several random initializations to find the global optimum.

Choosing K value

We could try plotting the cost and K value. If there is an elbow, it could be a good K value. If no, we should determine K by data and our purpose.

Support Vector Machine

Cost Function

$
J_\Theta=C\sum{[y * cost_1(\Theta^TX)+(1-y) * cost_0(\Theta^TX)]}+\frac{1}{2} * \sum{\Theta^2}

$

Differences with logistic cost function:

  • Replace sigmoid() with cost_1() and cost_0()
  • No \frac{1}{m}
  • Instead of \lambda on regulation, C is used (can be treated like C=\frac{1}{\lambda} )

Large Margin Classifier

SVM will maximize the margin to allow variance in test data. Also, it will ignore outliers when C is not too large.

Math behind: Since cost_1 and cost_0 requires \Theta’*X to be significant, and regularization requires ||\Theta|| to be small, ||projection X \rightarrow \Theta|| would be large, which is the margin.

( \Theta’* X = X’ * \Theta’ = ||projection X \rightarrow \Theta||* ||\Theta|| )

(||x|| denotes the length of vector, and it can be negative)

Kernel

Kernel is used to perform non linear classification, by using attributes given by kerneling function instead of raw attributes.

Kernel function will calculate the similarity of a data to another(landmark), from 0 to 1(similar).

  • Gaussian Kernel

Evaluating Model

Summary

  • High variance(over-fitting)
    • More training samples
    • Less features
    • Increase \lambda / [SVM]Decrease C
    • [Neural Network] Less layers, less units
    • [SVM] Increase \sigma
  • High bias(under-fitting)
    • Get additional features
    • Add polynomial features
    • Decrease \lambda / [SVM]Increase C
    • [Neural Network] More layers, more units
    • [SVM] Decrease \sigma

Testing

Testing is used to evaluate how well the model perform.

Split data into Training : Test = 7:3.

Cross Validation

Cross validation is used to use different data set to choose the best model and evaluate it.

  1. Split data into Training : Cross Validation : Test = 6:2:2.
  2. Use Cross Validation set to try on different models.
  3. Use Test set to evaluate the best model selected by step 2

Learning Curve

Learning curve is to plot (training set size, J_{training} ) and (training set size, J_{cv} ) so that we could know

  • High bias: High error on both curves. Curves flat out quickly, and get to same error.
  • High variance: High J_{cv} and low J_{training} , and gap becomes smaller. Indicating more data could help.

Skewed Classes

When the composition of target y is highly skewed(e.g. 99% of y are the same value), we can get a low error ignoring inputs. So we cannot know whether the model improved or not.

Precision/Recall

| Actual True | Actual False
—|—|—
Predicted True|True Positive |False Positive
Predicted False| False Negative | True Negative
(Note: y=1 is the rare class that we want to detect)

  • Precision=\frac{True Positive}{All Predicted True}
  • Recall=\frac{True Positive}{All Actual True}

We can trade off precision( P ) and recall( R ) by setting threshold of decision.

F_1 Score( F score) = 2\frac{PR}{P+R}