ML - hw3

作者: 谢小帅 | 来源:发表于2019-01-17 13:51 被阅读11次

1. Neural Networks

test, loss and accuracy

epoch10, loss and accuracy

2. K-Nearest Neighbor

(a) try KNN with different K

Conclusion:

  • A little K, like K=1 above, means we use a little neighborhood to make a prediction. Only the training instances close or similar to the test instance matters. The approximate error will reduce but the estimation error will increase. In other words, the decrease of K value means that the whole model will become more complex and easy to overfit.
  • A large K, like K=100 above, means we use a large neighborhood to make a prediction. It can reduce estimation error but large neighborhood means training instances far away from test instance also matters and may result in a wrong prediction. In an extreme case, K equals the number of training instances, the classifier just picks the modal number of the label. That's an entirely lousy classifier. It just outputs the same label for any test instances.
  • In short, K plays a critical role in KNN classifier. We should choose a relatively reliable K when using this method.
  • The extreme case, K=400, like the picture below. All test instances have been classified as the same label 0, so the result image plots no contours.

(b) How to choose a proper K

We can use Cross Validation.
Since we don't know which K is the best K, We can set a range of probable K values and test every KNN's performance in validation data, then choose the K with the least valid error as our best K.

(c) hack the CAPTCHA

I chose 20 Captchas or 100 numbers as train set and 10 Captchas as test set, like the picture below.

train set test set

Then I labeled the train set and stored the 100 samples and labels in hack_data.npz, then KNN is used to predicate the numbers in test set, the accuracy is 100%.

10 test Captchas

3. Decision Tree and ID3

ID3 use Infomation Gain to choose partition feature.
Below is the calculation process.

Draw the decision tree and annotate each non-leaf node.

The script I used to calculate the Infomation Gain.

import numpy as np

def cal_info_entropy(*probs):
    info_e = 0
    for p in probs:
        info_e -= p * np.log2(p)
    return info_e

# GPA non-left node
h = cal_info_entropy(4 / 9, 5 / 9)
h_gender = 205 / 450 * cal_info_entropy(105 / 205, 100 / 205) + 245 / 450 * cal_info_entropy(95 / 245, 150 / 245)
h_gpa = 215 / 450 * cal_info_entropy(15 / 215, 200 / 215) + 235 / 450 * cal_info_entropy(185 / 235, 50 / 235)
print('Dataset info:', h)
print('Gender info:', h_gender)
print('GPA info:', h_gpa)
print('GPA info Gain:', h - h_gpa)

# Gender left node
h = cal_info_entropy(185 / 235, 50 / 235)
h_gender = 115 / 235 * cal_info_entropy(95 / 115, 20 / 115) + 120 / 235 * cal_info_entropy(90 / 120, 30 / 120)
print('Gender left info Gain:', h - h_gender)

# Gender right node
h = cal_info_entropy(15 / 215, 200 / 215)
h_gender = 90 / 215 * cal_info_entropy(10 / 90, 80 / 90) + 125 / 215 * cal_info_entropy(5 / 125, 120 / 125)
print('Gender right info Gain:', h - h_gender)

Results:

Dataset info: 0.9910760598382222
Gender info: 0.9798427350133525
GPA info: 0.5643777078470715
GPA info Gain: 0.4266983519911507
Gender left info Gain: 0.006269007038336882
Gender right info Gain: 0.01352135817465705

4. K-Means Clustering

(a) k-means two trials

Black Point and Green Point are used to annotate the initial and final cluster centers in the following trail images

When k = 2

  • smallest SD
  • largest SD

When k = 3

  • smallest SD
  • largest SD

Conclusion:

  • The randomly chosen cluster centers have a significant influence on the iteration number of k-means algorithm.

(b) Get a stable result using k-means

We can choose k initial centers with the largest distance between each other. The process:

  1. randomly choose the 1st cluster center
  2. choose the 2nd center with the largest distance to the 1st center
  3. choose the 3rd center with the largest sum of the distance to the 1st and 2nd centers
  4. repeat 3 until we get all the k centers

Another way, we can use Hierarchical Clustering or Canopy Algorithm to do clustering and use its results as the initial cluster centers.

(c) Run k-means on digit_data.mat

test1

test2

Conclusion:

  • When k = 10, we just get 10 number labels, almost covering all the data labels.
  • When k increases, we have more cluster centers, which cluster some same numbers to different labels. It's like the overfitting problem.
  • So a suitable k is significant to the k-means algorithm, and sometimes, we need some prior knowledge from experts to find a better k.

(d) Vector quantization

When using Fixed Length Encoding

  • K = 64, log(K) = 6
  • compress ration = 6/24 = 25%

When using Huffman Encoding

  • avg_bits = 4.792682438794727
  • compress ration = 19.97%

相关文章

  • ML - hw3

    1. Neural Networks test, loss and accuracy epoch10, loss ...

  • OpenGPG尝试

    OpenGPG尝试 from my csdn blog 信息安全原理 hw3 gpg - OpenPGP encr...

  • HW3

    A账号:主动添加的精准健身减脂粉丝,长期关注的这部分人群,可以直接通过q&a 和朋友圈静默转化。 B账号:作为A账...

  • FIS HW3

    1. 有一个剩余期限为 2年的固定利率债券,本金为 100 元,票面利率为 5%,每半年付息一次,市场利率为 4....

  • FTS HW3

    1. 假设 ,其中 ,求 2. 假设 ,其中 ,求 3. 假设 ,其中 ,求 4. 比较以上三题答案,可以得出什...

  • 文本挖掘HW3

    我们发现存在jieba切分后有一些停用词在干扰,类似空格、标点以及一些中文中的介词助词等等。所以,此时我们需要导入...

  • 呼吸SU:M 37° FLAELESS完美无瑕水乳霜套装

    套盒内含:水150ml+乳130ml+面霜20ml+魔法精华 30ml 无瑕水 20ml 无瑕乳 20ml 无瑕精...

  • GPG传输文件

    gpg 加密传输文件 from my csdn blog 信息安全原理 hw3 将自己公钥发给助教,助教传回一份本...

  • 2019-07-08

    今天: 1.写作课任务 2.kde discussion 3.lecture 重听 4. hw2对答案,hw3 5...

  • 1月14日作息

    01.05 02.36(20ml) 06.00(30ml) 09.35(30ml) 12.05(25ml) 10....

网友评论

      本文标题:ML - hw3

      本文链接:https://www.haomeiwen.com/subject/kuuikqtx.html