当前位置：文档之家› 中科院刘莹大数据挖掘课程作业2

中科院刘莹大数据挖掘课程作业2

HW2Due Date: Nov. 23 Part I: written assignment1.a)Compute the Information Gain for Gender, Car Type and Shirt Size.本题的class有两类；即C0和C1I(C0，C1)= I(10，10)=1infor gender（D）=1020 I(6，4)+1020I(4，6)=10 20 (−610log2610−410log2410)+1020(−610log2610−410log2410)=0.971Gain（gender）= I(C0，C1)-infor gender（D）=1-0.971=0.029infor CarType（D）=420 I(1，3)+820I(8，0)+820I(1，7)=4 20(−14log214−34log234)+820(−18log218−78log278)=0.3797Gain（CarType）= I(C0，C1)-infor gender（D）=1-0.3797=0.6203infor ShirtSize（D）=520 I(3，2)+720I(3，4)+420I(2，2)+420I(2，2)=5 20(−35log235−25log225)+720(−37log237−47log247)+410(−24log212−24log212)=0.9876Gain（shirtSize）= I(C0，C1)-infor gender（D）=1-0.9876=0.0124b)Construct a decision tree with Information Gain.①由a知，CarType的information Gain最大，故本题应该选择CarType作为首要分裂属性。

CarType的类别有Luxury family Sport（因全部属于C0类，此类无需再划分）②对Luxury进一步划分：I(C0，C1)= I(1，7)=0.5436infor gender（D）=18 I(1，0)+78I(1，6)=0+78(−17log217−67log267)=0.5177Gain（gender）= I(C0，C1)-infor gender（D）=0.5436-0.5177=0.0259infor ShirtSize（D）=28 I(0，2)+38I(0，3)+28I(1，1)+18I(0，2)=0.25Gain（shirtSize）= I(C0，C1)-infor gender（D）=0.5436-0.25=0.2936 故此处选择ShirtSize进行属性分裂。

③对family进一步划分：I(C0，C1)= I(1，3)=0.811Gain（gender）= I(C0，C1)-infor gender（D）=0.811- I(1，3)=0 Gain（shirtSize）= I(C0，C1)-infor gender（D）=0.811-14 I(1，0)-14I(0，1)- 14I(0，1)- 14I(0，1)=0.811故此处选择ShirtSize进行属性分裂。

④根据以上的计算可得本题的决策数如下：2.CarTypeFamilyShirtTypeSportsC0LuxuryShirtTypesmallC0mediumC1largeC1Extra LargeC1C1C1C0 C1C1Small Medium Large ExtraLargeCarTypeFamilyShirtTypeSportsC0LuxuryShirtTypesmallC0OtherC1C0 C1C1Large Other(a) Design a multilayer feed-forward neural network (one hidden layer) for the data set in Q1. Label the nodes in the input and output layers.根据数据的属性特点易知输入层有8个节点，分别为：x1 Gender ( Gender = M: x1 = 1; Gender = F: x1 = 0 )x2 Car Type = Sports ( Y = 1; N = 0)x3 Car Type = Family( Y = 1; N = 0)x4 Car Type = Luxury ( Y = 1; N = 0)x5 Shirt Size = Small ( Y = 1; N = 0)x6 Shirt Size = Medium ( Y = 1; N = 0)x7 Shirt Size = Large ( Y = 1; N = 0)x8 Shirt Size = Extra Large ( Y = 1; N = 0)隐藏层有三个节点x9、x10和x11. 输出为二类问题, 因此只有1个节点x12(C0=1;C2=0).神经网络图如下：（其中Wij表示输入层第i个节点到隐藏层第j个节点所付权重，为方便计算，第i个节点到第9/10/11个节点的权重设置一样；Wi-j则表示隐藏层第i个节点到输出层节点所赋予的权重）1 23 4 5 6 7 89101112 w1jW 2jw3jw4jw5jw6jw7jw8jW10-12W9-12W11-12X1X2X3X4X5X6X7X8输入层隐藏层输出层c)Using the neural network obtained above, show the weight values after one iteration of the back propagationalgorithm, given the training instance “(M, Family, Small)". Indicate your initial weight values and biasesand the learning rate used.对于 (M, Family, Small), 其类标号为C0, 其训练元祖为{1, 0, 1, 0, 1, 0, 0, 0}.表 1初始输入、权重、偏倚值和学习率表 2净输入和净输出计算表 3每个节点误差的计算表 4权重和偏差更新计算3.a)Suppose the fraction of undergraduate students who smoke is 15% and the fraction of graduate students whosmoke is 23%. If one-ﬁfth of the college students are graduate students and the rest are undergraduates, what is the probability that a student who smokes is a graduate student?U for Undergraduate student, G for Graduate student. and S for Smoking则，P(S|U)=0.15, P(S|G)=0.23, P(G)=0.2, P(U)=0.8.故P(G|S)=P(S|G)×P(G)p(S)=P(S|G)×P(G)P(S|U)× P(U)+P(S|G)×P(G)=0.23×0.20.15×0.8+0.23×0.2=0.277.b)Given the information in part (a), is a randomly chosen college student more likely to be a graduate orundergraduate student?因为P(U)>P(G)故 Undergraduate student,c)Suppose 30% of the graduate students live in a dorm but only 10% of the undergraduate students live in a dorm.If a student smokes and lives in the dorm, is he or she more likely to be a graduate or undergraduate student?You can assume independence between students who live in a dorm and those who smoke.令D for Dorm.P(D|U)=0.1, P(D|G)=0.3.P(G|D∩S)×P(D∩S)=P(D∩S|G)×P(G)=P(D|G)×P(S|G)×P(G)=0.3×0.23×0.2=0.0138.P(U|D∩S)×P(D∩S)=P(D∩S|U)×P(U)=P(D|U)×P(S|U)×P(U)=0.1×0.15×0.8=0.012.因为P(G|D∩S)×P(D∩S)> P(U|D∩S)×P(D∩S),所以P(G|D∩S)>P(U|D∩S), 所以更可能是graduate student.4.(a) The three cluster center after the first round execution第一轮：center A1(4,2,5) B1(1,1,1) C1(11,9,2)表格 1各点与原始中心点距离① 判断各点与中心点的距离（A1在表格中的点表示为（A4，A5，A6），piA1表示各点到A1点的距离，piB1表示各点到B1点的距离，piC1表示各点到C1点的距离，下同） ② 由以上表格可知：Cluster1: A1 A3 B3 C3 C4Cluster2: B2 B1 Cluster3: C1 A2(b) The final three clusters第二轮：计算每簇的均值。

Cluster1: M1(5.2, 4.4, 7.2 ) Cluster2: M2（1.5, 2, 1.5） Cluster3: M3(10.5, 7, 2)① 各点到簇中心点的距离：表格 2各点与第一次聚类中心点距离② 再次聚类后的类簇为：Cluster1: A1 A3 B3 C3 C4 Cluster2: B2 B1 Cluster3: C1 A2③结果分析：第二轮聚类结果与第一轮一致，故算法停止Part II: LabQuestion 11.Build a decision tree using data set “transactions”that predicts milk as a function of the other fields. Set the“type”of each field to “Flag”, set the “direction”of “milk”as “out”, set the “type”of COD as “Typeless”, select “Expert”and set the “pruning severity”to 65, and set the “minimum records per child branch”to be 95. Hand-in: A figure showing your tree.2. Use the model (the full tree generated by Clementine in step 1 above) to make a prediction for each of the 20customers in the “rollout”data to determine whether the customer would buy milk. Hand-in: your prediction for each of the 20 customers.由程序运行的结果可知：customer(2,3,4,5,9,10,13,14,17,18) 会购买Milk。

e商务文档

中科院刘莹大数据挖掘课程作业2

相关文档推荐：