华南理工大学计算机科学与工程学院2004—2005学年度第一学期期末考试《数据挖掘与数据仓库技术》试卷专业:双语班年级:2001 姓名:学号:注意事项:1. 本试卷共四大题,满分100分,考试时间120分钟;2. 所有答案请直接答在试卷上;一.Fill in the following blanks. (1 point per blank, the total: 20 points)1. A data warehouse is a __________, __________, __________and __________collection of data in support of management’s decision making process.2.The most popular data model for a data warehouse is a multidimensional model. Sucha model can exist in the form of a _____schema, a __________schema, or a__________ schema.3.OLTP is the abbreviation for ____________________, and OLAP is the abbreviationfor ____________________.4.Measures can be organized into the following three categories, based on the kind ofaggregate functions used, __________, __________, and ________.5.Methods for data preprocessing can be organized into the following categories:__________, __________, __________ and __________.6.List four knowledge types to be mined: __________, __________, __________ and__________.二.True or False: if you think the following statement is true then mark it with √, otherwisemark it with ⨯. (1 point per decision, the total: 10 points)1.Decision tree induction is an unsupervised learning method. ( )2.Clustering is a supervised learning method. ( )3.The OLTP system is operational processing-oriented, while the OLAP system isinformational processing-oriented. ( )4.The access operation to the OLTP system is mostly read/write, while the accessoperation to the OLAP system is mostly read. ( )5.Outlier and noisy data are useless for data mining task and should be removed. ()6.For an itemset S, the constraint S ⊆ V is anti-monotone. ( )7.For an itemset S, the constraint min(S)≥ v is monotone. ( )8.The aggregate functions min()and max()are distributive, where min()is used tocompute the minimum value of a data set and max() is used to compute the maximum value of a data set. ( )9.The aggregate function avg() is holistic, where avg() is used to compute the averagevalue of a data set. ( )10.The difference between K-means and K-medoids clustering method is that the formeruses the centroid to represent a cluster, while the latter uses the real object to represent a cluster. ( )三.Miscellaneous questions. (8 points per question, the total: 24 points)1.Suppose that a data warehouse consists of the three dimensions time, doctor, andpatient, and the two measures count and charge, where charge is the fee that a doctor charges a patient for a visit. Draw the star schema diagram for the above data warehouse.2.Given the following table (Table 1):rule. For example, ∀X, Guangzhou(X) ⇔ TV(X) [t: x%, d: y%] ∨⋯.(2).M ap the class Computer (target class) into a (bi-directional) quantitative descriptiverule.3.Given frequent itemset m and subset s of m, prove that the confidence of the rule“s'⇒(m-s')” can not be more than the confidence of “s⇒(m-s)”, where s' is a subset of s.四.Problems. (The total: 46 points)1.In information retrieval, keywords-based retrieval method is the dominant method.Document is represented by a set of words, called keywords, and when you want to retrieve some documents, you just need to present some keywords. Given the following keywords-document table (Table 2), the first row means that the document D1 is represented by keywords K1, k2 and K4.(1).F ind all frequent patterns of keywords using Apriori algorithm, and generate strongassociation rules from L2 (i.e. the frequent 2-pattern). Assume the support count is 2 and the confidence is 80%. (12 points)(2).D raw the frequent pattern tree. (6 points)2.Table 3 presents a training set of data tuples about whether to play tennis. Given atuple (Outlook=sunny, temperature=cool, Humidity=high, Wind=strong), decide that the target class Playtennis is yes or no using Bayesian naïve classifier. (18 points)3.Table 4 presents distances between any two objects, e.g. the distance between objects1and 2is 2.5. Assume the distance between two clusters d(C1, C2)is defined as follows: d(C1, C2) = Min{d ij| i ∈ C1, j ∈ C2}, where C1, C2 are two clusters, and d ij is the distance between objects i and j, Min is used to compute the minimum value of a set. Clustering the objects using the agglomerative hierarchical clustering method and draw the dendrogram (i.e. shows how the clusters are merged hierarchically). (10 points)。