当前位置:文档之家› 《高级人工智能》演讲报告书(中英文)

《高级人工智能》演讲报告书(中英文)

华南理工大学《高级人工智能》演讲报告书题目:Machine learning: Trends, perspectives, and prospects (Unsupervised learning andfeature reduction)学院计算机科学与工程专业计算机科学与技术(全英创新班)学生姓名学生学号指导教师起始日期 2015年11月1日Target of feature selection is to select a subset of features instead of mapping them into low dimension.Given a set of features , Feature Selection problem is defined as finding a subset that maximizes the learner's ability to classify patterns. More formally, F‟ should maximize some scoring function where Γis the space of all possible feature subsets of F:(){}G F 'a r g m a x G ΓΘ∈=Framework of feature selection is given as follow:Where two main part of it is generation step and evaluation step. For generation step, the main task is select candidate subset of feature for evaluation. There are 3 ways in how the feature space is examined: (1) Complete (2) Heuristic (3) Random.(1) Complete/exhaustive:Examine all combinations of possible feature subset which contain elements, for example we can exam feature {f1, f2, f3} in this way: {f1,f2,f3} => { {f1},{f2},{f3},{f1,f2},{f1,f3},{f2,f3},{f1,f2,f3} } . Optimal subset is achievable if we search all the possible solution, but it‟s too expensive if feature space is very large.(2) HeuristicSelection is directed under certain guideline. Start with empty feature set (or full set), select (or delete) one feature in each step until the target number of features is achieved. For example the incremental generation of subsets: {f1} → {f1,f3} →{f1,f3,f2}.(3) RandomNo predefined way to select feature candidate, pick feature at random. Require more user-defined input parameters like the time of try.According to whether the learning algorithm is participate in the selection step, feature selection method can be divided into three category: filter, wrapper, and embedded, which is given as follow:Filter Approach is usually fast. It provide generic selection of features, not tuned by given learning algorithm. But it‟s tied to specific statistical method, not optimized for used classifier, so sometimes filter methods are used as a pre-processing step for other methods.For wrapper approach, learner is considered a black-box, used to score subsets according to their predictive power. The accuracy is usually high, but result vary for different learners, loss generalization. One needs to define: how to search the space of all possible variable subsets ( possible selections) and how to assess the prediction performance of a certain subset. Finding optimal subset is NP-hard! A wide range of heuristic search strategies can be used: IDPT, Branch-and-bound method, simulated annealing, TABU search algorithm, genetic algorithm, forward selection (start with empty feature set and add features at each step) and backward deletion (start with full feature set and delete one feature at each step). Predictive power is usually measured on a validation set or by cross-validation. Drawback of wrapper method is that a large amount of computation is required, has danger of overfitting.Embedded approach is specific to a given learning machine. It combine the advantages of both previous methods: reduce the classification of learning, takes advantage of its own variable selection algorithm and usually implemented by a two-step or multi-step process.For evaluation step, the main task is usually implemented by a two-step or multi-step process. 5 main type of evaluation functions are: distance (Euclideandistance measure), information (entropy, information gain, etc.), dependency (correlation coefficient), consistency (min-features bias), and classification error rate. Where the first four method are used for filter method and the last one is for wrapper.An application of feature selection in supervised learning is given in the following, which is extracted for the paper …Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy‟.Optimal characterization condition of feature selection in supervised learning is minimal classification error and maximal statistical dependency is for maximal statistical dependency. One of the most popular approaches to realize Max-Dependency is maximal relevance, which means that one of the most popular approaches to realize Max-Dependency is maximal relevance:Problems of thismethod is Combinations of individually good features do not necessarily lead to good classification performance, i.e. “the m b est features are not the best m features”. And also relevance alone may introduce rich redundancy. So features with minimum redundancy should also be considered. So the author proposed another algorithm that solve the problem.Main work of the paper consist of three part: (1) Present a theoretical analysis showing that mRMR (max-relevance and min-redundancy) is equivalent to Max-Dependency for first-order feature selection. (2) Investigate how to combine mRMR with other feature selection methods into a two-stageselection algorithm. (3) Compare mRMR, Max-Relevance, Max-Dependency, and the two-stage feature selection algorithm through comprehensive experiments. Since the first part is unrelated to the course project, so I skipped it and only one experiment in the original paper will be mentioned.The proposed algorithm is named mRMR (Max-Relevance and Min-Redundancy), where max-relevance means select features that are most relevant to the target class, i.e. select features satisfying:I(x,y) is mutual information that I had mentioned before. And Min-Redundancy means that select features that not redundant with selected features, which satisfying:Then a Operator is define to achieve this multi-object optimization task which combine D and R, optimize D and Rsimultaneously:In practice, incremental search methods can be used to findthe near-optimal features:Until now this not the whole process of the algorithm, it's only a half of it. The algorithm in this paper is A two-stage process: (1) Find a candidate feature subset using mRMR incremental selection method. (2) Use more sophisticated method (classifier involved) to search a compact feature subset from thecandidate subset. So that this two-stage algorithm is a case of embedded method.The first stage is given as follow:(1) U se mRMR incremental selection to select sequential features:(2) C ompare classification error of all the subset , find the range of k,called Ω, within which the respective error is consistently small.(3) W ithin Ω, find smallest error =min, optimal subset size is the kcorresponds to .The Second stage is given as follow:(1) F or backward selection:Exclude one redundant feature if resultant error is smaller thaneach time (select the one leads to greatest error reduction). Terminate if no error reduction can be obtained.(2) F or forward selection:Select one feature which leads to greatest error reduction each time.Terminate if error begins to increase.Now the algorithm of this paper is complete. Best evaluate how effective and efficient this algorithm is, there is also a problem that how tojudge which algorithm is superior. So the author define a measurement of RM-characteristic. Given two feature set and , which is generate sequentially:We say that is recursively more characteristic (RM-characteristic) than by ρ%, if for ρ% of k, error of is smaller than .Figure above is one of the result of experiment given in the paper. Each row is for a different dataset and each column is for different classification algorithm. For each graph, X-axis denotes the number of selected features, Y-axis is for error rate. The line on the top with triangle on it is the proposed algorithm and the button one is the state-of-art algorithm on that time. As shown in the result, classification accuracy can be significantly improved based on mRMR feature selection. There is also an experiment done by myself to verify that feature selection method can improve accuracy:This experiment is carried on Spambase dataset by SVM algorithm with linear kernel. X-axis denotes the number of selected features, Y-axis is for accuracy. Red line is the proposed algorithm, others are baseline, traditional, random. We can see that the proposed algorithm performs the best. So I am convinced that feature selection methods can improved accuracy of learning algorithm.Random projectionRandom projection is one of feature extraction algorithm. Most famous feature extraction algorithm includes PCA, LDA, LLE etc. Random projection is mentioned as LSH method sometimes and it‟s highly unstable, so it‟s not so famous. But it‟s quiet useful in some case and much efficient than that of most famous algorithm such as PCA.Main steps of random projection can be introduced briefly:(1) S elect a set of high-dimensional unit vectors (not necessary orthogonal)randomly(2) P roject high dimension data into low dimension by production of thesevectorsSuch steps sounds simple and somewhat unreliable, but in fact there‟s Lemma that guarantee the precision of it, which is called Johnson-Lindenstrauss Lemma. Main idea of it is, It is possible to project n points in a space of arbitrarily high dimension onto an O(logn)-dimensional space, such that the pairwise distances between the points are approximately preserved. More formally:Here we use sample distance as a measure of goodness of feature reduction performance for the reason that one of the Objective of feature reduction is that pairwise distances of the points are approximately the same as before. In data mining area, we know that dataset has two way of representation: Data matrixand Discernibility Matrix:If pairwise distance of data points reserve precisely, then the Discernibility Matrix retain most of the information for the original dataset, and we say thatthat‟s a good feature reduction method.There are several ways for random projection .We adopt the one in the original Johnson-Lindenstrauss paper:To make a better understanding, I draw a graph for the process:Advantage of random projection is that it does not use any defined “interestingness” criterion like PCA and High-dimensional distributions look more like Gaussian when projected to low dimensional. But it's a highly unstable algorithm, for example:The left picture is the true distribution of a high dimensional dataset(use 2 of its features to make the graph). The middle and right is two single run of clustering algorithm after random projection. We can find that result of each run make have great difference. But it's just this unstable performance provide a multi view of the same dataset, which is useful in ensemble learning.Cluster ensembleEnsemble learning is a hot topic in these years. Cluster ensemble is one of the newest topic of unsupervised learning. Frame work of classification ensemble is shown as follow:Given a certain dataset, we first generate a different view of the dataset, which can be implemented by bootstrap sampling, random feature subspace or other method. Then we use different learning algorithm or the same algorithm with different parameter or even just the same algorithm to generate serval different classifier. When a new data comes in, multi classifier can be used to classify it and obtain the final classification result based on voting scheme or other method. Cluster ensemble is almost the same with classification ensemble:* Probability of point i and point j denote to same cluster is denoted by:And then Pij is average for multiple run. To verify the usefulness of this metric, the author plot a histogram for whether sample i and sample j belongs to same cluster or not:We can see that they have different and little overlap, so it'sa good metric for similarity matrix. Then we can use the following algorithm to obtain the final clustering:In fact it‟s the Complete-link algorithm. But if there are some points that dissimilar to other cluster, the algorithm may have bad performance. So the author define Pmax as follow:Feature with 10% lowest Pmax will be discard as outliers in the merging step. After merging, assign these points to their most similar clusters:This the experiment of single RP+EM vs ensemble RP+EM:NMI and CE are performance measurement of clusteringalgorithm, where the best result is emphasized by red rectangle. From this result we can draw the conclusion that ensemble improves the clustering results over its components for all three data setsThis another experiment of ensemble RP+EM vs ensemble PCA+EM:It can be seen by the graph that 29 of the 30 result that RP outperform PCA, so we can say that RP is superior that PCA in cluster ensemble study.。

相关主题