Neural Network Theory and ApplicationsHomework Assignment3oxstar@SJTUJanuary19,20121Data PreprocessingFirst we used‘svm-scale’of LibSVM to scale the data.There are two main advantages of scaling:one is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges,another one is to avoid numerical difficulties during the calculation[1].We linearly scaled each attribute to the range[-1,+1].2Model SelectionWe tried three different kernel functions,namely linear,polynomial and RBF.•liner:K(x i,x j)=x T i x j•polynomial:K(x i,x j)=(γx T i x j+r)d,γ>0•radial basis function(RBF):K(x i,x j)=exp(−γ x i−x j 2),γ>0The penalty parameter C and kernel parameters(γ,r,d)should be chosen.We used the ‘grid-search’[1]on C andγwhile r and d are set to their default values:0and3.In Figure1,we presents the contour maps for choosing the proper attributes.We just searched for some maxima while the global maximum is usually difficult tofind and with the values of attributes increasing,the running time increasing dramatically.Note that‘ovr’stands for one-versus-rest task decomposition methods while‘ovo’is short for one-versus-one and‘pvp’is short for part-versus-part.The liner kernel doesn’t have private attributes,so we should just search for the penalty parameter C.The results are shown in Figure2.Thefinal selection for each attributes are presented in Table1.Table1:A Selection for Each AttributesDecomposition Kernel CγRBF10 1.0one-versus-rest Polynomial0.10.7Liner1RBF1 1.5one-versus-one Polynomial0.010.2Liner0.1RBF10.1part-versus-part Polynomial0.010.4Liner1Gammal g (c o s t )3840424446485052(a)RBF Kernel(ovr)Gammal g (c o s t )3132333435363738394041(b)Polynomial Kernel (ovr)Gammal g (c o s t )52535455565758(c)RBF Kernel(ovo)Gammal g (c o s t )38404244464850525456(d)Polynomial Kernel (ovo)Gammal g (c o s t )20253035404550(e)RBF Kernel (pvp)Gammalg (c o s t )15202530354045(f)Polynomial Kernel (pvp)Figure 1:Grid Search for RBF and Polynomial Kernellg(cost)A c c u r a c y(a)ovr lg(cost)A c c u r a c y(b)ovolg(cost)A c c u r a c y(c)pvpFigure 2:Attributes Search for Liner Kernel3Experiments3.1Task Decomposition MethodsThere are several multi-class classification techniques have been proposed for SVM models.The most typical approach for decomposing tasks is the so called one-versus-rest method which classifies one class from the other class.Assume that we construct N two-class clas-sifiers,a test datum is classified to C i iff.the i th classifier recognizes this datum.However,probably more than one classifiers recognize it.In this case,we set it belonging to C i if the i th classifier gives the largest decision value.On the other side,if no classifier recognizes it,we would set it belonging to C i if the i th classifier gives the smallest decision value for classifying it to the ‘rest’class.One-versus-one combining all possible two-class classifier is another methodology for dealing with multi-class problems[3].The size of classifier grows super-linearly with the number of classes but the running time may not because each divided problem is much smaller.We used a election strategy to make the final decisions:if the number of i -relative classifiers that classifying a datum to the i th class is the largest,we would say that this datum belongs to C i .Part-versus-part method is another choice[4].Any two-class problem can be further de-composed into a number of two-class sub-problems as small as needed.It is good at dealing with unbalance classification problems.As shown in Table 2,number of training data in each class from our dataset is just unbalance.Table 2:Number of Training Data in Each ClassClass Number of Training Data Class Number of Training Data 05376741994758423281545391910046891013385381143We used MAX-MIN strategy to make the final decisions.We also have to determine the size of minimum parts,which affects the performance of classification a lot.From Figure 3,we chose 200as the number of each sub-class because the accuracy reach a local maximum and it would make no sense if 1600is chosen.3.2ResultsIn our experiments,we used the Java version of LibSVM[2].25501002004008001600102030405060N per partA c c u r a c yFigure 3:Relationship between Sun-class Size and AccuracyTask Decomposition MethodsA c c u r a c y (%)(a)Accuracy Task Decomposition MethodsT i m e (s )(b)Running TimeFigure 4:Performance of Each Task Decomposition Method and Each Kernel The running time and accuracy are shown in Figure 4a and Figure 4b.3.3DiscussionComparing with ovo and pvp,one-versus-one decomposition method always has the worst accuracy no mater which kernel is used.However,due to the simple procedure,only N classifiers are required for a N-class problem.So the scalability of this method is better than the others.The one-versus-one decomposition method performed the best in our experiments.De-spite it need N(N-1)/2classifiers for a N-class problem,each sub-problem is really small and the running time is even the smallest.However,with the increasing of class number,the size of classifier would be growing super-linearly.The last one,part-versus-part decomposition method,did not perform well in our exper-iments because its running time is too long comparing with others.Its accuracy is between the one of ovr and the one of ovo.Massively parallel learning can be easily implemented by using this method;however,this advantage is depressed here.RBF kernel is the best choice if we just take accuracy into account but it cost the longest time mostly.The accuracy of polynomial and liner kernel is very close while the running time of liner kernel are more satisfying.References[1]Chih-Wei Hsu,Chih-Chung Chang,and Chih-Jen Lin:A Practical Guide to SupportVector Classiffication,2003.[2]C.-C.Chang and C.-J.Lin:LIBSVM:a library for support vector machines,2001.[3]Johannes F¨u rnkranz:Round Robin Classification,2002.[4]Bao-Liang Lu,Kai-An Wang,Masao Utiyama,and Hitoshi Isahara:A Part-Versus-PartMethod for Massively Parallel Training of Support Vector Machines,2004.。