当前位置:文档之家› Literature Review 英文文献综述模板

Literature Review 英文文献综述模板

Text Recognition with Machine Learning based on Text StructureLiterature ReviewYifan Shi Student ID:27291944Email:ys1n13@MSc Artificial IntelligenceFaculty of Physical Sciences&Eng,University of SouthamptonAbstract—The fast developing Machine Learning algorithms introduced to semantic area nowadays has brought vast techniques in text recognition,classification, and processing.However,there is always a contradiction between accuracy and speed,as higher accuracy generally represents more complicated system as well as large training database.In order to achieve a balance between fast speed and good accuracy,many brilliant designs are used in text processing.In this literature review,these efforts are introduced in three layers:Natural-Language Processing,Text Classification,and IBM Watson System.Keywords—Machine Learning,Natural-Language Processing,Text Classification,IBM WatsonI.I NTRODUCTIONThe growing popularity of the Internet has brought increasing number of users online,with a vast amount of messages,blogs,articles,etc.to be dealt with.These texts,known as natural-language texts,contain possible useful information but take a long time for human to read,understand and deal with.Despite the popular search engine technology nowadays in helping users tofind the sources with keywords,semantic techniques are also needed by many companies to improve their user-friendly working environment.In this literature review,I will introduce several important semantic techniques,starting from the most basic Natural-Language Processing,concentrating in the meaning of words and sentences,followed by Text Classification which is focused on paragraphs and articles.Then,I will introduce a landmark system named IBM Watson,which has DeepQA as its working pipeline.Finally,a conclusion will be included to give some comments on these techniques.II.N ATURAL L ANGUAGE P ROCESSING In order to deal with the human natural-language, it is necessary to transform the unstructured text into well-structured tables of explicit semantics (Ferrucci,2012).According to Liddy(2001), Natural-Language Processing(NLP)is a series of computational techniques used to analyze and represent naturally organized text in order to achieve certain tasks and applications.Collobert and Weston(2008)have categorized NLP tasks into six types:Part-Of-Speech Tagging,Chunking,Named Entity Recognition,Semantic Role Labeling, Language Models,and Semantically Related Words.In addition to this,they also implemented Multitask Learning with Deep Neural Networks to build a successful unified architecture which avoided traditional large amount of empirical hand-designed features to train the system by using backpropagation training(Collobert et al.,2011).III.T EXT C LASSIFICATIONOne of the simple way to represent an article for a learning algorithm is to use the number of times that distinct words appear in the document (Joachims,2005).However,due to the large amount of possible words used in articles,it would create a very high dimensional space of features.Joachims(1999)suggests a TransductiveSupport Vector Machines to do classification because of its effective learning ability even in high dimensional feature space.Rather than using non-linear Support Vector Machine(SVM), Dumais et al.(1998)compared linear SVM with another four different learning algorithms which are Find Similar,Decision Trees,Naive Bayes, and Bayes Nets,which also supports SVM in text classification because of its high accuracy,fast speed as well as its simple model.Sebastiani(2002) also recommends Neural Network as a potential selection in text classification in that its accuracy is only slightly lower than SVM in comparison. The cross-document comparison of small pieces of text,using linguistic features such as noun phrases,and synonyms is introduced by Hatzivassiloglou et al.(1999).The similarity of two paragraphs is defined by the same action conducted on the same object by the same actor. Therefore,drawing features according to nouns and verbs would generally conclude a paragraph into several primitive elements.In addition to the similar primitive elements,restrictions such as ordering, distances and primitive(matching noun and verb pairs)are also implemented to exclude weakly related features.The feature selection methods can effectively reduce the dimensions of dataset (Ikonomakis,2005)while keeping the performance of classification.To make sure which words are to be kept,an Evaluation function has been introduced by Soucy and Mineau(2003)to measure how much information we can get by classifying through a single word.Another improvement by Han et al. (2004)is to use Principal Component Analysis (PCA)to reduce the dimension in transformation of features.Nigam and Mccallum(2000)combine Expectation-Maximization and Naive Bayes classifier to train the classifier with certain amount of labeled texts followed by large amount of unlabeled documents,which realizes the automatic training without huge amount of hand-designed training data.IV.IBM W ATSONThe IBM Watson project has shown us that computer system in open-domain question-answering(QA)is possible to beat human champions in Jeopardy.As Ferrucci(2012) mentioned,the structure of Watson is more complicated than any single agent as it has hundreds of algorithms working together,in the way that Minsky(1988)introduced in Society of Mind.Generally,Watson consists of parts which are DeepQA,Natural Language Processing(NLP), Machine Learning(ML),and Semantic Web and Cloud Computing(Gliozzo et al.,2013).The DeepQA system analyzes the question by different algorithms,giving different interpretations of questions and forming queries for each question (Ferrucci,2012).It provides all the possible answers to the question with the evidences and the scores for each candidate,which would generate a ranking of candidate answers with the likelihood of correctness.The Machine Learning algorithms are used to train the weights in its evaluating and analyzing algorithms(Gliozzo et al.,2013).The clue that Watson uses in searching is named as lexical answer type(LAT),which tells Watson what the question is asking about and what kind of things it needs to look for.Before doing searching, it would generate prior knowledge of type label, known as‘direction’,to each candidate answer and search evidences for and against this‘type direction’(Ferrucci,2012).The DeepQA also has a high requirement in Grammar-based and syntactic analysis techniques,for example,relation extraction techniques in getting possible relations between words,based on a rule-based approach.In addition,the ability of breaking the question down into sub-questions by logics also improved Watsons performance(Ferrucci,2012),which enables Watson tofind results for each smaller questions and combine them together.In correspondence to the ability of breaking down questions,it can also generate the score for the original question based on the evidence for sub-questions.To simulate human knowledge,Watson also uses self-contained database.However,this requirement has led to its great hardware cost.Watson also needs to do automatic text analysis and knowledge extraction to update its database,because of the enormous amount of work and the insurance ofinput-knowledge accuracy.However,the use of self-contained database is costly,that only few institutions can afford the hardware expense,which makes the application of Watson expensive.Another limitation is that the structured resource is relatively narrow compared with vast unstructured natural-language texts.One of the possible improvement is to use online data and ordinary online search engine tofind possible related articles and analyze them with PC clients.Despite the tradeoff between accuracy and cost,because of the possible the unreal data and incorrect information online,it makes the technique more realizable in general.V.C ONCLUSIONAs can be seen from the content above,most techniques used in text analysis are based on‘word feature’extraction,word types,and relations, which are all semantic techniques.While Watson also uses searching techniques tofind the exact answer shown in text.However,the machines lack the ability to conclude the main idea in a paragraph,which is more related with abstract logic thinking.While the way that human read concerns not only on vocabularies and meanings, but also the structure of paragraph and the location of sentences,for example,thefirst sentence in the paragraph usually guides the following content, which helps tell the significance of the sentences and words.Therefore,using machine learning to analyze the structure of an article and combining with the meaning of every sentence might generate the ability to conclude the main idea,which can be used in text scanning and classification.R EFERENCES[1]S.Dumais,J.Platt,D.Heckerman,and M.Sahami,InductiveLearning Algorithms and Representations for Text Categoriza-tion,Proceedings of the seventh international conference on Information and knowledge management,pp-148-155,1998. [2]T.Joachims,Text Categorization with Support Vector Machines:Learning with Many Relevant,ECML-98Proceedings of the10th European Conference on Machine Learning,pp-137-142,1998.[3]T.Joachims,Transductive Inference for Text Classification usingSupport Vector Machines,International Conference on Machine Learning(ICML),pp-200-209,1999.[4]V.Hatzivassiloglou,J.Klavans,and E.Eskin,Detecting TextSimilarity Over Short Passages:Exploring Linguistic Feature Combinations Via Machine Learning,Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora,2000.[5]K.Nigam,Text Classification from Labeled and Unlabeled Doc-uments using EM,Machine Learning,V olume39,pp-103134, 2000.[6] E.Liddy,Natural Language Processing,In Encyclopedia ofLibrary and Information Science,2nd Ed.NY.Marcel Decker, Inc,2001.[7]S.Tong and D.Koller,Support Vector Machine Active Learningwith Applications to Text Classification,Journal of Machine Learning Research pp-45-66,2001.[8] F.Sebastiani,Machine Learning in Automated Text Categoriza-tion,ACM Computing Surveys(CSUR),Issue1,V olume34, pp-1-47,2002.[9]P.Soucy and G.Mineau,Feature Selection Strategies for TextCategorization,AI2003,LNAI2671,pp-505-509,2003. [10]X.Han,G.Zu,W.Ohyama,T.Wakabayashi,and F.Kimura,Accuracy Improvement of Automatic Text Classification Based on Feature Transformation and Multi-classifier Combination, LNCS,V olume3309,pp.463-468,Jan2004.[11]M.Ikonomakis,S.Kotsiantis,V.and Tampakas,Text Classifica-tion using Machine Learning Techniques,WSEAS Transactions on Computers,Issue8,V olume4,pp-966-974,2005.[12]R.Collobert and J.Weston,unified architecture for natural lan-guage processing:deep neural networks with multitask learning, ICML’08Proceedings of the25th international conference on Machine learning,ACM New York,USA,Pages160-167,2008.[13]R.Collobert,J.Weston,L.Bottou,M.Karlen,K.Kavukcuoglu,and P.Kuksa Natural Language Processing(Almost)from Scratch,Journal of Machine Learning Research,V olume12,pp-2493-2537,2011.[14] A.Gliozzo,O.Biran,S.Patwardhan,and K.McKeown,Seman-tic Technologies in IBM Watson,The10th International Semantic Web Conference,Bonn,Germany,2011.[15] D.Ferrucci,Introduction to“This is Watson”,IBM Journal ofResearch and Development,V olume56Number3/4,pp-1:1-1:15 May/July2012.[16]G.Tesauro,D.Gondek,J.Lenchner,J.Fan,and J.Prager,Simulation,learning,and optimization techniques in Watsons game strategies,IBM Journal of Research and Development, V olume56,Number3/4,pp-16:116:11,2012.。

相关主题