Advanced computation linguistics1. Collect the most frequent words in 5 genres of Brown Corpus:news, adventure, hobbies, science_fiction, romanceTo collect most frequent words from the given genres we can follow the following steps:>>> import nltk>>> from nltk.corpus import brown>>> brown.categories()['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies','humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']>>> news_text = brown.words(categories=['news','adventure','hobbies','science_fiction','romance'])>>> from nltk.probability import FreqDist>>> fdist=FreqDist([w.lower() for w in news_text])>>> voca=fdist.keys()>>> voca[:50]['the', ',', '.', 'and', 'of', 'to', 'a', 'in', 'he', "''", '``', 'was', 'for','that', 'it', 'his', 'on', 'with', 'i', 'is', 'at', 'had', '?', 'as', 'be', 'you', ';', 'her', 'but', 'she', 'this', 'from', 'by', '--', 'have', 'they', 'said','not', 'are', 'him', 'or', 'an', 'one', 'all', 'were', 'would', 'there', '!', 'out', 'will']>>> voca1=fdist.items()>>> voca1[:50][('the', 18635), (',', 17215), ('.', 16062), ('and', 8269), ('of', 8131), ('to',7125), ('a', 7039), ('in', 5549), ('he', 3380), ("''", 3237), ('``', 3237), ('was', 3100), ('for', 2725), ('that', 2631), ('it', 2595), ('his', 2237), ('on', 2162), ('with', 2157), ('i', 2034), ('is', 2014), ('at', 1817), ('had', 1797), ('?', 1776), ('as', 1725), ('be', 1610), ('you', 1600), (';', 1394), ('her', 1368), ('but', 1296), ('she', 1270), ('this', 1248), ('from', 1174), ('by', 1157), ('--', 1151), ('have', 1099), ('they', 1093), ('said', 1081), ('not', 1051), ('are', 1019), ('him', 955), ('or', 950), ('an', 911), ('one', 903), ('all', 894), ('were', 882), ('would', 850), ('there', 807), ('!', 802), ('out', 781), ('will',775)]This means that the frequency of word “the” is more than others.2. Exclude or filter out all words that have a frequency lower than 15 occurrencies. (hint using conditional frequency distribution)By adding functionalities on the first task of collecting words based on their frequency ofoccurrences, we can filter words which has frequency occurrence of >=15.>>> filteredText= filter(lambda word: fdist[word]>=15,fdist.keys())>>> voca=fdist.keys()>>> filteredText[:50] /*first 50 words*/['the', ',', '.', 'and', 'of', 'to', 'a', 'in', 'he', "''", '``', 'was', 'for','that', 'it', 'his', 'on', 'with', 'i', 'is', 'at', 'had', '?', 'as', 'be', 'you', ';', 'her', 'but', 'she', 'this', 'from', 'by', '--', 'have', 'they', 'said','not', 'are', 'him', 'or', 'an', 'one', 'all', 'were', 'would', 'there', '!', 'out', 'will']>>> filteredText[-50:] /*last 50 words*/['musical', 'naked', 'names', 'oct.', 'offers', 'orders', 'organizations', 'parade', 'permit', 'pittsburgh', 'prison', 'professor', 'properly', 'regarded', 'release', 'republicans', 'responsible', 'retirement', 'sake', 'secrets', 'senior','sharply', 'shipping', 'sir', 'sister', 'sit', 'sought', 'stairs', 'starts', 'style', 'surely', 'symphony', 'tappet', "they'd", 'tied', 'tommy', 'tournament', 'understanding', 'urged', 'vice', 'views', 'village', 'vital', 'waddell', 'wagner', 'walter', 'waste', "we'd", 'wearing', 'winning']3. Then exclude or filter out all stopwords from the lists you have created.(hint using conditional frequency distribution)To filter the stop words we have to define tiny function using the word net library for 'english' language.>>> from nltk.corpus import stopwords>>> stopwords.words('english')['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am','is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no','nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']>>> def content_fraction(text):... stopwords= nltk.corpus.stopwords.words('english')... content = [w for w in text if w.lower() not in stopwords]... return len(content) / len(text)...>>> content_fraction(nltk.corpus.reuters.words())0.65997695393285261>>> filterdText = filterStopword(freqDist)>>> filterdText[:50][',', '.', "''", '``', '?', ';', '--', 'said', 'would', 'one', '!', 'could', '(', ')', ':', 'time', 'like', ' back','two', 'first', 'man','made', 'Mrs.', 'new', 'get', 'way', 'last', 'long', 'much', 'even', 'years', 'good', 'little', 'also', 'Mr.', 'see','right', 'make', 'got', 'home', 'many', 'never', 'work', 'know','day' , 'around', 'year', 'may', 'came', 'still']>>> freqDist[:50][',', 'the', '.', 'of', 'and', 'to', 'a', 'in', "''", '``', 'was', 'for', 'that', 'he', 'on', 'with', 'his', 'I', 'it','is', 'The', 'had', '?','at', 'as', 'be', ';', 'you', 'her', 'He', '--', 'from', 'by', 'said', 'h ave', 'not','are', 'this', 'him', 'or', 'were', 'an', 'but','would', 'she', 'they', 'one', '!', 'all', 'out ']From the result in filterdText words like 'the', 'it', 'is' and so on does not existcompared to the same number of output with stop words.>>> len(freqDist)2341>>> len(filterdText)2153We can further check that how many stop-words have been removed from the freqDist15 using len( ) function.4. Create a new list of lemmas or roots by normalizing all words by stemmingfor create the normalized list of lemmas we apply the Porter Stemmer nltk functionality.>>> file = open('filterdText.txt')>>> text = file.read()>>> textTokens = nltk.word_tokenize(text)Now we do stemming>>> p = nltk.PorterStemmer ( )>>> rootStemming = [p.stem(t) for t in textTokens]>>> textTokens[:100]['!', '&', "'", "''", "'em", '(', ')', ',', '--', '.', '1', '10', '100', '11', '12', '13', '14', '15', '16', '17', '18', '1958', '1959','1960', '1961', '2', '20', '200', '22', '25', '3', '30', '4', '5', '50', '6', '60 ', '7', '8', '9', ':', ';', '?', 'A.', 'Actually','Af', 'Ah', 'Aj', 'Alexander', 'Also', 'Although', 'Americ a', 'American', 'Americans','Among', 'Angeles','Anne', 'Anniston', 'Another', 'April', 'Association', 'Augu st', 'Austin', 'Avenue', 'B', "B'dikkat", 'B.','Barton', 'Beach', 'Belgians', 'Besides', 'Bill', 'Billy', 'B lue', 'Board', 'Bob','Bobbie', 'Boston', 'Brannon','British', 'C.', 'Cady', 'California', 'Catholic', 'Cath y', 'Center', 'Central', 'Charles', 'Charlie', 'Chicago','Christian', 'Church', 'City', 'Class', 'Clayton', 'Club', 'Co.', 'Coast','Cobb', 'College']This function can display sorted non normalized sample outputs for comparison>>> rootStemming[:100]['!', '&', "'", "''", "'em", '(', ')', ',', '--', '.', '1', '10', '100', '11', '12', '13', '14', '15', '16', '17', '18', '1958', '1959','1960', '1961', '2', '20', '200', '22', '25', '3', '30', '4', '5', '50', '6', '60 ', '7', '8', '9', ':', ';', '?', 'A.','Actual', 'Af','Ah', 'Aj', 'Alexand', 'Also', 'Although', 'America', 'American', 'American', 'Among','Angel', 'Ann','Anniston', 'Anoth', 'April', 'Associ', 'August', 'Austin','Avenu', 'B', "B'dikkat", 'B.','Barton', 'Beach','Belgian', 'Besid', 'Bill', 'Billi', 'Blue', 'Board', 'Bob ', 'Bobbi', 'Boston', 'Brannon','British', 'C.', 'Cadi','California', 'Cathol', 'Cathi','Center','Central','Charl','Charli','Chicago','Christian', 'Church', 'Citi','Class', 'Clayton', 'Club', 'Co.', 'Coast', ' Cobb', 'Colleg']This can sorted stemmed sample output for comparison5. Create a new list of lemmas or roots by normalizing all words by lemmatizationAfter importing the file we need to lemmatize, which is the same step as the previous one:and using the same rawText>>> wnl = nltk.WordNetLemmatizer()>>> rootLemmatize = [wnl.lemmatize(t) for t in textTokens]>>> rootLemmatize[:100] / *the first 100 sorted lemmatized lemmas for comparison*/['!', '&', "'", "''", "'em", '(', ')', ',', '--', '.', '1', '10', '100', '11', '12', '13', '14', '15', '16', '17', '18', '1958', '1959','1960', '1961', '2', '20', '200', '22', '25', '3', '30', '4', '5', '50', '6', '60 ', '7', '8', '9', ':', ';', '?', 'A.', 'Actually','Af', 'Ah', 'Aj', 'Alexander', 'Also', 'Although', 'Americ a', 'American', 'Americans','Among', 'Angeles','Anne', 'Anniston', 'Another', 'April', 'Association', 'Augu st', 'Austin', 'Avenue', 'B', "B'dikkat", 'B.','Barton', 'Beach', 'Belgians', 'Besides', 'Bill', 'Billy', 'B lue', 'Board', 'Bob','Bobbie', 'Boston', 'Brannon','British', 'C.', 'Cady', 'California', 'Catholic', 'Cath y', 'Center', 'Central', 'Charles', 'Charlie', 'Chicago','Christian', 'Church', 'City', 'Class', 'Clayton', 'Club','Co.','Coast','Cobb', 'College']We end this task by writing the out put of the lemmatized lemmas or rootwords to the file '(rootLemmatize.txt').6. Use the most frequent lemmas to find semantic similarities using WordNet.To find synsets with related meanings we have to traverse the WordNet network. knowing which word is semantically related is useful for indexing a collection of texts. For example a search for a general term like 'England' will match for specific terms like 'UK'.Top 100 frequent lemmas:>>> file = open('filterdText.txt') /*from 'filterdText' we get the words*/>>> tmp = file.read()>>> from nltk.tokenize import RegexpTokenizer /*remove punctuations*/>>>>>> tokenizer = RegexpTokenizer(r'\w+')>>> textSimilarity = tokenizer.tokenize(tmp)>>> freqDistSimilarity = FreqDist([w.lower() for w in textSimilarity]) /* extract the first 100 frequent lemmas to new list*/>>> for word in textSimilarity:... freqDistSimilarity.inc(word)>>> tmpFDS = freqDistSimilarity.keys()[:100] /* first 100 most frequent lemmas*/>>> freqDistSimilarity.items()[:50][('s', 64), ('t', 60), ('re', 24), ('d', 22), ('you', 20), ('ll', 18), ('m', 14), ('he', 12), ('let', 12), ('man', 10), ('p',10), ('we', 10), ('I', 8), ('i', 8), ('ve', 8), ('won', 8), ('year', 8), ('B', 6), ('a', 6), ('actually', 6), ('also', 6),('although', 6), ('among', 6), ('another', 6), ('association', 6), ('b', 6), ('beach', 6), ('bill', 6),('blue', 6),('board', 6), ('center', 6), ('central', 6), ('church', 6), (' city', 6), ('class', 6), ('club', 6), ('college', 6), ('come', 6), ('committee',6) ,('council',6),('county',6),('court',6),('day',6),('department',6),('district', 6), ('don', 6), ('earth', 6), ('education', 6), ('even', 6), ('every', 6)]>>> def pathSimilarity(word1,word2, s=wnet.path_similarity): /*path similarity between two words*/... synSets1= wnet.synsets(word1)... synSets2= wnet.synsets(word2)... pointSimilarity = []... for synSet1 in synSets1:... for synSet2 in synSets2:... pointSimilarity.append(s(synSet1,synSet2))... if len(pointSimilarity)==0:... return 0... else:... return max(pointSimilarity)>>> tmpFDS[30:35] /*arbitrary path similarity test for 5 lemmas*/['center', 'central', 'church', 'city', 'class']>>> for word1 in tmpFDS[30:35]:... for word2 in tmpFDS[30:35]:... print word1+' == '+word2+' -->', pathSimilarity(word1,word2)center == center --> 1.0center == central --> 0.25center == church --> 0.25center == city --> 0.166center == class --> 0.5central == center --> 0.2central == central --> 1.0central == church --> 0.083central == city --> 0.111central == class --> 0.083church == center --> 0.25church == central --> 0.2church == church --> 1.0church == city --> 0.166church == class --> 0.2city == center --> 0.166city == central --> 0.111city == church --> 0.166city == city --> 1.0city == class --> 0.25class == center --> 0.5class == central --> 0.166class == church --> 0.2class == city --> 0.25class == class --> 1.0Because of the different sizes of synset for the lemmas we can get different path similarity . An independent path similarity test on the synonym sets also proves the same hypothesis.We continue the analysis with the rest of the lemmas by calling the pathSimilarity( ) function with the provided arguments. From above results we can say that the root word 'class' and'center' has a better semantic similarity.。