朴素贝叶斯优点:在数据较少的情况下仍然有效,可以处理多类别问题缺点:对于输入数据的准备方式较为敏感适用数据类型:标称型数据贝叶斯准则:使用朴素贝叶斯进行文档分类朴素贝叶斯的一般过程(1)收集数据:可以使用任何方法。
本文使用RSS源(2)准备数据:需要数值型或者布尔型数据(3)分析数据:有大量特征时,绘制特征作用不大,此时使用直方图效果更好(4)训练算法:计算不同的独立特征的条件概率(5)测试算法:计算错误率(6)使用算法:一个常见的朴素贝叶斯应用是文档分类。
可以在任意的分类场景中使用朴素贝叶斯分类器,不一定非要是文本。
准备数据:从文本中构建词向量摘自机器学习实战。
[['my','dog','has','flea','problems','help','please'], 0['maybe','not','take','him','to','dog','park','stupid'], 1['my','dalmation','is','so','cute','I','love','him'], 0['stop','posting','stupid','worthless','garbage'], 1['mr','licks','ate','my','steak','how','to','stop','him'], 0['quit','buying','worthless','dog','food','stupid']] 1以上是六句话,标记是0句子的表示正常句,标记是1句子的表示为粗口。
我们通过分析每个句子中的每个词,在粗口句或是正常句出现的概率,可以找出那些词是粗口。
在bayes.py文件中添加如下代码:[python]view plaincopy1.# coding=utf-82.3.def loadDataSet():4. postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],5. ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],6. ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],7. ['stop', 'posting', 'stupid', 'worthless', 'garbage'],8. ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],9. ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]10. classVec = [0, 1, 0, 1, 0, 1] # 1代表侮辱性文字,0代表正常言论11.return postingList, classVec12.13.def createVocabList(dataSet):14. vocabSet = set([])15.for document in dataSet:16. vocabSet = vocabSet | set(document)17.return list(vocabSet)18.19.def setOfWords2Vec(vocabList, inputSet):20. returnVec = [0] * len(vocabList)21.for word in inputSet:22.if word in vocabList:23. returnVec[vocabList.index(word)] = 124.else:25.print"the word: %s is not in my Vocabulary!" % word26.return returnVec运行结果:训练算法:从词向量计算概率[python]view plaincopy1.# 朴素贝叶斯分类器训练函数2.# trainMatrix: 文档矩阵, trainCategory: 由每篇文档类别标签所构成的向量3.def trainNB0(trainMatrix, trainCategory):4. numTrainDocs = len(trainMatrix)5. numWords = len(trainMatrix[0])6. pAbusive = sum(trainCategory) / float(numTrainDocs)7. p0Num = zeros(numWords);8. p1Num = zeros(numWords);9. p0Denom = 0.0;10. p1Denom = 0.0;11.for i in range(numTrainDocs):12.if trainCategory[i] == 1:13. p1Num += trainMatrix[i]14. p1Denom += sum(trainMatrix[i])15.else:16. p0Num += trainMatrix[i]17. p0Denom += sum(trainMatrix[i])18. p1Vect = p1Num / p1Denom19. p0Vect = p0Num / p1Denom20.return p0Vect, p1Vect, pAbusive运行结果:测试算法:根据现实情况修改分类器上一节中的trainNB0函数中修改几处:p0Num = ones(numWords);p1Num = ones(numWords);p0Denom = 2.0;p1Denom = 2.0;p1Vect = log(p1Num / p1Denom)p0Vect = log(p0Num / p1Denom)[python]view plaincopy1.# 朴素贝叶斯分类器训练函数2.# trainMatrix: 文档矩阵, trainCategory: 由每篇文档类别标签所构成的向量3.def trainNB0(trainMatrix, trainCategory):4. numTrainDocs = len(trainMatrix)5. numWords = len(trainMatrix[0])6. pAbusive = sum(trainCategory) / float(numTrainDocs)7. p0Num = ones(numWords);8. p1Num = ones(numWords);9. p0Denom = 2.0;10. p1Denom = 2.0;11.for i in range(numTrainDocs):12.if trainCategory[i] == 1:13. p1Num += trainMatrix[i]14. p1Denom += sum(trainMatrix[i])15.else:16. p0Num += trainMatrix[i]17. p0Denom += sum(trainMatrix[i])18. p1Vect = log(p1Num / p1Denom)19. p0Vect = log(p0Num / p1Denom)20.return p0Vect, p1Vect, pAbusive21.22.# 朴素贝叶斯分类函数23.def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):24. p1 = sum(vec2Classify * p1Vec) + log(pClass1)25. p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)26.if p1 > p0:27.return 128.else:29.return 030.31.def testingNB():32. listOPosts, listClasses = loadDataSet()33. myVocabList = createVocabList(listOPosts)34. trainMat = []35.for postinDoc in listOPosts:36. trainMat.append(setOfWords2Vec(myVocabList, postinDoc))37.38. p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))39.40. testEntry = ['love', 'my', 'dalmation']41. thisDoc = array(setOfWords2Vec(myVocabList, testEntry))42.print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)43.44. testEntry = ['stupid', 'garbage']45. thisDoc = array(setOfWords2Vec(myVocabList, testEntry))46.print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb) 运行结果:准备数据:文档词袋模型词集模型(set-of-words model):每个词是否出现,每个词只能出现一次词袋模型(bag-of-words model):一个词可以出现不止一次[python]view plaincopy1.# 朴素贝叶斯词袋模型2.def bagOfWords2VecMN(vocabList, inputSet):3. returnVec = [0] * len(vocabList)4.for word in inputSet:5.if word in vocabList:6. returnVec[vocabList.index(word)] += 17.return returnVec示例:使用朴素贝叶斯过滤垃圾邮件(1)收集数据:提供文本文件(2)准备数据:将文本文件解析成词条向量(3)分析数据:检查词条确保解析的正确性(4)训练算法:使用我们之前建立的trainNB0()函数(5)测试算法:使用classifyNB(),并且构建一个新的测试函数来计算文档集的错误率(6)使用算法:构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上准备数据:切分文本使用正则表达式切分句子测试算法:使用朴素贝叶斯进行交叉验证[python]view plaincopy1.# 该函数接受一个大写字符的字串,将其解析为字符串列表2.# 该函数去掉少于两个字符的字符串,并将所有字符串转换为小写3.def textParse(bigString):4.import re5. listOfTokens = re.split(r'\W*', bigString)6.return [tok.lower() for tok in listOfTokens if len(tok) > 2]7.8.# 完整的垃圾邮件测试函数9.def spamTest():10. docList = []11. classList = []12. fullText = []13.# 导入并解析文本文件14.for i in range(1, 26):15. wordList = textParse(open('email/spam/%d.txt' % i).read())16. docList.append(wordList)17. fullText.extend(wordList)18. classList.append(1)19.20. wordList = textParse(open('email/ham/%d.txt' % i).read())21. docList.append(wordList)22. fullText.extend(wordList)23. classList.append(0)24.25. vocabList = createVocabList(docList)26. trainingSet = range(50)27. testSet = []28.# 随机构建训练集29.for i in range(10):30. randIndex = int(random.uniform(0, len(trainingSet)))31. testSet.append(trainingSet[randIndex])32.del(trainingSet[randIndex])33.34. trainMat = []35. trainClasses = []36.for docIndex in trainingSet:37. trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))38. trainClasses.append(classList[docIndex])39.40. p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))41. errorCount = 042.# 对测试集分类43.for docIndex in testSet:44. wordVector = setOfWords2Vec(vocabList, docList[docIndex])45.if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:46. errorCount += 147.print"classification error",docList[docIndex]48.print'the error rate is: ', float(errorCount) / len(testSet)运行结果:因为这些电子邮件是随机选择的,所以每次输出的结果可能会不一样。