当前位置：文档之家› 【原创】R语言文本挖掘tf-idf,主题建模,情感分析,n-gram建模研究分析案例报告(附代码数据)

【原创】R语言文本挖掘tf-idf,主题建模,情感分析,n-gram建模研究分析案例报告(附代码数据)

务（附代码数据）,咨询QQ：3025393450有问题到百度搜索“大数据部落”就可以了欢迎登陆官网：/datablogR语言挖掘公告板数据文本挖掘研究分析## Registered S3 methods overwritten by 'ggplot2':## method from## [.quosures rlang## c.quosures rlang## print.quosures rlang我们对1993年发送到20个Usenet公告板的20,000条消息进行从头到尾的分析。

此数据集中的Usenet公告板包括新闻组用于政治，宗教，汽车，体育和密码学等主题，并提供由许多用户编写的丰富文本。

该数据集可在/~jason/20Newsgroups/（该20news-bydate.tar.gz文件）上公开获取，并已成为文本分析和机器学习练习的热门。

1预处理我们首先阅读20news-bydate文件夹中的所有消息，这些消息组织在子文件夹中，每个消息都有一个文件。

我们可以看到在这样的文件用的组合read_lines()，map()和unnest()。

请注意，此步骤可能需要几分钟才能读取所有文档。

library(dplyr)library(tidyr)library(purrr)务（附代码数据）,咨询QQ：3025393450有问题到百度搜索“大数据部落”就可以了欢迎登陆官网：/databloglibrary(readr)training_folder <- "data/20news-bydate/20news-bydate-train/"# Define a function to read all files from a folder into a data frameread_folder <-function(infolder) {tibble(file =dir(infolder, s =TRUE)) %>%mutate(text =map(file, read_lines)) %>%transmute(id =basename(file), text) %>%unnest(text)}# Use unnest() and map() to apply read_folder to each subfolderraw_text <-tibble(folder =dir(training_folder, s =TRUE)) %>%unnest(map(folder, read_folder)) %>%transmute(newsgroup =basename(folder), id, text)raw_text## # A tibble: 511,655 x 3## newsgroup id text## <chr> <chr> <chr>## 1 alt.atheism 49960 From: mathew <mathew@>## 2 alt.atheism 49960 Subject: Alt.Atheism FAQ: Atheist Resources## 3 alt.atheism 49960 Summary: Books, addresses, music -- anything related to atheism## 4 alt.atheism 49960 Keywords: FAQ, atheism, books, music, fiction, addresses, contacts## 5 alt.atheism 49960 Expires: Thu, 29 Apr 1993 11:57:19 GMT## 6 alt.atheism 49960 Distribution: world## 7 alt.atheism 49960 Organization: Mantis Consultants, Cambridge. UK.## 8 alt.atheism 49960 Supersedes: <19930301143317@>## 9 alt.atheism 49960 Lines: 290## 10 alt.atheism 49960 ""## # … with 511,645 more rows请注意该newsgroup列描述了每条消息来自哪20个新闻组，以及id列，用于标识该新闻组中的唯一消息。

包含哪些新闻组，以及每个新闻组中发布的消息数量（图1）？务（附代码数据）,咨询QQ：3025393450有问题到百度搜索“大数据部落”就可以了欢迎登陆官网：/databloglibrary(ggplot2)raw_text %>%group_by(newsgroup) %>%summarize(messages =n_distinct(id)) %>%ggplot(aes(newsgroup, messages)) +geom_col() +coord_flip()图1：来自每个新闻组的消息数我们可以看到Usenet新闻组名称是按层次命名的，从主题如“talk”，“sci”或“rec”开始，然后是进一步的规范。

但是，在这里，每条消息都有一些我们不想在分析中包含的结构和额外文本。

例如，每条消息都有一个标题，其中包含描述消息的字段，例如“from：”或“in_reply_to：”。

有些还有自动电子邮件签名，这些签名发生在类似的行之后--。

这种预处理可以在dplyr包中使用cumsum()（累积和）和str_detect()来自stringr 的组合来完成。

library(stringr)# must occur after the first occurrence of an empty line,# and before the first occurrence of a line starting with --cleaned_text <-raw_text %>%group_by(newsgroup, id) %>%filter(cumsum(text == "") >0,cumsum(str_detect(text, "^--")) ==0) %>%ungroup()许多行也有嵌套文本代表来自其他用户的引号，通常以“某某写入...”之类的行开头。

这些可以通过一些正则表达式删除。

我们也可以选择手动删除两条消息，9704并9985包含了大量的非文本内容。

cleaned_text <-cleaned_text %>%filter(str_detect(text, "^[^>]+[A-Za-z\\d]") |text == "",!str_detect(text, "writes(:|\\.\\.\\.)$"),!str_detect(text, "^In article <"),!id %in%c(9704, 9985))此时，我们已准备好使用unnest_tokens()将数据集拆分为标记，同时删除停用词。

library(tidytext)务（附代码数据）,咨询QQ：3025393450有问题到百度搜索“大数据部落”就可以了欢迎登陆官网：/datablogusenet_words <-cleaned_text %>%unnest_tokens(word, text) %>%filter(str_detect(word, "[a-z']$"),!word %in%stop_words$word)每个原始文本数据集都需要不同的数据清理步骤，这通常涉及一些反复试验和探索数据集中的异常情况。

重要的是要注意这种清洁可以使用整洁的工具，如dplyr和tidyr来实现。

2新闻组中的单词现在我们已经删除了标题，签名和格式，我们可以开始探索常用词。

首先，我们可以在整个数据集或特定新闻组中找到最常用的单词。

usenet_words %>%count(word, sort =TRUE)## # A tibble: 68,137 x 2## word n## <chr> <int>## 1 people 3655## 2 time 2705## 3 god 1626## 4 system 1595## 5 program 1103## 6 bit 1097## 7 information 1094## 8 windows 1088## 9 government 1084## 10 space 1072## # … with 68,127 more rowswords_by_newsgroup <-usenet_words %>%务（附代码数据）,咨询QQ：3025393450有问题到百度搜索“大数据部落”就可以了欢迎登陆官网：/datablogcount(newsgroup, word, sort =TRUE) %>%ungroup()words_by_newsgroup## # A tibble: 173,913 x 3## newsgroup word n## <chr> <chr> <int>## 1 soc.religion.christian god 917## 2 sci.space space 840## 3 talk.politics.mideast people 728## 4 sci.crypt key 704## 5 comp.os.ms-windows.misc windows 625## 6 talk.politics.mideast armenian 582## 7 sci.crypt db 549## 8 talk.politics.mideast turkish 514## 9 rec.autos car 509## 10 talk.politics.mideast armenians 509## # … with 173,903 more rows2.1在新闻组中查找tf-idf我们希望新闻组在主题和内容方面有所不同，因此，它们之间的词语频率也不同。

e商务文档

【原创】R语言文本挖掘tf-idf,主题建模,情感分析,n-gram建模研究分析案例报告(附代码数据)

相关文档推荐：