DATS 6101: Amazon movie data grasping and recommendation
system analysis final project
Prepared by: Pseudo_yuan December 16, 2015 Introduction Big data provide useful information to the recommendation system. A good recommendation system is based on efficient algorithms. There are three popular recommendation algorithms: user-based recommendation algorithm, item-based recommendation algorithm and collaborative filtering recommendation. Based on one movie, Amazon recommends other movies that customs who watch this movie also watched. That is, this recommendation system is based on the user. However, in this system recommendations are limited, because some movies could fail to be recommended when few people have watched them. To address this problem, I will analyze attributes of the recommended movies and discuss the similarity of them to see whether it is possible to make a recommendation based on the attributes of items. In detail, with the help of the R package “rvest” I will grasp data from Amazon website pages and analysis the relationship between one movie and movies that customs who watch this movie also watched. Based on these relationships, customers’ preference could be predicted and more unpopular movies can be recommended.
Description and Quality of Data In one Amazon movie website page, there are lots of data such as the name, the genres, the director, the staring and the rates providing useful information for this movie. Amazon also gives links to recommended movies. A collection of informed data of a single movie could be a sub-dataset. One movie always associated to more than 6 recommended movies. And each recommended movie could create a new sub-dataset. In my database, one dataset includes information of one movie (the basic movie) and 6 movies that are recommended (the sub-movie) and movies that are recommended based on the sub-movies. In one dataset, there are attributes of name, year, mins, IMDb rate, BoxOffice, genre 1, genre 2, director, star 1, star 2 and studio in 43 movies. These data are website data and distribute in text, graphs even in image. The data are unstructured and sometimes could be missing, so they need cleaning before analyzed.
Data Acquisition and clean R package “rvest” is a useful package that helps to grasp data from html website pages. The function “read_html” helps to read the html website and the function“html_nodes” helps to select nodes from a HTML document. the function “html_text”, “html_name”, “html_children” or “html_attrs” helps extract attributes, text and tag name from html. With these functions, we can grasp wanted data from the website page. For example, we can use the following code to fetch the movie name from the given address.
movie <- read_html(address) Name <- movie %>% html_nodes("#aiv-content-title") %>% html_text() In this example, we get the movie name. However, the result contains useless black space. We can use the following code to delete it and make the data clean.
name <- trimws(strsplit(Name,"\n")[[1]][2]) The full code using for grasping and cleaning data is showed in appendix 1 and the result is showed in appendix 2.
The Amazon Movie data In this project, I build four data sets based on movie “A Most Wanted Man”, “Big Hero 6”, “Saving Christmas” and “Schindler’s List” and name them “group 1”, “group 2”, “group 3” and “group 4” separately. One data set includes the information of one movie and the movies recommended based on it. So in one data set the movies are recommendation relative. The full data sets are showed in the excel document named “ShuyuanZhao_FinalProjectData_Amazon Movie.xlsx”.
To detect the insights, I will visualize the data with the R package “ggplot2”. Firstly, I will present the year and IMDb rate of the movies in four data set with the following code: p <- ggplot(data=AmazonMovie,mapping=aes(x=YEAR,y=IMDBRATE)) p + geom_point(aes(color=GROUP)) The result is presented in Figure 1. Figure 1. the year and IMDbRate of the movies in four groups As we can see, movies in group 4 have relative high IMDb rate and movies in group 3 have relative low IMDb rate. In the middle, the rate of movies in group 2 is higher than the rate of movies in group 1. Andthe rate of the basic movie in group 4 is 8.9, in group 2 is 7.9, in group 1 is 6.9 and in group 3 is 1.6. The sort of the recommended movies matches to the sort of the basic movies in each group in IMDb rate. So in this case, we can conclude that the rate of basic movie has relationship with the rate of the recommended movies. Then, I will show the box office and mins of the movies in four data set using the following code: p <- ggplot(data=AmazonMovie,mapping=aes(x=BoxOffice,y=MINS)) p + geom_point(aes(color=GROUP))