关联规则分析
x=subset(rules, subset = lhs %pin% "whole milk" &lift > 1.2) inspect(sort(x, by = "support")[1:5]) inspect(sort(x, by = "confidence")[1:5]) #inspect(sort(x, by = "lift")[1:5]) x=subset(rules, subset = rhs %pin% "whole milk" &lift > 1.2) inspect(sort(x, by = "support")[1:5]) inspect(sort(x, by = "confidence")[1:5]) #inspect(sort(x, by = "lift")[1:5])
连续变量
AdultUCI[["hours-per-week"]] <- ordered(cut(AdultUCI[["hours-per-week"]], c(0, 25, 40, 60, 168)), labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))
关联规则分析 (association analysis)
超市例子
例3.1 (Groceries.txt) 这是一个超市购物例子(Hahsler et al., 2006),数据中有9835笔交易,涉及169种商品。每个交易 为一个顾客的购买记录,而每种商品是一个二分变量,比 如,购买用1代表,未购买用0代表。通过对数据的初步计 算,我们发现在单项计数中,全牛奶(whole milk)的频数最 高,为2513(频率接近26%),而其次为:其它蔬菜(other vegetables)为1903,面包(rolls/buns)为1809,苏打(soda)为 1715,酸奶(yogurt)为1372等等。超过5%的顾客购买的商 品频率显示在图3.1中。此外,还可以知道分别买不同数 量商品的顾客人数,购买1至9种商品的人数展示在下表中:
fra nk fu sa rter us ag e po rk b cit e e f ru tro s fr u pi c a it lf ru it ro p ip ot ve fru ot he get it r v ab eg le s et a wh b le s ol e m ilk bu tte r wh c ur ip d pe yo d/ gu so do ur c r t r m e s eam t ic eg g ro lls s /b br ow un s n br ea d pa s m ar try ga rin e bo co ff ttl ed ee wa fru te it / r ve ge so ta b l da e ju bo ic ttl ed e ca b nn ee ed r be na er ne pk in w sh sp s a op pe pi ng rs 4;AdultUCI")#library(arules) attributes(AdultUCI)$class;attributes(AdultUCI)$names;dim(AdultUCI);AdultUCI[1:2, ]
#连续变量处理: #删除 AdultUCI[["fnlwgt"]] <- NULL AdultUCI[["education-num"]] <- NULL #分级 AdultUCI[["age"]] <- ordered(cut(AdultUCI[["age"]], c(15,25, 45, 65, 100)), labels = c("Young", "Middleaged","Senior", "Old"))
a=as.matrix(a); trans2 <- as(a, "transactions"); summary(trans2)#数据概况
item frequency (relative) 0.0 0.1 0.2 0.3 0.4
Re ad y. m ad e
Fr oz en .fo od s
Al co ho l
连续变量(先变成分类变量)
• data("AdultUCI")#library(arules) • attributes(AdultUCI)$class;attributes(AdultUCI)$na mes;dim(AdultUCI);AdultUCI[1:2, ] • 连续变量处理:
– 删除
• AdultUCI[["fnlwgt"]] <- NULL • AdultUCI[["education-num"]] <- NULL
x=subset(rules, subset = lhs %in% "whole milk" &lift > 1.2) inspect(sort(x, by = "support")[1:5]) inspect(sort(x, by = "confidence")[1:5]) #inspect(sort(x, by = "lift")[1:5]) x=subset(rules, subset = lhs %ain% "whole milk" &lift > 1.2) inspect(sort(x, by = "support")[1:5]) inspect(sort(x, by = "confidence")[1:5]) #inspect(sort(x, by = "lift")[1:5]) x=subset(rules, subset = rhs %ain% "whole milk" &lift > 1.2) inspect(sort(x, by = "support")[1:5]) inspect(sort(x, by = "confidence")[1:5]) #inspect(sort(x, by = "lift")[1:5])
library(arules); w=read.table("f:/xzwu/adbook/shopping.txt",header=TRUE,sep="\t");a=w[1:10]; dim(a) [1] 786 10
> names(a) [1] "Ready.made" [6] "Bakery.goods" "Frozen.foods" "Alcohol" "Fresh.meat" "Toiletries" "Fresh.Vegetables" "Milk" "Snacks" "Tinned.goods"
library(arules) data(Groceries) summary(Groceries) itemFrequencyPlot(Groceries, support = 0.05, s = 0.8) #图3.1
0.00
0.05
0.10
0.15
0.20
0.25
超过5%的顾客购买的商品名字和频率
信息 • X=>Y的支持度(support)
记s(Z)表示事务Z在包含N个事务的整个事务数据集 中的频数,用A表示事务包含X的事件,而B表示事 务包含Y的事件(X和Y没有交) ,则:
• X=>Y的置信度(confidence) • X=>Y的提升(lift)
library(arules) data(Groceries) summary(Groceries) itemFrequencyPlot(Groceries, support = 0.05, s = 0.8) #图3.1 fsets <- eclat(Groceries, parameter = list(support = 0.05,maxlen=10))#求频繁项集 inspect(fsets[1:10]) inspect(sort(fsets, by = "support")[1:10]) rules = apriori(Groceries, parameter = list(support = 0.01,confidence = 0.01))#求规 则 x=subset(rules, subset = rhs %in% "whole milk" &lift > 1.2) inspect(sort(x, by = "support")[1:5]) #第三章表 inspect(sort(x, by = "confidence")[1:5])#第三章表 #inspect(sort(x, by = "lift")[1:5])
M ilk
#图示数据 itemFrequencyPlot(trans2, support = 0.1, s = 0.8)
Ba ke ry .g oo ds
Sn ac ks
Ti nn ed .g oo ds
fsets <- eclat(trans2, parameter = list(support = 0.05,maxlen=10))#求频繁项集 rules = apriori(trans2, parameter = list(support = 0.01,confidence = 0.6))#求规则