关联规则挖掘算法_#数据挖掘初体验使用weka做关联规则

首页 > 关联规则挖掘算法_#数据挖掘初体验使用weka做关联规则

关联规则挖掘算法_#数据挖掘初体验使用weka做关联规则

这学期选了数据挖掘课，前两节课刚好都没有去上课。照着教程练习一下课程内容...

prepare

下载软件weka，根据系统选择版本，个人使用版本“a disk image for OS X that contains a Mac application including Oracle's Java 1.8 JVM”Data Mining with Open Source Machine Learning Software in Java

Note : mac版本安装时不是拖拽至application，而是双击weka.jar文件安装。

下载python，terminal自带python2和python3，个人使用python3
下载 mlxtend和jupyter，使用以下pip安装命令在终端中安装

pip3 install mlxtend -i https://pypi.tuna.tsinghua.edu.cn/simple #安装mxltend 
pip3 install jupyter -i https://pypi.tuna.tsinghua.edu.cn/simple #安装jupyter

实验一：使用weka做关联规则

第一步：打开explorer，open file在weka所在目录的位置中在data找到supermarket数据

使用weka官方自带的数据集supermarket数据集，来自真实超市的购物数据，记录了4627条购物记录和购物记录对应的217个属性。除total外，每个属性都是布尔类型的。't'带表True，'?'代表false。而totol字段中，‘low’代表低于100$的消费，‘high‘代表高于100$的消费。属性中，除了商品还有商品对应的department，若购买商品中有来自某depart ment 的商品，则该depart ment 对应属性为't ',否则为'?'。

第二步：使用算法和参数过滤

在associate下选择算法和参数，点击start可以开始分析。

实验二：使用python做关联规则

使用mlxtend对api做关联规则：Mlxtend.frequent patterns - mlxtend

主要步骤：

读取数据，进行预处理，将数据转为onehot 编码。
使用apriori挖掘频繁项集
使用association_rules根据指定的阈值(support ，confidence，lift ，leverage，conviction)生成满足条件的关联规则。

任务：Supermarket.arff / Weather.nominal.arff

步骤1：按total字段中low和high的值分组分别进行关联规则挖掘，注意分组后删除total字段。

df_low=df[df['total']=='low'] 
df_high=df[df['total']=='high']

步骤2: 删除所有department 属性，使用删除depart ment 后的数据进行关联规则挖掘。

 #删除department数据
departments=[x for x in df.columns if x.find('department')==0] 
df_without_department=df.drop(labels=departments,axis=1)

步骤三：使用weather.nominal.arff数据集挖掘关联规则，若使用weka，必须使用FPgrowth算法。

#FPGrowth要求输入01类型的nominal值矩阵。
df = pd.read_csv(path) 
df = pd.get_dummies(df)

python版本：

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rulesdef encode_units(x): if x == 't':return 1 if x == '?':return 0 else:return x #获取在满足最小support条件下confidence最高的top n rules 
def get_rules(df,support,confidence,n):# 获取support>=指定阈值的频繁项集frequent_itemsets = apriori(df, min_support=support, use_colnames=True) # 获取confidence>=指定阈值的的关联规则rules = association_rules(frequent_itemsets, metric="confidence",min_threshold=confidence)# 将获取的rule按照confidence升序排序 rules.sort_values(by='confidence', ascending=False) # 获取confidence前10的ruleif len(rules)>10:return rules[0:n]else:return rulesif __name__=="__main__": #对supermarket.csv数据集进行关联规则挖掘 path=r'C:UserspcDesktopsupermarket.csv' df = pd.read_csv(path)# 将数据转化成01矩阵df = df.applymap(encode_units)#删除department数据departments=[x for x in df.columns if x.find('department')==0] df_without_department=df.drop(labels=departments,axis=1) df_without_department=pd.get_dummies(df_without_department) #按照total字段low或high删除记录 df_low=df[df['total']=='low'].drop(labels='total',axis=1) df_high=df[df['total']=='high'].drop(labels='total',axis=1)#当df_high sppport取0.1时，关联规则较多，需要计算1分钟，故取0.3 print(get_rules(df=df_high,support=0.3,confidence=0.9,n=10)) print(get_rules(df=df_low, support=0.1, confidence=0.8, n=10)) print(get_rules(df=df_without_department, support=0.1, confidence=0.9, n=10))#对Weather.nominal.csv数据集进行关联规则挖掘path = r'C:UserspcDesktopweather.nominal.csv'df = pd.read_csv(path)df=pd.get_dummies(df)print(get_rules(df=df, support=0.1, confidence=0.9, n=10))

总结

对比python和weka，可以发现pyhton在数据预处理方面拥有很多的便利，关于pandas和python在数据分析领域的进一步使用，可以参考《利用Python进行数据分析》。

mlxtend association api介绍

DataFrame - pandas 0.24.1 documentationpandas.pydata.org

pandas dataframe api介绍

DataFrame - pandas 0.24.1 documentationpandas.pydata.org

python语法

Python教程www.liaoxuefeng.com