目录
一、项目背景与目标
二、数据描述
三、问题分析
四、数据探索与预处理
五、特征工程(构造特征)
5.1 特征构造-整体数据
5.1.1 时间特征
5.1.2 优惠券特征
5.1.3 预测目标值构造
5.2 数据划分-时间滑窗
5.3 特征构造-滑窗数据
5.3.1 用户特征
5.3.2 商户特征
5.3.3 优惠券特征
5.3.4 用户-优惠券联合特征
5.3.5 用户-商户联合特征
5.3.6 商户-优惠券联合特征
5.3.7 用户-商户-优惠券联合特征
六、模型构建
七、数据保存
八、心得体会
文章来源地址https://uudwc.com/A/nJoq2
一、项目背景与目标
O2O行业关联数亿消费者,各类APP每天记录了超过百亿条用户行为和位置记录,因而成为大数据科研和商业化运营的最佳结合点之一。以优惠券盘活老用户或吸引新客户进店消费是O2O的一种重要营销方式。然而随机投放的优惠券对多数用户造成无意义的干扰。对商家而言,滥发的优惠券可能降低品牌声誉,同时难以估算营销成本。个性化投放是提高优惠券核销率的重要技术,它能让具有一定偏好的消费者得到真正的实惠,同时赋予商家更强的营销能力。
利用用户在2016年1月1日至2016年6月30日之间真实线下消费行为数据,预测用户在2016年7月领取优惠券后15天以内的使用情况。
预测前的进行了初步数据分析:O2O优惠券数据分析报告
二、数据描述
编程语言:Python
数据来源:https://tianchi.aliyun.com/competition/entrance/231593/information
数据字段:
三、问题分析
问题一:预测数据集特征只有6个,如何全面构造特征工程来表达样本?
问题二:如何划分数据集,利用历史数据预测未来数据?
历史数据——>提取特征——>代表一种习惯或者固有惯性——>不易改变
因此可以利用7月份(待预测)的前几个月数据集——>提取固有特征——>基于固有特征进行预测
数据划分:
- 2016.01.01-2016.04.30预测→2016.05.01-2016.05.31(数据集1)
- 2016.02.01-2016.05.31预测→2016.06.01-2016.06.30(数据集2)
- 2016.03.01-2016.06.30预测→2016.07.01-2016.07.31(待预测数据集)
实现过程:
- 对2016.01.01-2016.04.30的数据提取特征
- 将提取的特征应用于2016.05.01-2016.05.31数据中,另外两组同理
- 将处理后的2016.05.01-2016.05.31和2016.06.01-2016.06.30的数据合并为一个数据集
- 将合并后的数据集划分train、test进行模型训练
- 将训练好的模型用于预测2016.07.01-2016.07.31。
四、数据探索与预处理
import pandas as pd
import numpy as np
data_off = pd.read_csv("/项目准备/O2O优惠券使用预测/offline_train.csv")
off_test = pd.read_csv("/项目准备/O2O优惠券使用预测/offline_test.csv")
off_test1 = off_test
off_test.head()
data_off.shape
data_off.info()
data_off.describe()
# 消费日期的最大最小值
# 领券日期的最大最小值
print(data_off['Date'].max(),data_off['Date'].min())
print(data_off['Date_received'].max(),data_off['Date_received'].min())
输出:
20160630.0 20160101.0
20160615.0 20160101.0
# 缺失值
data_off.isnull().sum()
# 没有优惠券时coupon_id,字段discount_rate和date_received也同时没有
nan1 = data_off["Discount_rate"].isnull()
nan2 = data_off['Date_received'].isnull()
nan3 = data_off['Coupon_id'].isnull()
np.all(nan1==nan2),np.all(nan1==nan3)
输出:
(True, True)
# 删除重复值
data_off.drop_duplicates(inplace=True)
data_off.info()
# 将日期float64类型转换为日期类型
data_off['Date'] = pd.to_datetime(data_off['Date'],format='%Y%m%d')
data_off['Date_received'] = pd.to_datetime(data_off['Date_received'],format='%Y%m%d')
off_test['Date_received'] = pd.to_datetime(off_test['Date_received'],format='%Y%m%d')
data_off.info()
五、特征工程(构造特征)
5.1 特征构造-整体数据
5.1.1 时间特征
# 从领券到消费的天数
date_interval = data_off['Date']-data_off['Date_received']
data_off['date_interval'] = [d.days for d in date_interval]
#领券日期是周几
data_off['receive_week']=[d.weekday()+1 for d in data_off['Date_received']]
off_test['receive_week']=[d.weekday()+1 for d in off_test['Date_received']]
#优惠券领取时间是否是周末
data_off['receive_isWeekend']=data_off['receive_week'].apply(lambda x:1 if x>5 else 0)
off_test['receive_isWeekend']=off_test['receive_week'].apply(lambda x:1 if x>5 else 0)
5.1.2 优惠券特征
# 折扣率
def deal_rate(x):
if pd.isna(x):
y =float(x)
elif ":" in x:
a = float(x.split(":")[0])# 分母
b = a-float(x.split(":")[1])# 分子
y = np.round(b/a,2)
else:
y = float(x)
return y
data_off['Discount_rate_%'] = data_off['Discount_rate'].map(deal_rate)
off_test['Discount_rate_%'] = off_test['Discount_rate'].map(deal_rate)
# 门槛
def deal_mk(x):
if pd.isna(x):# nan
y =float(x)
elif ":" in x:# 满减券
y = int(x.split(":")[0])# 分母
else:# 打折券
y = np.nan
return y
data_off['Discount_rate_mk'] = data_off['Discount_rate'].apply(deal_mk,1)
off_test['Discount_rate_mk'] = off_test['Discount_rate'].apply(deal_mk,1)
data_off.head()
5.1.3 预测目标值构造
data_off['Y'] = data_off['date_interval'].apply(lambda x:1 if x<=15 else 0)
data_off.head()
5.2 数据划分-时间滑窗
feature1=data_off[((data_off['Date_received']>='2016-01-01')&(data_off['Date_received']<='2016-04-30')) | ((data_off['Date']>='2016-01-01')&(data_off['Date']<='2016-04-30'))]
feature1.reset_index(drop=True,inplace=True)
database1=data_off[((data_off['Date_received']>='2016-05-01')&(data_off['Date_received']<='2016-05-31')) | ((data_off['Date']>='2016-05-01')&(data_off['Date']<='2016-05-31'))]
database1.reset_index(drop=True,inplace=True)
print(' 1-4月数据总计%i行'%len(feature1))
print(' 5月数据总计%i行'%len(database1))
feature2=data_off[((data_off['Date_received']>='2016-02-01')&(data_off['Date_received']<='2016-05-31')) | ((data_off['Date']>='2016-02-01')&(data_off['Date']<='2016-05-31'))]
feature2.reset_index(drop=True,inplace=True)
database2=data_off[((data_off['Date_received']>='2016-06-01')&(data_off['Date_received']<='2016-06-30')) | ((data_off['Date']>='2016-06-01')&(data_off['Date']<='2016-06-30'))]
database2.reset_index(drop=True,inplace=True)
print(' 2-5月数据总计%i行'%len(feature2))
print(' 6月数据总计%i行'%len(database2))
feature3=data_off[((data_off['Date_received']>='2016-03-01')&(data_off['Date_received']<='2016-06-30')) | ((data_off['Date']>='2016-03-01')&(data_off['Date']<='2016-06-30'))]
feature3.reset_index(drop=True,inplace=True)
database3=off_test
print(' 3-5月数据总计%i行'%len(feature3))
print(' 7月数据总计%i行'%len(database3))
5.3 特征构造-滑窗数据
# 从领券到消费的天数
date_interval = data_off['Date']-data_off['Date_received']
data_off['date_interval'] = [d.days for d in date_interval]
#领券日期是周几
data_off['receive_week']=[d.weekday()+1 for d in data_off['Date_received']]
off_test['receive_week']=[d.weekday()+1 for d in off_test['Date_received']]
#优惠券领取时间是否是周末
data_off['receive_isWeekend']=data_off['receive_week'].apply(lambda x:1 if x>5 else 0)
off_test['receive_isWeekend']=off_test['receive_week'].apply(lambda x:1 if x>5 else 0)
# 折扣率
def deal_rate(x):
if pd.isna(x):
y =float(x)
elif ":" in x:
a = float(x.split(":")[0])# 分母
b = a-float(x.split(":")[1])# 分子
y = np.round(b/a,2)
else:
y = float(x)
return y
data_off['Discount_rate_%'] = data_off['Discount_rate'].map(deal_rate)
off_test['Discount_rate_%'] = off_test['Discount_rate'].map(deal_rate)
# 门槛
def deal_mk(x):
if pd.isna(x):# nan
y =float(x)
elif ":" in x:# 满减券
y = int(x.split(":")[0])# 分母
else:# 打折券
y = np.nan
return y
data_off['Discount_rate_mk'] = data_off['Discount_rate'].apply(deal_mk,1)
off_test['Discount_rate_mk'] = off_test['Discount_rate'].apply(deal_mk,1)
data_off.head()
data_off['Y'] = data_off['date_interval'].apply(lambda x:1 if x<=15 else 0)
data_off.head()
feature1=data_off[((data_off['Date_received']>='2016-01-01')&(data_off['Date_received']<='2016-04-30')) | ((data_off['Date']>='2016-01-01')&(data_off['Date']<='2016-04-30'))]
feature1.reset_index(drop=True,inplace=True)
database1=data_off[((data_off['Date_received']>='2016-05-01')&(data_off['Date_received']<='2016-05-31')) | ((data_off['Date']>='2016-05-01')&(data_off['Date']<='2016-05-31'))]
database1.reset_index(drop=True,inplace=True)
print(' 1-4月数据总计%i行'%len(feature1))
print(' 5月数据总计%i行'%len(database1))
feature2=data_off[((data_off['Date_received']>='2016-02-01')&(data_off['Date_received']<='2016-05-31')) | ((data_off['Date']>='2016-02-01')&(data_off['Date']<='2016-05-31'))]
feature2.reset_index(drop=True,inplace=True)
database2=data_off[((data_off['Date_received']>='2016-06-01')&(data_off['Date_received']<='2016-06-30')) | ((data_off['Date']>='2016-06-01')&(data_off['Date']<='2016-06-30'))]
database2.reset_index(drop=True,inplace=True)
print(' 2-5月数据总计%i行'%len(feature2))
print(' 6月数据总计%i行'%len(database2))
feature3=data_off[((data_off['Date_received']>='2016-03-01')&(data_off['Date_received']<='2016-06-30')) | ((data_off['Date']>='2016-03-01')&(data_off['Date']<='2016-06-30'))]
feature3.reset_index(drop=True,inplace=True)
database3=off_test
print(' 3-5月数据总计%i行'%len(feature3))
print(' 7月数据总计%i行'%len(database3))
对每个划分后的数据集分别进行指标提取
5.3.1 用户特征
def user_feature(feature):
all_users = feature['User_id']
users = all_users.drop_duplicates()
# 1.用户消费次数(不对商家去重)
users_goods = feature[pd.notna(feature.Date)][['User_id','Merchant_id']]
users_goods['Merchant_id']=1
users_goods_nums = users_goods.groupby(by = 'User_id').sum('Merchant_id')
users_goods_nums.columns=['buy_num']
users = pd.merge(users,users_goods_nums,on='User_id',how = 'left')
# 2.每个用户的领券次数
Coupon = feature[pd.notna(feature['Coupon_id'])][['User_id','Coupon_id']]
Coupon['Coupon_id'] = 1
Coupon_num = Coupon.groupby(by='User_id').sum('Coupon_id')
Coupon_num.columns = ['Coupon_get_num']
users = pd.merge(users,Coupon_num,on='User_id',how='left')
users['Coupon_get_num']=users['Coupon_get_num'].replace(np.nan,0)
# 3.用户领券消费次数
Used_Coupon = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','Coupon_id']]
Used_Coupon['Coupon_id'] = 1
Used_Coupon_num = Used_Coupon.groupby(by='User_id').sum('Coupon_id')
Used_Coupon_num.columns = ['Coupon_use_num']
users = pd.merge(users,Used_Coupon_num,on='User_id',how='left')
users['Coupon_use_num']=users['Coupon_use_num'].replace(np.nan,0)
# 4.用户用券购买概率
users['yqgmgl'] = users['Coupon_use_num']/users['buy_num']
# 5.用户核销率
users['Coupon_use_rate'] = users['Coupon_use_num']/users['Coupon_get_num']
# 6.每个用户15天内核销优惠券的张数
Used_Coupon = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['User_id','Coupon_id']]
Used_Coupon['Coupon_id'] = 1
Used_Coupon_num15 = Used_Coupon.groupby(by='User_id').sum('Coupon_id')
Used_Coupon_num15.columns = ['Coupon_use_num15']
users = pd.merge(users,Used_Coupon_num15,on='User_id',how='left')
users['Coupon_use_num15']=users['Coupon_use_num15'].replace(np.nan,0)
# 7.每个用户15天内优惠券核销率
users['Coupon_use_rate15'] = users['Coupon_use_num15']/users['Coupon_get_num']
# 8.用户消费过的不同商家数量(对商家去重)
users_goods = feature[pd.notna(feature.Date)][['User_id','Merchant_id']]
users_goods = users_goods.drop_duplicates()
users_goods['Merchant_id']=1
users_goods_nums = users_goods.groupby(by = 'User_id').sum('Merchant_id')
users_goods_nums.columns=['buy_merchant_num']
users = pd.merge(users,users_goods_nums,on='User_id',how = 'left')
# 9.优惠券使用间隔天数(最小天数,平均天数)
get_user_date = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','date_interval']]
min_interval = get_user_date.groupby('User_id').min('date_interval')
min_interval.columns = ['user_min_interval']
mean_interval = get_user_date.groupby('User_id').mean('date_interval')
mean_interval.columns = ['user_mean_interval']
users = pd.merge(users,min_interval,on='User_id',how='left')
users = pd.merge(users,mean_interval,on='User_id',how='left')
# 10.用户-商家领券消费距离(最大/最小/平均距离)
distance = feature[(pd.notna(feature.Distance))&(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','Distance']]
user_distance_max = distance.groupby(by='User_id').max('Distance')
user_distance_max.columns = ['user_distance_max']
user_distance_min = distance.groupby(by='User_id').min('Distance')
user_distance_min.columns = ['user_distance_min']
user_distance_mean = distance.groupby(by='User_id').mean('Distance')
user_distance_mean.columns = ['user_distance_mean']
users = pd.merge(users,user_distance_max,on='User_id',how='left')
users = pd.merge(users,user_distance_mean,on='User_id',how='left')
users = pd.merge(users,user_distance_min,on='User_id',how='left')
# 11.用户核销优惠券的平均门槛
mk = feature[pd.notna(feature['Discount_rate_mk'])][['User_id','Discount_rate_mk']]
user_Discount_mk_mean =mk.groupby(by='User_id').mean('Discount_rate_mk')
user_Discount_mk_mean.columns = ['user_Discount_mk_mean']
users = pd.merge(users,user_Discount_mk_mean,on='User_id',how='left')
user_Discount_mk_min =mk.groupby(by='User_id').mean('Discount_rate_mk')
user_Discount_mk_min.columns = ['user_Discount_mk_min']
users = pd.merge(users,user_Discount_mk_min,on='User_id',how='left')
user_Discount_mk_max =mk.groupby(by='User_id').mean('Discount_rate_mk')
user_Discount_mk_max.columns = ['user_Discount_mk_max']
users = pd.merge(users,user_Discount_mk_max,on='User_id',how='left')
users.buy_num =users.buy_num.replace(np.nan,0)
users.buy_merchant_num =users.buy_merchant_num.replace(np.nan,0)
return users
5.3.2 商户特征
def Merchant_feature(feature):
all_Merchants = feature['Merchant_id']
Merchants = all_Merchants.drop_duplicates()
# 1.商户合计被消费次数
Merchant_sale = feature[pd.notna(feature['Date'])][['Merchant_id']]
Merchant_sale['Merchant_sale_num'] = 1
Merchant_sale_num = Merchant_sale.groupby(by='Merchant_id').sum('Merchant_sale_num')
Merchants = pd.merge(Merchants,Merchant_sale_num,on='Merchant_id',how='left')
Merchants['Merchant_sale_num']=Merchants['Merchant_sale_num'].replace(np.nan,0)
# 2.商户被领券次数
Merchant_coupons = feature[pd.notna(feature['Date_received'])][['Merchant_id']]
Merchant_coupons['Merchant_coupons_num'] = 1
Merchant_coupons_num = Merchant_coupons.groupby(by='Merchant_id').sum('Merchant_coupons_num')
Merchants = pd.merge(Merchants,Merchant_coupons_num,on='Merchant_id',how='left')
Merchants['Merchant_coupons_num']=Merchants['Merchant_coupons_num'].replace(np.nan,0)
# 3.商户被领券消费次数
Merchant_coupons_buy = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Merchant_id']]
Merchant_coupons_buy['Merchant_coupons_buy_num'] = 1
Merchant_coupons_buy_num = Merchant_coupons_buy.groupby(by='Merchant_id').sum('Merchant_coupons_buy_num')
Merchants = pd.merge(Merchants,Merchant_coupons_buy_num,on='Merchant_id',how='left')
Merchants['Merchant_coupons_buy_num']=Merchants['Merchant_coupons_buy_num'].replace(np.nan,0)
# 4.商户用券率
Merchants['Merchant_user_rate'] = Merchants['Merchant_coupons_buy_num']/Merchants['Merchant_sale_num']
# 5.商户核销率
Merchants['Merchant_rate'] = Merchants['Merchant_coupons_buy_num']/Merchants['Merchant_coupons_num']
# 6. 消费者15天内核销总数、核销率
Merchant_Coupon = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['Merchant_id','Coupon_id']]
Merchant_Coupon['Coupon_id'] = 1
Merchant_Coupon_num15 = Merchant_Coupon.groupby(by='Merchant_id').sum('Coupon_id')
Merchant_Coupon_num15.columns = ['Merchant_Coupon_use_num15']
Merchants = pd.merge(Merchants,Merchant_Coupon_num15,on='Merchant_id',how='left')
Merchants['Merchant_Coupon_use_num15']=Merchants['Merchant_Coupon_use_num15'].replace(np.nan,0)
Merchants['Merchant_Coupon_use_rate15'] = Merchants['Merchant_Coupon_use_num15']/Merchants['Merchant_coupons_num']
# 7. 商户-消费者距离(max/mean已核销)
Merchant_distance = feature[(pd.notna(feature['Distance']))&(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Merchant_id','Distance']]
Merchant_distance_max = Merchant_distance.groupby(by='Merchant_id').max('Distance')
Merchant_distance_max.columns = ['Merchant_distance_max']
Merchant_distance_mean = Merchant_distance.groupby(by='Merchant_id').mean('Distance')
Merchant_distance_mean.columns = ['Merchant_distance_mean']
Merchants = pd.merge(Merchants,Merchant_distance_mean,on='Merchant_id',how='left')
Merchants = pd.merge(Merchants,Merchant_distance_max,on='Merchant_id',how='left')
# 8. 商家已使用的优惠券门槛(平均、最小)
Merchant_mk = feature[(pd.notna(feature['Discount_rate_mk']))&(pd.notna(feature['Date']))][['Discount_rate_mk','Merchant_id']]
Merchant_mk_min = Merchant_mk.groupby(by='Merchant_id').min('Discount_rate_mk')
Merchant_mk_min.columns = ['Merchant_mk_min']
Merchant_mk_mean = Merchant_mk.groupby(by='Merchant_id').mean('Discount_rate_mk')
Merchant_mk_mean.columns = ['Merchant_mk_mean']
Merchant_mk_max = Merchant_mk.groupby(by='Merchant_id').mean('Discount_rate_mk')
Merchant_mk_max.columns = ['Merchant_mk_max']
Merchants = pd.merge(Merchants,Merchant_mk_min,on='Merchant_id',how='left')
Merchants = pd.merge(Merchants,Merchant_mk_max,on='Merchant_id',how='left')
Merchants = pd.merge(Merchants,Merchant_mk_mean,on='Merchant_id',how='left')
# 9. 商家优惠券被使用的平均时间
Merchant_interval = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Merchant_id','date_interval']]
min_interval = Merchant_interval.groupby('Merchant_id').min('date_interval')
min_interval.columns = ['Merchant_min_interval']
mean_interval = Merchant_interval.groupby('Merchant_id').mean('date_interval')
mean_interval.columns = ['Merchant_mean_interval']
Merchants = pd.merge(Merchants,min_interval,on='Merchant_id',how='left')
Merchants = pd.merge(Merchants,mean_interval,on='Merchant_id',how='left')
return Merchants
5.3.3 优惠券特征
def couponsType_feature(feature):
all_coupons = feature[pd.notna(feature['Discount_rate'])]['Discount_rate']
Coupons = all_coupons.drop_duplicates()
# 1.各类优惠券type被领取次数
Coupons_Type_get = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Discount_rate']))][['Discount_rate']]
Coupons_Type_get['Coupons_Type_get_num'] = 1
Coupons_Type_get_num = Coupons_Type_get.groupby(by='Discount_rate').sum('Coupons_Type_get_num')
Coupons = pd.merge(Coupons,Coupons_Type_get_num,on='Discount_rate',how='left')
Coupons['Coupons_Type_get_num']=Coupons['Coupons_Type_get_num'].replace(np.nan,0)
# 2.各类优惠券type被使用次数
Coupons_Type_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Discount_rate']]
Coupons_Type_use['Coupons_Type_use_num'] = 1
Coupons_Type_use_num = Coupons_Type_use.groupby(by='Discount_rate').sum('Coupons_Type_use_num')
Coupons = pd.merge(Coupons,Coupons_Type_use_num,on='Discount_rate',how='left')
Coupons['Coupons_Type_use_num']=Coupons['Coupons_Type_use_num'].replace(np.nan,0)
# 3.各类优惠券type核销率
Coupons['Coupons_Type_rate']=Coupons['Coupons_Type_use_num']/Coupons['Coupons_Type_get_num']
# 4.各类优惠券type15天内核销数量
Coupon15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['Discount_rate']]
Coupon15['Coupon15_use_num'] = 1
Coupon15_num15 = Coupon15.groupby(by='Discount_rate').sum('Coupon15_use_num')
Coupons = pd.merge(Coupons,Coupon15_num15,on='Discount_rate',how='left')
# 5.各类优惠券type15天内核销率
Coupons['Coupons15_Type_rate']=Coupons['Coupon15_use_num']/Coupons['Coupons_Type_get_num']
# 6.各类优惠券type被使用的距离(max/mean)
Coupon_distance = feature[(pd.notna(feature['Distance']))&(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Discount_rate','Distance']]
Coupon_distance_max = Coupon_distance.groupby(by='Discount_rate').max('Distance')
Coupon_distance_max.columns = ['Coupons_Type_distance_max']
Coupon_distance_mean = Coupon_distance.groupby(by='Discount_rate').mean('Distance')
Coupon_distance_mean.columns = ['Coupons_Type_distance_mean']
Coupons = pd.merge(Coupons,Coupon_distance_mean,on='Discount_rate',how='left')
Coupons = pd.merge(Coupons,Coupon_distance_max,on='Discount_rate',how='left')
# 7.各类优惠券type被使用的时间间隔(mean/min)
Coupon_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Discount_rate','date_interval']]
Coupon_interval_min = Coupon_interval.groupby(by='Discount_rate').min('date_interval')
Coupon_interval_min.columns = ['Coupons_Type_interval_min']
Coupon_interval_mean = Coupon_interval.groupby(by='Discount_rate').mean('date_interval')
Coupon_interval_mean.columns = ['Coupons_Type_interval_mean']
Coupons = pd.merge(Coupons,Coupon_interval_mean,on='Discount_rate',how='left')
Coupons = pd.merge(Coupons,Coupon_interval_min,on='Discount_rate',how='left')
return Coupons
5.3.4 用户-优惠券联合特征
def User_CouponsType_feature(feature):
User_Coupons = feature[['User_id','Discount_rate']]
User_Coupons = User_Coupons.drop_duplicates()
# 1. 用户领取特定优惠券次数
User_CouponType_get = feature[pd.notna(feature['Date_received'])][['User_id','Discount_rate']]
User_CouponType_get['User_CouponType_get_num'] = 1
User_CouponType_get = User_CouponType_get.groupby(['User_id','Discount_rate']).sum('User_CouponType_get_num')
User_Coupons = pd.merge(User_Coupons,User_CouponType_get,on=['User_id','Discount_rate'],how='left')
User_Coupons['User_CouponType_get_num']=User_Coupons['User_CouponType_get_num'].replace(np.nan,0)
# 2. 用户使用特定优惠券次数
User_CouponType_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['User_id','Discount_rate']]
User_CouponType_use['User_CouponType_use_num'] = 1
User_CouponType_use = User_CouponType_use.groupby(['User_id','Discount_rate']).sum('User_CouponType_use_num')
User_Coupons = pd.merge(User_Coupons,User_CouponType_use,on=['User_id','Discount_rate'],how='left')
User_Coupons['User_CouponType_use_num']=User_Coupons['User_CouponType_use_num'].replace(np.nan,0)
# 3. 用户特定优惠券核销率
User_Coupons['User_Coupons_rate'] = User_Coupons['User_CouponType_use_num']/User_Coupons['User_CouponType_get_num']
# 4. 15天核销次数
User_Coupon15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['User_id','Discount_rate']]
User_Coupon15['User_Coupon15_use_num'] = 1
User_Coupon15_num15 = User_Coupon15.groupby(['User_id','Discount_rate']).sum('User_Coupon15_use_num')
User_Coupons = pd.merge(User_Coupons,User_Coupon15_num15,on=['User_id','Discount_rate'],how='left')
User_Coupons['User_Coupon15_use_num']=User_Coupons['User_Coupon15_use_num'].replace(np.nan,0)
# 5. 15天核销率
User_Coupons['User_Coupons_rate15'] = User_Coupons['User_Coupon15_use_num']/User_Coupons['User_CouponType_get_num']
# 6. 时间间隔
User_Coupon_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','Discount_rate','date_interval']]
User_Coupon_interval_min = User_Coupon_interval.groupby(['Discount_rate','User_id']).min('date_interval')
User_Coupon_interval_min.columns = ['User_Coupons_Type_interval_min']
User_Coupon_interval_mean = User_Coupon_interval.groupby(['Discount_rate','User_id']).mean('date_interval')
User_Coupon_interval_mean.columns = ['User_Coupons_Type_interval_mean']
User_Coupons = pd.merge(User_Coupons,User_Coupon_interval_mean,on=['Discount_rate','User_id'],how='left')
User_Coupons = pd.merge(User_Coupons,User_Coupon_interval_min,on=['Discount_rate','User_id'],how='left')
return User_Coupons
5.3.5 用户-商户联合特征
def User_Merchants_feature(feature):
User_Merchants = feature[['User_id','Merchant_id']]
User_Merchants = User_Merchants.drop_duplicates()
# 1.用户在特定商家消费次数
User_Merchant_buy = feature[pd.notna(feature['Date'])][['User_id','Merchant_id']]
User_Merchant_buy['User_Merchant_buy_num'] = 1
User_Merchant_buy = User_Merchant_buy.groupby(['User_id','Merchant_id']).sum('User_Merchant_buy_num')
User_Merchants = pd.merge(User_Merchants,User_Merchant_buy,on=['User_id','Merchant_id'],how='left')
User_Merchants['User_Merchant_buy_num']=User_Merchants['User_Merchant_buy_num'].replace(np.nan,0)
# 2. 用户在特定商家领取优惠券次数
User_Merchant_get = feature[pd.notna(feature['Date_received'])][['User_id','Merchant_id']]
User_Merchant_get['User_Merchant_get_num'] = 1
User_Merchant_get = User_Merchant_get.groupby(['User_id','Merchant_id']).sum('User_Merchant_get_num')
User_Merchants = pd.merge(User_Merchants,User_Merchant_get,on=['User_id','Merchant_id'],how='left')
User_Merchants['User_Merchant_get_num']=User_Merchants['User_Merchant_get_num'].replace(np.nan,0)
# 3. 用户在特定商家使用优惠券次数
User_Merchant_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['User_id','Merchant_id']]
User_Merchant_use['User_Merchant_use_num'] = 1
User_Merchant_use = User_Merchant_use.groupby(['User_id','Merchant_id']).sum('User_Merchant_use_num')
User_Merchants = pd.merge(User_Merchants,User_Merchant_use,on=['User_id','Merchant_id'],how='left')
User_Merchants['User_Merchant_use_num']=User_Merchants['User_Merchant_use_num'].replace(np.nan,0)
# 4. 用户在特定商家优惠券核销率
User_Merchants['User_Merchants_rate'] = User_Merchants['User_Merchant_use_num']/User_Merchants['User_Merchant_get_num']
# 5. 用券率
User_Merchants['User_Merchants_user_rate'] = User_Merchants['User_Merchant_use_num']/User_Merchants['User_Merchant_buy_num']
# 6. 15天核销次数
User_Merchant15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['User_id','Merchant_id']]
User_Merchant15['User_Merchant15_use_num'] = 1
User_Merchant15_num15 = User_Merchant15.groupby(['User_id','Merchant_id']).sum('User_Merchant15_use_num')
User_Merchants = pd.merge(User_Merchants,User_Merchant15_num15,on=['User_id','Merchant_id'],how='left')
User_Merchants['User_Merchant15_use_num']=User_Merchants['User_Merchant15_use_num'].replace(np.nan,0)
# 7. 15天核销率
User_Merchants['User_Merchant_rate15'] = User_Merchants['User_Merchant15_use_num']/User_Merchants['User_Merchant_get_num']
# 8. 时间间隔
User_Merchant_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','Merchant_id','date_interval']]
User_Merchant_interval_min = User_Merchant_interval.groupby(['User_id','Merchant_id']).min('date_interval')
User_Merchant_interval_min.columns = ['User_Merchants_Type_interval_min']
User_Merchant_interval_mean = User_Merchant_interval.groupby(['User_id','Merchant_id']).mean('date_interval')
User_Merchant_interval_mean.columns = ['User_Merchants_Type_interval_mean']
User_Merchants = pd.merge(User_Merchants,User_Merchant_interval_mean,on=['User_id','Merchant_id'],how='left')
User_Merchants = pd.merge(User_Merchants,User_Merchant_interval_min,on=['User_id','Merchant_id'],how='left')
return User_Merchants
5.3.6 商户-优惠券联合特征
def Merchants_CouponsType_feature(feature):
Merchants_Coupons = feature[['Merchant_id','Discount_rate']]
Merchants_Coupons = Merchants_Coupons.drop_duplicates()
# 1. 商户领取特定优惠券次数
Merchants_CouponType_get = feature[pd.notna(feature['Date_received'])][['Merchant_id','Discount_rate']]
Merchants_CouponType_get['Merchants_CouponType_get_num'] = 1
Merchants_CouponType_get = Merchants_CouponType_get.groupby(['Merchant_id','Discount_rate']).sum('Merchants_CouponType_get_num')
Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_CouponType_get,on=['Merchant_id','Discount_rate'],how='left')
Merchants_Coupons['Merchants_CouponType_get_num']=Merchants_Coupons['Merchants_CouponType_get_num'].replace(np.nan,0)
# 2. 商户使用特定优惠券次数
Merchants_CouponType_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Merchant_id','Discount_rate']]
Merchants_CouponType_use['Merchants_CouponType_use_num'] = 1
Merchants_CouponType_use = Merchants_CouponType_use.groupby(['Merchant_id','Discount_rate']).sum('Merchants_CouponType_use_num')
Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_CouponType_use,on=['Merchant_id','Discount_rate'],how='left')
Merchants_Coupons['Merchants_CouponType_use_num']=Merchants_Coupons['Merchants_CouponType_use_num'].replace(np.nan,0)
# 3. 商户特定优惠券核销率
Merchants_Coupons['Merchants_Coupons_rate'] = Merchants_Coupons['Merchants_CouponType_use_num']/Merchants_Coupons['Merchants_CouponType_get_num']
# 4. 15天核销次数
Merchants_CouponsType15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['Discount_rate','Merchant_id']]
Merchants_CouponsType15['Merchants_CouponsType15_use_num'] = 1
Merchants_CouponsType15_num15 = Merchants_CouponsType15.groupby(['Discount_rate','Merchant_id']).sum('Merchants_CouponsType15_use_num')
Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_CouponsType15_num15,on=['Discount_rate','Merchant_id'],how='left')
Merchants_Coupons['Merchants_CouponsType15_use_num']=Merchants_Coupons['Merchants_CouponsType15_use_num'].replace(np.nan,0)
# 5. 15天核销率
Merchants_Coupons['Merchants_CouponsType_rate15'] = Merchants_Coupons['Merchants_CouponsType15_use_num']/Merchants_Coupons['Merchants_CouponType_get_num']
# 6. 时间间隔
Merchants_Coupon_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Discount_rate','Merchant_id','date_interval']]
Merchants_Coupon_interval_min = Merchants_Coupon_interval.groupby(['Discount_rate','Merchant_id']).min('date_interval')
Merchants_Coupon_interval_min.columns = ['Merchants_Coupons_Type_interval_min']
Merchants_Coupon_interval_mean = Merchants_Coupon_interval.groupby(['Discount_rate','Merchant_id']).mean('date_interval')
Merchants_Coupon_interval_mean.columns = ['Merchants_Coupons_Type_interval_mean']
Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_Coupon_interval_mean,on=['Discount_rate','Merchant_id'],how='left')
Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_Coupon_interval_min,on=['Discount_rate','Merchant_id'],how='left')
return Merchants_Coupons
5.3.7 用户-商户-优惠券联合特征
def M_C_UType_feature(feature):
M_C_U = feature[['Merchant_id','Discount_rate','User_id']]
M_C_U = M_C_U.drop_duplicates()
# 1. 用户-商户-优惠券-领取次数
M_C_U_get = feature[pd.notna(feature['Date_received'])][['Merchant_id','Discount_rate','User_id']]
M_C_U_get['M_C_U_get_num'] = 1
M_C_U_get = M_C_U_get.groupby(['Merchant_id','Discount_rate','User_id']).sum('M_C_U_get_num')
M_C_U = pd.merge(M_C_U,M_C_U_get,on=['Merchant_id','Discount_rate','User_id'],how='left')
M_C_U['M_C_U_get_num']=M_C_U['M_C_U_get_num'].replace(np.nan,0)
# 2. 用户-商户-优惠券-使用次数
M_C_U_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Merchant_id','Discount_rate','User_id']]
M_C_U_use['M_C_U_use_num'] = 1
M_C_U_use = M_C_U_use.groupby(['Merchant_id','Discount_rate','User_id']).sum('M_C_U_use_num')
M_C_U = pd.merge(M_C_U,M_C_U_use,on=['Merchant_id','Discount_rate','User_id'],how='left')
M_C_U['M_C_U_use_num']=M_C_U['M_C_U_use_num'].replace(np.nan,0)
# 3. 商户特定优惠券核销率
M_C_U['M_C_U_rate'] = M_C_U['M_C_U_use_num']/M_C_U['M_C_U_get_num']
# 4. 15天核销次数
M_C_U15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['Merchant_id','Discount_rate','User_id']]
M_C_U15['M_C_U15_use_num'] = 1
M_C_U15_num15 = M_C_U15.groupby(['Merchant_id','Discount_rate','User_id']).sum('M_C_U15_use_num')
M_C_U = pd.merge(M_C_U,M_C_U15_num15,on=['Merchant_id','Discount_rate','User_id'],how='left')
M_C_U['M_C_U15_use_num']=M_C_U['M_C_U15_use_num'].replace(np.nan,0)
# 5. 15天核销率
M_C_U['M_C_UType_rate15'] = M_C_U['M_C_U15_use_num']/M_C_U['M_C_U_get_num']
# 6. 时间间隔
M_C_U_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Merchant_id','Discount_rate','User_id','date_interval']]
M_C_U_interval_min = M_C_U_interval.groupby(['Merchant_id','Discount_rate','User_id']).min('date_interval')
M_C_U_interval_min.columns = ['M_C_Us_Type_interval_min']
M_C_U_interval_mean = M_C_U_interval.groupby(['Merchant_id','Discount_rate','User_id']).mean('date_interval')
M_C_U_interval_mean.columns = ['M_C_Us_Type_interval_mean']
M_C_Us = pd.merge(M_C_U,M_C_U_interval_mean,on=['Merchant_id','Discount_rate','User_id'],how='left')
M_C_Us = pd.merge(M_C_U,M_C_U_interval_min,on=['Merchant_id','Discount_rate','User_id'],how='left')
return M_C_U
def leakage(database3):
# 1.每个用户的领券次数
Coupon = database3[pd.notna(database3['Coupon_id'])][['User_id','Coupon_id']]
Coupon['Coupon_id'] = 1
Coupon_num = Coupon.groupby(by='User_id').sum('Coupon_id')
Coupon_num.columns = ['Coupon_get_num']
database3 = pd.merge(database3,Coupon_num,on=['User_id'],how='left')
# 2.用户本月领取的某种优惠券的数量
User_CouponType_get = database3[pd.notna(database3['Date_received'])][['User_id','Discount_rate']]
User_CouponType_get['leakage_User_CouponType_get_num'] = 1
User_CouponType_get = User_CouponType_get.groupby(['User_id','Discount_rate']).sum('leakage_User_CouponType_get_num')
database3 = pd.merge(database3,User_CouponType_get,on=['User_id','Discount_rate'],how='left')
# 3.用户在特定商家领取优惠券次数
User_Merchant_get = database3[pd.notna(database3['Date_received'])][['User_id','Merchant_id']]
User_Merchant_get['leakage_User_Merchant_get_num'] = 1
User_Merchant_get = User_Merchant_get.groupby(['User_id','Merchant_id']).sum('leakage_User_Merchant_get_num')
database3 = pd.merge(database3,User_Merchant_get,on=['User_id','Merchant_id'],how='left')
# 4.每个用户当天的领券次数
Coupon_day = database3[pd.notna(database3['Coupon_id'])][['User_id','Date_received']]
Coupon_day['leakage_Coupon_dayget_num'] = 1
Coupon_num_day = Coupon_day.groupby(['User_id','Date_received']).sum('leakage_Coupon_dayget_num')
database3 = pd.merge(database3,Coupon_num_day,on=['User_id','Date_received'],how='left')
# 5.每个用户当天某种优惠券的领券次数
Coupon_s_day = database3[pd.notna(database3['Coupon_id'])][['User_id','Date_received','Discount_rate']]
Coupon_s_day['speleakage_Coupon_dayget_num'] = 1
Coupon_num_s_day = Coupon_s_day.groupby(['User_id','Date_received','Discount_rate']).sum('speleakage_Coupon_dayget_num')
database3 = pd.merge(database3,Coupon_num_s_day,on=['User_id','Date_received','Discount_rate'],how='left')
lekge_user_SpeCouSum_maxday=database3[database3['leakage_User_CouponType_get_num']>1].groupby(['User_id','Discount_rate'])['Date_received'].max().reset_index().rename(columns={'Date_received':'lekge_user_SpeCouSum_maxday'})
lekge_user_SpeCouSum_minday=database3[database3['leakage_User_CouponType_get_num']>1].groupby(['User_id','Discount_rate'])['Date_received'].min().reset_index().rename(columns={'Date_received':'lekge_user_SpeCouSum_minday'})
database3=pd.merge(database3,lekge_user_SpeCouSum_maxday,how='left',on=['User_id','Discount_rate'])
database3=pd.merge(database3,lekge_user_SpeCouSum_minday,how='left',on=['User_id','Discount_rate'])
database3['lekge_user_SpeCou_ifirst']=(database3['Date_received']-database3['lekge_user_SpeCouSum_minday']).apply(lambda x:1 if x.days==0 else 0 if x.days>0 else -1)
database3['lekge_user_SpeCou_iflast']=(database3['lekge_user_SpeCouSum_maxday']-database3['Date_received']).apply(lambda x:1 if x.days==0 else 0 if x.days>0 else -1)
return database3
将前面所有构造的特征merge连接起来
def feature_all(feature3,y):
# 用户
users = user_feature(feature3)
# 商户
Merchants = Merchant_feature(feature3)
# 优惠券
Coupons_type = couponsType_feature(feature3)
# 用户-商户
User_Merchants = User_Merchants_feature(feature3)
# 用户-优惠券
User_CouponsType = User_CouponsType_feature(feature3)
# 商户-优惠券
Merchants_CouponsType = Merchants_CouponsType_feature(feature3)
y = leakage(y)
feature_final = pd.merge(y,users,on='User_id',how='left')
feature_final = pd.merge(feature_final,Merchants,on='Merchant_id',how='left')
feature_final = pd.merge(feature_final,Coupons_type,on='Discount_rate',how='left')
feature_final = pd.merge(feature_final,User_Merchants,on=['User_id','Merchant_id'],how='left')
feature_final = pd.merge(feature_final,User_CouponsType,on=['User_id','Discount_rate'],how='left')
feature_final = pd.merge(feature_final,Merchants_CouponsType,on=['Merchant_id','Discount_rate'],how='left')
feature_final = feature_final[feature_final['Discount_rate']==feature_final['Discount_rate']]
feature_final['user_distance_max_interval'] = feature_final['Distance']-feature_final['user_distance_max']
feature_final['user_distance_mean_interval'] = feature_final['Distance']-feature_final['user_distance_mean']
feature_final['Merchant_distance_max_interval'] = feature_final['Distance']-feature_final['Merchant_distance_max']
feature_final['Merchant_distance_mean_interval'] = feature_final['Distance']-feature_final['Merchant_distance_mean']
feature_final['Coupons_Type_distance_mean_interval'] = feature_final['Distance']-feature_final['Coupons_Type_distance_mean']
feature_final['Coupons_Type_distance_max_interval'] = feature_final['Distance']-feature_final['Coupons_Type_distance_max']
feature_final['user_Discount_mk_mean_interval'] = feature_final['Discount_rate_mk']-feature_final['user_Discount_mk_mean']
feature_final['user_Discount_mk_min_interval'] = feature_final['Discount_rate_mk']-feature_final['user_Discount_mk_min']
feature_final['user_Discount_mk_max_interval'] = feature_final['Discount_rate_mk']-feature_final['user_Discount_mk_max']
# feature_final = feature_final.replace(np.nan,-99999)
return feature_final
数据集划分
data3 = feature_all(feature3,database3)
data2 = feature_all(feature2,database2)
data1 = feature_all(feature1,database1)
data_train = pd.concat([data2,data1],axis =0)
y_train = data_train['Y'].values
x_train1 = data_train.drop(columns=['date_interval','Discount_rate','Date_received','Date','User_id','Merchant_id','Coupon_id','Y','lekge_user_SpeCouSum_maxday','lekge_user_SpeCouSum_minday'])
x_train = x_train1.values
print('总计%i个特征'%len(x_train1.columns))
from sklearn.model_selection import train_test_split
(train_x,test_x,train_y,test_y)=train_test_split(x_train, y_train,test_size=0.8,random_state=0)
x_pred1 = data3.drop(columns=['Discount_rate','Date_received','User_id','Merchant_id','Coupon_id','lekge_user_SpeCouSum_maxday','lekge_user_SpeCouSum_minday'])
x_pred = x_pred1.values
print('总计%i个特征'%len(x_pred1.columns))
输出:
总计83个特征
总计83个特征
六、模型构建
import xgboost as xgb
from sklearn.model_selection import train_test_split
from xgboost import plot_importance
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score,roc_curve,auc,recall_score,precision_score
from sklearn.preprocessing import MinMaxScaler
xgb_model = xgb.XGBClassifier(
booster='gbtree',
objective= 'binary:logistic',
eval_metric='auc',
learning_rate =0.03,
n_estimators=1000,
max_depth=5,
min_child_weight=1.1,
gamma=0.1,
subsample=0.8,
colsample_bytree=0.8,
seed=10,
reg_alpha=0,
reg_lambda=0
)
xgb_model.fit(train_x,train_y)
print('xgboost模型的召回率为:',recall_score(test_y, xgb_model.predict(test_x)))
print('xgboost模型的精确率为:',precision_score(test_y, xgb_model.predict(test_x)))
print('xgboost模型的auc为:',roc_auc_score(test_y, xgb_model.predict_proba(test_x)[:,1]))
七、数据保存
data33= pd.read_csv('/项目准备/O2O优惠券使用预测/offline_test.csv')
y_pred= xgb_model.predict_proba(x_pred)
print(len(y_pred))
a = pd.DataFrame(y_pred)[1].values
pred=pd.DataFrame({'User_id':data33['User_id'].values,"Coupon_id":data33['Coupon_id'].values,'Date_received':data33['Date_received'].values,'pred':a})
pred.to_csv('/项目准备/O2O优惠券使用预测/result16.csv',index=None,header=None)
后续简单调参后,提交系统得分0.7882
八、心得体会
接触到这个数据集的时候,原本只想运用Tableau完成一个数据分析报告,但是又觉得这个比赛蛮有意思的,就尝试敲了一下代码,中间有借鉴其他博主的思路,比如数据划分的时候,利用数据滑窗的思路,但是里面很多特征构造还是基于之前做的数据分析报告的一些洞察。本人也通过这个比赛学习到了很多知识,比如:文章来源:https://uudwc.com/A/nJoq2
- 特征提取才是机器学习的精髓,它考察了对具体业务的洞察力。比如之前数据分析的时候发现,距离、优惠券门槛、星期等都是影响优惠券核销的关键因素,因此在构建特征工程的时候将这些指标特征构造出来会很大程度提升模型的效果
- 合适的特征则需要洞察力。比如单纯构造”优惠券历史核销平均距离“特征可能对模型预测的影响并不显著,但是将”用户本次领取的优惠券距离-优惠券历史核销平均距离“特征却对模型预测的影响很显著。所以找到合适的特征需要对业务的一些灵感
- 模型调参:其实参数调节对模型的影响不会很大,但花费的时间却是较长的,最开始花费了大量的时间在模型调参上,但是最后优化了特征后,发现参数调节对模型的影响不会很大,很多时候特征工程和预测方法的选择对模型的影响更大。在计算机算力充分的情况下,可以利用grid search的方式进行参数探索,但是耗时很久很久很久,于是本文只是进行了一个简单的人工调参,具体步骤参考博客xgboost参数调节的一般思路