O2O优惠券核销-模型预测

目录

一、项目背景与目标

二、数据描述

三、问题分析

四、数据探索与预处理

五、特征工程(构造特征)

5.1 特征构造-整体数据

5.1.1 时间特征

5.1.2 优惠券特征

5.1.3 预测目标值构造

5.2 数据划分-时间滑窗

5.3 特征构造-滑窗数据

5.3.1 用户特征

5.3.2 商户特征

5.3.3 优惠券特征

5.3.4 用户-优惠券联合特征

5.3.5 用户-商户联合特征

5.3.6 商户-优惠券联合特征

5.3.7 用户-商户-优惠券联合特征

六、模型构建

七、数据保存

八、心得体会


文章来源地址https://uudwc.com/A/nJoq2

一、项目背景与目标

O2O行业关联数亿消费者,各类APP每天记录了超过百亿条用户行为和位置记录,因而成为大数据科研和商业化运营的最佳结合点之一。以优惠券盘活老用户或吸引新客户进店消费是O2O的一种重要营销方式。然而随机投放的优惠券对多数用户造成无意义的干扰。对商家而言,滥发的优惠券可能降低品牌声誉,同时难以估算营销成本。个性化投放是提高优惠券核销率的重要技术,它能让具有一定偏好的消费者得到真正的实惠,同时赋予商家更强的营销能力。

利用用户在2016年1月1日至2016年6月30日之间真实线下消费行为数据,预测用户在2016年7月领取优惠券后15天以内的使用情况。

预测前的进行了初步数据分析:O2O优惠券数据分析报告

二、数据描述

编程语言:Python
数据来源:https://tianchi.aliyun.com/competition/entrance/231593/information
数据字段:

三、问题分析

问题一:预测数据集特征只有6个,如何全面构造特征工程来表达样本?

 

 

问题二:如何划分数据集,利用历史数据预测未来数据?

历史数据——>提取特征——>代表一种习惯或者固有惯性——>不易改变

因此可以利用7月份(待预测)的前几个月数据集——>提取固有特征——>基于固有特征进行预测

数据划分:

  • 2016.01.01-2016.04.30预测→2016.05.01-2016.05.31(数据集1)
  • 2016.02.01-2016.05.31预测→2016.06.01-2016.06.30(数据集2)
  • 2016.03.01-2016.06.30预测→2016.07.01-2016.07.31(待预测数据集)

实现过程:

- 对2016.01.01-2016.04.30的数据提取特征

- 将提取的特征应用于2016.05.01-2016.05.31数据中,另外两组同理

- 将处理后的2016.05.01-2016.05.31和2016.06.01-2016.06.30的数据合并为一个数据集

- 将合并后的数据集划分train、test进行模型训练

- 将训练好的模型用于预测2016.07.01-2016.07.31。

四、数据探索与预处理

import pandas as pd
import numpy as np
data_off = pd.read_csv("/项目准备/O2O优惠券使用预测/offline_train.csv")
off_test = pd.read_csv("/项目准备/O2O优惠券使用预测/offline_test.csv")
off_test1 = off_test
off_test.head()
data_off.shape
data_off.info()
data_off.describe()
# 消费日期的最大最小值
# 领券日期的最大最小值
print(data_off['Date'].max(),data_off['Date'].min())
print(data_off['Date_received'].max(),data_off['Date_received'].min())

输出:
20160630.0 20160101.0
20160615.0 20160101.0
# 缺失值
data_off.isnull().sum()
# 没有优惠券时coupon_id,字段discount_rate和date_received也同时没有
nan1 = data_off["Discount_rate"].isnull()
nan2 = data_off['Date_received'].isnull()
nan3 = data_off['Coupon_id'].isnull()
np.all(nan1==nan2),np.all(nan1==nan3)

输出:
(True, True)
# 删除重复值
data_off.drop_duplicates(inplace=True) 
data_off.info()
# 将日期float64类型转换为日期类型
data_off['Date'] = pd.to_datetime(data_off['Date'],format='%Y%m%d')
data_off['Date_received'] = pd.to_datetime(data_off['Date_received'],format='%Y%m%d')
off_test['Date_received'] = pd.to_datetime(off_test['Date_received'],format='%Y%m%d')
data_off.info()

五、特征工程(构造特征)

5.1 特征构造-整体数据

5.1.1 时间特征

# 从领券到消费的天数
date_interval = data_off['Date']-data_off['Date_received']
data_off['date_interval'] = [d.days for d in date_interval]
#领券日期是周几
data_off['receive_week']=[d.weekday()+1 for d in data_off['Date_received']]
off_test['receive_week']=[d.weekday()+1 for d in off_test['Date_received']]

#优惠券领取时间是否是周末
data_off['receive_isWeekend']=data_off['receive_week'].apply(lambda x:1 if x>5 else 0)
off_test['receive_isWeekend']=off_test['receive_week'].apply(lambda x:1 if x>5 else 0)

5.1.2 优惠券特征

# 折扣率
def deal_rate(x):
    if pd.isna(x):
        y =float(x)
    elif ":" in x:
        a = float(x.split(":")[0])# 分母
        b = a-float(x.split(":")[1])# 分子
        y = np.round(b/a,2)
    else:
        y = float(x)
    return y
data_off['Discount_rate_%'] = data_off['Discount_rate'].map(deal_rate)
off_test['Discount_rate_%'] = off_test['Discount_rate'].map(deal_rate)

# 门槛
def deal_mk(x):
    if pd.isna(x):# nan
        y =float(x)
    elif ":" in x:# 满减券
        y = int(x.split(":")[0])# 分母
    else:# 打折券
        y = np.nan
    return y
data_off['Discount_rate_mk'] = data_off['Discount_rate'].apply(deal_mk,1)
off_test['Discount_rate_mk'] = off_test['Discount_rate'].apply(deal_mk,1)
data_off.head()

5.1.3 预测目标值构造

data_off['Y'] = data_off['date_interval'].apply(lambda x:1 if x<=15 else 0)
data_off.head()

5.2 数据划分-时间滑窗

feature1=data_off[((data_off['Date_received']>='2016-01-01')&(data_off['Date_received']<='2016-04-30')) | ((data_off['Date']>='2016-01-01')&(data_off['Date']<='2016-04-30'))]
feature1.reset_index(drop=True,inplace=True)
database1=data_off[((data_off['Date_received']>='2016-05-01')&(data_off['Date_received']<='2016-05-31')) | ((data_off['Date']>='2016-05-01')&(data_off['Date']<='2016-05-31'))]
database1.reset_index(drop=True,inplace=True)
print(' 1-4月数据总计%i行'%len(feature1))
print(' 5月数据总计%i行'%len(database1))
feature2=data_off[((data_off['Date_received']>='2016-02-01')&(data_off['Date_received']<='2016-05-31')) | ((data_off['Date']>='2016-02-01')&(data_off['Date']<='2016-05-31'))]
feature2.reset_index(drop=True,inplace=True)
database2=data_off[((data_off['Date_received']>='2016-06-01')&(data_off['Date_received']<='2016-06-30')) | ((data_off['Date']>='2016-06-01')&(data_off['Date']<='2016-06-30'))]
database2.reset_index(drop=True,inplace=True)
print(' 2-5月数据总计%i行'%len(feature2))
print(' 6月数据总计%i行'%len(database2))
feature3=data_off[((data_off['Date_received']>='2016-03-01')&(data_off['Date_received']<='2016-06-30')) | ((data_off['Date']>='2016-03-01')&(data_off['Date']<='2016-06-30'))]
feature3.reset_index(drop=True,inplace=True)
database3=off_test
print(' 3-5月数据总计%i行'%len(feature3))
print(' 7月数据总计%i行'%len(database3))

5.3 特征构造-滑窗数据

对每个划分后的数据集分别进行指标提取

5.3.1 用户特征

def user_feature(feature):
    all_users = feature['User_id']
    users = all_users.drop_duplicates()
    # 1.用户消费次数(不对商家去重)
    users_goods = feature[pd.notna(feature.Date)][['User_id','Merchant_id']]
    users_goods['Merchant_id']=1
    users_goods_nums = users_goods.groupby(by = 'User_id').sum('Merchant_id')
    users_goods_nums.columns=['buy_num']
    users = pd.merge(users,users_goods_nums,on='User_id',how = 'left')
    # 2.每个用户的领券次数
    Coupon = feature[pd.notna(feature['Coupon_id'])][['User_id','Coupon_id']]
    Coupon['Coupon_id'] = 1
    Coupon_num = Coupon.groupby(by='User_id').sum('Coupon_id')
    Coupon_num.columns = ['Coupon_get_num']
    users = pd.merge(users,Coupon_num,on='User_id',how='left')
    users['Coupon_get_num']=users['Coupon_get_num'].replace(np.nan,0)
    # 3.用户领券消费次数
    Used_Coupon = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','Coupon_id']]
    Used_Coupon['Coupon_id'] = 1
    Used_Coupon_num = Used_Coupon.groupby(by='User_id').sum('Coupon_id')
    Used_Coupon_num.columns = ['Coupon_use_num']
    users = pd.merge(users,Used_Coupon_num,on='User_id',how='left')
    users['Coupon_use_num']=users['Coupon_use_num'].replace(np.nan,0)
    # 4.用户用券购买概率
    users['yqgmgl'] = users['Coupon_use_num']/users['buy_num']
    # 5.用户核销率
    users['Coupon_use_rate'] = users['Coupon_use_num']/users['Coupon_get_num']
    # 6.每个用户15天内核销优惠券的张数
    Used_Coupon = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['User_id','Coupon_id']]
    Used_Coupon['Coupon_id'] = 1
    Used_Coupon_num15 = Used_Coupon.groupby(by='User_id').sum('Coupon_id')
    Used_Coupon_num15.columns = ['Coupon_use_num15']
    users = pd.merge(users,Used_Coupon_num15,on='User_id',how='left')
    users['Coupon_use_num15']=users['Coupon_use_num15'].replace(np.nan,0)
    # 7.每个用户15天内优惠券核销率
    users['Coupon_use_rate15'] = users['Coupon_use_num15']/users['Coupon_get_num']
    # 8.用户消费过的不同商家数量(对商家去重)
    users_goods = feature[pd.notna(feature.Date)][['User_id','Merchant_id']]
    users_goods = users_goods.drop_duplicates()
    users_goods['Merchant_id']=1
    users_goods_nums = users_goods.groupby(by = 'User_id').sum('Merchant_id')
    users_goods_nums.columns=['buy_merchant_num']
    users = pd.merge(users,users_goods_nums,on='User_id',how = 'left')
    # 9.优惠券使用间隔天数(最小天数,平均天数)
    get_user_date = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','date_interval']]
    min_interval = get_user_date.groupby('User_id').min('date_interval')
    min_interval.columns = ['user_min_interval']
    mean_interval = get_user_date.groupby('User_id').mean('date_interval')
    mean_interval.columns = ['user_mean_interval']
    users = pd.merge(users,min_interval,on='User_id',how='left')
    users = pd.merge(users,mean_interval,on='User_id',how='left')
    # 10.用户-商家领券消费距离(最大/最小/平均距离)
    distance = feature[(pd.notna(feature.Distance))&(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','Distance']]
    user_distance_max = distance.groupby(by='User_id').max('Distance')
    user_distance_max.columns = ['user_distance_max']
    user_distance_min = distance.groupby(by='User_id').min('Distance')
    user_distance_min.columns = ['user_distance_min']
    user_distance_mean = distance.groupby(by='User_id').mean('Distance')
    user_distance_mean.columns = ['user_distance_mean']
    users = pd.merge(users,user_distance_max,on='User_id',how='left')
    users = pd.merge(users,user_distance_mean,on='User_id',how='left')
    users = pd.merge(users,user_distance_min,on='User_id',how='left')
    
    # 11.用户核销优惠券的平均门槛
    mk = feature[pd.notna(feature['Discount_rate_mk'])][['User_id','Discount_rate_mk']]
    user_Discount_mk_mean =mk.groupby(by='User_id').mean('Discount_rate_mk')
    user_Discount_mk_mean.columns = ['user_Discount_mk_mean']
    users = pd.merge(users,user_Discount_mk_mean,on='User_id',how='left')
    
    user_Discount_mk_min =mk.groupby(by='User_id').mean('Discount_rate_mk')
    user_Discount_mk_min.columns = ['user_Discount_mk_min']
    users = pd.merge(users,user_Discount_mk_min,on='User_id',how='left')
    
    user_Discount_mk_max =mk.groupby(by='User_id').mean('Discount_rate_mk')
    user_Discount_mk_max.columns = ['user_Discount_mk_max']
    users = pd.merge(users,user_Discount_mk_max,on='User_id',how='left')
    
    users.buy_num =users.buy_num.replace(np.nan,0)
    users.buy_merchant_num =users.buy_merchant_num.replace(np.nan,0)
    
    return users

5.3.2 商户特征

def Merchant_feature(feature):
    all_Merchants = feature['Merchant_id']
    Merchants = all_Merchants.drop_duplicates()
    # 1.商户合计被消费次数
    Merchant_sale = feature[pd.notna(feature['Date'])][['Merchant_id']] 
    Merchant_sale['Merchant_sale_num'] = 1
    Merchant_sale_num = Merchant_sale.groupby(by='Merchant_id').sum('Merchant_sale_num')
    Merchants = pd.merge(Merchants,Merchant_sale_num,on='Merchant_id',how='left')
    Merchants['Merchant_sale_num']=Merchants['Merchant_sale_num'].replace(np.nan,0)
    # 2.商户被领券次数
    Merchant_coupons = feature[pd.notna(feature['Date_received'])][['Merchant_id']]
    Merchant_coupons['Merchant_coupons_num'] = 1
    Merchant_coupons_num = Merchant_coupons.groupby(by='Merchant_id').sum('Merchant_coupons_num')
    Merchants = pd.merge(Merchants,Merchant_coupons_num,on='Merchant_id',how='left')
    Merchants['Merchant_coupons_num']=Merchants['Merchant_coupons_num'].replace(np.nan,0)
    # 3.商户被领券消费次数
    Merchant_coupons_buy = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Merchant_id']]
    Merchant_coupons_buy['Merchant_coupons_buy_num'] = 1
    Merchant_coupons_buy_num = Merchant_coupons_buy.groupby(by='Merchant_id').sum('Merchant_coupons_buy_num')
    Merchants = pd.merge(Merchants,Merchant_coupons_buy_num,on='Merchant_id',how='left')
    Merchants['Merchant_coupons_buy_num']=Merchants['Merchant_coupons_buy_num'].replace(np.nan,0)
    # 4.商户用券率
    Merchants['Merchant_user_rate'] = Merchants['Merchant_coupons_buy_num']/Merchants['Merchant_sale_num']
    # 5.商户核销率
    Merchants['Merchant_rate'] = Merchants['Merchant_coupons_buy_num']/Merchants['Merchant_coupons_num']
    # 6. 消费者15天内核销总数、核销率
    Merchant_Coupon = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['Merchant_id','Coupon_id']]
    Merchant_Coupon['Coupon_id'] = 1
    Merchant_Coupon_num15 = Merchant_Coupon.groupby(by='Merchant_id').sum('Coupon_id')
    Merchant_Coupon_num15.columns = ['Merchant_Coupon_use_num15']
    Merchants = pd.merge(Merchants,Merchant_Coupon_num15,on='Merchant_id',how='left')
    Merchants['Merchant_Coupon_use_num15']=Merchants['Merchant_Coupon_use_num15'].replace(np.nan,0)
    Merchants['Merchant_Coupon_use_rate15'] = Merchants['Merchant_Coupon_use_num15']/Merchants['Merchant_coupons_num']
    # 7. 商户-消费者距离(max/mean已核销)
    Merchant_distance = feature[(pd.notna(feature['Distance']))&(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Merchant_id','Distance']]
    Merchant_distance_max = Merchant_distance.groupby(by='Merchant_id').max('Distance')
    Merchant_distance_max.columns = ['Merchant_distance_max']
    Merchant_distance_mean = Merchant_distance.groupby(by='Merchant_id').mean('Distance')
    Merchant_distance_mean.columns = ['Merchant_distance_mean']
    Merchants = pd.merge(Merchants,Merchant_distance_mean,on='Merchant_id',how='left')
    Merchants = pd.merge(Merchants,Merchant_distance_max,on='Merchant_id',how='left')
    # 8. 商家已使用的优惠券门槛(平均、最小)
    Merchant_mk = feature[(pd.notna(feature['Discount_rate_mk']))&(pd.notna(feature['Date']))][['Discount_rate_mk','Merchant_id']]
    Merchant_mk_min = Merchant_mk.groupby(by='Merchant_id').min('Discount_rate_mk')
    Merchant_mk_min.columns = ['Merchant_mk_min']
    Merchant_mk_mean = Merchant_mk.groupby(by='Merchant_id').mean('Discount_rate_mk')
    Merchant_mk_mean.columns = ['Merchant_mk_mean']
    Merchant_mk_max = Merchant_mk.groupby(by='Merchant_id').mean('Discount_rate_mk')
    Merchant_mk_max.columns = ['Merchant_mk_max']
    Merchants = pd.merge(Merchants,Merchant_mk_min,on='Merchant_id',how='left')
    Merchants = pd.merge(Merchants,Merchant_mk_max,on='Merchant_id',how='left')
    Merchants = pd.merge(Merchants,Merchant_mk_mean,on='Merchant_id',how='left')
    # 9. 商家优惠券被使用的平均时间
    Merchant_interval = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Merchant_id','date_interval']]
    min_interval = Merchant_interval.groupby('Merchant_id').min('date_interval')
    min_interval.columns = ['Merchant_min_interval']
    mean_interval = Merchant_interval.groupby('Merchant_id').mean('date_interval')
    mean_interval.columns = ['Merchant_mean_interval']
    Merchants = pd.merge(Merchants,min_interval,on='Merchant_id',how='left')
    Merchants = pd.merge(Merchants,mean_interval,on='Merchant_id',how='left')
    return Merchants

5.3.3 优惠券特征

def couponsType_feature(feature):
    all_coupons = feature[pd.notna(feature['Discount_rate'])]['Discount_rate']
    Coupons = all_coupons.drop_duplicates()
    # 1.各类优惠券type被领取次数
    Coupons_Type_get = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Discount_rate']))][['Discount_rate']]
    Coupons_Type_get['Coupons_Type_get_num'] = 1
    Coupons_Type_get_num = Coupons_Type_get.groupby(by='Discount_rate').sum('Coupons_Type_get_num')
    Coupons = pd.merge(Coupons,Coupons_Type_get_num,on='Discount_rate',how='left')
    Coupons['Coupons_Type_get_num']=Coupons['Coupons_Type_get_num'].replace(np.nan,0)
    # 2.各类优惠券type被使用次数
    Coupons_Type_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Discount_rate']]
    Coupons_Type_use['Coupons_Type_use_num'] = 1
    Coupons_Type_use_num = Coupons_Type_use.groupby(by='Discount_rate').sum('Coupons_Type_use_num')
    Coupons = pd.merge(Coupons,Coupons_Type_use_num,on='Discount_rate',how='left')
    Coupons['Coupons_Type_use_num']=Coupons['Coupons_Type_use_num'].replace(np.nan,0)
    # 3.各类优惠券type核销率
    Coupons['Coupons_Type_rate']=Coupons['Coupons_Type_use_num']/Coupons['Coupons_Type_get_num']
    # 4.各类优惠券type15天内核销数量
    Coupon15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['Discount_rate']]
    Coupon15['Coupon15_use_num'] = 1
    Coupon15_num15 = Coupon15.groupby(by='Discount_rate').sum('Coupon15_use_num')
    Coupons = pd.merge(Coupons,Coupon15_num15,on='Discount_rate',how='left')
    # 5.各类优惠券type15天内核销率
    Coupons['Coupons15_Type_rate']=Coupons['Coupon15_use_num']/Coupons['Coupons_Type_get_num']
    # 6.各类优惠券type被使用的距离(max/mean)
    Coupon_distance = feature[(pd.notna(feature['Distance']))&(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Discount_rate','Distance']]
    Coupon_distance_max = Coupon_distance.groupby(by='Discount_rate').max('Distance')
    Coupon_distance_max.columns = ['Coupons_Type_distance_max']
    Coupon_distance_mean = Coupon_distance.groupby(by='Discount_rate').mean('Distance')
    Coupon_distance_mean.columns = ['Coupons_Type_distance_mean']
    Coupons = pd.merge(Coupons,Coupon_distance_mean,on='Discount_rate',how='left')
    Coupons = pd.merge(Coupons,Coupon_distance_max,on='Discount_rate',how='left')
    # 7.各类优惠券type被使用的时间间隔(mean/min)
    Coupon_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Discount_rate','date_interval']]
    Coupon_interval_min = Coupon_interval.groupby(by='Discount_rate').min('date_interval')
    Coupon_interval_min.columns = ['Coupons_Type_interval_min']
    Coupon_interval_mean = Coupon_interval.groupby(by='Discount_rate').mean('date_interval')
    Coupon_interval_mean.columns = ['Coupons_Type_interval_mean']
    Coupons = pd.merge(Coupons,Coupon_interval_mean,on='Discount_rate',how='left')
    Coupons = pd.merge(Coupons,Coupon_interval_min,on='Discount_rate',how='left')
    return Coupons

5.3.4 用户-优惠券联合特征

def User_CouponsType_feature(feature):
    User_Coupons = feature[['User_id','Discount_rate']]
    User_Coupons = User_Coupons.drop_duplicates()
    # 1. 用户领取特定优惠券次数
    User_CouponType_get = feature[pd.notna(feature['Date_received'])][['User_id','Discount_rate']]
    User_CouponType_get['User_CouponType_get_num'] = 1
    User_CouponType_get = User_CouponType_get.groupby(['User_id','Discount_rate']).sum('User_CouponType_get_num')
    User_Coupons = pd.merge(User_Coupons,User_CouponType_get,on=['User_id','Discount_rate'],how='left')
    User_Coupons['User_CouponType_get_num']=User_Coupons['User_CouponType_get_num'].replace(np.nan,0)
    # 2. 用户使用特定优惠券次数
    User_CouponType_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['User_id','Discount_rate']]
    User_CouponType_use['User_CouponType_use_num'] = 1
    User_CouponType_use = User_CouponType_use.groupby(['User_id','Discount_rate']).sum('User_CouponType_use_num')
    User_Coupons = pd.merge(User_Coupons,User_CouponType_use,on=['User_id','Discount_rate'],how='left')
    User_Coupons['User_CouponType_use_num']=User_Coupons['User_CouponType_use_num'].replace(np.nan,0)
    # 3. 用户特定优惠券核销率
    User_Coupons['User_Coupons_rate'] = User_Coupons['User_CouponType_use_num']/User_Coupons['User_CouponType_get_num']
    # 4. 15天核销次数
    User_Coupon15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['User_id','Discount_rate']]
    User_Coupon15['User_Coupon15_use_num'] = 1
    User_Coupon15_num15 = User_Coupon15.groupby(['User_id','Discount_rate']).sum('User_Coupon15_use_num')
    User_Coupons = pd.merge(User_Coupons,User_Coupon15_num15,on=['User_id','Discount_rate'],how='left')
    User_Coupons['User_Coupon15_use_num']=User_Coupons['User_Coupon15_use_num'].replace(np.nan,0)
    # 5. 15天核销率
    User_Coupons['User_Coupons_rate15'] = User_Coupons['User_Coupon15_use_num']/User_Coupons['User_CouponType_get_num']
    # 6. 时间间隔
    User_Coupon_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','Discount_rate','date_interval']]
    User_Coupon_interval_min = User_Coupon_interval.groupby(['Discount_rate','User_id']).min('date_interval')
    User_Coupon_interval_min.columns = ['User_Coupons_Type_interval_min']
    User_Coupon_interval_mean = User_Coupon_interval.groupby(['Discount_rate','User_id']).mean('date_interval')
    User_Coupon_interval_mean.columns = ['User_Coupons_Type_interval_mean']
    User_Coupons = pd.merge(User_Coupons,User_Coupon_interval_mean,on=['Discount_rate','User_id'],how='left')
    User_Coupons = pd.merge(User_Coupons,User_Coupon_interval_min,on=['Discount_rate','User_id'],how='left')
    return User_Coupons

5.3.5 用户-商户联合特征

def User_Merchants_feature(feature):
    User_Merchants = feature[['User_id','Merchant_id']]
    User_Merchants = User_Merchants.drop_duplicates()
    # 1.用户在特定商家消费次数
    User_Merchant_buy = feature[pd.notna(feature['Date'])][['User_id','Merchant_id']]
    User_Merchant_buy['User_Merchant_buy_num'] = 1
    User_Merchant_buy = User_Merchant_buy.groupby(['User_id','Merchant_id']).sum('User_Merchant_buy_num')
    User_Merchants = pd.merge(User_Merchants,User_Merchant_buy,on=['User_id','Merchant_id'],how='left')
    User_Merchants['User_Merchant_buy_num']=User_Merchants['User_Merchant_buy_num'].replace(np.nan,0)
    # 2. 用户在特定商家领取优惠券次数
    User_Merchant_get = feature[pd.notna(feature['Date_received'])][['User_id','Merchant_id']]
    User_Merchant_get['User_Merchant_get_num'] = 1
    User_Merchant_get = User_Merchant_get.groupby(['User_id','Merchant_id']).sum('User_Merchant_get_num')
    User_Merchants = pd.merge(User_Merchants,User_Merchant_get,on=['User_id','Merchant_id'],how='left')
    User_Merchants['User_Merchant_get_num']=User_Merchants['User_Merchant_get_num'].replace(np.nan,0)
    # 3. 用户在特定商家使用优惠券次数
    User_Merchant_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['User_id','Merchant_id']]
    User_Merchant_use['User_Merchant_use_num'] = 1
    User_Merchant_use = User_Merchant_use.groupby(['User_id','Merchant_id']).sum('User_Merchant_use_num')
    User_Merchants = pd.merge(User_Merchants,User_Merchant_use,on=['User_id','Merchant_id'],how='left')
    User_Merchants['User_Merchant_use_num']=User_Merchants['User_Merchant_use_num'].replace(np.nan,0)
    # 4. 用户在特定商家优惠券核销率
    User_Merchants['User_Merchants_rate'] = User_Merchants['User_Merchant_use_num']/User_Merchants['User_Merchant_get_num']
    # 5. 用券率
    User_Merchants['User_Merchants_user_rate'] = User_Merchants['User_Merchant_use_num']/User_Merchants['User_Merchant_buy_num']
    # 6. 15天核销次数
    User_Merchant15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['User_id','Merchant_id']]
    User_Merchant15['User_Merchant15_use_num'] = 1
    User_Merchant15_num15 = User_Merchant15.groupby(['User_id','Merchant_id']).sum('User_Merchant15_use_num')
    User_Merchants = pd.merge(User_Merchants,User_Merchant15_num15,on=['User_id','Merchant_id'],how='left')
    User_Merchants['User_Merchant15_use_num']=User_Merchants['User_Merchant15_use_num'].replace(np.nan,0)
    # 7. 15天核销率
    User_Merchants['User_Merchant_rate15'] = User_Merchants['User_Merchant15_use_num']/User_Merchants['User_Merchant_get_num']
    # 8. 时间间隔
    User_Merchant_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','Merchant_id','date_interval']]
    User_Merchant_interval_min = User_Merchant_interval.groupby(['User_id','Merchant_id']).min('date_interval')
    User_Merchant_interval_min.columns = ['User_Merchants_Type_interval_min']
    User_Merchant_interval_mean = User_Merchant_interval.groupby(['User_id','Merchant_id']).mean('date_interval')
    User_Merchant_interval_mean.columns = ['User_Merchants_Type_interval_mean']
    User_Merchants = pd.merge(User_Merchants,User_Merchant_interval_mean,on=['User_id','Merchant_id'],how='left')
    User_Merchants = pd.merge(User_Merchants,User_Merchant_interval_min,on=['User_id','Merchant_id'],how='left')
    return User_Merchants

5.3.6 商户-优惠券联合特征

def Merchants_CouponsType_feature(feature):
    Merchants_Coupons = feature[['Merchant_id','Discount_rate']]
    Merchants_Coupons = Merchants_Coupons.drop_duplicates()
    # 1. 商户领取特定优惠券次数
    Merchants_CouponType_get = feature[pd.notna(feature['Date_received'])][['Merchant_id','Discount_rate']]
    Merchants_CouponType_get['Merchants_CouponType_get_num'] = 1
    Merchants_CouponType_get = Merchants_CouponType_get.groupby(['Merchant_id','Discount_rate']).sum('Merchants_CouponType_get_num')
    Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_CouponType_get,on=['Merchant_id','Discount_rate'],how='left')
    Merchants_Coupons['Merchants_CouponType_get_num']=Merchants_Coupons['Merchants_CouponType_get_num'].replace(np.nan,0)
    # 2. 商户使用特定优惠券次数
    Merchants_CouponType_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Merchant_id','Discount_rate']]
    Merchants_CouponType_use['Merchants_CouponType_use_num'] = 1
    Merchants_CouponType_use = Merchants_CouponType_use.groupby(['Merchant_id','Discount_rate']).sum('Merchants_CouponType_use_num')
    Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_CouponType_use,on=['Merchant_id','Discount_rate'],how='left')
    Merchants_Coupons['Merchants_CouponType_use_num']=Merchants_Coupons['Merchants_CouponType_use_num'].replace(np.nan,0)
    # 3. 商户特定优惠券核销率
    Merchants_Coupons['Merchants_Coupons_rate'] = Merchants_Coupons['Merchants_CouponType_use_num']/Merchants_Coupons['Merchants_CouponType_get_num']
    
    # 4. 15天核销次数
    Merchants_CouponsType15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['Discount_rate','Merchant_id']]
    Merchants_CouponsType15['Merchants_CouponsType15_use_num'] = 1
    Merchants_CouponsType15_num15 = Merchants_CouponsType15.groupby(['Discount_rate','Merchant_id']).sum('Merchants_CouponsType15_use_num')
    Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_CouponsType15_num15,on=['Discount_rate','Merchant_id'],how='left')
    Merchants_Coupons['Merchants_CouponsType15_use_num']=Merchants_Coupons['Merchants_CouponsType15_use_num'].replace(np.nan,0)
    # 5. 15天核销率
    Merchants_Coupons['Merchants_CouponsType_rate15'] = Merchants_Coupons['Merchants_CouponsType15_use_num']/Merchants_Coupons['Merchants_CouponType_get_num']
    # 6. 时间间隔
    Merchants_Coupon_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Discount_rate','Merchant_id','date_interval']]
    Merchants_Coupon_interval_min = Merchants_Coupon_interval.groupby(['Discount_rate','Merchant_id']).min('date_interval')
    Merchants_Coupon_interval_min.columns = ['Merchants_Coupons_Type_interval_min']
    Merchants_Coupon_interval_mean = Merchants_Coupon_interval.groupby(['Discount_rate','Merchant_id']).mean('date_interval')
    Merchants_Coupon_interval_mean.columns = ['Merchants_Coupons_Type_interval_mean']
    Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_Coupon_interval_mean,on=['Discount_rate','Merchant_id'],how='left')
    Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_Coupon_interval_min,on=['Discount_rate','Merchant_id'],how='left')
    return Merchants_Coupons

5.3.7 用户-商户-优惠券联合特征

def M_C_UType_feature(feature):
    M_C_U = feature[['Merchant_id','Discount_rate','User_id']]
    M_C_U = M_C_U.drop_duplicates()
    # 1. 用户-商户-优惠券-领取次数
    M_C_U_get = feature[pd.notna(feature['Date_received'])][['Merchant_id','Discount_rate','User_id']]
    M_C_U_get['M_C_U_get_num'] = 1
    M_C_U_get = M_C_U_get.groupby(['Merchant_id','Discount_rate','User_id']).sum('M_C_U_get_num')
    M_C_U = pd.merge(M_C_U,M_C_U_get,on=['Merchant_id','Discount_rate','User_id'],how='left')
    M_C_U['M_C_U_get_num']=M_C_U['M_C_U_get_num'].replace(np.nan,0)
    # 2. 用户-商户-优惠券-使用次数
    M_C_U_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Merchant_id','Discount_rate','User_id']]
    M_C_U_use['M_C_U_use_num'] = 1
    M_C_U_use = M_C_U_use.groupby(['Merchant_id','Discount_rate','User_id']).sum('M_C_U_use_num')
    M_C_U = pd.merge(M_C_U,M_C_U_use,on=['Merchant_id','Discount_rate','User_id'],how='left')
    M_C_U['M_C_U_use_num']=M_C_U['M_C_U_use_num'].replace(np.nan,0)
    # 3. 商户特定优惠券核销率
    M_C_U['M_C_U_rate'] = M_C_U['M_C_U_use_num']/M_C_U['M_C_U_get_num']
    
    # 4. 15天核销次数
    M_C_U15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['Merchant_id','Discount_rate','User_id']]
    M_C_U15['M_C_U15_use_num'] = 1
    M_C_U15_num15 = M_C_U15.groupby(['Merchant_id','Discount_rate','User_id']).sum('M_C_U15_use_num')
    M_C_U = pd.merge(M_C_U,M_C_U15_num15,on=['Merchant_id','Discount_rate','User_id'],how='left')
    M_C_U['M_C_U15_use_num']=M_C_U['M_C_U15_use_num'].replace(np.nan,0)
    # 5. 15天核销率
    M_C_U['M_C_UType_rate15'] = M_C_U['M_C_U15_use_num']/M_C_U['M_C_U_get_num']
    
    # 6. 时间间隔
    M_C_U_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Merchant_id','Discount_rate','User_id','date_interval']]
    M_C_U_interval_min = M_C_U_interval.groupby(['Merchant_id','Discount_rate','User_id']).min('date_interval')
    M_C_U_interval_min.columns = ['M_C_Us_Type_interval_min']
    M_C_U_interval_mean = M_C_U_interval.groupby(['Merchant_id','Discount_rate','User_id']).mean('date_interval')
    M_C_U_interval_mean.columns = ['M_C_Us_Type_interval_mean']
    M_C_Us = pd.merge(M_C_U,M_C_U_interval_mean,on=['Merchant_id','Discount_rate','User_id'],how='left')
    M_C_Us = pd.merge(M_C_U,M_C_U_interval_min,on=['Merchant_id','Discount_rate','User_id'],how='left')
    return M_C_U
def leakage(database3):
    # 1.每个用户的领券次数
    Coupon = database3[pd.notna(database3['Coupon_id'])][['User_id','Coupon_id']]
    Coupon['Coupon_id'] = 1
    Coupon_num = Coupon.groupby(by='User_id').sum('Coupon_id')
    Coupon_num.columns = ['Coupon_get_num']
    database3 = pd.merge(database3,Coupon_num,on=['User_id'],how='left')
    # 2.用户本月领取的某种优惠券的数量
    User_CouponType_get = database3[pd.notna(database3['Date_received'])][['User_id','Discount_rate']]
    User_CouponType_get['leakage_User_CouponType_get_num'] = 1
    User_CouponType_get = User_CouponType_get.groupby(['User_id','Discount_rate']).sum('leakage_User_CouponType_get_num')
    database3 = pd.merge(database3,User_CouponType_get,on=['User_id','Discount_rate'],how='left')

    # 3.用户在特定商家领取优惠券次数
    User_Merchant_get = database3[pd.notna(database3['Date_received'])][['User_id','Merchant_id']]
    User_Merchant_get['leakage_User_Merchant_get_num'] = 1
    User_Merchant_get = User_Merchant_get.groupby(['User_id','Merchant_id']).sum('leakage_User_Merchant_get_num')
    database3 = pd.merge(database3,User_Merchant_get,on=['User_id','Merchant_id'],how='left')
    # 4.每个用户当天的领券次数
    Coupon_day = database3[pd.notna(database3['Coupon_id'])][['User_id','Date_received']]
    Coupon_day['leakage_Coupon_dayget_num'] = 1
    Coupon_num_day = Coupon_day.groupby(['User_id','Date_received']).sum('leakage_Coupon_dayget_num')
    database3 = pd.merge(database3,Coupon_num_day,on=['User_id','Date_received'],how='left')
    # 5.每个用户当天某种优惠券的领券次数
    Coupon_s_day = database3[pd.notna(database3['Coupon_id'])][['User_id','Date_received','Discount_rate']]
    Coupon_s_day['speleakage_Coupon_dayget_num'] = 1
    Coupon_num_s_day = Coupon_s_day.groupby(['User_id','Date_received','Discount_rate']).sum('speleakage_Coupon_dayget_num')
    database3 = pd.merge(database3,Coupon_num_s_day,on=['User_id','Date_received','Discount_rate'],how='left')
    lekge_user_SpeCouSum_maxday=database3[database3['leakage_User_CouponType_get_num']>1].groupby(['User_id','Discount_rate'])['Date_received'].max().reset_index().rename(columns={'Date_received':'lekge_user_SpeCouSum_maxday'})
    lekge_user_SpeCouSum_minday=database3[database3['leakage_User_CouponType_get_num']>1].groupby(['User_id','Discount_rate'])['Date_received'].min().reset_index().rename(columns={'Date_received':'lekge_user_SpeCouSum_minday'})
    database3=pd.merge(database3,lekge_user_SpeCouSum_maxday,how='left',on=['User_id','Discount_rate'])
    database3=pd.merge(database3,lekge_user_SpeCouSum_minday,how='left',on=['User_id','Discount_rate'])
    database3['lekge_user_SpeCou_ifirst']=(database3['Date_received']-database3['lekge_user_SpeCouSum_minday']).apply(lambda x:1 if x.days==0 else 0 if x.days>0 else -1)
    database3['lekge_user_SpeCou_iflast']=(database3['lekge_user_SpeCouSum_maxday']-database3['Date_received']).apply(lambda x:1 if x.days==0 else 0 if x.days>0 else -1)
    return database3

 将前面所有构造的特征merge连接起来

def feature_all(feature3,y):
    # 用户
    users = user_feature(feature3)
    # 商户
    Merchants = Merchant_feature(feature3)
    # 优惠券
    Coupons_type = couponsType_feature(feature3)
    # 用户-商户
    User_Merchants = User_Merchants_feature(feature3)
    # 用户-优惠券
    User_CouponsType = User_CouponsType_feature(feature3)
    # 商户-优惠券
    Merchants_CouponsType = Merchants_CouponsType_feature(feature3)
    y = leakage(y)
    feature_final = pd.merge(y,users,on='User_id',how='left')
    feature_final = pd.merge(feature_final,Merchants,on='Merchant_id',how='left')
    feature_final = pd.merge(feature_final,Coupons_type,on='Discount_rate',how='left')
    feature_final = pd.merge(feature_final,User_Merchants,on=['User_id','Merchant_id'],how='left')
    feature_final = pd.merge(feature_final,User_CouponsType,on=['User_id','Discount_rate'],how='left')
    feature_final = pd.merge(feature_final,Merchants_CouponsType,on=['Merchant_id','Discount_rate'],how='left')
    feature_final = feature_final[feature_final['Discount_rate']==feature_final['Discount_rate']]
    feature_final['user_distance_max_interval'] = feature_final['Distance']-feature_final['user_distance_max']
    feature_final['user_distance_mean_interval'] = feature_final['Distance']-feature_final['user_distance_mean']
    feature_final['Merchant_distance_max_interval'] = feature_final['Distance']-feature_final['Merchant_distance_max']
    feature_final['Merchant_distance_mean_interval'] = feature_final['Distance']-feature_final['Merchant_distance_mean']
    feature_final['Coupons_Type_distance_mean_interval'] = feature_final['Distance']-feature_final['Coupons_Type_distance_mean']
    feature_final['Coupons_Type_distance_max_interval'] = feature_final['Distance']-feature_final['Coupons_Type_distance_max']
    feature_final['user_Discount_mk_mean_interval'] = feature_final['Discount_rate_mk']-feature_final['user_Discount_mk_mean']
    feature_final['user_Discount_mk_min_interval'] = feature_final['Discount_rate_mk']-feature_final['user_Discount_mk_min']
    feature_final['user_Discount_mk_max_interval'] = feature_final['Discount_rate_mk']-feature_final['user_Discount_mk_max']
    
#     feature_final = feature_final.replace(np.nan,-99999)
    return feature_final

 数据集划分

data3 = feature_all(feature3,database3)
data2 = feature_all(feature2,database2)
data1 = feature_all(feature1,database1)
data_train = pd.concat([data2,data1],axis =0)

y_train = data_train['Y'].values
x_train1 = data_train.drop(columns=['date_interval','Discount_rate','Date_received','Date','User_id','Merchant_id','Coupon_id','Y','lekge_user_SpeCouSum_maxday','lekge_user_SpeCouSum_minday'])
x_train = x_train1.values
print('总计%i个特征'%len(x_train1.columns))
from sklearn.model_selection import train_test_split
(train_x,test_x,train_y,test_y)=train_test_split(x_train, y_train,test_size=0.8,random_state=0)
x_pred1 = data3.drop(columns=['Discount_rate','Date_received','User_id','Merchant_id','Coupon_id','lekge_user_SpeCouSum_maxday','lekge_user_SpeCouSum_minday'])
x_pred = x_pred1.values
print('总计%i个特征'%len(x_pred1.columns))

输出:
总计83个特征
总计83个特征

六、模型构建

import xgboost as xgb
from sklearn.model_selection import train_test_split
from xgboost import plot_importance
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score,roc_curve,auc,recall_score,precision_score
from sklearn.preprocessing import MinMaxScaler
xgb_model = xgb.XGBClassifier(
 booster='gbtree',
 objective= 'binary:logistic', 
 eval_metric='auc',
 learning_rate =0.03, 
 n_estimators=1000, 
 max_depth=5, 
 min_child_weight=1.1, 
 gamma=0.1, 
 subsample=0.8, 
 colsample_bytree=0.8, 
 seed=10, 
 reg_alpha=0, 
 reg_lambda=0 
)
xgb_model.fit(train_x,train_y)
print('xgboost模型的召回率为:',recall_score(test_y, xgb_model.predict(test_x)))
print('xgboost模型的精确率为:',precision_score(test_y, xgb_model.predict(test_x)))
print('xgboost模型的auc为:',roc_auc_score(test_y, xgb_model.predict_proba(test_x)[:,1]))

七、数据保存

data33= pd.read_csv('/项目准备/O2O优惠券使用预测/offline_test.csv')
y_pred= xgb_model.predict_proba(x_pred)
print(len(y_pred))
a = pd.DataFrame(y_pred)[1].values
pred=pd.DataFrame({'User_id':data33['User_id'].values,"Coupon_id":data33['Coupon_id'].values,'Date_received':data33['Date_received'].values,'pred':a})
pred.to_csv('/项目准备/O2O优惠券使用预测/result16.csv',index=None,header=None)

后续简单调参后,提交系统得分0.7882 

八、心得体会

接触到这个数据集的时候,原本只想运用Tableau完成一个数据分析报告,但是又觉得这个比赛蛮有意思的,就尝试敲了一下代码,中间有借鉴其他博主的思路,比如数据划分的时候,利用数据滑窗的思路,但是里面很多特征构造还是基于之前做的数据分析报告的一些洞察。本人也通过这个比赛学习到了很多知识,比如:

  • 特征提取才是机器学习的精髓,它考察了对具体业务的洞察力。比如之前数据分析的时候发现,距离、优惠券门槛、星期等都是影响优惠券核销的关键因素,因此在构建特征工程的时候将这些指标特征构造出来会很大程度提升模型的效果
  • 合适的特征则需要洞察力。比如单纯构造”优惠券历史核销平均距离“特征可能对模型预测的影响并不显著,但是将”用户本次领取的优惠券距离-优惠券历史核销平均距离“特征却对模型预测的影响很显著。所以找到合适的特征需要对业务的一些灵感 
  • 模型调参:其实参数调节对模型的影响不会很大,但花费的时间却是较长的,最开始花费了大量的时间在模型调参上,但是最后优化了特征后,发现参数调节对模型的影响不会很大,很多时候特征工程和预测方法的选择对模型的影响更大。在计算机算力充分的情况下,可以利用grid search的方式进行参数探索,但是耗时很久很久很久,于是本文只是进行了一个简单的人工调参,具体步骤参考博客xgboost参数调节的一般思路

原文地址:https://blog.csdn.net/twlve/article/details/128609147

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请联系站长进行投诉反馈,一经查实,立即删除!

上一篇 2023年06月18日 20:53
「Java 数据结构和算法」:图文详解---中缀表达式转后缀表达式。
下一篇 2023年06月18日 20:53