1

我想计算不同拆分的事务之间的平均延迟。我已经有了解决方案,但我需要通过不同的方法计算延迟。

数据集如下所示:

customer_id      transaction_date       type      sign     period  
    A               01/01/15              A         C     30 days    
    A               05/01/15              A         C     30 days    
    A               10/01/15              B         D     30 days    
    A               25/01/15              B         D     30 days    

transaction_data = structure(list(customer_id = c("A", "A", "A", "A"), 
transaction_date = c("01/01/15", 
"05/01/15", "10/01/15", "25/01/15"), type = c("A", "A", "B", 
"B"), sign = c("C", "C", "D", "D"), period = c("30 days", "30 days", 
"30 days", "30 days")), .Names = c("customer_id", "transaction_date", 
"type", "sign", "period"), row.names = c(NA, -4L), class = "data.frame")

解决老方法

我以前做的是先计算后续事务之间的延迟,像这样:

# Delay between subseauent transactions
library(data.table)
setDT(transaction_data)[,delay_in_transactions_days:= c(0, diff.Date(transaction_date)), .(customer_id)]

# Convert seconds to days
transaction_data <- mutate(transaction_data, delay_in_days = delay_in_transactions_days/86400)
# Convert to integer
transaction_data$delay_in_days <- as.integer(transaction_data$delay_in_days)

然后通过 dcast 计算每个事务延迟的每个拆分的平均值:

dcast(setDT(transaction_data), customer_id ~ paste0("avg_delay_",period), value.var = "delay_in_days", mean)

问题新方法

我想用来计算延迟的新方法是通过以下等式:

对于每个客户:( 最新交易 - 第一次交易)/(交易数量 - 1)

当然,问题是不能按周期计算延迟,因为这将是所有交易的延迟。相反,它需要计算为特定类型或符号或拆分组合的每个周期的延迟。

有什么想法可以解决这个问题吗?

预期产出

customer_id   av.delay_30days  av.delay_30_days_TYPE_A  av.delay_30_days_TYPE_B

     A               8                   4                         15
4

1 回答 1

1

请尝试以下使用dcast()并从包中加入的方法。data.table

OP给出的公式

(最新交易 - 第一笔交易)/(交易数量 - 1)

实现为diff(range(transaction_date)) / (length(transaction_date) - 1L).

library(data.table)
setDT(transaction_data)

# coerce transaction_date to class Date
transaction_data[, transaction_date := lubridate::dmy(transaction_date)]

# compute average delay for each customer according to OP's formula
avg_dly_total <- transaction_data[
  , .(av.delay_30days = diff(range(transaction_date), units = "days") / (.N - 1L)), 
  by = customer_id]

avg_dly_total
#   customer_id av.delay_30days
#1:           A          8 days

# compute average delay by Type for each customer
avg_dly_type <- transaction_data[
  , .(av.delay_30days = diff(range(transaction_date), units = "days") / (.N - 1L)), 
  by = .(customer_id, type)]

avg_dly_type
#   customer_id type av.delay_30days
#1:           A    A          4 days
#2:           A    B         15 days

# cast type results from long to wide
value_var <- "av.delay_30days"
temp <- dcast(avg_dly_type, customer_id ~ paste0(value_var, "_TYPE_", type), 
              value.var = value_var)

temp
#   customer_id av.delay_30days_TYPE_A av.delay_30days_TYPE_B
#1:           A                 4 days                15 days

# join with totals
result <- avg_dly_total[temp, on = "customer_id"]

最终结果看起来几乎与预期输出完全相同

 result
#   customer_id av.delay_30days av.delay_30days_TYPE_A av.delay_30days_TYPE_B
#1:           A          8 days                 4 days                15 days
于 2017-04-18T10:15:27.143 回答