0

我有一个数据集,我试图根据在我的情况下发生某些事件(即加载)的时间戳创建一个“会话 ID”

我的数据:

userid  event  timestamp
xyz     load   '2016-12-01 08:21:13:000'
xyz     view   '2016-12-01 08:21:14:000'
xyz     view   '2016-12-01 08:21:16:000'
xyz     exit   '2016-12-01 08:21:17:000'
xyz     load   '2016-12-02 08:01:13:000'
xyz     view   '2016-12-02 08:01:16:000'
abc     load   '2016-12-01 08:11:13:000'
abc     view   '2016-12-01 08:11:14:000'

我想要实现的是创建一个名为 session_start_timestamp 的新列,其中该行被标记为每个用户的最后一个“加载”。

我知道如何通过创建子集表(通过采用最小时间戳和自加入)来做到这一点,但是是否有一个滞后/领先/最大/分区函数可以代替?

最终输出应如下所示:

userid  event  timestamp                  session_start_timestamp
xyz     load   '2016-12-01 08:21:13:000'  '2016-12-01 08:21:13:000'
xyz     view   '2016-12-01 08:21:14:000'  '2016-12-01 08:21:13:000'
xyz     view   '2016-12-01 08:21:16:000'  '2016-12-01 08:21:13:000'
xyz     exit   '2016-12-01 08:21:17:000'  '2016-12-01 08:21:13:000'
xyz     load   '2016-12-02 08:01:13:000'  '2016-12-02 08:01:13:000'
xyz     view   '2016-12-02 08:01:16:000'  '2016-12-02 08:01:13:000'
abc     load   '2016-12-01 08:11:13:000'  '2016-12-01 08:11:13:000'
abc     view   '2016-12-01 08:11:14:000'  '2016-12-01 08:11:13:000'
4

1 回答 1

1

这是一个间隙/孤岛问题:

SQL 演示(postgresql)

  1. 您计算差距或断点。
  2. 然后使用累积SUM()计算组
  3. 然后MIN()从每组中选择时间

--

WITH gap as (
    SELECT *, CASE WHEN "event" = 'load' THEN 1 ELSE 0 END as gap
    FROM Table1
), island as (
    SELECT *, SUM(gap) OVER (PARTITION BY "userid" ORDER BY "timestamp" ) as grp
    FROM gap
)    
SELECT *, MIN("timestamp") OVER (PARTITION BY "userid", "grp") as new_timestamp
FROM island

输出

在此处输入图像描述

您可以合并前两个查询:

WITH island as (
    SELECT *, SUM (CASE WHEN "event" = 'load' THEN 1 ELSE 0 END ) 
              OVER (PARTITION BY "userid" ORDER BY "timestamp" ) as grp
    FROM Table1
)    
SELECT *, MIN("timestamp") OVER (PARTITION BY "userid", "grp") as new_timestamp
FROM island
于 2017-03-22T02:33:13.687 回答