sql - TPC-DS 查询 6：为什么我们需要 'where j.i_category = i.i_category' 条件？

Question

我正在为 Amazon Athena 使用 TPC-DS。

在查询 5 之前一直很好。

我在查询 6 上遇到了一些问题。（如下所示）

select  a.ca_state state, count(*) cnt
 from customer_address a
     ,customer c
     ,store_sales s
     ,date_dim d
     ,item i
 where       a.ca_address_sk = c.c_current_addr_sk
    and c.c_customer_sk = s.ss_customer_sk
    and s.ss_sold_date_sk = d.d_date_sk
    and s.ss_item_sk = i.i_item_sk
    and d.d_month_seq = 
         (select distinct (d_month_seq)
          from date_dim
               where d_year = 2002
            and d_moy = 3 )
    and i.i_current_price > 1.2 * 
             (select avg(j.i_current_price) 
         from item j 
         where j.i_category = i.i_category)
 group by a.ca_state
 having count(*) >= 10
 order by cnt, a.ca_state 
 limit 100;

它花了超过 30 分钟，所以它因超时而失败。

我试图找出导致问题的部分，所以我检查了 where 条件，并找到了 where 条件 where j.i_category = i.i_category的最后一部分。

我不知道为什么需要这个条件，所以我删除了这部分并且查询运行正常。

你们能告诉我为什么需要这部分吗？

score 0 · Accepted Answer

是j.i_category = i.i_category子查询相关条件。如果你从子查询中删除它

select avg(j.i_current_price) 
from item j 
where j.i_category = i.i_category)

子查询变得不相关，成为item表上的全局聚合，计算方便，查询引擎需要做一次。

如果您想要 AWS 上的快速、高性能查询引擎，我可以推荐 Starburst Presto（免责声明：我来自 Starburst）。有关相关比较，请参见https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-redshift/（注意：这不是与 Athena 的比较）。

如果不需要那么快，您可以在 EMR 上使用 PrestoSQL（请注意，EMR 上的“PrestoSQL”和“Presto”组件不是一回事）。

sql - TPC-DS 查询 6：为什么我们需要 'where j.i_category = i.i_category' 条件？

1 回答 1

Related

Reference