1

我需要从已分区的 Hive 表中删除特定行。这些要删除的行符合某些条件,因此不能为了这样做而删除整个分区。假设该表Table具有三列:partnerdatesource_key,并且它由date和分区source_key

众所周知,hive 中不支持删除或更新特定记录集的操作(请参阅如何在 Hive 中删除和更新记录)

按照此解决方案,我成功执行了以下查询,以便仅保留与某些给定条件匹配的记录,例如:属于某个给定范围的date,source_key='heaven'和列partner<>'angel'...

创建表的临时空副本Table

CREATE TABLE IF NOT EXISTS tmpTable LIKE Table;

用当前行填充它。

INSERT OVERWRITE TABLE tmpTable
PARTITION (date,source_key)
SELECT * FROM Table
WHERE
date >= '2020-05-01' AND date < '2020-11-30' AND
source_key = 'heaven';

删除目标分区。

ALTER TABLE Table DROP IF EXISTS
PARTITION (source_key = 'heaven' , date >= '2020-05-01' , date < '2020-11-30' );

将已编辑的分区插入到目标表中。(由于语法错误,无法插入 OVERWRITE)

INSERT INTO Table
PARTITION (source_key,date)
SELECT * FROM tmpTable
WHERE
partner <> 'angel';

删除临时表。

DROP TABLE IF EXISTS tmpTable;

The query runs fine. Because the table Table is managed, when the partitions are dropped the hdfs files should be dropped within, but something is wrong (perhaps in the last INSERT INTO statement) because after the execution of all these queries the target table Table keeps all records with partner = 'angel' in the given range of dates and with , basically stays the same.

Where is the fault? What is missing? How to accurately delete specific rows matching certain conditions for such a Hive table?

4

1 回答 1

0

Table partitions can be overwritten directly from select from itself + WHERE filter. The scenario is quite simple, you do not need any temporary table. Make backup table if you are not sure what will happen.

  1. If you want to drop entire partitions (not overwrite), execute

    ALTER TABLE TableName DROP IF EXISTS
    PARTITION (<partition spec to be dropped>); --check partition spec to be dropped carefully
    

Skip this if no partitions to be dropped.

  1. Overwrite other partitions with filtered rows:

    set hive.exec.dynamic.partition=true;
    set hive.exec.dynamic.partition.mode=nonstrict;
    set hive.allow.move.on.s3=true; --If you are on Qubole/S3
    
    insert overwrite table TableName partition (date, source_key ) --partition spec should match table DDL
    select * from TableName 
     where <condition> --condition should be True for rows which NOT be deleted
    

Your code is rather confusing because you created temporary table using LIKE but using different partition specification and selecting * (same order of columns like in original table). Order of columns shold match exactly, partition columns are the last ones, also in the same order.

于 2021-04-30T07:44:29.303 回答