我需要从已分区的 Hive 表中删除特定行。这些要删除的行符合某些条件,因此不能为了这样做而删除整个分区。假设该表Table
具有三列:partner
、date
和source_key
,并且它由date
和分区source_key
。
众所周知,hive 中不支持删除或更新特定记录集的操作(请参阅如何在 Hive 中删除和更新记录)。
按照此解决方案,我成功执行了以下查询,以便仅保留与某些给定条件匹配的记录,例如:属于某个给定范围的date
,source_key='heaven'
和列partner<>'angel'
...
创建表的临时空副本Table
。
CREATE TABLE IF NOT EXISTS tmpTable LIKE Table;
用当前行填充它。
INSERT OVERWRITE TABLE tmpTable
PARTITION (date,source_key)
SELECT * FROM Table
WHERE
date >= '2020-05-01' AND date < '2020-11-30' AND
source_key = 'heaven';
删除目标分区。
ALTER TABLE Table DROP IF EXISTS
PARTITION (source_key = 'heaven' , date >= '2020-05-01' , date < '2020-11-30' );
将已编辑的分区插入到目标表中。(由于语法错误,无法插入 OVERWRITE)
INSERT INTO Table
PARTITION (source_key,date)
SELECT * FROM tmpTable
WHERE
partner <> 'angel';
删除临时表。
DROP TABLE IF EXISTS tmpTable;
The query runs fine. Because the table Table
is managed, when the partitions are dropped the hdfs files should be dropped within, but something is wrong (perhaps in the last INSERT INTO statement) because after the execution of all these queries the target table Table
keeps all records with partner = 'angel'
in the given range of dates and with , basically stays the same.
Where is the fault? What is missing? How to accurately delete specific rows matching certain conditions for such a Hive table?