1

我有三张桌子:

grade         (grade_id,       grade_value,       grade_date)     ~100M rows
grade_archive (grade_id,       grade_value,       grade_date)         0 rows
peer_review   (grade_id, peer_review_value, peer_review_date)      ~10M rows

我想将所有行从 table 移动gradegrade_archive超过一个月且不在 table 中的行peer_review

这些表被积极使用,因此任何插入都必须具有低优先级,以避免在运行时中断现有和新进程。

完成后,预期的表行应如下所示:

grade          ~10M rows
grade_archive  ~90M rows
peer_review    ~10M rows

我想它接近于:

INSERT
    LOW_PRIORITY
    INTO grade_archive
        (grade_id,grade_value,grade_date)
    SELECT
        grade_id,grade_value,grade_date
    FROM
        grade
    WHERE
            grade_date < DATE_ADD(NOW(), INTERVAL -1 MONTH)
        AND grade_id NOT IN
            (
                SELECT grade_id FROM peer_review
            );

然后通过删除存档表中的所有行来清理grade表:

DELETE LOW_PRIORITY FROM grade WHERE grade_id IN (SELECT grade_id FROM grade_archive);

但是这些子选择对于大型表来说非常慢,我对结果感到紧张。寻找更好的方向。

4

2 回答 2

1

过去我在将部分数据从大型活动表迁移到存档表时遇到过类似的问题。我使用的方法(针对您的用例进行了修改)如下:

/* Set time for calculation basis */
SET@calc_time = NOW();
/* Create empty copy of grade table */
CREATE TABLE grade_temp LIKE grade;
/* Add rows you want to save from grade into temp table */
INSERT INTO grade_temp
SELECT
    g.grade_id AS grade_id,
    g.grade_value AS grade_value,
    g.grade_date AS grade_date
FROM grade AS g
LEFT JOIN peer_review AS pr
  ON g.grade_id = pr.grade_id
WHERE
/*
To keep the record it must either have an entry in peer review
or it is less than a month old
*/
    pr.grade_id IS NOT NULL
    OR g.grade_date >= DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/*
Switch new temp table for active table.
This happens really fast (it is just file name switching on the system).
*/
RENAME TABLE grade TO grade_old, grade_temp TO grade;
/*
You are now taking new records into new version of grade table
and free to do your much slower operations against the grade_old table
*/
/* Delete more recent rows */
DELETE FROM grade_old
WHERE grade_date >= DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/* Delete rows that exist in peer review */
DELETE FROM grade old
WHERE grade_id IN (
    SELECT grade_id
    FROM peer_review
    WHERE grade_date < DATE_SUB(@calc_time, INTERVAL 1 MONTH)
);
/*
As an alternate to the above action, you could also try deleting across join as shown below. Which is faster will likely depend upon number of records that are returned from that subquery shown above. You can try both out and see what works best
*/
DELETE go FROM grade_old AS go
INNER JOIN peer_review AS pr
  ON go.grade_id = pr.grade_id
WHERE pr.grade_date < DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/* Add all rows from grade_old to grade_archive */
INSERT INTO grade_archive
SELECT
    grade_id,
    grade_value,
    grade_date
FROM grade_old;
/* Drop date_old table */
DROP TABLE date_old;

这里的关键是尽快获得一个新版本的成绩表,其中只包含所需的行,然后在事后整理归档表中的内容。您不想对该大小的表执行任何批量删除操作。这可以使您为这些归档操作对表进行评分的时间降至最低。

但是,我会说您的数据库架构似乎可以针对此类操作进行优化。例如,您可以在成绩表上有一个同行评审标志,您可以使用它来更快地过滤,而不必在连接中进行过滤。我实际上是在质疑整个同行评审表的必要性,除非它与成绩表有多对一的关系(您的问题中似乎没有指出)。如果每个grade_id 只有一个同行评审条目,我认为这些列应该规范化到成绩表中。这将大大简化此维护过程。

于 2015-03-02T19:26:32.733 回答
1

由于NOT IN ( SELECT ... )速度非常慢,因此使用LEFT JOIN .. IS NULL以获得相同的效果:

SELECT  g.grade_id, g.grade_value, g.grade_date
    FROM  grade AS g
    LEFT JOIN  peer_review AS p USING(grade_id)
    WHERE  g.grade_date < DATE_ADD(NOW(), INTERVAL -1 MONTH)
      AND  gi.grade_id IS NULL ; 

不需要明确的 tmp 表。

于 2015-03-05T04:16:31.787 回答