python - 使用 pymongo 上传批处理数据时有效检查重复项

Question

我有一个 MongoDB 集合，它使用成批的 DataFrames 顺序更新：

print(batch_df_0)

id      date     shop  product
1   28/10/2021    1     apple
2   28/10/2021    2     apple
3   28/10/2021    3     apple

##################
# MongoDB Update #
##################

print(batch_df_1)

id      date     shop  product
1   28/10/2021    1     apple # not to be uploaded, since already in DB
1   29/10/2021    1     apple # OK
1   29/10/2021    1     banana # OK, since product is not key
10  29/10/2021    1     apple # OK

1   29/10/2021    2     banana # OK
1   29/10/2021    3     apple # OK

print(batch_df_1_to_be_updated)

id      date     shop  product
1   29/10/2021    1     apple
1   29/10/2021    1     banana
10  29/10/2021    1     apple
1   29/10/2021    2     banana
1   29/10/2021    3     apple

##################
# MongoDB Update #
##################

我想确保我不会在同一行上传两次（例如 1 28/10/2021 1 个来自 batch_df_1 的苹果，已经存在于 batch_df_0 中），给定“id”、“date”和“shop”作为 DB应控制重复项的键。

到目前为止，我已经尝试将复合索引设置为：

compound_index = [('id', 1), ('date', 1), ('shop', 1)]
collection.create_index(compound_index, unique=True)

insert_result = collection.insert_many(batch_df_1.to_dict("records"))

但是，一旦发现重复，它就会停止上传。

是否有一种有效的方法来确保对 DataFrame 的每一行进行重复检查，而不停止整个 DataFrame 上传？

score 0 · Accepted Answer

传递ordered=False给insert_many操作，使其继续推送其他文档并最终抛出异常。

try:
    insert_result = collection.insert_many(batch_df_1.to_dict("records"), ordered=False)
except:
    # Ignore Error

python - 使用 pymongo 上传批处理数据时有效检查重复项

1 回答 1

Related

Reference