2

I want to call a custom python function on some existing attribute of every document in the entire collection and store the result as a new key-value pair in that (same) document. May I know if there's any way to do that (since each call is independent of others) ?

I noticed cursor.forEach but can't it be done just using python efficiently ?

A simple example would be to split the string in text and store the no. of words as a new attribute.

def split_count(text):
    # some complex preprocessing...

    return len(text.split())

# Need something like this...
db.collection.update_many({}, {'$set': {"split": split_count('$text') }}, upsert=True)

But it seems like setting a new attribute in a document based on the value of another attribute in the same document is not possible this way yet. This post is old but the issues seem to be still open.

4

2 回答 2

3

我找到了一种在 PyMongo 中使用parallel_scan在集合上调用任何自定义 python 函数的方法。

def process_text(cursor):
    for row in cursor.batch_size(200):
        # Any complex preprocessing here...
        split_text = row['text'].split()

        db.collection.update_one({'_id': row['_id']}, 
                                 {'$set': {'split_text': split_text, 
                                           'num_words': len(split_text) }},
                                 upsert=True)


def preprocess(num_threads=4):

    # Get up to max 'num_threads' cursors.
    cursors = db.collection.parallel_scan(num_threads)
    threads = [threading.Thread(target=process_text, args=(cursor,)) for cursor in cursors]

    for thread in threads:
        thread.start()

    for thread in threads:
        thread.join()

这并没有真正比cursor.forEach(但也不是那么慢)快,但它可以帮助我执行任意复杂的 Python 代码并从 Python 本身中保存结果。

ints此外,如果我在其中一个属性中有一个数组,cursor.forEach则将它们转换为floats我不想要的。所以我更喜欢这种方式。

但我很高兴知道是否有比这更好的方法:)

于 2016-06-24T08:37:47.753 回答
0

在 python 中做这种事情不太可能有效率。这是因为文档必须往返并通过客户端机器上的 python 函数。

在您的示例代码中,您将函数的结果传递给 mongodbupdate查询,这将不起作用。您不能在 db 服务器上的 mongodb 查询中运行任何 python 代码。

正如您对链接问题的回答所暗示的那样,这种类型的操作必须在 mongo shell 中执行。例如:

db.collection.find().snapshot().forEach(
    function (elem) {
        splitLength = elem.text.split(" ").length
        db.collection.update(

            {
                _id: elem._id
            },
            {
                $set: {
                    split: splitLength 
                }
            }
        );
    }
);  
于 2016-06-13T22:21:19.723 回答