python - PyMongo——游标迭代

Question

我最近开始通过 shell 和 PyMongo 测试 MongoDB。我注意到返回游标并尝试对其进行迭代似乎是实际迭代的瓶颈。有没有办法在迭代期间返回多个文档？

伪代码：

for line in file:
    value = line[a:b]
    cursor = collection.find({"field": value})
    for entry in cursor:
        (deal with single entry each time)

我希望做的是这样的：

for line in file
    value = line[a:b]
    cursor = collection.find({"field": value})
    for all_entries in cursor:
        (deal with all entries at once rather than iterate each time)

我已经尝试根据this question使用batch_size（）并将值一直更改为1000000，但它似乎没有任何效果（或者我做错了）。

任何帮助是极大的赞赏。请对这个Mongo新手放轻松！

- - 编辑 - -

谢谢迦勒。我想你已经指出了我真正想问的问题，那就是：有什么方法可以执行某种collection.findAll()或cursor.fetchAll()命令，就像 cx_Oracle 模块一样？问题不在于存储数据，而是尽可能快地从 Mongo DB 中检索数据。

据我所知，数据返回给我的速度是由我的网络决定的，因为 Mongo 必须单次获取每条记录，对吗？

score 17 · Accepted Answer

您是否考虑过类似的方法：

for line in file
  value = line[a:b]
  cursor = collection.find({"field": value})
  entries = cursor[:] # or pull them out with a loop or comprehension -- just get all the docs
  # then process entries as a list, either singly or in batch

或者，类似：

# same loop start
  entries[value] = cursor[:]
# after the loop, all the cursors are out of scope and closed
for value in entries:
  # process entries[value], either singly or in batch

基本上，只要您有足够的 RAM 来存储您的结果集，您就应该能够在处理之前将它们从游标上拉下来并保留在它们上面。这可能不会明显更快，但它会减轻光标的任何减速，并且如果您已为此设置，您可以自由地并行处理数据。

score 15 · Accepted Answer

你也可以试试：

results = list(collection.find({'field':value}))

这应该将所有内容都加载到 RAM 中。

或者这也许，如果你file不是太大：

values = list()
for line in file:
    values.append(line[a:b])
results = list(collection.find({'field': {'$in': values}}))

score 2 · Accepted Answer

toArray()可能是一个解决方案。根据文档，它首先遍历 Mongo 上的所有游标，并且只返回一次结果，以数组的形式。

http://docs.mongodb.org/manual/reference/method/cursor.toArray/

这与list(coll.find())or不同[doc for doc in coll.find()]，它一次将一个文档获取到 Python，然后返回到 Mongo 并获取下一个游标。

但是，这个方法并没有在pyMongo上实现……奇怪

score -1 · Accepted Answer

就像@jmelesky 上面提到的，我总是遵循同样的方法。这是我的示例代码。为了存储我的光标 twts_result，在下面声明要复制的列表。如果可以，请使用 RAM 来存储数据。如果不需要对您从中获取数据的集合进行处理和更新，这将解决游标超时问题。

在这里，我正在从收藏中获取推文。

twts_result = maindb.economy_geolocation.find({}, {'_id' : False})
print "Tweets for processing -> %d" %(twts_result.count())

tweets_sentiment = []
batch_tweets = []
#Copy the cursor data into list
tweets_collection = list(twts_result[:])
for twt in tweets_collection:
    #do stuff here with **twt** data

python - PyMongo——游标迭代

4 回答 4

Related

Reference