1

我想用 python 脚本中任何最快的方法过滤两个列表。filter()为此,我使用了内置方法。但它很慢而且花费太多时间,因为我有很大的列表,我认为每个列表中有超过 500 万个项目,或者可能更多。我不知道我会怎么做。请如果有人有想法或为它编写小功能。

4

6 回答 6

15

Maybe your lists are too large and do not fit in memory, and you experience thrashing. If the sources are in files, you do not need the whole list in memory all at once. Try using itertools, e.g.:

from itertools import ifilter

def is_important(s):
   return len(s)>10

filtered_list = ifilter(is_important, open('mylist.txt'))

Note that ifilter returns an iterator that is fast and memory efficient.

Generator Tricks is a tutorial by David M. Beazley that teaches some interesting uses for generators.

于 2008-10-14T10:18:45.160 回答
5

如果你能避免一开始就创建列表,你会更快乐。

而不是

aBigList = someListMakingFunction()
filter( lambda x:x>10, aBigList )

您可能想查看生成列表的函数。

def someListMakingGenerator( ):
    for x in some source:
        yield x

那么您的过滤器不会涉及大量内存

def myFilter( aGenerator ):
    for x in aGenerator:
        if x > 10: 
            yield x

通过使用生成器,您不会在内存中保留太多内容。

于 2008-10-14T12:50:25.213 回答
2

Filter will create a new list, so if your original is very big, you could end up using up to twice as much memory. If you only need to process the results iteratively, rather than use it as a real random-access list, you are probably better off using ifilter instead. ie.

for x in itertools.ifilter(condition_func, my_really_big_list):
    do_something_with(x)

Other speed tips are to use a python builtin, rather than a function you write yourself. There's a itertools.ifilterfalse specifically for the case where you would otherwise need to introduce a lambda to negate your check. (eg "ifilter(lambda x: not x.isalpha(), l)" should be written "ifilterfalse(str.isalpha, l)")

于 2008-10-14T10:24:46.800 回答
2

知道条件列表理解通常比相应的 lambda 快得多可能很有用:

>>> import timeit
>>> timeit.Timer('[x for x in xrange(10) if (x**2 % 4) == 1]').timeit()
2.0544309616088867
>>> timeit.f = lambda x: (x**2 % 4) == 1
timeit.Timer('[x for x in xrange(10) if f(x)]').timeit()
>>> 
3.4280929565429688

(不知道为什么我需要把 f 放在timeit命名空间中,那里。并没有真正使用过这个模块。)

于 2008-10-14T12:49:35.840 回答
2

我猜 filter() 尽可能快,而不必在 C 中编写过滤函数(在这种情况下,你最好用 C 编写整个过滤过程)。

为什么不粘贴要过滤的功能?这可能会导致更容易的优化。

阅读这篇关于 Python 优化的文章。这是关于Python/C API 的。

于 2008-10-14T08:39:16.123 回答
1

Before doing it in C, you could try numpy. Perhaps you can turn your filtering into number crunching.

于 2008-10-14T10:19:48.060 回答