rosalind - GC skew 的运行时间太长

Question

我目前正在编写一个分析偏斜差异的脚本。不幸的是，我的问题是当字符串的长度增加时，运行时间变得太长，我似乎无法计算出我的答案。

def SkewGC(file):
    countG = 0
    countC = 0
    diffGtoC = ""
    # first, we need to find number of G's.
    # the idea is, if G appears, we add it to the count.
    # We'll just do the same to each one.
    for pos in range(0,len(file)):
        if file[pos] == "G":
            countG = countG+1
        if file[pos] == "C":
            countC = countC+1
        diffGtoC = diffGtoC + str(countG-countC) + ","
    return diffGtoC.split(",")

SkewGCArray = SkewGC(data)
# This because I included extra "," at the end...
SkewGCArray = [int(i) for i in SkewGCArray[:len(SkewGCArray)-1]]

def min_locator(file):
    min_indices = ""
    for pos in range(0,len(file)):
        if file[pos] == min(file):
            min_indices = min_indices + str(pos) + " "
    return min_indices

print min_locator(SkewGCArray)

本质上，这个脚本计算 G 和 C 的数量（对应于 DNA 中的核苷酸），获得每个位置的差异，然后我试图找到最小值的索引。它适用于低长度的文件（即输入字符串），但是当长度变大时 - 即使像 90000+，然后我的脚本运行但无法在合理的时间内（约 4-5 分钟）解决答案。

谁能指出我可以做些什么来让它更快？我考虑过是否最好说，获得差异（diffGtoC），将其设置为最小值，然后重新计算每个差异，直到它看到不同的东西，在此期间我也替换最小值。

但是我对这种方法的担忧是寻找和保留最小值的索引。如果我说，有一个包含值的数组：

[-4,-2,-5,-6,-5,-6]

我可以看到更改最小值（-4 到 -5，然后到 -6）在算法运行时方面会更快，但我如何能够保持 -6 的位置？不确定这是否完全有意义。

score 0 · Accepted Answer

提高代码性能的几个建议：

diffGtoC = diffGtoC + str(countG-countC) + ","
    return diffGtoC.split(",")

实际上相当于：

diffGtoC = list()
diffGtoC.append(countG - countC)

字符串在 Python 中是不可变的，因此您要为每个位置生成一个新字符串，这不是很有效。使用列表还将保存您正在执行str的int转换以及列表的截断。您还可以使用pop()删除列表的最后一项而不是生成新的。

一个非常简单的替代方法是搜索最小值并仅存储最小值及其位置。然后从最小位置开始迭代，看看是否可以再次找到最小值，如果可以，将其附加到第一个最小位置。减少数据操作，节省时间和内存。

rosalind - GC skew 的运行时间太长

1 回答 1

Related

Reference