python - 通过序列输出解析 - Python

Question

我有这些来自对细菌群落进行测序的数据。我知道一些基本的 Python，并且正在完成 codecademy 教程。出于实用目的，请将 OTU 视为“物种”的另一个词

以下是原始数据的示例：

OTU ID   OTU Sum Lineage
591820   1083    k__Bacteria; p__Fusobacteria; c__Fusobacteria; o__Fusobacteriales; f__Fusobacteriaceae; g__u114; s__
532752   517     k__Bacteria; p__Fusobacteria; c__Fusobacteria; o__Fusobacteriales; f__Fusobacteriaceae; g__u114; s__
218456   346     k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__Bordetella; s__
590248   330     k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__; s__
343284   321     k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Comamonadaceae; g__Limnohabitans; s__

数据包括三件事：物种的参考编号、该物种在样本中的次数以及所述物种的分类。

我正在尝试做的是将所有为分类家族找到序列的时间加起来（f_x在数据中指定）。

这是所需输出的示例：

f__Fusobacteriaceae 1600
f__Alcaligenaceae  676
f__Comamonadaceae  321

这不是为了上课。几个月前我开始学习python，所以我至少能够查找任何建议。我知道它是如何通过缓慢的方式进行的（在 Excel 中复制和粘贴），所以这是供将来参考。

score 1 · Accepted Answer

如果文件中的行真的像这样，你可以这样做

from collections import defaultdict
import re
nums = defaultdict(int)
with open("file.txt") as f:
    for line in f:
        items =  line.split(None, 2)  # Split twice on any whitespace
        if items[0].isdigit():
            key = re.search(r"f__\w+", items[2]).group(0)
            nums[key] += int(items[1])

结果：

>>> nums
defaultdict(<type 'int'>, {'f__Comamonadaceae': 321, 'f__Fusobacteriaceae': 1600, 
'f__Alcaligenaceae': 676})

score 1 · Accepted Answer

另一个解决方案，使用collections.Counter：

from collections import Counter

counter = Counter()

with open('data.txt') as f:
    # skip header line
    next(f)
    for line in f:
        # Strip line of extraneous whitespace
        line = line.strip()

        # Only process non-empty lines
        if line:
            # Split by consecutive whitespace, into 3 chunks (2 splits)
            otu_id, otu_sum, lineage = line.split(None, 2)

            # Split the lineage tree into a list of nodes
            lineage = [node.strip() for node in lineage.split(';')]

            # Extract family node (assuming there's only one)
            family = [node for node in lineage if node.startswith('f__')][0]

            # Increase count for this family by `otu_sum`
            counter[family] += int(otu_sum)

for family, count in counter.items():
    print "%s %s" % (family, count)

str.split()有关参数的详细信息None（匹配连续空格），请参阅文档。

score 0 · Accepted Answer

获取所有原始数据并首先对其进行处理，我的意思是对其进行结构化，然后使用结构化数据执行您想要的任何类型的操作。如果你有 GB 的数据，你可以使用 elasticsearch。在这种情况下 f_* 提供您的原始数据和查询，并获取所有条目并添加它们

score 0 · Accepted Answer

这对于基本的python是非常可行的。将图书馆参考资料放在枕头下，因为您会经常参考它。

你最终可能会做类似的事情（我会用更长更易读的方式来写它——有办法压缩代码并更快地做到这一点）。

# Open up a file handle
file_handle = open('myfile.txt')
# Discard the header line
file_handle.readline()

# Make a dictionary to store sums
sums = {}

# Loop through the rest of the lines
for line in file_handle.readlines():
    # Strip off the pesky newline at the end of each line.
    line = line.strip()

    # Put each white-space delimited ... whatever ... into items of a list.
    line_parts = line.split()

    # Get the first column
    reference_number = line_parts[0]

    # Get the second column, convert it to an integer
    sum = int(line_parts[1])

    # Loop through the taxonomies (the rest of the 'columns' separated by whitespace)
    for taxonomy in line_parts[2:]:
        # skip it if it doesn't start with 'f_'
        if not taxonomy.startswith('f_'):
            continue
        # remove the pesky semi-colon
        taxonomy = taxonomy.strip(';')
        if sums.has_key(taxonomy):
            sums[taxonomy] += int(sum)
        else:
            sums[taxonomy] = int(sum)

# All done, do some fancy reporting.  We'll leave sorting as an exercise to the reader.
for taxonomy in sums.keys():
    print("%s %d" % (taxonomy, sums[taxonomy]))

python - 通过序列输出解析 - Python

4 回答 4

Related

Reference