0

我想对 FASTA 文件中包含的每个蛋白质序列进行多次成对比较,然后打印百分比序列相似性(平均值或单独)。我想我需要使用 itertools 来创建所有组合,对齐它们,然后可能将匹配的数量除以对齐的序列长度以获得 % 序列相似性,但我遇到了我需要执行此操作的特定脚本的问题,如果可能的话,最好在 biopython 中。任何帮助表示赞赏。

4

1 回答 1

1

我的答案不涉及Biopython,但由于尚未发布其他答案,我还是会发布它:

我目前正在开发的生物信息学包Biotite ( https://www.biotite-python.org/ ) 将使用以下脚本解决您的问题:

import numpy as np
import biotite
import biotite.sequence as seq
import biotite.sequence.io.fasta as fasta
import biotite.sequence.align as align
import biotite.database.entrez as entrez


# 5 example sequences (bacterial luciferase variants)
uids = [
    'Q7N575', 'P19839', 'P09140', 'P07740', 'P24113'
]
# Download these sequences as one file from NCBI
file_name = entrez.fetch_single_file(
    uids, biotite.temp_file("fasta"), db_name="protein", ret_type="fasta"
)

# Read each sequence in the file as 'ProteinSequence' object
fasta_file = fasta.FastaFile()
fasta_file.read(file_name)
sequences = list(fasta.get_sequences(fasta_file).values())

# BLOSUM62
substitution_matrix = align.SubstitutionMatrix.std_protein_matrix()
# Matrix that will be filled with pairwise sequence identities
identities = np.ones((len(sequences), len(sequences)))
# Iterate over sequences
for i in range(len(sequences)):
    for j in range(i):
        # Align sequences pairwise
        alignment = align.align_optimal(
            sequences[i], sequences[j], substitution_matrix
        )[0]
        # Calculate pairwise sequence identities and fill matrix
        identity = align.get_sequence_identity(alignment)
        identities[i,j] = identity
        identities[j,i] = identity

print(identities)

输出:

[[1.         0.97214485 0.62921348 0.84225352 0.59776536]
 [0.97214485 1.         0.62359551 0.85352113 0.60055866]
 [0.62921348 0.62359551 1.         0.61126761 0.85393258]
 [0.84225352 0.85352113 0.61126761 1.         0.59383754]
 [0.59776536 0.60055866 0.85393258 0.59383754 1.        ]]
于 2019-12-16T15:56:13.213 回答