biopython - 如何从 FASTA 文件中进行多个成对对齐并打印相似度百分比？

Question

我想对 FASTA 文件中包含的每个蛋白质序列进行多次成对比较，然后打印百分比序列相似性（平均值或单独）。我想我需要使用 itertools 来创建所有组合，对齐它们，然后可能将匹配的数量除以对齐的序列长度以获得 % 序列相似性，但我遇到了我需要执行此操作的特定脚本的问题，如果可能的话，最好在 biopython 中。任何帮助表示赞赏。

score 1 · Accepted Answer

我的答案不涉及Biopython，但由于尚未发布其他答案，我还是会发布它：

我目前正在开发的生物信息学包Biotite ( https://www.biotite-python.org/ ) 将使用以下脚本解决您的问题：

import numpy as np
import biotite
import biotite.sequence as seq
import biotite.sequence.io.fasta as fasta
import biotite.sequence.align as align
import biotite.database.entrez as entrez


# 5 example sequences (bacterial luciferase variants)
uids = [
    'Q7N575', 'P19839', 'P09140', 'P07740', 'P24113'
]
# Download these sequences as one file from NCBI
file_name = entrez.fetch_single_file(
    uids, biotite.temp_file("fasta"), db_name="protein", ret_type="fasta"
)

# Read each sequence in the file as 'ProteinSequence' object
fasta_file = fasta.FastaFile()
fasta_file.read(file_name)
sequences = list(fasta.get_sequences(fasta_file).values())

# BLOSUM62
substitution_matrix = align.SubstitutionMatrix.std_protein_matrix()
# Matrix that will be filled with pairwise sequence identities
identities = np.ones((len(sequences), len(sequences)))
# Iterate over sequences
for i in range(len(sequences)):
    for j in range(i):
        # Align sequences pairwise
        alignment = align.align_optimal(
            sequences[i], sequences[j], substitution_matrix
        )[0]
        # Calculate pairwise sequence identities and fill matrix
        identity = align.get_sequence_identity(alignment)
        identities[i,j] = identity
        identities[j,i] = identity

print(identities)

输出：

[[1.         0.97214485 0.62921348 0.84225352 0.59776536]
 [0.97214485 1.         0.62359551 0.85352113 0.60055866]
 [0.62921348 0.62359551 1.         0.61126761 0.85393258]
 [0.84225352 0.85352113 0.61126761 1.         0.59383754]
 [0.59776536 0.60055866 0.85393258 0.59383754 1.        ]]

biopython - 如何从 FASTA 文件中进行多个成对对齐并打印相似度百分比？

1 回答 1

Related

Reference