python - 将一行的一部分与python中另一个文件中的每一行进行比较

Question

我正在尝试比较一个文件中的一行并将每个匹配的行放在另一个文件中的输出文件中。例如这里是第一个文件。

chr8    18      .       T       T       *       *
chr8    29      .       C       T       .       .
chr9    21      .       TA      T       .       .
chr18    22      .       C       T       .       .
chr18    23      .       A       G       .       .

然后是另一个文件：

chr8    ensembl CDS     1       1042    .       -       0       gene_id "ENSCAFG00000031632"; gene_version "1"; transcript_id "ENSCAFT00000048171"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000042624"; protein_version "1";
chr8    ensembl CDS     27     1227    .       +       0       gene_id "ENSCAFG00000032228"; gene_version "1"; transcript_id "ENSCAFT00000037896"; transcript_version "2"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000033535"; protein_version "2";
chr8    ensembl CDS     41      1006    .       -       0       gene_id "ENSCAFG00000029302"; gene_version "1"; transcript_id "ENSCAFT00000048043"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000036901"; protein_version "1";

我想要的输出是：

chr8    18      .       T       T       *       *
chr8    ensembl CDS     1       1042    .       -       0       gene_id "ENSCAFG00000031632"; gene_version "1"; transcript_id "ENSCAFT00000048171"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000042624"; protein_version "1";
chr8    29      .       C       T       .       .   
chr8    ensembl CDS     1       1042    .       -       0       gene_id "ENSCAFG00000031632"; gene_version "1"; transcript_id "ENSCAFT00000048171"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000042624"; protein_version "1";
chr8    ensembl CDS     27     1227    .       +       0       gene_id "ENSCAFG00000032228"; gene_version "1"; transcript_id "ENSCAFT00000037896"; transcript_version "2"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000033535"; protein_version "2";

因此，我想获取第一个文件的每一行并查找每一行并搜索第一列是否匹配，如果第 1 列匹配，则文件 1 中的第二个数字在第 4 列和第 5 列的范围内。然后，如果它们匹配，则使用第一个文件中的行编写一个新文件，并在其下的文件 2 中的每个匹配行。这是我尝试过的：

opt=''
with open('file1.vcf') as vfh:
    with open('file2.gtf') as gfh:
        for line in vfh:
                ct=0
                vll=line.split('\t')
                for gline in gfh:
                    gll=gline.split('\t')
                    if vll[0] == gll[0]:
                        if (int(vll[1]) > int(gll[3])) and (int(vll[1]) < int(gll[4])):
                            while ct < 1:
                                opt+=line
                                ct+=1
                            opt+=gline
with open('out.txt','w') as fh:
    fh.write(opt)

但我从来没有得到我想要的输出。

score 0 · Accepted Answer

找到问题了，只需要用 open 语句移动我的。另外，我添加了一些内容来处理原始文件中的一些注释：

with open('a1.vcf') as vfh:
    for line in vfh:
        if '#' not in line[0]:
            ct=0
            vll=line.split('\t')
            with open('cds.gtf') as gfh:
                for gline in gfh:
                    gll=gline.split('\t')
                    if vll[0] == gll[0]:
                        if (int(vll[1]) > int(gll[3])) and (int(vll[1]) < int(gll[4])):
                            while ct < 1:
                                opt+=line
                                ct+=1
                            opt+=gline

score 0 · Accepted Answer

我相信你的索引是错误的。

if (int(vll[1]) > int(gll[3])) and (int(vll[1]) < int(gll[4])):

"vll[1]" 是 18 "gll[3]" 是 1042 因为 "ensembl CDS" 似乎被 " " 而不是 "\t" 分隔请尝试使用调试器并验证索引。

python - 将一行的一部分与python中另一个文件中的每一行进行比较

2 回答 2

Related

Reference