我正在尝试比较一个文件中的一行并将每个匹配的行放在另一个文件中的输出文件中。例如这里是第一个文件。
chr8 18 . T T * *
chr8 29 . C T . .
chr9 21 . TA T . .
chr18 22 . C T . .
chr18 23 . A G . .
然后是另一个文件:
chr8 ensembl CDS 1 1042 . - 0 gene_id "ENSCAFG00000031632"; gene_version "1"; transcript_id "ENSCAFT00000048171"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000042624"; protein_version "1";
chr8 ensembl CDS 27 1227 . + 0 gene_id "ENSCAFG00000032228"; gene_version "1"; transcript_id "ENSCAFT00000037896"; transcript_version "2"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000033535"; protein_version "2";
chr8 ensembl CDS 41 1006 . - 0 gene_id "ENSCAFG00000029302"; gene_version "1"; transcript_id "ENSCAFT00000048043"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000036901"; protein_version "1";
我想要的输出是:
chr8 18 . T T * *
chr8 ensembl CDS 1 1042 . - 0 gene_id "ENSCAFG00000031632"; gene_version "1"; transcript_id "ENSCAFT00000048171"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000042624"; protein_version "1";
chr8 29 . C T . .
chr8 ensembl CDS 1 1042 . - 0 gene_id "ENSCAFG00000031632"; gene_version "1"; transcript_id "ENSCAFT00000048171"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000042624"; protein_version "1";
chr8 ensembl CDS 27 1227 . + 0 gene_id "ENSCAFG00000032228"; gene_version "1"; transcript_id "ENSCAFT00000037896"; transcript_version "2"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000033535"; protein_version "2";
因此,我想获取第一个文件的每一行并查找每一行并搜索第一列是否匹配,如果第 1 列匹配,则文件 1 中的第二个数字在第 4 列和第 5 列的范围内。然后,如果它们匹配,则使用第一个文件中的行编写一个新文件,并在其下的文件 2 中的每个匹配行。这是我尝试过的:
opt=''
with open('file1.vcf') as vfh:
with open('file2.gtf') as gfh:
for line in vfh:
ct=0
vll=line.split('\t')
for gline in gfh:
gll=gline.split('\t')
if vll[0] == gll[0]:
if (int(vll[1]) > int(gll[3])) and (int(vll[1]) < int(gll[4])):
while ct < 1:
opt+=line
ct+=1
opt+=gline
with open('out.txt','w') as fh:
fh.write(opt)
但我从来没有得到我想要的输出。