正如 Antimony 所指出的,听起来您的数据中偶尔会出现缺失值,而 csv 无法轻松处理开箱即用的情况。我建议使用像 pandas 这样的库,它有一个read_csv
函数,可以处理缺失值。以此数据为例:
gene_id, ENSDARG00000104632, gene_version, 2, gene_name, RERG
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id,
gene_id, ENSDARG00000104632, gene_version, , transcript_id,
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
可以读作如下:
import pandas as pd
# Use the 2nd, 5th and 6th columns - i.e.column indices 1, 4 and 5 respectively
# And, we set the 'not available' data - i.e. `na_values` as 'N/A'.
data = pd.read_csv('test.dat', na_values='N/A', header=None, skipinitialspace=True, usecols=[1,4,5])
# now select only the rows without 'gene_version':
d = data.loc[data[4] != 'gene_name']
# and, now we only select columns with index 1 and 5:
selected_data = d[[1, 5]]
产量:
1 5
0 ENSDARG00000104632 RERG
1 ENSDARG00000104632 ENSDART00000166186
2 ENSDARG00000104632 ENSDART00000166186
3 ENSDARG00000104632 ENSDART00000166186
4 ENSDARG00000104632 ENSDART00000166186
5 ENSDARG00000104632 ENSDART00000166186
6 ENSDARG00000104632 ENSDART00000166186
7 ENSDARG00000104632 NaN
8 ENSDARG00000104632 NaN
9 ENSDARG00000104632 ENSDART00000166186
如预期的。
但是,如果缺少数据 - 就像在这个例子中一样 - 你所要做的就是删除这些行,如:
selected_data.dropna()
哪个输出:
1 5
1 ENSDARG00000104632 ENSDART00000166186
2 ENSDARG00000104632 ENSDART00000166186
3 ENSDARG00000104632 ENSDART00000166186
4 ENSDARG00000104632 ENSDART00000166186
5 ENSDARG00000104632 ENSDART00000166186
6 ENSDARG00000104632 ENSDART00000166186
9 ENSDARG00000104632 ENSDART00000166186
(但是,这可能不是您想要的。)
参考
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html