我可以使用scanpy将表格数据加载到 DataFrame 中,但我缺少如何迭代它以访问选定的行/列。
这是单细胞基因组学数据,其中每一行是一个基因,每一列是特定细胞的表达值。行和列都有标签。表格原始数据如下所示:
Gene_symbol Cancer--Cell_1 Cancer--Cell_10 Cancer--Cell_100
A2M.AS1 0.0 0.0 0.0
A2MP1 0.0 0.0 0.0
AADACL2 0.0 0.0 0.0
AAGAB 154.561226827488 0.0 0.0
AAR2 295.875190529996 299.455534712676 0.0
AATF 546.792205537953 323.38381204192996 0.0
AATK 0.0 0.0 0.0
AATK.AS1 0.0 0.0 0.0
ABAT 0.0 0.0 0.0
这很容易像这样转换为 h5ad:
import pandas as pd
import scanpy.api as sc
adata = sc.read('fig1.tab', ext='txt', first_column_names=True).transpose()
adata.write('fig1.h5')
我可以加载它,但无法再次访问它的所有部分。例如,我如何选择两个基因行并获取所有列及其对应的值?如果我只想要某些列怎么办?
我的代码中的注释尝试输出如下:
adata = sc.read_h5ad('fig1.h5')
# this is for the cancer dataset
selected = adata[:, adata.var_names.isin({'AAR2', 'ECT2'})]
## this line spews information on the columns like:
# Empty DataFrameView
# Columns: []
# Index: [Cancer--Cell_1, Cancer--Cell_10, Cancer--Cell_100, Cancer--Cell_1000, Cancer--Cell_1001
print(selected.obs)
## this line gives the row information:
# Empty DataFrameView
# Columns: []
#Index: [AAR2, ECT2]
print(selected.var)
# Nothing happens here at all
#for i, row in selected.obs.iteritems():
# print(i, row)
for gene_name, row in selected.var.T.iteritems():
# this prints like: Series([], Name: AAR2, dtype: float64)
print(row)
# Nothing happens here
for cell_name, val in row.iteritems():
print("{0}\t{1}\t{2}".format(gene_name, cell_name, val))
如果有帮助,这里是fig1.h5 文件的 Dropbox 链接