python - 使用 pandas-profiling 时如何更改变量类型？

Question

为了重现问题，笔记本，数据，输出：github链接
我的数据集中有合同变量/列，看起来像这样，看起来都像数字，但它们实际上是分类的。

当使用 pandas 读取时，信息显示它被读取为 int。由于合同变量是一个类别（来自我收到的元数据）所以我手动更改了变量类型，如下所示

df['Contract'] = df['Contract'].astype('categorical')
df.dtypes # shows modified dtype now

然后我试图从pandas_profiling. 生成的报告显示它contact被解释为实数，即使我将类型从更改int为str/ category。

# Tried both, but resulted in same.
ProfileReport(df)
df.profile_report()

你能解释解释数据类型的正确方法pandas_profiling吗？即，将contract变量更改为categorical类型。

score 0 · Accepted Answer

在 GitHub 页面上发布这个问题、提出问题并为此创建拉取请求很长时间之后pandas-profiling，我几乎忘记了这个问题。我感谢IampShadesDrifter提醒我通过回答来结束这个问题。

实际上，这种行为pandas-profiling是意料之中的。pandas-profiling尝试推断最适合列的数据类型。以前是这样写的。因为没有解决办法。它促使我在 GitHub 上创建了我的第一个拉取请求。

现在使用/中新添加的参数infer_dtypes，我们可以明确要求不要推断任何数据类型，而是使用( ) 中的数据类型。ProfileReportprofile_reportpandas-profilingpandasdf.dtypes

# for the df in the question,

df['Contract'] = df['Contract'].astype(str)

# by default it infers the dtype. So, `Contract` is read as number (because it looks like number).
ProfileReport(df) 
df.profile_report()

# `Contract` dtype now will be `str` as we explicitly type-casted with pandas.
ProfileReport(df, infer_dtypes=True) 
df.profile_report(infer_dtypes=True)

如果您发现任何值得一提的内容，请随时为这个答案做出贡献。

python - 使用 pandas-profiling 时如何更改变量类型？

1 回答 1

Related

Reference