0

考虑以下文本场景:

df = pd.read_csv('dummy.txt', sep='|')
        ID   Name           Email Country  Quantity
0  2.0  name2  name@email.com      UK       8.0
1  3.0  name3  name@email.com     NaN       NaN
2  NaN     UK               8     NaN       NaN
3  5.0  name4  name@email.com     NaN       NaN
4  NaN     UK               8     NaN       NaN
5  7.0  name5  name@email.com      UK       8.0

原始数据是:

ID|Name|Email|Country|Quantity
2|name2|name@email.com|UK|8
3|name3|name@email.com
|UK|8
5|name4|name@email.com
|UK|8
7|name5|name@email.com|UK|8

所以有一个带有“|”的虚线。逻辑应该是:如果行以“|”开头 然后与它所属的上一行合并

结果应该是:

ID|Name|Email|Country|Quantity
2|name2|name@email.com|UK|8
3|name3|name@email.com|UK|8
5|name4|name@email.com|UK|8
7|name5|name@email.com|UK|8

Linux 代码完成了这项工作:

sed -z 's/\n|/|/g

但是,我无法在 Python 中执行此操作。

4

1 回答 1

3

使用re模块(regex101):

txt = '''ID|Name|Email|Country|Quantity
2|name2|name@email.com|UK|8
3|name3|name@email.com
|UK|8
5|name4|name@email.com
|UK|8
7|name5|name@email.com|UK|8'''

import re

txt = re.sub(r'\n\|', '|', txt)
print(txt)

印刷:

ID|Name|Email|Country|Quantity
2|name2|name@email.com|UK|8
3|name3|name@email.com|UK|8
5|name4|name@email.com|UK|8
7|name5|name@email.com|UK|8

加载为 pandas DataFrame:

df = pd.read_csv(StringIO(txt), sep='|')
print(df)

印刷:

   ID   Name           Email Country  Quantity
0   2  name2  name@email.com      UK         8
1   3  name3  name@email.com      UK         8
2   5  name4  name@email.com      UK         8
3   7  name5  name@email.com      UK         8

编辑:要从文件中读取,您可以使用:

import re
import sys
import pandas as pd

if sys.version_info[0] == 2:  # Not named on 2.6
    from StringIO import StringIO
else:
    from io import StringIO

with open('dummy.txt', 'r') as f_in:
    txt = f_in.read()

txt = re.sub(r'\n\|', '|', txt)

df = pd.read_csv(StringIO(txt), sep='|')

print(df)  # or 'print df' in Python2

印刷:

   ID   Name           Email Country  Quantity
0   2  name2  name@email.com      UK         8
1   3  name3  name@email.com      UK         8
2   5  name4  name@email.com      UK         8
3   7  name5  name@email.com      UK         8
于 2020-01-15T16:54:28.840 回答