python - 检查 CSV 中的下一行是否与当前行中的值具有相同的 ID

Question

我正在开发一个从 CSV 读取购买数据并为 API 有效负载输出 JSON 的项目。有多个具有相同订单 ID 的行，因为它们每个都有一个单独的项目，我想在创建购买有效负载之前将这些项目组合成一个数组。

order id             name    product_code   purchase_price
012006251700-68811   item1   321618         1380
012006251700-68811   item1   321618         690
012006241026-13750   item2   329452         1490
012006221101-40527   item3   326353         1990
012006221101-40527   item4   321625         1490
012006192158-63823   item5   323098         1990
012006192158-63823   item6   320923         590
012006192158-63823   item7   325051         590
012006192158-63823   item8   325446         1990

我已经能够从 CSV 导入行，并且正在检查当前购买的 ID 值，但无法获得我想要的结果。

下面的代码应检查下一行的 ID 值是否相同，如果是，则仅将项目详细信息添加到项目数组中。

如果下一行没有相同的 ID，else 语句会将完整的购买添加到购买数组中。

import csv
import json

output = {'purchases': []}
items = {'items': []}
purchaseBody = {}
current_purchase = None

with open('tester - tester.csv') as csv_file:
    for purchase in csv.DictReader(csv_file):
        if current_purchase is not None and purchase['id'] == current_purchase['id']:
            items['items'].append({'id': purchase['id'],
                                  'name': purchase['name'],
                                  'product_code': purchase['product_code']
                                  'purchase_price': purchase['purchase_price'],
                                  })
                 
        else:

            purchaseBody = {
                'id': purchase['id'],
                'user': {'email': purchase['email']},
                'total': purchase['total'],
                'createdAt': purchase['createdAt']
                }
            items['items'].append({'id': purchase['id'],
                                  'name': purchase['name'],
                                  'product_code': purchase['product_code']
                                  'purchase_price': purchase['purchase_price'],
                                  })
            output['purchases'].append(purchaseBody)
            items = {'items': []}   
            purchaseBody.update(items)


        current_purchase = purchase


with open('file.json', 'w') as jsonfile:
    json.dump(output, jsonfile, ensure_ascii=False)
    jsonfile.write('\n')

所需的输出应类似于以下内容：

{
    
    "purchases": [{
    
        "id": "purchase id",
    
        "user": {
    
            "email": "email"
    
        },
    
        "items": [{
    
                "id": "id1",
    
                "name": "name1",
    
                "additionalFields": {
    
                    "product_code": "product_code1",
    
                    "purchase_price": "purchase_price1"
    
                }
    
            },
    
            {
    
                "id": "id2",
    
                "name": "name2",
    
                "additionalFields": {
    
                    "product_code": "product_code2",
    
                    "purchase_price": "purchase_price2"
    
                }
    
            }
    
        ],
    
        "total": "total",
    
        "createdAt": "createdAt"
    
    }]
    
}

score 2 · Accepted Answer

考虑使用pandas
用于.groupby选择组
- 当.groupby对单个列执行时，组作为 a 返回str，如果.groupby对多个列执行，tuple则返回 a。
- o_id是str表示用于的值groupby
- o_id必须是listortuple才能zip创建groupby_list一个dict.
- d是每个groupby组的数据框。
用于.iterrows遍历每组的行
- 返回index由 first 表示_，因为它不是必需的
- 返回data，从中groupby_list删除标签，然后将剩余部分转换为dictusing .to_dict()，并将其附加到list,att_list
- 遍历组的所有行后，将items_list值分配给group['items']
遍历每个组后，将dict, , 附加group到dict_list.
dict_list可以使用以下内容转换回数据框：
- df = pd.json_normalize(dict_list, 'items', meta=groupby_list)

预期产出

{'items': [{'name': 'item6', 'product_code': '323098', 'purchase_price': 1990},
           {'name': 'item7', 'product_code': '3209233', 'purchase_price': 590}],
 'orderId': '012006192158-63823'}

产生预期输出的代码

import pandas as pd
import json

# read in the file
df = pd.read_csv('test.csv')

dict_list = list()
groupby_list = ['order id']

for o_id, d in df.groupby(groupby_list):
    if type(o_id) != tuple:
        o_id = [o_id]
    group = dict(zip(groupby_list, o_id))
    items_list = list()
    for _, data in d.iterrows():
        data = data.drop(labels=groupby_list)
        items_list.append(data.to_dict())
    group['items'] = items_list
    dict_list.append(group)

# save to a file
with open('test.json', 'w') as f:
    json.dump(dict_list, f, ensure_ascii=False)
    jsonfile.write('\n')

最终输出：`dict_list`

[{
        'items': [{
                'name': 'item6',
                'product_code': 323098,
                'purchase_price': 1990
            }, {
                'name': 'item7',
                'product_code': 320923,
                'purchase_price': 590
            }, {
                'name': 'item8',
                'product_code': 325051,
                'purchase_price': 590
            }, {
                'name': 'item9',
                'product_code': 325446,
                'purchase_price': 1990
            }
        ],
        'order id': '012006192158-63823'
    }, {
        'items': [{
                'name': 'item4',
                'product_code': 326353,
                'purchase_price': 1990
            }, {
                'name': 'item5',
                'product_code': 321625,
                'purchase_price': 1490
            }
        ],
        'order id': '012006221101-40527'
    }, {
        'items': [{
                'name': 'item3',
                'product_code': 329452,
                'purchase_price': 1490
            }
        ],
        'order id': '012006241026-13750'
    }, {
        'items': [{
                'name': 'item1',
                'product_code': 321618,
                'purchase_price': 1380
            }, {
                'name': 'item2',
                'product_code': 321618,
                'purchase_price': 690
            }
        ],
        'order id': '012006251700-68811'
    }
]

`test.csv`

order id,name,product_code,purchase_price
012006251700-68811,item1,321618,1380
012006251700-68811,item2,321618,690
012006241026-13750,item3,329452,1490
012006221101-40527,item4,326353,1990
012006221101-40527,item5,321625,1490
012006192158-63823,item6,323098,1990
012006192158-63823,item7,320923,590
012006192158-63823,item8,325051,590
012006192158-63823,item9,325446,1990

score 2 · Accepted Answer

在问题的代码中，我认为它会通过缩进行来做你想做的事

 current_purchase = purchase

这样它就在else块内。

但是，可以使用itertools.groupby函数来简化这种任务 - 遍历集合并按键分组 - 可以使用。给定一个排序的集合，它将为您进行分组。operator.itemgetter函数可用于减少从行字典中获取值所需的代码量。

import csv
import itertools
import operator
import json

output = {'purchases': []}

reader = csv.DictReader(buf)

# Sort the rows by `id` - if the data is not guaranteed to be sorted.
# If the order id guaranteed, pass `reader` to itertools.groupby.
keyfunc = operator.itemgetter('id')
rows = sorted(reader, key=keyfunc)

# Make a function to build the item dictionaries.
item_keys = ('id', 'name')
item_values = operator.itemgetter(*item_keys)
additional_keys = ('product_code', 'purchase_price')
additional_values = operator.itemgetter(*additional_keys)


def build_item(purchase):
    item = dict(zip(item_keys, item_values(purchase)))
    item['additionalFields'] = dict(zip(additional_keys, additional_values(purchase)))
    return item


for _, purchases in itertools.groupby(rows, keyfunc):
    # Get the first row, because we need some of the data to build purchaseBody.
    purchase = next(purchases)
    # Initialise the items dict with data from the first purchase, and add the rest.
    items = [build_item(purchase)]
    items.extend(build_item(purchase) for p in purchases)
    purchaseBody = {
        'id': purchase['id'],
        'user': {'email': purchase['email']},
        'total': sum(float(item['additionalFields']['purchase_price']) for item in items),
        'createdAt': '2020-08-02',
        'items': items,
    }
    output['purchases'].append(purchaseBody)

with open('file.json', 'w') as jsonfile:
    json.dump(output, jsonfile, ensure_ascii=False)
    jsonfile.write('\n')

python - 检查 CSV 中的下一行是否与当前行中的值具有相同的 ID

2 回答 2

预期产出

产生预期输出的代码

最终输出：dict_list

test.csv

Related

Reference

最终输出：`dict_list`

`test.csv`