2

我有以下形式的数据集(大约 8000 行)

Employee ID | Manager ID
a | b
c | b
b | e
d | e
e | f

我想将其转换为一种形式,其中显示了最低层级的员工和“最高”层级的所有经理之间的整个“链接”,即:

Employee ID | Manager ID 1 | Manager ID 2 | Manager ID 3
a | b | e | f
c | b | e | f
d | e | f

pandas在 Python中计算这个最有效的方法是什么?

4

2 回答 2

1

这更多地与图和树理论有关。Pandas 并不专注于这个领域。对于这种事情,networkx比较合适。我提出了一个使用networkx的解决方案。在处理之前,您需要 install 或pip networkx

DiGraph从您的数据框构造一个。获取leaves图表列表。使用列表shortest_path推导从每个root到获取节点列表leaf

import networkx as nx

G = nx.from_pandas_edgelist(df, 'Employee ID', 'Manager ID', create_using=nx.DiGraph)
leaves = [node for node in G if G.out_degree(node)==0]
data   = [nx.shortest_path(G, node, leaf) for node in G if G.in_degree(node)==0 
                                               for leaf in leaves]
manager_cols = [f'Manager ID {i}' for i in range(1, df['Manager ID'].nunique()+1)]

df_final = pd.DataFrame(data, columns=['Employee ID', *manager_cols])

Out[371]:
  Employee ID Manager ID 1 Manager ID 2 Manager ID 3
0           a            b            e            f
1           c            b            e            f
2           d            e            f         None
于 2020-01-31T19:33:39.643 回答
1

这是一个numpy不带pandas但也许对您有帮助的解决方案:

employee = np.array(['a', 'c', 'b', 'd', 'e', 'f'])  # Add 'f' as employee 
manager = np.array(['b', 'b', 'e', 'e', 'f', 'f'])   # being his own manager

获取每个员工的经理编号(抱歉):

manager_idx = np.array([np.where(employee == mng)[0] for mng in manager]).ravel()

循环直到你在层次结构的末尾

manager_idx_list = [manager_idx]
while True:
    new_manger_idx = manager_idx_list[-1][manager_idx]
    if all(new_manger_idx == manager_idx_list[-1]):
        break
    else:
        manager_idx_list.append(new_manger_idx)


manager_list = np.array([employee[mng_idx] for mng_idx in manager_idx_list]).T
# 'a': [['b' 'e' 'f']
# 'c':  ['b' 'e' 'f']
# 'b':  ['e' 'f' 'f']
# 'd':  ['e' 'f' 'f']
# 'e':  ['f' 'f' 'f']
# 'f':  ['f' 'f' 'f']]
于 2020-01-31T10:26:00.607 回答