2

我有一个包含公司名称列表的数据集,以及它们各自的 ID。每家公司都有多个实例,其中一些看起来不同。每个公司名称至少有一个实例具有 ID,但由于拼写不一致,并非所有实例都具有 ID。所有的公司都聚集在一起。数据看起来像这样:

company_name                 id

T. Rowe Price Group
Group, T. Rowe Price         576
T. ROWE PRICE GROUP
Transatlantic, Inc           458
Transatlantic, Incorporated
Transatlantic, Inc           458

有没有一种很好的方法可以将缺少 ID 的公司名称与正确的名称相匹配?

4

2 回答 2

1

这是一种使用方法pandas

import pandas as pd
import numpy as np
import re
from collections import OrderedDict
# a function that splits a string into text and number
def my_splitter(s):
    return filter(None, re.split(r'(\d+)', s))
#reading the data as a dataframe from the file
df=pd.read_csv('dataset.txt',sep='\t',header=None,skiprows=1,names=['Name'])
join=[]
for i in range(len(df)):
    if len(my_splitter(df['Name'][i]))!=2:
        join.append({'Name': my_splitter(df['Name'][i])[0], 'ID': 'na'})
    else:
        join.append({'Name': my_splitter(df['Name'][i])[0], 'ID': my_splitter(df['Name'][i])[1]})
df_new=pd.DataFrame(join) 

diction=OrderedDict()
#creating a dictionary that stores the company name and ID
for i in range(len(df_new)):
    if df_new['ID'][i]!='na':
        diction[df_new['ID'][i]]=df_new['Name'][i].split()

for i in range(len(df_new)):
    if df_new['ID'][i]=='na':
        for j in diction:
            if bool(set(df_new['Name'][i].split()) & set(diction[j])):
                df_new['ID'][i]=j

print (df) # contents of the testing file read as a dataframe
print ("####################")
print (df_new)
#save result to a file - dataset.txt
df_new.to_csv('dataset.txt', sep='\t')

输出:

                              Name
0               T. Rowe Price Group
1  Group, T. Rowe Price         576
2               T. ROWE PRICE GROUP
3  Transatlantic, Inc           458
4       Transatlantic, Incorporated
5  Transatlantic, Inc           458
####################
    ID                           Name
0  576            T. Rowe Price Group
1  576  Group, T. Rowe Price         
2  576            T. ROWE PRICE GROUP
3  458  Transatlantic, Inc           
4  458    Transatlantic, Incorporated
5  458  Transatlantic, Inc   
于 2020-02-06T16:05:42.097 回答
0

使用 NLTK,您可以将 company_names 转换为其根(从此处查找词干和词形还原示例https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)然后您可以为同一家公司提供相同的ID。

于 2020-02-06T16:40:40.840 回答