我有一个包含文章标题和相关 url 链接的数据框。
我的问题是相应标题的行中不需要url链接,例如:
title | urls
Who will be the next president? | https://website/5-ways-to-make-a-cocktail.com
5 ways to make a cocktail | https://website/who-will-be-the-next-president.com
2 millions raised by this startup | https://website/how-did-you-find-your-house.com
How did you find your house | https://website/2-millions-raised-by-this-startup.com
How did you find your house | https://washingtonpost/article/latest-movies-in-theater.com
Latest movies in Theater | www.newspaper/mynews/what-to-cook-in-summer.com
What to cook in summer | https://website/2-millions-raised-by-this-startup.com
我的猜测是我需要考虑如此模糊的匹配逻辑,但我不确定如何。对于重复项,我将只使用unique
函数。
我开始使用包中的levenshteinSim
函数RecordLinkage
,它为每一行给出一个相似度得分,但显然由于行不匹配,所以到处的相似度得分都很低。
我也从包中听说过这个stringdistmatrix
功能,stringdist
但不知道如何在这里使用它。