3

I need to join two tables based on names. And the problem is that names may be a slight mispelling in one of the database. I have remedy this problem in the past using Stata and Python's fuzzy merging, where names are matched based on how closely similar they are, but I am wondering if this is possible to do in Postgresql.

For example, may data may be something similar to this:

Table A:

first_name_a | last_name_a | id_a
----------------------------------
   William   |   Hartnell  | 1
   Matt      |   Smithe    | 2
   Paul      |   McGann    | 3
   David     |   Tennant   | 4
   Colin     |   Baker     | 5

Table B:

first_name_b | last_name_b | id_b
----------------------------------
   Matt      |   Smith     | a
   Peter     |   Davison   | b
   Dave      |   Tennant   | c
   Colin     |   Baker     | d
   Will      |   Hartnel   | e

And in the end, I hope my results would look something like:

first_name_a | last_name_a | id_a | first_name_b | last_name_b | id_b
----------------------------------------------------------------------
 William     |  Hartnell   |  1   |  Will        | Hartnel     | e
 Matt        |  Smithe     |  2   |  Matt        | Smith       | a
 Paul        |  McGann     |  3   |              |             | 
 David       |  Tennant    |  4   |  Dave        | Tennant     | c
 Colin       |  Baker      |  5   |  Colin       | Baker       | d
             |             |      |  Peter       | Davison     | b

My Sonic Screwdriver gives me some pseudo-code like this:

SELECT a.*, b.* FROM A a
     JOIN B b
     WHERE LEVENSHTEIN(first_name_a, first_name_b) IS LESS THAN 1
     AND LEVENSHTEIN(last_name_a, last_name_b) IS LESS THAN 1
4

2 回答 2

3

你提到的DML:

SELECT a.*, b.* FROM A a
 JOIN B b
 WHERE LEVENSHTEIN(first_name_a, first_name_b) IS LESS THAN 1
 AND LEVENSHTEIN(last_name_a, last_name_b) IS LESS THAN 1

看起来正确,只需提高“模糊性”(假设“小于 1”,用 1 代替您需要的“模糊性”级别)

有关 LEVENSHTEIN 的参考信息,请参阅http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html

于 2013-12-07T04:39:49.483 回答
0

完成为SQLFiddle。使用阈值/查看匹配模糊字符串中提到的其他一些映射函数。

于 2013-12-07T04:35:57.880 回答