4

每天我们有大约 50,000 个数据结构实例(最终可能会变得更大),这些实例封装了以下内容:

DateTime AsOfDate;
int key;
List<int> values; // list of distinct integers

这可能无关紧要,但该列表values是不同整数的列表,其属性对于给定的 值,所有值的AsOfDate联合产生不同整数的列表。也就是说,同一天没有整数出现在两个不同的列表中。valueskeyvalues

列表通常包含很少的元素(1 到 5 个),但有时长达 50 个元素。

给定相邻的日子,我们试图找到这些对象的实例,它们的值key在两天内不同,但列表values包含相同的整数。

我们正在使用以下算法。通过将列表转换values为字符串

string signature = String.Join("|", values.OrderBy(n => n).ToArray());

然后散列signature到一个整数,对得到的散列码列表进行排序(每天一个列表),遍历两个列表以查找匹配项,然后检查关联的键是否不同。(还要检查相关列表以确保我们没有哈希冲突。)

有没有更好的方法?

4

6 回答 6

5

您可能只是散列列表本身,而不是通过字符串。

除此之外,我认为您的算法几乎是最佳的。假设没有哈希冲突,它是 O(n log n + m log m) 其中 n 和 m 是您正在比较的每一天的条目数。(排序是瓶颈。)

如果您使用插入哈希的存储桶数组(本质上是:哈希表),则可以在 O(n + m) 中执行此操作。您可以在 O(max(n, m)) 中比较两个存储桶数组,假设长度取决于条目的数量(以获得合理的负载因子)。

通过使用 HashSet.IntersectWith() 并编写合适的比较函数,应该可以让库为您执行此操作(看起来您正在使用.NET)。

你不能做得比 O(n + m) 更好,因为每个条目都需要至少访问一次。

编辑:误读,已修复。

于 2009-02-27T02:04:06.223 回答
4

On top of the other answers you could make the process faster by creating a low-cost hash simply constructed of a XOR amongst all the elements of each List. You wouldn't have to order your list and all you would get is an int which is easier and faster to store than strings.

Then you only need to use the resulting XORed number as a key to a Hashtable and check for the existence of the key before inserting it. If there is already an existing key, only then do you sort the corresponding Lists and compare them.

You still need to compare them if you find a match because there may be some collisions using a simple XOR.
I think thought that the result would be much faster and have a much lower memory footprint than re-ordering arrays and converting them to strings.

If you were to have your own implementation of the List<>, then you could build the generation of the XOR key within it so it would be recalculated at each operation on the List.
This would make the process of checking duplicate lists even faster.

Code

Below is a first-attempt at implementing this.

Dictionary<int, List<List<int>>> checkHash = new Dictionary<int, List<List<int>>>();

public bool CheckDuplicate(List<int> theList) {
    bool isIdentical = false;
    int xorkey = 0;
    foreach (int v in theList) xorkey ^= v;

    List<List<int>> existingLists;
    checkHash.TryGetValue(xorkey, out existingLists);
    if (existingLists != null) {
        // Already in the dictionary. Check each stored list
        foreach (List<int> li in existingLists) {
            isIdentical = (theList.Count == li.Count);
            if (isIdentical) {
                // Check all elements
                foreach (int v in theList) {
                    if (!li.Contains(v)) {
                        isIdentical = false;
                        break;
                    }
                }
            }
            if (isIdentical) break;
        }
    }
    if (existingLists == null || !isIdentical) {
        // never seen this before, add it
        List<List<int>> newList = new List<List<int>>();
        newList.Add(theList);
        checkHash.Add(xorkey, newList);
    }
    return isIdentical;
}

Not the most elegant or easiest to read at first sight, it's rather 'hackey' and I'm not even sure it performs better than the more elegant version from Guffa.
What it does though is take care of collision in the XOR key by storing Lists of List<int> in the Dictionary.

If a duplicate key is found, we loop through each previously stored List until we found a mismatch.

The good point about the code is that it should be probably as fast as you could get in most cases and still faster than compiling strings when there is a collision.

于 2009-02-27T03:08:04.320 回答
2

Implement an IEqualityComparer for List, then you can use the list as a key in a dictionary.

If the lists are sorted, it could be as simple as this:

IntListEqualityComparer : IEqualityComparer<List<int>> {

   public int GetHashCode(List<int> list) {
      int code = 0;
      foreach (int value in list) code ^=value;
      return code;
   }

   public bool Equals(List<int> list1, List<int> list2) {
      if (list1.Count != list2.Coount) return false;
      for (int i = 0; i < list1.Count; i++) {
        if (list1[i] != list2[i]) return false;
      }
      return true;
   }

}

Now you can create a dictionary that uses the IEqualityComparer:

Dictionary<List<int>, YourClass> day1 = new Dictionary<List<int>, YourClass>(new IntListEqualityComparer());

Add all the items from the first day in the dictionary, then loop through the items from the second day and check if the key exists in the dictionary. As the IEqualityComprarer both handles the hash code and the comparison, you will not get any false matches.

You may want to test some different methods of calculating the hash code. The one in the example works, but may not give the best efficiency for your specific data. The only requirement on the hash code for the dictionary to work is that the same list always gets the same hash code, so you can do pretty much what ever you want to calculate it. The goal is to get as many different hash codes as possible for the keys in your dictionary, so that there are as few items as possible in each bucket (with the same hash code).

于 2009-02-27T03:10:12.443 回答
0

排序重要吗?即第 1 天的 [1,2] 和第 2 天的 [2,1],它们是否相等?如果是这样,那么散列可能无法很好地工作。您可以改用排序数组/向量来帮助进行比较。

另外,它是什么样的钥匙?它是否有明确的范围(例如 0-63)?您可能能够将它们连接成大整数(可能需要超过 64 位的精度)和散列,而不是转换为字符串,因为这可能需要一段时间。

于 2009-02-27T02:36:45.987 回答
0

将它放在 SQL 数据库中可能是值得的。如果您不想拥有完整的 DBMS,则可以使用 sqlite。

这将使唯一性检查和联合以及这些类型的操作非常简单的查询并且非常有效。如果再次需要它,它还可以让您轻松存储信息。

于 2009-02-27T04:02:00.090 回答
0

您是否考虑对值列表求和以获得一个整数,该整数可用作不同列表是否包含相同值集的预检查?

虽然会有更多的冲突(相同的总和并不一定意味着相同的值集),但我认为它可以首先减少大部分所需的比较集。

于 2009-02-27T04:02:48.667 回答