java - 获取未包含在各种数组中的值的有效方法是什么？

Question

假设我有两个数组：

int[] a1 = {5,2,1,13,4,9,7};
int[] a2 = {3,1,6,9,23,12,34};

现在我想得到一些不包含在任何一个数组中的值，例如：8,10,11,14,...

我当前的解决方案是将每个可能值（大约 14000）的状态（使用/未使用）存储在一个额外的布尔数组中。一旦我使用一个值，它就会在附加数组中被标记。所以如果我想找到其他数组中没有包含的值，我只需要通过附加数组，寻找没有标记的值。

是否有另一种（有效的）方法来完成这项工作？

score 5 · Accepted Answer

将值加载到Set<Integer>（一次）中，然后使用set.contains().

如果您使用 a HashSet，则该contains()方法为 O(1) - 即非常快。

这是代码：

// Do this once
Set<Integer> set = new HashSet<Integer>();
for (int i : a1) set.add(i);
for (int i : a2) set.add(i);

然后检查其中是否有数字：

if (set.contains(i))

或不在其中：

if (!set.contains(i))

要获取第一个数字，从 1 开始，而不是在数组中：

int i = 0;
while (set.contains(++i));

要查找给定范围内不在集合中的所有数字，作为数组：

int[] arr = new int[max - min - set.size() - 1]; // correct final size
int index = 0;
for (int i = min; i <= max; i++)
    if (!set.contains(i))
        arr[index++] = i;

score 3 · Accepted Answer

如果所有值都相对较小且非负数，则您可以使用 a 做得很好BitSet：

int[] a1 = {5,2,1,13,4,9,7};
int[] a2 = {3,1,6,9,23,12,34};
BitSet bits = new BitSet();
for (int i : a1) {
    bits.set(i);
}
for (int i : a2) {
    bits.set(i, !bits.get(i));
}
int[] result = new int[bits.cardinality()];
int next = 0;
for (int i = bits.nextSetBit(0); i >= 0; i = bits.nextSetBit(i+1)) {
    result[next++] = i;
}

这可能与使用任何解决方案一样快，Set<Integer>并且消除了与使用集合框架相关的自动装箱开销。

score 3 · Accepted Answer

如果内存不是对象，重要的是性能......并且您知道数组中可能出现的每个可能值......那么，当然，使用一个标志来指示该值是否已被使用。这为您提供了快速测试，但随着未使用值的数量变少，如果您尝试随机生成它们，生成新值可能需要很长时间。

如果您知道值的范围，为什么不用所有可能的值填充一个容器……然后随机化容器中的顺序。当您需要一个值时，只需将“下一个”从容器中弹出即可。使用这样的系统：

根本不需要测试该值是否已被使用（因为您已经知道容器只包含一个这样的值）。
无论使用了多少值，新值的生成仍然非常快。

另一种可以帮助您快速确定一组项目是否已经包含特定项目的数据结构是使用布隆过滤器（http://en.wikipedia.org/wiki/Bloom_filter），尽管它听起来不像这样filter 在您的情况下会很有用。

score 1 · Accepted Answer

有两个性能方面：从给定数组构造集合的时间和获得下一个不包含值的时间。

此外，还有内存使用方面 - 如果您有数千个这样的独立集合，您可能希望集合数据结构消耗尽可能少的内存。（但我想这不是你的情况。）

最后，未包含值的典型数量很重要。我测试了 2 种情况：使用了一半的值，使用了 99% 的值。

我对 3 个解决方案进行了基准测试：您的原始布尔数组、bitSet 和 HashSet。 https://gist.github.com/leventov/6749728
结果：

Benchmark                                   Mean    Units
construction_bitSet_05_load               19,184  usec/op
construction_bitSet_099_load              38,319  usec/op
construction_booleanArray_05_load          7,987  usec/op
construction_booleanArray_099_load        16,255  usec/op
construction_complementHashSet_05_load   859,151  usec/op
construction_complementHashSet_099_load  923,588  usec/op
construction_hashSet_05_load             262,920  usec/op
construction_hashSet_099_load            441,306  usec/op

nextIndex_bitSet_05_load                   2,086  nsec/op
nextIndex_bitSet_099_load                  2,147  nsec/op
nextIndex_booleanArray_05_load             9,264  nsec/op
nextIndex_booleanArray_099_load           65,424  nsec/op
nextIndex_complementHashSet_05_load       27,298  nsec/op
nextIndex_complementHashSet_099_load     142,565  nsec/op
nextIndex_hashSet_05_load                 27,159  nsec/op
nextIndex_hashSet_099_load              1948,120  nsec/op

（Complement HashSet 在 99% 的负载情况下比 ordinal 更快，但在创建时非常期待。）

正如我个人预期的那样，您的原始解决方案在集合构造上最快，而 BitSet 在下一个未包含的值检索中最快。

内存消耗：

boolean[]: 14000 字节。
BitSet: 1750 字节（每 8 个可能值 1 个字节）
HashSet: ~= 每个包含值 62 字节（~= 38 字节，-XX:+UseCompressedOops）
补充HashSet：类似地每个未包含的值。

score 0 · Accepted Answer

这种方法可能是 cpu 最有效的方法，它使用零内存。它需要修改初始数据，并且基于Google Guava库。

如果你不能修改你的数组，你可以克隆它们：a1.clone()它仍然会非常有效。

方法：

对两个数组进行排序
使用Iterators.html#mergeSorted迭代合并的排序数字集。

代码：

Arrays.sort(a1);
Arrays.sort(a2);

Iterator<Integer> mergedIterator = Iterators.mergeSorted(Ints.asList(a1).iterator(), Ints.asList(a2).iterator());

... iterate to find gaps in the sorted sequence ...

它的时间复杂度是O(n log(n))，内存消耗是零（常数）。如果迭代器不进行装箱操作（例如在 Trove4j 库中），它可能会更快。

score -1 · Accepted Answer

也许使用 ArrayList 而不是 int[] 然后使用 .contains(Object obj) 函数来查看 int 值是否已经存在于任一数组中。

java - 获取未包含在各种数组中的值的有效方法是什么？

6 回答 6

Related

Reference