c - 在 C 中删除数组中的重复项

Question

这个问题有点复杂。这里的问题是摆脱重复并将数组的唯一元素保存到具有原始序列的另一个数组中。

例如：

如果输入的是 bacadt

结果应该是： bacdt 处于输入输入的确切状态。

因此，为了对数组进行排序，然后检查无法正常工作，因为我丢失了原始序列。有人建议我使用索引数组，但我不知道该怎么做。那么你有什么建议这样做呢？

对于那些愿意回答问题的人，我想添加一些具体信息。

char** finduni(char *words[100],int limit)
{
//
//Methods here
//
}

是我的功能。应删除其重复项并将其存储在不同数组中的数组是 words[100]。因此，该过程将在此完成。我首先考虑将单词的所有元素放入另一个数组并对该数组进行排序，但经过一些测试后这不起作用。只是提醒解决者:)。

score 3 · Accepted Answer

好吧，这是char类型的一个版本。请注意，它不会缩放。

#include "stdio.h"
#include "string.h"

void removeDuplicates(unsigned char *string)
{
   unsigned char allCharacters [256] = { 0 };
   int lookAt;
   int writeTo = 0;
   for(lookAt = 0; lookAt < strlen(string); lookAt++)
   {
      if(allCharacters[ string[lookAt] ] == 0)
      {
         allCharacters[ string[lookAt] ] = 1;  // mark it seen
         string[writeTo++] = string[lookAt];     // copy it
      }
   }
   string[writeTo] = '\0';
}

int main()
{
   char word[] = "abbbcdefbbbghasdddaiouasdf";
   removeDuplicates(word);
   printf("Word is now [%s]\n", word);
   return 0;
}

以下是输出：

Word is now [abcdefghsiou]

那是你想要的吗？如果字母之间有空格，您可以修改该方法，但如果您使用int、float、double或char *作为类型，则此方法根本无法缩放。

编辑

我发布然后看到你的澄清，它是一个数组char *。我会更新方法。

我希望这不是太多的代码。我调整了这个 QuickSort 算法，基本上给它添加了索引内存。该算法是 O(n log n)，因为下面的 3 个步骤是相加的，这是其中 2 个步骤的最坏情况复杂度。

对字符串数组进行排序，但每次交换也应反映在索引数组中。在此阶段之后，第 i 个元素originalIndices保存已排序数组的第 i 个元素的原始索引。
删除排序数组中的重复元素，方法是将它们设置为NULL，并将索引值设置为elements，这是可以达到的最高值。
对原始索引数组进行排序，并确保每个交换都反映在字符串数组中。这给了我们原始的字符串数组，除了重复的在最后而且它们都是NULL.
为了更好地衡量，我返回了新的元素计数。

代码：

#include "stdio.h"
#include "string.h"
#include "stdlib.h"

void sortArrayAndSetCriteria(char **arr, int elements, int *originalIndices)
{
   #define  MAX_LEVELS  1000
   char *piv;
   int  beg[MAX_LEVELS], end[MAX_LEVELS], i=0, L, R;
   int idx, cidx;
   for(idx = 0; idx < elements; idx++)
      originalIndices[idx] = idx;
   beg[0] = 0;
   end[0] = elements;
   while (i>=0)
   {
      L = beg[i];
      R = end[i] - 1;
      if (L<R)
      {
         piv = arr[L];
         cidx = originalIndices[L];
         if (i==MAX_LEVELS-1)
            return;
         while (L < R)
         {
            while (strcmp(arr[R], piv) >= 0 && L < R) R--;
            if (L < R)
            {
               arr[L] = arr[R];
               originalIndices[L++] = originalIndices[R];
            }
            while (strcmp(arr[L], piv) <= 0 && L < R) L++;
            if (L < R)
            {
               arr[R] = arr[L];
               originalIndices[R--] = originalIndices[L];
            }
         }
         arr[L] = piv;
         originalIndices[L] = cidx;
         beg[i + 1] = L + 1;
         end[i + 1] = end[i];
         end[i++] = L;
      }
      else
      {
         i--;
      }
   }
}

int removeDuplicatesFromBoth(char **arr, int elements, int *originalIndices)
{
   // now remove duplicates
   int i = 1, newLimit = 1;
   char *curr = arr[0];
   while (i < elements)
   {
      if(strcmp(curr, arr[i]) == 0)
      {
         arr[i] = NULL;   // free this if it was malloc'd
         originalIndices[i] = elements;  // place it at the end
      }
      else
      {
         curr = arr[i];
         newLimit++;
      }
      i++;
   }
   return newLimit;
}

void sortArrayBasedOnCriteria(char **arr, int elements, int *originalIndices)
{
   #define  MAX_LEVELS  1000
   int piv;
   int beg[MAX_LEVELS], end[MAX_LEVELS], i=0, L, R;
   int idx;
   char *cidx;
   beg[0] = 0;
   end[0] = elements;
   while (i>=0)
   {
      L = beg[i];
      R = end[i] - 1;
      if (L<R)
      {
         piv = originalIndices[L];
         cidx = arr[L];
         if (i==MAX_LEVELS-1)
            return;
         while (L < R)
         {
            while (originalIndices[R] >= piv && L < R) R--;
            if (L < R)
            {
               arr[L] = arr[R];
               originalIndices[L++] = originalIndices[R];
            }
            while (originalIndices[L] <= piv && L < R) L++;
            if (L < R)
            {
               arr[R] = arr[L];
               originalIndices[R--] = originalIndices[L];
            }
         }
         arr[L] = cidx;
         originalIndices[L] = piv;
         beg[i + 1] = L + 1;
         end[i + 1] = end[i];
         end[i++] = L;
      }
      else
      {
         i--;
      }
   }
}

int removeDuplicateStrings(char *words[], int limit)
{
   int *indices = (int *)malloc(limit * sizeof(int));
   int newLimit;
   sortArrayAndSetCriteria(words, limit, indices);
   newLimit = removeDuplicatesFromBoth(words, limit, indices);
   sortArrayBasedOnCriteria(words, limit, indices);
   free(indices);
   return newLimit;
}

int main()
{
   char *words[] = { "abc", "def", "bad", "hello", "captain", "def", "abc", "goodbye" };
   int newLimit = removeDuplicateStrings(words, 8);
   int i = 0;
   for(i = 0; i < newLimit; i++) printf(" Word @ %d = %s\n", i, words[i]);
   return 0;
}

score 0 · Accepted Answer

您知道如何为 char 类型执行此操作，对吗？你可以用字符串做同样的事情，但不是使用布尔数组（这在技术上是“set”对象的实现），你必须用线性字符串数组来模拟“set”（或布尔数组）你已经遇到了。即你有一个你已经看到的字符串数组，对于每个新字符串，你检查它是否在“seen”字符串数组中，如果是，那么你忽略它（不是唯一的），如果它不在数组中，你添加它既可以看到字符串数组，也可以输出。如果您有少量不同的字符串（低于 1000），您可以忽略性能优化，只需将每个新字符串与您之前已经看到的所有内容进行比较。

但是，对于大量字符串（几千个），您需要稍微优化一下：

1) 每次将新字符串添加到已经看到的字符串数组时，使用插入排序算法对数组进行排序。不要使用快速排序，因为当数据几乎排序时，插入排序往往会更快。

2) 检查字符串是否在数组中时，使用二分查找。

如果不同字符串的数量是合理的（即您没有数十亿个唯一字符串），那么这种方法应该足够快。

score 0 · Accepted Answer

遍历数组中的项——O(n)操作
对于每个项目，将其添加到另一个排序数组
在将其添加到排序数组之前，请检查该条目是否已存在 -O(log n)操作

最后，O(n log n)操作

score 0 · Accepted Answer

我认为在 C 中你可以创建第二个数组。然后仅当该元素不在发送数组中时，才从原始数组中复制该元素。这也保留了元素的顺序。

如果您一个一个地读取元素，您可以在插入原始数组之前丢弃该元素，这可以加快该过程。

score 0 · Accepted Answer

正如 Thomas 在评论中所建议的那样，如果保证数组的每个元素都来自一组有限的值（例如 a char），那么您可以及时实现这一点O(n)。

保留一个 256 的数组bool（或者int如果您的编译器不支持bool），或者数组中可能有许多不同的离散值。将所有值初始化为false。
逐一扫描输入数组。
对于每个元素，如果数组中对应的值为，bool则将false其添加到输出数组中，并将bool数组值设置为true。否则，什么也不做。

c - 在 C 中删除数组中的重复项

5 回答 5

Related

Reference