0

我试图找到一种方法来使用认知服务来检测字符串是否包含一段连贯的文本或者只是垃圾。例子:

SDF#%# ASFSDS b

对比

嗨,我的名字是山姆。

这似乎是不可能的。我有通过关键字文本分析运行文本的想法(这会给我一个 ASDSDS 的关键字(多么有用!)),然后通过 Bing 拼写检查运行该关键字。我不确定美国发生了什么,但似乎 ASFSDS 是英文的。这真的很……呃……愚蠢。

我尝试通过一系列服务(如语言检测)运行类似的文本,他们似乎都相信我的乱码样本是 100% 连贯的英语。

我将在周五对一位 MS 代表进行测验,但我想知道是否有人使用认知服务实现了类似的目标?

4

1 回答 1

1

Rather than a binary is-word-or-not question, what you might consider instead is the probability of a word being gibberish. You can then choose a threshold that you like.

For computing word probalities, you might try the Web Language Model API. You could look at the joint probability, as an example. For your set of words, the response looks as follows (values for the body corpus):

{
  "results": [
    {
      "words": "sdf#%#",
      "probability": -12.215
    },
    {
      "words": "asfsds",
      "probability": -12.215
    },
    {
      "words": "b",
      "probability": -3.127
    },
    {
      "words": "hi",
      "probability": -3.905
    },
    {
      "words": "my",
      "probability": -2.528
    },
    {
      "words": "name",
      "probability": -3.128
    },
    {
      "words": "is",
      "probability": -2.201
    },
    {
      "words": "sam.",
      "probability": -12.215
    },
    {
      "words": "sam",
      "probability": -4.431
    }
  ]
}

You will notice a couple of idiosyncrasies:

  1. Probabilities are negative. This is because they are logarithmic.
  2. All terms are case-folded. This means that the corpus won't distinguish between, say, GOAT and goat.
  3. Caller must perform a certain amount of normalization themselves (note probability of sam. vs sam)
  4. Corpora are only available for the en-us market. This could be problematic depending on your use case.

An advanced use case would be computing conditional probabilities, i.e. the probability of a word in the context of words preceding it.

于 2017-07-25T17:53:57.710 回答