0

I'm using php 5.3 and I want to count the words of some text for validation reason. My problem is that the javascript functionality that I have for the validation text, returns different number of words according the php functionality.

Here is the php code:

//trim it
$text = strip_tags(html_entity_decode($text,ENT_QUOTES));
// replace numbers with X
$text = preg_replace('/\d/', 'X', $text);
// remove ./,/-/&
$text = str_replace(array('.',',','-','&'), '', $text);
// number of words
$count = str_word_count($text);

I noticed that with php 5.5, I get the right number of the words but with php 5.3 not. I searched about that and I found this link (http://grokbase.com/t/php/php-bugs/12c14e0y6q/php-bug-bug-63663-new-str-word-count-does-not-properly-handle-non-latin-characters) that explains about the bug that php 5.3 has regarding with the latin characters. I tried to solve it with this code:

// remove non-utf8 characters
$text = preg_replace('/[^(\x20-\x7F)]*/','', $text);

But I still didn't get right result. Basically, the number of the word was very close to the result and sometimes accurate but often I had issues.

我决定创建另一个 php 功能来修复错误。这是php代码:

//trim it
$text = strip_tags(html_entity_decode($text,ENT_QUOTES));
// replace multiple (one ore more) line breaks with a single space
$text = preg_replace("/[\n]+/", " ", $text);
// replace multiple (one ore more) spaces with a separator string (@SEPARATOR@)
$text = preg_replace("/[\s]+/", "@SEPARATOR@", $text);
// explode the separator string (@SEPARATOR@) and get the array
$text_array = explode('@SEPARATOR@', $text);
// get the numbers of the array/words
$count = count($text_array);
// check if the last key of the array is empty and decrease the count by one 
$last_key = end($text_array);
if (empty($last_key)) {
    $count--;
}

最后一个代码对我来说很好,我想问两个问题:

  1. 在第一种情况下我可以对 str_word_count 函数做些什么?
  2. 如果我的第二个解决方案是准确的,或者我可以做些什么来改进它?
4

2 回答 2

0

;您是否考虑使用正则表达式拆分来使用您自己对单词的定义来计算单词的数量。我可能会推荐 /[^\s]+/ 作为“单词”,这意味着在 /\s/ 上拆分并计算得到的“单词”数组。

PHP:那就$input = 'your input here'count(pregsplit('/\s/', $input))

JS:那就var input = 'your input here'input.split(/\s/).length

您还可以使用正则表达式字符范围来捕获要用作有效单词内容的一组字符,更多关于正则表达式的信息:http: //www.geocities.jp/kosako3/oniguruma/doc/RE.txt

于 2014-03-31T00:59:14.403 回答
0
  1. 假设您正在询问如何仍然使用str_word_count:您可以尝试使用:preg_replace('/[^a-zA-Z0-9\s]/','',$string)在您已经替换任何标点符号之后。没有您知道失败的“测试字符串”,我无法尝试,但至少您可以自己尝试。

  2. 一个改进是实际修剪文本,它在第一条评论中提到修剪,但第一行只是删除 HTML 标签。添加一个trim($string)然后您可以删除最后一部分:

更改前 2 行:

//trim it & remove tags
$text = trim(strip_tags(html_entity_decode($text,ENT_QUOTES)));

消除:

// check if the last key of the array is empty and decrease the count by one 
$last_key = end($text_array);
if (empty($last_key)) {
    $count--;
}
于 2014-03-31T01:01:55.023 回答