4
4

4 回答 4

4

UTF-8 has the very nice feature that it is ASCII-compatible. With this I mean that:

  • ASCII characters stay the same when encoded to UTF-8
  • no other characters will be encoded to ASCII characters

This means that when you try to split a UTF-8 string by the semicolon character ;, which is an ASCII character, you can just use standard single byte string functions.

In your example, you can just use explode(';',$utf8encodedText) and everything should work as expected.

PS: Since the UTF-8 encoding is prefix-free, you can actually use explode() with any UTF-8 encoded separator.

PPS: It seems like you try to parse a CSV file. Have a look at the fgetcsv() function. It should work perfectly on UTF-8 encoded strings as long as you use ASCII characters for separators, quotes, etc.

于 2011-12-03T22:32:49.947 回答
1

The mb_splitDocs function should be fine, but you should define the charset it's using as well with mb_regex_encodingDocs:

mb_regex_encoding('UTF-8');

About mb_detect_encodingDocs: it can fail, but that's just by the fact that you can never detect an encoding. You either know it or you can try but that's all. Encoding detection is mostly a gambling game, however you can use the strict parameter with that function and specify the encoding(s) you're looking for.

How to remove the BOM mask:

You can filter the string input and remove a UTF-8 bom with a small helper function:

/**
 * remove UTF-8 BOM if string has it at the beginning
 *
 * @param string $str
 * @return string
 */
function remove_utf8_bom($str)
{
   if ($bytes = substr($str, 0, 3) && $bytes === "\xEF\xBB\xBF") 
   {
       $str = substr($str, 3);
   }
   return $str;
}

Usage:

$line = remove_utf8_bom($line);

There are probably better ways to do it, but this should work.

于 2011-12-03T17:43:40.057 回答
1

Edit, I just read your post closer. You're suggesting this should output false, because you're suggesting a BOM was introduced by mb_split().

header('content-type: text/plain;charset=utf-8');
$s = "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5";
$str = base64_decode($s);

$peices = mb_split(';', $str);

var_dump(substr($str, 0, 10) === $peices[0]);
var_dump($peices);

Does it? It works as expected for me( bool true, and the strings in the array are correct)

于 2011-12-03T20:04:45.267 回答
1

When you write debug/testing scripts in php, make sure you output a more or less valid HTML page.

I like to use a PHP file similar to the following:

<!DOCTYPE html>
<html>
  <head>
    <meta charset=utf-8>
    <title>Test page for project XY</title>
  </head>
  <body>
     <h1>Test Page</h1>
     <pre><?php
        echo print_r($_GET,1);
     ?></pre>
  </body>
</html>

If you don't include any HTML tags, the browser might interpret the file as a text file and all kinds of weird things could happen. In your case, I assume the browser interpreted the file as a Latin1 encoded text file. I assume it worked with the BOM, because whenever the BOM was present, the browser recognized the file as a UTF-8 file.

于 2011-12-04T17:35:21.133 回答