4 回答
UTF-8 has the very nice feature that it is ASCII-compatible. With this I mean that:
- ASCII characters stay the same when encoded to UTF-8
- no other characters will be encoded to ASCII characters
This means that when you try to split a UTF-8 string by the semicolon character ;
, which is an ASCII character, you can just use standard single byte string functions.
In your example, you can just use explode(';',$utf8encodedText)
and everything should work as expected.
PS: Since the UTF-8 encoding is prefix-free, you can actually use explode()
with any UTF-8 encoded separator.
PPS: It seems like you try to parse a CSV file. Have a look at the fgetcsv() function. It should work perfectly on UTF-8 encoded strings as long as you use ASCII characters for separators, quotes, etc.
The mb_split
Docs function should be fine, but you should define the charset it's using as well with mb_regex_encoding
Docs:
mb_regex_encoding('UTF-8');
About mb_detect_encoding
Docs: it can fail, but that's just by the fact that you can never detect an encoding. You either know it or you can try but that's all. Encoding detection is mostly a gambling game, however you can use the strict parameter with that function and specify the encoding(s) you're looking for.
How to remove the BOM mask:
You can filter the string input and remove a UTF-8 bom with a small helper function:
/**
* remove UTF-8 BOM if string has it at the beginning
*
* @param string $str
* @return string
*/
function remove_utf8_bom($str)
{
if ($bytes = substr($str, 0, 3) && $bytes === "\xEF\xBB\xBF")
{
$str = substr($str, 3);
}
return $str;
}
Usage:
$line = remove_utf8_bom($line);
There are probably better ways to do it, but this should work.
Edit, I just read your post closer. You're suggesting this should output false, because you're suggesting a BOM was introduced by mb_split().
header('content-type: text/plain;charset=utf-8');
$s = "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5";
$str = base64_decode($s);
$peices = mb_split(';', $str);
var_dump(substr($str, 0, 10) === $peices[0]);
var_dump($peices);
Does it? It works as expected for me( bool true, and the strings in the array are correct)
When you write debug/testing scripts in php, make sure you output a more or less valid HTML page.
I like to use a PHP file similar to the following:
<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8>
<title>Test page for project XY</title>
</head>
<body>
<h1>Test Page</h1>
<pre><?php
echo print_r($_GET,1);
?></pre>
</body>
</html>
If you don't include any HTML tags, the browser might interpret the file as a text file and all kinds of weird things could happen. In your case, I assume the browser interpreted the file as a Latin1 encoded text file. I assume it worked with the BOM, because whenever the BOM was present, the browser recognized the file as a UTF-8 file.