perl - 如何在 perl v5.24 中获取 unicode 代码点？

Question

我想记录作为参数剪切并粘贴到 bash 中的字符串的十六进制 unicode 代码点。ord 不这样做；ord 似乎只在 ascii 范围内工作。

我发现的关于 ord 的大部分内容至少有 6 岁或以上，并且不再相关，因为我使用的是我读过的 v5.24，它内置了 unicode 支持。在 python 中它是微不足道的：


for i in unicode(sys.argv[1], 'utf-8'):
    print i.encode("utf_16_be").encode("hex")

这适用于 bash。我认为问题出在 ord 函数本身，它似乎没有针对 unicode 更新。


# ord.pl does not provide the unicode code point for a pasted variable.
use strict;
use warnings;
#use charnames (); #nope.
#use feature 'unicode_strings'; #nope.  Already automatically using as of v5.12.
#use utf8; #nope.
#binmode(STDOUT, ":encoding(UTF-8)"); #nope.

my $arg = "";

foreach $arg  (@ARGV) {
  print $arg . " is " . ord($arg) . " in code.\n";  # seems to me ord is ascii only.
  #utf8::encode($arg);  #nope.
  #print unpack("H*", $arg) . "\n";  #nope.

  #printf "%vX\n", $arg;  #nope.
}

得到：

david@A8DT01:~/bin$ ord.pl A B C D a b c d \  \\ … —  €
A is 65 in code.
41
B is 66 in code.
42
C is 67 in code.
43
D is 68 in code.
44
a is 97 in code.
61
b is 98 in code.
62
c is 99 in code.
63
d is 100 in code.
64
  is 32 in code.
20
\ is 92 in code.
5c
… is 226 in code.
c3a2c280c2a6
— is 226 in code.
c3a2c280c294
 is 239 in code.
c3afc280c2a8
€ is 226 in code.
c3a2c282c2ac
david@A8DT01:~/bin$

我想得到我在 python 中得到的输出：

david@A8DT01:~/bin$ python code-points.py "ABCDabcd \ … —  €"
0041
0042
0043
0044
0061
0062
0063
0064
0020
005c
0020
2026
0020
2014
0020
f028
0020
20ac
david@A8DT01:~/bin$

score 5 · Accepted Answer

这不是 ord 的问题，而是编码的问题。来自命令行的输入通常是 UTF-8 编码的，并且 ord 只需要一个字符，而不是多字节字符串。您可以使用-CA开关自动解码@ARGV（或者-CSASTDOUT 也为终端编码），或者在脚本中进行。

use strict;
use warnings;
use Encode;
foreach my $arg (@ARGV) {
  my $decoded = decode 'UTF-8', $arg;
  print $arg . " is " . ord($decoded) . " in code.\n";
}

但是，您的 python 脚本正在做一些非常不同的事情，它返回编码为 UTF-16BE 的字符串的十六进制表示，而不是 unicode 字符的十进制序数。您也可以在 Perl 中执行此操作。

use strict;
use warnings;
use Encode;
foreach my $arg (@ARGV) {
  my $utf16 = encode 'UTF-16BE', decode 'UTF-8', $arg;
  print $arg . " is " . sprintf("%vX", $utf16) . " in code.\n";
}

score 3 · Accepted Answer

Perl 等价于

for ucp_str in unicode(sys.argv[1], 'utf-8'):
    print ucp_str.encode("utf_16_be").encode("hex")

是

use Encode qw( decode encode );

for my $ucp_str (split(//, decode("UTF-8", $ARGV[0]))) {
   say unpack("H*", encode("UTF-16be", $ucp_str));
}

演示：

$ ./a.py aé€♠
0061
00e9
20ac
2660
d840dc00

$ ./a.pl aé€♠
0061
00e9
20ac
2660
d840dc00

但是您要求输出代码点，而这些程序不是这样做的。为此，您可以使用以下内容：

use Encode qw( decode_utf8 );

for my $ucp_num (unpack('W*', decode_utf8($ARGV[0]))) {
   say sprintf("%04X", $ucp_num);
}

演示：

$ ./a2.pl aé€♠
0061
00E9
20AC
2660
20000

要将字符串的字符作为字符串获取：

unpack('(a)*', $_)
split(//, $_)

要将字符串的字符作为数字获取：

unpack('W*', $_)
map { ord($_) } split(//, $_))

要将一串字节（0x00..0xFF 范围内的字符）转换为十六进制：

unpack('H*', $_)
join "", map { sprintf('%02X', $_) } split(//, $_))

将字符串的字符视为十六进制以进行调试的简单方法：

sprintf("%vX", $_)

perl - 如何在 perl v5.24 中获取 unicode 代码点？

2 回答 2

Related

Reference