如何在 Perl 中确定 unicode 字符是全角(占用两个单元格;双倍宽度)还是半角(如通常的拉丁字符)?
例如,表情符号是双倍宽度的,但也有较低块中的字符,例如"\N{MEDIUM BLACK CIRCLE}"
(U+26ab)。
我试过了
Unicode::GCString->new("\N{LARGE RED CIRCLE}")->columns()
但它也返回 1。
如何在 Perl 中确定 unicode 字符是全角(占用两个单元格;双倍宽度)还是半角(如通常的拉丁字符)?
例如,表情符号是双倍宽度的,但也有较低块中的字符,例如"\N{MEDIUM BLACK CIRCLE}"
(U+26ab)。
我试过了
Unicode::GCString->new("\N{LARGE RED CIRCLE}")->columns()
但它也返回 1。
我有一些 C++ 代码来计算字符宽度。因此,稍后快速转换为 perl,然后...
#!/usr/bin/env perl
use warnings;
use strict;
use feature qw/state/;
use open qw/:std :locale/;
use charnames qw/:full/;
use Unicode::UCD qw/charinfo charprop/;
# Return the number of fixed-width columns taken up by a unicode codepoint
# Inspired by https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
# First adapted to use C++/ICU functions and then to perl
sub charwidth ($) {
state %cache;
my $cp = shift; # Numeric codepoint
return $cache{$cp} if exists $cache{$cp};
if ($cp == 0 || $cp == 0x200B) {
# nul and ZERO WIDTH SPACE
$cache{$cp} = 0;
return 0;
} elsif ($cp >= 0x1160 && $cp <= 0x11FF) {
# Hangul Jamo vowels and final consonants
$cache{$cp} = 0;
return 0;
} elsif ($cp == 0xAD) {
# SOFT HYPHEN
$cache{$cp} = 1;
return 1;
}
my $ci = charinfo($cp);
return undef unless defined $ci;
my $type = $ci->{category};
if ($type eq "Cc" || $type eq "Mn" || $type eq "Me" || $type eq "Cf") {
# Control Code, Non Spacing Mark, Enclosing Mark, Format Char
$cache{$cp} = 0;
return 0;
}
state $widths = { Fullwidth => 2, Wide => 2, Halfwidth => 1, Narrow => 1,
Neutral => 1, Ambiguous => 1 };
my $eaw = charprop($cp, "East_Asian_Width");
my $width = $widths->{$eaw} // 1;
$cache{$cp} = $width;
return $width;
}
sub testwidth ($) {
my $char = shift;
my $cp = ord $char;
printf "Width of %c (U+%04X %s) is %d\n", $cp, $cp, charnames::viacode($cp),
charwidth($cp);
}
testwidth "\x04";
testwidth "a";
testwidth "\N{MEDIUM BLACK CIRCLE}";
testwidth "\N{LARGE RED CIRCLE}";
testwidth "\N{U+20A9}";
testwidth "\N{U+1F637}";
示例使用:
$ ./charwidths.pl
Width of (U+0004 END OF TRANSMISSION) is 0
Width of a (U+0061 LATIN SMALL LETTER A) is 1
Width of ⚫ (U+26AB MEDIUM BLACK CIRCLE) is 2
Width of (U+1F534 LARGE RED CIRCLE) is 2
Width of ₩ (U+20A9 WON SIGN) is 1
Width of (U+1F637 FACE WITH MEDICAL MASK) is 2
它只是对特定范围和代码点类别进行一些特殊情况检查,然后使用 East Asian Width 属性以及TR11的建议来确定其他所有内容的宽度。
这有点混乱,我很犹豫是否将其发布到互联网上而不将其清理到适当的库中......但我不太可能有时间制作那个库,所以在这里以防万一它有用. 它很大程度上源于Shawn 的贡献,但它没有使用可能增长到数百万个条目的每个代码点“缓存”,而是使用 Unicode::UCD 数据在第一次调用时构建代码点范围及其相关宽度的“invmap” ; charprop
查询该地图的工作方式类似于(并且成本与单个调用相同或略低于) 。
map_im
接受由返回的 invmapprop_invmap
并通过哈希映射属性值。在散列中找不到的任何值都将变为undef
Unicode::UCD 不使用它,但我们的代码将其视为“无关”。merge_im
获取两个这样的 invmap 并将它们合并,以使“右”invmap 中的值覆盖“左”invmap 中的值,但右侧的 undef 范围允许左侧值“穿透”。charwidth
的状态初始化根据Shawn自己的逻辑映射和合并三个invmap(East_Asian_Width、Category和特殊情况覆盖列表)charwidth
,该函数只是使用Unicode::UCD自己的search_invlist
例程查询。
初始化在我的笔记本电脑上花费了 <60 毫秒,并产生了一个 909 元素的 invmap(使用 perl 5.32.1 中的 UCD),之后每次调用大约需要 2.5 微秒。
use Unicode::UCD;
use open qw/:std :locale/;
use charnames qw/:full/;
use feature 'state';
use List::Util 'reduce';
sub map_im {
my ($im, $h) = @_;
die unless $im->[2] eq 's';
my $out;
for my $i (0 .. $#{ $im->[0] }) {
my $val = $h->{ $im->[1][$i] };
my $different = @{ $out->[0] } ? ($val ne $out->[1][-1]) : defined($val);
if ($different) {
push @{ $out->[0] }, $im->[0][$i];
push @{ $out->[1] }, $val;
}
}
return $out;
}
sub merge_im {
my ($l, $r) = @_;
die unless $l->[0][0] == 0;
my $out;
my $idx_l = my $idx_r = 0;
my $val_l = $l->[1][0];
my $val_r;
while ($idx_r < @{ $r->[0] } || $idx_l < @{ $l->[0] }) {
my $newcp;
# Take from the list with the lower next entry. Or the one with entries left.
# This could probably be simplified.
if ($idx_r >= @{ $r->[0] } || ($idx_l < @{ $l->[0] }
&& $l->[0][$idx_l] <= $r->[0][$idx_r])) {
$newcp = $l->[0][$idx_l];
$val_l = $l->[1][$idx_l];
$idx_l ++;
} else {
$newcp = $r->[0][$idx_r];
$val_r = $r->[1][$idx_r];
$idx_r ++;
}
# But if they both have a transition at the same codepoint, take both so there's
# not a duplicate.
if ($idx_r < @{ $r->[0] } && $r->[0][$idx_r] == $newcp) {
$val_r = $r->[1][$idx_r];
$idx_r ++;
}
my $newval = defined($val_r) ? $val_r : $val_l;
# This gets skipped if we updated $val_l but $val_r is overriding, or
# $val_r went from undef to equaling $val_l.
if ($newval ne $out->[1][-1]) {
push @{ $out->[0] }, $newcp;
push @{ $out->[1] }, $newval;
}
}
return $out;
}
sub charwidth {
state $width_eaw = map_im([Unicode::UCD::prop_invmap('East_Asian_Width')],
{ F => 2, W => 2, H => 1, Na => 1, Neutral => 1, A => 1 }
);
state $width_cat = map_im([Unicode::UCD::prop_invmap('Category')],
{ Cc => 0, Mn => 0, Me => 0, Cf => 0 }
);
state $width_override = [
[ 0x0000, 0x0001, # NUL
0x00AD, 0x00AE, # Soft Hyphen
0x1160, 0x1200, # Hangul Jamo vowels and final consonants
0x200B, 0x200C, # ZWSP
],
[ 0, undef,
1, undef,
0, undef,
0, undef,
],
];
state $merged = reduce { merge_im($a, $b) } $width_eaw, $width_cat, $width_override;
my $cp = shift;
my $idx = Unicode::UCD::search_invlist($merged->[0], $cp);
return $merged->[1][$idx];
}
sub testwidth($) {
my $char = shift;
my $cp = ord $char;
printf "Width of %c (U+%04X %s) is %d\n", $cp, $cp,
charnames::viacode($cp), charwidth($cp);
}
testwidth "\x04";
testwidth "a";
testwidth "\N{MEDIUM BLACK CIRCLE}";
testwidth "\N{LARGE RED CIRCLE}";
testwidth "\N{U+20A9}";
testwidth "\N{U+1F637}";