perl - 确定 unicode 字符在 Perl 中是全角还是半角

Question

如何在 Perl 中确定 unicode 字符是全角（占用两个单元格；双倍宽度）还是半角（如通常的拉丁字符）？

例如，表情符号是双倍宽度的，但也有较低块中的字符，例如"\N{MEDIUM BLACK CIRCLE}"(U+26ab)。

我试过了

Unicode::GCString->new("\N{LARGE RED CIRCLE}")->columns()

但它也返回 1。

score 3 · Accepted Answer

我有一些 C++ 代码来计算字符宽度。因此，稍后快速转换为 perl，然后...

#!/usr/bin/env perl
use warnings;
use strict;
use feature qw/state/;
use open qw/:std :locale/;
use charnames qw/:full/;
use Unicode::UCD qw/charinfo charprop/;

# Return the number of fixed-width columns taken up by a unicode codepoint
# Inspired by https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
# First adapted to use C++/ICU functions and then to perl
sub charwidth ($) {
  state %cache;

  my $cp = shift; # Numeric codepoint
  return $cache{$cp} if exists $cache{$cp};

  if ($cp == 0 || $cp == 0x200B) {
    # nul and ZERO WIDTH SPACE
    $cache{$cp} = 0;
    return 0;
  } elsif ($cp >= 0x1160 && $cp <= 0x11FF) {
    # Hangul Jamo vowels and final consonants
    $cache{$cp} = 0;
    return 0;
  } elsif ($cp == 0xAD) {
    # SOFT HYPHEN
    $cache{$cp} = 1;
    return 1;
  }

  my $ci = charinfo($cp);
  return undef unless defined $ci;

  my $type = $ci->{category};
  if ($type eq "Cc" || $type eq "Mn" || $type eq "Me" || $type eq "Cf") {
    # Control Code, Non Spacing Mark, Enclosing Mark, Format Char
    $cache{$cp} = 0;
    return 0;
  }

  state $widths = { Fullwidth => 2, Wide => 2, Halfwidth => 1, Narrow => 1,
                    Neutral => 1, Ambiguous => 1 };
  my $eaw = charprop($cp, "East_Asian_Width");
  my $width = $widths->{$eaw} // 1;
  $cache{$cp} = $width;
  return $width;
}

sub testwidth ($) {
  my $char = shift;
  my $cp = ord $char;
  printf "Width of %c (U+%04X %s) is %d\n", $cp, $cp, charnames::viacode($cp),
    charwidth($cp);
}

testwidth "\x04";
testwidth "a";
testwidth "\N{MEDIUM BLACK CIRCLE}";
testwidth "\N{LARGE RED CIRCLE}";
testwidth "\N{U+20A9}";
testwidth "\N{U+1F637}";

示例使用：

$ ./charwidths.pl
Width of  (U+0004 END OF TRANSMISSION) is 0
Width of a (U+0061 LATIN SMALL LETTER A) is 1
Width of ⚫ (U+26AB MEDIUM BLACK CIRCLE) is 2
Width of  (U+1F534 LARGE RED CIRCLE) is 2
Width of ₩ (U+20A9 WON SIGN) is 1
Width of  (U+1F637 FACE WITH MEDICAL MASK) is 2

它只是对特定范围和代码点类别进行一些特殊情况检查，然后使用 East Asian Width 属性以及TR11的建议来确定其他所有内容的宽度。

score 3 · Accepted Answer

这有点混乱，我很犹豫是否将其发布到互联网上而不将其清理到适当的库中......但我不太可能有时间制作那个库，所以在这里以防万一它有用. 它很大程度上源于Shawn 的贡献，但它没有使用可能增长到数百万个条目的每个代码点“缓存”，而是使用 Unicode::UCD 数据在第一次调用时构建代码点范围及其相关宽度的“invmap” ; charprop查询该地图的工作方式类似于（并且成本与单个调用相同或略低于）。

map_im接受由返回的 invmapprop_invmap并通过哈希映射属性值。在散列中找不到的任何值都将变为undefUnicode::UCD 不使用它，但我们的代码将其视为“无关”。merge_im获取两个这样的 invmap 并将它们合并，以使“右”invmap 中的值覆盖“左”invmap 中的值，但右侧的 undef 范围允许左侧值“穿透”。charwidth的状态初始化根据Shawn自己的逻辑映射和合并三个invmap（East_Asian_Width、Category和特殊情况覆盖列表）charwidth，该函数只是使用Unicode::UCD自己的search_invlist例程查询。

初始化在我的笔记本电脑上花费了 <60 毫秒，并产生了一个 909 元素的 invmap（使用 perl 5.32.1 中的 UCD），之后每次调用大约需要 2.5 微秒。

use Unicode::UCD;
use open qw/:std :locale/;
use charnames qw/:full/;
use feature 'state';
use List::Util 'reduce';

sub map_im {
  my ($im, $h) = @_;
  die unless $im->[2] eq 's';

  my $out;

  for my $i (0 .. $#{ $im->[0] }) {
    my $val = $h->{ $im->[1][$i] };

    my $different = @{ $out->[0] } ? ($val ne $out->[1][-1]) : defined($val);
    if ($different) {
      push @{ $out->[0] }, $im->[0][$i];
      push @{ $out->[1] }, $val;
    }
  }

  return $out;
}

sub merge_im {
  my ($l, $r) = @_;
  die unless $l->[0][0] == 0;

  my $out;
  my $idx_l = my $idx_r = 0;
  my $val_l = $l->[1][0];
  my $val_r;

  while ($idx_r < @{ $r->[0] } || $idx_l < @{ $l->[0] }) {
    my $newcp;
    # Take from the list with the lower next entry. Or the one with entries left.
    # This could probably be simplified.
    if ($idx_r >= @{ $r->[0] } || ($idx_l < @{ $l->[0] } 
        && $l->[0][$idx_l] <= $r->[0][$idx_r])) {
      $newcp = $l->[0][$idx_l];
      $val_l = $l->[1][$idx_l];
      $idx_l ++;
    } else {
      $newcp = $r->[0][$idx_r];
      $val_r = $r->[1][$idx_r];
      $idx_r ++;
    }

    # But if they both have a transition at the same codepoint, take both so there's
    # not a duplicate.
    if ($idx_r < @{ $r->[0] } && $r->[0][$idx_r] == $newcp) {
      $val_r = $r->[1][$idx_r];
      $idx_r ++;
    }

    my $newval = defined($val_r) ? $val_r : $val_l;
    # This gets skipped if we updated $val_l but $val_r is overriding, or
    # $val_r went from undef to equaling $val_l.
    if ($newval ne $out->[1][-1]) {
      push @{ $out->[0] }, $newcp;
      push @{ $out->[1] }, $newval;
    }
  }

  return $out;
}

sub charwidth {
  state $width_eaw = map_im([Unicode::UCD::prop_invmap('East_Asian_Width')], 
    { F => 2, W => 2, H => 1, Na => 1, Neutral => 1, A => 1 }
  );

  state $width_cat = map_im([Unicode::UCD::prop_invmap('Category')],
    { Cc => 0, Mn => 0, Me => 0, Cf => 0 }
  );

  state $width_override = [
    [ 0x0000, 0x0001, # NUL
      0x00AD, 0x00AE, # Soft Hyphen
      0x1160, 0x1200, # Hangul Jamo vowels and final consonants
      0x200B, 0x200C, # ZWSP
    ],
    [ 0, undef,
      1, undef,
      0, undef,
      0, undef,
    ],
  ];

  state $merged = reduce { merge_im($a, $b) } $width_eaw, $width_cat, $width_override;

  my $cp = shift;
  my $idx = Unicode::UCD::search_invlist($merged->[0], $cp);
  return $merged->[1][$idx];
}

sub testwidth($) {
  my $char = shift;
  my $cp = ord $char;
  printf "Width of %c (U+%04X %s) is %d\n", $cp, $cp,
     charnames::viacode($cp), charwidth($cp);
}

testwidth "\x04";
testwidth "a";
testwidth "\N{MEDIUM BLACK CIRCLE}";
testwidth "\N{LARGE RED CIRCLE}";
testwidth "\N{U+20A9}";
testwidth "\N{U+1F637}";

perl - 确定 unicode 字符在 Perl 中是全角还是半角

2 回答 2

Related

Reference