r - 通过 r 将全角字符串转换为半角字符串

Question

我怎样才能转换Ａb９８７６５４３２１０成Ab9876543210？有正则表达式的解决方案吗？谢谢。

test <- dput("Ａb９８７６５４３２１０")

score 1 · Accepted Answer

免责声明：以下适用于我的机器，但由于我无法仅根据提供的示例复制您的全宽字符串，因此这是基于我的问题版本的最佳猜测（将字符串粘贴到文本文件中，保存使用 UTF-8 编码，并使用指定为 UTF-8 的编码加载它。

步骤 1。阅读文本（我添加了一个半角版本进行比较）：

> test <- readLines("fullwidth.txt", encoding = "UTF-8")
> test
[1] "Ａb９８７６５４３２１０" "Ab9876543210"

步骤 2。验证全宽和半宽版本不相等：

# using all.equal()
test1 <- test[1]
test2 <- test[2]
> all.equal(test1, test2)
[1] "1 string mismatch"

# compare raw bytes
> charToRaw(test1)
 [1] ef bb bf ef bc a1 62 ef bc 99 ef bc 98 ef bc 97 ef bc 96 ef bc 95 ef
[24] bc 94 ef bc 93 ef bc 92 ef bc 91 ef bc 90
> charToRaw(test2)
 [1] 41 62 39 38 37 36 35 34 33 32 31 30

对于任何感兴趣的人，如果您将原始字节版本作为十六进制输入粘贴到utf-8 解码器中，您会看到除了字母 b（从第 7 个字节中的 62 映射）之外，其余字母由 3-字节序列。此外，前 3 字节序列映射到“零宽度无间隔字符”，因此当您将字符串打印到控制台时它是不可见的。

步骤 3。使用包从全宽转换为半宽Nippon：

library(Nippon)
test1.converted <- zen2han(test1)

> test1.converted
[1] "Ab9876543210"

# If you want to compare against the original test2 string, remove the zero 
# width character in front
> all.equal(substring(test1.converted, 2), test2)
[1] TRUE

score 0 · Accepted Answer

晚了几年 - 但这是一个基本的 R 解决方案

全角字符在范围内0xFF01:0xFFEF，可以像这样偏移。

x <- "Ａb９８７６５４３２１０"
iconv(x, to = "utf8") |>
  utf8ToInt() |>
  (\(.) ifelse(. > 0xFF01 & . <= 0xFFEF, . - 65248, .))() |>
  intToUtf8()

[1] "Ab9876543210"

r - 通过 r 将全角字符串转换为半角字符串

2 回答 2

Related

Reference