r - 量词可以用于R中的正则表达式替换吗？

Question

我的目标是用一个重复与字符串一样多的字符的符号替换字符串，以某种方式可以将字母替换为大写字母\\U\\1，如果我的模式是"...(*)..."我对捕获的内容的替换，则(*)类似于我将获得如此多的字符。x\\q1{\\q1}xx*

这可能吗？

我主要考虑的是，sub,gsub但您可以使用其他库来回答，例如stringi,stringr等。您可以方便地使用perl = TRUEorperl = FALSE和任何其他选项。

我认为答案可能是否定的，因为选项似乎非常有限（?gsub）：

a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed = FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern. For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. If a character vector of length 2 or more is supplied, the first element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA.

主要量词是（?base::regex）：

?

    The preceding item is optional and will be matched at most once.
*

    The preceding item will be matched zero or more times.
+

    The preceding item will be matched one or more times.
{n}

    The preceding item is matched exactly n times.
{n,}

    The preceding item is matched n or more times.
{n,m}

    The preceding item is matched at least n times, but not more than m times.

好的，但它似乎是一个选项（它不在PCRE，不确定是否在PERL或在哪里......）(*)它捕获了星量词能够匹配的字符数（我在https://www.rexegg .com/regex-quantifier-capture.html），因此可以使用\q1（相同的引用）来指代第一个捕获的量词（和\q2等）。我也读到这(*)相当于，{0,}但我不确定这是否真的是我感兴趣的事实。

编辑更新：

自从评论者询问后，我用这个有趣的问题提供的具体示例更新了我的问题。我修改了一下这个例子。假设我们有a <- "I hate extra spaces elephant"，所以我们有兴趣保持单词之间的唯一空间，每个单词的前 5 个字符（直到这里作为原始问题），然后是每个其他字符的点（不确定这是否是预期的原来的问题，但没关系）所以结果字符串将是"I hate extra space. eleph..."（一个.用于最后一个s， 3个点用于结尾spaces的3个字母）。所以我开始保留前 5 个字符antelephant

gsub("(?<!\\S)(\\S{5})\\S*", "\\1", a, perl = TRUE)
[1] "I hate extra space eleph"

\\S*我应该如何用点或任何其他符号替换确切的字符数？

score 0 · Accepted Answer

量词不能在替换模式中使用，也不能在它们匹配多少字符的信息中使用。

您需要的是一个\G基本 PCRE 模式，用于在字符串中的特定位置之后查找连续匹配：

a <- "I hate extra spaces elephant"
gsub("(?:\\G(?!^)|(?<!\\S)\\S{5})\\K\\S", ".", a, perl = TRUE)

请参阅R 演示和正则表达式演示。

细节

(?:\G(?!^)|(?<!\S)\S{5})- 上一个成功匹配的结尾或五个非空白字符前面没有非空白字符
\K-匹配重置运算符丢弃到目前为止匹配的文本
\S- 任何非空白字符。

score 0 · Accepted Answer

gsubfn就像gsub除了替换字符串可以是一个输入匹配并输出替换的函数。该函数可以选择性地表示为一个公式，就像我们在这里所做的那样，将每个单词字符串替换为替换该字符串的函数的输出。不需要复杂的正则表达式。

library(gsubfn)

gsubfn("\\w+", ~ paste0(substr(x, 1, 5), strrep(".", max(0, nchar(x) - 5))), a)
## [1] "I hate extra space. eleph..."

或几乎相同，只是功能略有不同：

gsubfn("\\w+", ~ paste0(substr(x, 1, 5), substring(gsub(".", ".", x), 6)), a)
## [1] "I hate extra space. eleph..."

r - 量词可以用于R中的正则表达式替换吗？

2 回答 2

Related

Reference