32

我不明白这个 Ruby 代码:

>> puts '\\ <- single backslash'
# \ <- single backslash

>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced

到目前为止,一切都符合预期。但是如果我们用 搜索 1 /\\/,并用 2 替换,用 编码'\\\\',为什么我们会得到这个:

>> puts '\\ <- only 1 ... replace 1 with 2'.sub(/\\/, '\\\\')
# \ <- only 1 backslash, even though we replace 1 with 2

然后,当我们用 编码 3 时'\\\\\\',我们只得到 2:

>> puts '\\ <- only 2 ... 1 with 3'.sub(/\\/, '\\\\\\')
# \\ <- 2 backslashes, even though we replace 1 with 3

任何人都能够理解为什么反斜杠会被替换字符串吞噬?这发生在 1.8 和 1.9 上。

4

5 回答 5

72

Quick Answer

If you want to sidestep all this confusion, use the much less confusing block syntax. Here is an example that replaces each backslash with 2 backslashes:

"some\\path".gsub('\\') { '\\\\' }

Gruesome Details

The problem is that when using sub (and gsub), without a block, ruby interprets special character sequences in the replacement parameter. Unfortunately, sub uses the backslash as the escape character for these:

\& (the entire regex)
\+ (the last group)
\` (pre-match string)
\' (post-match string)
\0 (same as \&)
\1 (first captured group)
\2 (second captured group)
\\ (a backslash)

Like any escaping, this creates an obvious problem. If you want include the literal value of one of the above sequences (e.g. \1) in the output string you have to escape it. So, to get Hello \1, you need the replacement string to be Hello \\1. And to represent this as a string literal in Ruby, you have to escape those backslashes again like this: "Hello \\\\1"

So, there are two different escaping passes. The first one takes the string literal and creates the internal string value. The second takes that internal string value and replaces the sequences above with the matching data.

If a backslash is not followed by a character that matches one of the above sequences, then the backslash (and character that follows) will pass through unaltered. This is also affects a backslash at the end of the string -- it will pass through unaltered. It's easiest to see this logic in the rubinius code; just look for the to_sub_replacement method in the String class.

Here are some examples of how String#sub is parsing the replacement string:

  • 1 backslash \ (which has a string literal of "\\")

    Passes through unaltered because the backslash is at the end of the string and has no characters after it.

    Result: \

  • 2 backslashes \\ (which have a string literal of "\\\\")

    The pair of backslashes match the escaped backslash sequence (see \\ above) and gets converted into a single backslash.

    Result: \

  • 3 backslashes \\\ (which have a string literal of "\\\\\\")

    The first two backslashes match the \\ sequence and get converted to a single backslash. Then the final backslash is at the end of the string so it passes through unaltered.

    Result: \\

  • 4 backslashes \\\\ (which have a string literal of "\\\\\\\\")

    Two pairs of backslashes each match the \\ sequence and get converted to a single backslash.

    Result: \\

  • 2 backslashes with character in the middle \a\ (which have a string literal of "\\a\\")

    The \a does not match any of the escape sequences so it is allowed to pass through unaltered. The trailing backslash is also allowed through.

    Result: \a\

    Note: The same result could be obtained from: \\a\\ (with the literal string: "\\\\a\\\\")

In hindsight, this could have been less confusing if String#sub had used a different escape character. Then there wouldn't be the need to double escape all the backslashes.

于 2010-11-10T21:08:12.463 回答
18

这是一个问题,因为反斜杠 (\) 用作正则表达式和字符串的转义字符。您可以使用特殊变量 \& 来减少 gsub 替换字符串中的反斜杠数量。

foo.gsub(/\\/,'\&\&\&') #for some string foo replace each \ with \\\

编辑:我应该提到 \& 的值来自正则表达式匹配,在这种情况下是一个反斜杠。

另外,我认为有一种特殊的方法可以创建一个禁用转义字符的字符串,但显然不是。这些都不会产生两个斜杠:

puts "\\"
puts '\\'
puts %q{\\}
puts %Q{\\}
puts """\\"""
puts '''\\'''
puts <<EOF
\\
EOF  
于 2009-10-09T07:11:52.733 回答
4

澄清作者第二行代码的一点困惑。

你说:

>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced

2 个反斜杠在这里没有被替换。您正在用两个 a ('aa')替换1 个转义的反斜杠。也就是说,如果你使用.sub(/\\/, 'a'),你只会看到一个 'a'

'\\'.sub(/\\/, 'anything') #=> anything
于 2009-10-09T19:22:21.967 回答
4

啊,在我输入所有这些之后,我意识到这\是用来指代替换字符串中的组。我想这意味着您需要\\在替换字符串中使用文字来替换\。要获得文字\\,您需要四个\s,因此要将一个替换为两个,您实际上需要八个(!)。

# Double every occurrence of \. There's eight backslashes on the right there!
>> puts '\\'.sub(/\\/, '\\\\\\\\')

有什么我想念的吗?还有更有效的方法吗?

于 2009-10-09T07:02:36.127 回答
2

实际上,镐书提到了这个确切的问题。这是另一种选择(来自最新版本的第 130 页)

str = 'a\b\c'               # => "a\b\c"
str.gsub(/\\/) { '\\\\' }   # => "a\\b\\c"
于 2009-10-09T07:30:21.547 回答