java - Fixing unescaped XML entities in Java with Regex?

Question

I have some badly formatted XML that I must parse. Fixing the problem upstream is not possible.

The (current) problem is that ampersand characters are not always escaped properly, so I need to convert & into &

If & is already there, I don't want to change it to &amp;. In general, if any well-formed entity is already there, I don't want to destroy it. I don't think that it's possible, in general, to know all entities that could appear in any particular XML document, so I want a solution where anything like &<characters>; is preserved.

Where <characters> is some set of characters defining an entity between the initial & and the closing ;. In particular, < and > are not literals that would otherwise denote an XML element.

Now, when parsing, if I see &<characters> I don't know whether I'll run into a ;, a (space), end-of-line, or another &. So I think that I have to remember <characters> as I look ahead for a character that will tell me what to do with the original &.

I think that I need the power of a Push Down Automaton to do this, I don't think that a Finite State Machine will work because of what I think is a memory requirement - is that correct? If I need a PDA, then a regular expression in a call to String.replaceAll(String, String) won't work. Or is there a Java regex that can solve this problem?

Remember: there could be multiple replacements per line.

(I'm aware of this question, but it does not provide the answer that I am looking for.)

score 8 · Accepted Answer

这是您要查找的正则表达式：&([^;\\W]*([^;\\w]|$))，相应的替换字符串将是&$1. 它匹配 on &，后跟零个或多个非分号或分词符（它需要允许零个来匹配独立的＆符号），然后是一个不是分号（或行尾）的分词符。&捕获组允许您使用您正在寻找的内容进行替换。

这是一些使用它的示例代码：

String s = "&amp; & &nsbp; &tc., &tc. &tc";
final String regex = "&([^;\\W]*([^;\\w]|$))";
final String replacement = "&amp;$1";
final String t = s.replaceAll(regex, replacement);

在沙箱中运行它后，我得到以下 t 的结果：

&amp; &amp; &nsbp; &amp;tc., &amp;tc. &amp;tc

正如你所看到的，原来的&并 保持不变。但是，如果你用“&&”尝试它，你会得到&&，如果你用“&&&”尝试它，你会得到&&&，我认为这是你所暗示的前瞻问题的症状。但是，如果您替换该行：

final String t = s.replaceAll(regex, replacement);

和：

final String t = s.replaceAll(regex, replacement).replaceAll(regex, replacement);

它适用于所有这些字符串以及我能想到的任何其他字符串。（在成品中，您可能会编写一个执行此双重replaceAll调用的例程。）

score 5 · Accepted Answer

I think you can also use look-ahead to see if & characters are followed by characters & a semicolon (e.g. &(?!\w+;)). Here's an example:

import java.util.*;
import java.util.regex.*;

public class HelloWorld{
    private static final Pattern UNESCAPED_AMPERSAND =
        Pattern.compile("&(?!(#\\d+|\\w+);)");
     public static void main(String []args){
        for (String s : Arrays.asList(
            "http://www.example.com/?a=1&b=2&amp;c=3/",
            "Three in a row: &amp;&&amp;",
            "&lt; is <, &gt; is >, &apos; is ', etc."
        )) {
            System.out.println(
                UNESCAPED_AMPERSAND.matcher(s).replaceAll("&amp;")
            );        
        }
     }
}

// Output:
// http://www.example.com/?a=1&amp;b=2&amp;c=3/
// Three in a row: &amp;&amp;&amp;
// &lt; is <, &gt; is >, &apos; is ', etc.

score 2 · Accepted Answer

首先了解实体的语法：http: //www.w3.org/TR/xml/#NT-EntityRef

然后查看 JavaDoc ：http FilterInputStream: //download.oracle.com/javase/6/docs/api/java/io/FilterInputStream.html

然后实现一个逐字符读取实际输入的方法。当它看到一个 & 符号时，它会切换到“实体模式”并寻找一个有效的实体引用 ( & Name ;)。如果它在第一个字符之前找到一个不允许 in 的字符Name，则将其逐字写入输出。否则，它会&在 & 号之后写入所有内容。

score 1 · Accepted Answer

与其尝试对所有可能的不良数据进行一般性的处理，只需处理一次出现的不良数据。有可能生成 XML 的东西会弄乱一两个字符，但不是所有字符。这当然是一个假设。

尝试将所有 & 替换为 & 除了 & 后跟 amp; 时。如果您遇到的下一个编码不正确的字符是 <，则将它们全部替换为 <。保持规则集小而易于管理，只处理你知道是错误的事情。

如果您尝试做很多事情，最终可能会替换您不打算做的事情并自己弄乱数据。

我只想指出，最好的解决方案是鼓励生成 XML 的人最终修复编码。问这个问题可能很尴尬，但如果你专业地向他们解释他们没有生成有效的 XML，他们可能愿意修复错误。这将为下一个必须使用它的人带来额外的好处，不需要做一些疯狂的自定义代码来解决应该从源头解决的问题。至少考虑一下。可能发生的更糟糕的事情是你问，他们说不，你就在你现在的位置。

score 0 · Accepted Answer

很抱歉激起了一个旧线程：
我遇到了同样的问题，我使用的解决方法分为 3 个步骤：

识别有效的实体引用并从正则表达式中“隐藏”它们
使用正则表达式替换非转义字符
恢复以前“隐藏”的实体引用

隐藏是通过将实体包含在自定义字符序列中来完成的。例如“ #||<ENTITY_NAME>||#”

为了说明，假设我们有这个带有未转义字符的 XML 片段&：

<NAME>Testname</NAME>
<VALUE>
    random words one &amp; two
    I am sad&happy; at the same time!
    its still &lt; ecstatic
    It is two & three words
    Short form is 2&three
    Now for some invalid entity refs: &amp, &gt, and &lt too.
</VALUE>

Step1：
我们使用正则表达式替换"[&]$amp|apos|gt|lt|quot$[;]"为"#||$1||#". 这是因为根据 W3C 的有效 XML 实体引用是amp,lt,gt,apos & quot。字符串现在看起来像这样：

<NAME>Testname</NAME>
<VALUE>
    random words one #||amp||# two
    I am sad&happy; at the same time!
    its still #||lt||# ecstatic
    It is two & three words
    Short form is 2&three
    Now for some invalid entity refs: &amp, &gt, and &lt too.
</VALUE>

只有有效的实体引用被隐藏了。&happy;原封不动。

第2步：
将正则表达式替换"[&]"为"&". 字符串现在看起来像这样：

<NAME>Testname</NAME>
<VALUE>
    random words one #||amp||# two
    I am sad&amp;happy; at the same time!
    its still #||lt||# ecstatic
    It is two &amp; three words
    Short form is 2&amp;three
    Now for some invalid entity refs: &amp;amp, &amp;gt, and &amp;lt too.
</VALUE>

Step3：
将正则表达式替换"#\|\|([a-z]+)\|\|#"为"&$1;". 最终更正后的字符串现在如下所示：

<NAME>Testname</NAME>
<VALUE>
    random words one &amp; two
    I am sad&amp;happy; at the same time!
    its still &lt; ecstatic
    It is two &amp; three words
    Short form is 2&amp;three
    Now for some invalid entity refs: &amp;amp, &amp;gt, and &amp;lt too.
</VALUE>

缺点： 必须仔细选择隐藏有效实体的自定义字符序列，以确保没有任何有效内容会偶然包含相同的序列。虽然机会很小，但承认，这不是一个完全可靠的解决方案......

score 0 · Accepted Answer

我使用了UNESCAPED_AMPERSAND上面的解决方案，但我不得不将正则表达式更改为

private static final Pattern UNESCAPED_AMPERSAND =
        Pattern.compile("&(?!(#\\d+|#x[0-9a-fA-F]+|\\w+);)");

添加|#x[0-9a-fA-F]+以考虑十六进制字符引用。

（我想评论那个解决方案，但显然我不能。）

java - Fixing unescaped XML entities in Java with Regex?

6 回答 6

Related

Reference