5

I have some badly formatted XML that I must parse. Fixing the problem upstream is not possible.

The (current) problem is that ampersand characters are not always escaped properly, so I need to convert & into &

If &amp; is already there, I don't want to change it to &amp;amp;. In general, if any well-formed entity is already there, I don't want to destroy it. I don't think that it's possible, in general, to know all entities that could appear in any particular XML document, so I want a solution where anything like &<characters>; is preserved.

Where <characters> is some set of characters defining an entity between the initial & and the closing ;. In particular, < and > are not literals that would otherwise denote an XML element.

Now, when parsing, if I see &<characters> I don't know whether I'll run into a ;, a (space), end-of-line, or another &. So I think that I have to remember <characters> as I look ahead for a character that will tell me what to do with the original &.

I think that I need the power of a Push Down Automaton to do this, I don't think that a Finite State Machine will work because of what I think is a memory requirement - is that correct? If I need a PDA, then a regular expression in a call to String.replaceAll(String, String) won't work. Or is there a Java regex that can solve this problem?

Remember: there could be multiple replacements per line.

(I'm aware of this question, but it does not provide the answer that I am looking for.)

4

6 回答 6

8

这是您要查找的正则表达式:&([^;\\W]*([^;\\w]|$)),相应的替换字符串将是&amp;$1. 它匹配 on &,后跟零个或多个非分号或分词符(它需要允许零个来匹配独立的&符号),然后是一个不是分号(或行尾)的分词符。&amp;捕获组允许您使用您正在寻找的内容进行替换。

这是一些使用它的示例代码:

String s = "&amp; & &nsbp; &tc., &tc. &tc";
final String regex = "&([^;\\W]*([^;\\w]|$))";
final String replacement = "&amp;$1";
final String t = s.replaceAll(regex, replacement);

在沙箱中运行它后,我得到以下 t 的结果:

&amp; &amp; &nsbp; &amp;tc., &amp;tc. &amp;tc

正如你所看到的,原来的&amp;&nbsp;保持不变。但是,如果你用“&&”尝试它,你会得到&amp;&,如果你用“&&&”尝试它,你会得到&amp;&&amp;,我认为这是你所暗示的前瞻问题的症状。但是,如果您替换该行:

final String t = s.replaceAll(regex, replacement);

和:

final String t = s.replaceAll(regex, replacement).replaceAll(regex, replacement);

它适用于所有这些字符串以及我能想到的任何其他字符串。(在成品中,您可能会编写一个执行此双重replaceAll调用的例程。)

于 2011-07-11T23:41:14.827 回答
5

I think you can also use look-ahead to see if & characters are followed by characters & a semicolon (e.g. &(?!\w+;)). Here's an example:

import java.util.*;
import java.util.regex.*;

public class HelloWorld{
    private static final Pattern UNESCAPED_AMPERSAND =
        Pattern.compile("&(?!(#\\d+|\\w+);)");
     public static void main(String []args){
        for (String s : Arrays.asList(
            "http://www.example.com/?a=1&b=2&amp;c=3/",
            "Three in a row: &amp;&&amp;",
            "&lt; is <, &gt; is >, &apos; is ', etc."
        )) {
            System.out.println(
                UNESCAPED_AMPERSAND.matcher(s).replaceAll("&amp;")
            );        
        }
     }
}

// Output:
// http://www.example.com/?a=1&amp;b=2&amp;c=3/
// Three in a row: &amp;&amp;&amp;
// &lt; is <, &gt; is >, &apos; is ', etc.
于 2014-04-14T21:47:05.920 回答
2

首先了解实体的语法:http: //www.w3.org/TR/xml/#NT-EntityRef

然后查看 JavaDoc :http FilterInputStream: //download.oracle.com/javase/6/docs/api/java/io/FilterInputStream.html

然后实现一个逐字符读取实际输入的方法。当它看到一个 & 符号时,它会切换到“实体模式”并寻找一个有效的实体引用 ( & Name ;)。如果它在第一个字符之前找到一个不允许 in 的字符Name,则将其逐字写入输出。否则,它会&amp;在 & 号之后写入所有内容。

于 2011-07-11T18:29:19.443 回答
1

与其尝试对所有可能的不良数据进行一般性的处理,只需处理一次出现的不良数据。有可能生成 XML 的东西会弄乱一两个字符,但不是所有字符。这当然是一个假设。

尝试将所有 & 替换为 & 除了 & 后跟 amp; 时。如果您遇到的下一个编码不正确的字符是 <,则将它们全部替换为 <。保持规则集小而易于管理,只处理你知道是错误的事情。

如果您尝试做很多事情,最终可能会替换您不打算做的事情并自己弄乱数据。

我只想指出,最好的解决方案是鼓励生成 XML 的人最终修复编码。问这个问题可能很尴尬,但如果你专业地向他们解释他们没有生成有效的 XML,他们可能愿意修复错误。这将为下一个必须使用它的人带来额外的好处,不需要做一些疯狂的自定义代码来解决应该从源头解决的问题。至少考虑一下。可能发生的更糟糕的事情是你问,他们说不,你就在你现在的位置。

于 2011-07-11T18:22:10.173 回答
0

很抱歉激起了一个旧线程:
我遇到了同样的问题,我使用的解决方法分为 3 个步骤:

  1. 识别有效的实体引用并从正则表达式中“隐藏”它们
  2. 使用正则表达式替换非转义字符
  3. 恢复以前“隐藏”的实体引用

隐藏是通过将实体包含在自定义字符序列中来完成的。例如“ #||<ENTITY_NAME>||#

为了说明,假设我们有这个带有未转义字符的 XML 片段&

<NAME>Testname</NAME>
<VALUE>
    random words one &amp; two
    I am sad&happy; at the same time!
    its still &lt; ecstatic
    It is two & three words
    Short form is 2&three
    Now for some invalid entity refs: &amp, &gt, and &lt too.
</VALUE>

Step1:
我们使用正则表达式替换"[&]\(amp|apos|gt|lt|quot\)[;]""#||$1||#". 这是因为根据 W3C 的有效 XML 实体引用是amp,lt,gt,apos & quot。字符串现在看起来像这样:

<NAME>Testname</NAME>
<VALUE>
    random words one #||amp||# two
    I am sad&happy; at the same time!
    its still #||lt||# ecstatic
    It is two & three words
    Short form is 2&three
    Now for some invalid entity refs: &amp, &gt, and &lt too.
</VALUE>

只有有效的实体引用被隐藏了。&happy;原封不动。

第2步:
将正则表达式替换"[&]""&amp;". 字符串现在看起来像这样:

<NAME>Testname</NAME>
<VALUE>
    random words one #||amp||# two
    I am sad&amp;happy; at the same time!
    its still #||lt||# ecstatic
    It is two &amp; three words
    Short form is 2&amp;three
    Now for some invalid entity refs: &amp;amp, &amp;gt, and &amp;lt too.
</VALUE>

Step3:
将正则表达式替换"#\|\|([a-z]+)\|\|#""&$1;". 最终更正后的字符串现在如下所示:

<NAME>Testname</NAME>
<VALUE>
    random words one &amp; two
    I am sad&amp;happy; at the same time!
    its still &lt; ecstatic
    It is two &amp; three words
    Short form is 2&amp;three
    Now for some invalid entity refs: &amp;amp, &amp;gt, and &amp;lt too.
</VALUE>


缺点: 必须仔细选择隐藏有效实体的自定义字符序列,以确保没有任何有效内容会偶然包含相同的序列。虽然机会很小,但承认,这不是一个完全可靠的解决方案......

于 2014-04-04T18:10:04.970 回答
0

我使用了UNESCAPED_AMPERSAND上面的解决方案,但我不得不将正则表达式更改为

private static final Pattern UNESCAPED_AMPERSAND =
        Pattern.compile("&(?!(#\\d+|#x[0-9a-fA-F]+|\\w+);)");

添加|#x[0-9a-fA-F]+以考虑十六进制字符引用。

(我想评论那个解决方案,但显然我不能。)

于 2020-09-17T20:11:56.363 回答