我正在声明中使用原始编码 ASCII 编辑 xml 文件。在生成的文件中,我希望编码为 UTF-8,以便编写像 åäö 这样的瑞典字符,这是我目前无法做到的。
可以在archivematica wiki找到与我的文件等效的示例文件。
使用上述示例文件的副本运行我的程序后得到的 SIP.xml 可以通过此链接访问。添加的带有 åäö 文本的标签位于文档的最后。
如下面的代码所示,我尝试在转换器上设置编码,并尝试使用 OutputStreamWriter 来设置编码。最后我将原始文件中的声明编辑为UTF-8,最后写出了åäö。所以问题似乎是原始文件的编码。如果我没记错的话,将声明从 ASCII 更改为 UTF-8 应该不会造成任何问题,问题是,如何在我的程序中执行此操作?我可以在将其解析为 Document 对象之后执行此操作,还是在解析之前需要做一些事情?
package provklasser;
import java.io.File;
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.swing.JOptionPane;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.xml.sax.SAXException;
/**
*
* @author
*/
public class Provklass {
/**
* @param args the command line arguments
*/
public static void main(String[] args) {
try {
File chosenFile = new File("myFile.xml");
//parsing the xml file
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document metsDoc = builder.parse(chosenFile.getAbsolutePath());
Element agent = (Element) metsDoc.getDocumentElement().appendChild(metsDoc.createElementNS("http://www.loc.gov/METS/","mets:agent"));
agent.appendChild(metsDoc.createTextNode("åäö"));
DOMSource source = new DOMSource(metsDoc);
// write the content into xml file
File newFile = new File(chosenFile.getParent(), "SIP.xml");
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
StreamResult result = new StreamResult(newFile);
//Writer out = new OutputStreamWriter(new FileOutputStream("SIP.xml"), "UTF-8");
//StreamResult result = new StreamResult(out);
transformer.transform(source, result);
} catch (ParserConfigurationException ex) {
Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
} catch (SAXException ex) {
Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
} catch (IOException ex) {
Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
} catch (TransformerConfigurationException ex) {
Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
} catch (TransformerException ex) {
Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
更新:使用 metsDoc.getInputEncoding() 返回 UTF-8,而 metsDoc.getXmlEncoding() 返回 ASCII。如果我在保存后解析新文件并创建一个新文档,我会得到相同的结果。所以文档似乎有正确的编码,但xml声明不正确。
现在我在解析之前将 xml 编辑为文本文件,将上面的解析部分替换为parseXML(chosenFile.getAbsoutePath());
并使用以下方法:
private String withEditedDeclaration(String fileName) {
StringBuilder text = new StringBuilder();
try {
String NL = System.getProperty("line.separator");
try (Scanner scanner = new Scanner(new FileInputStream(fileName))) {
String line = scanner.nextLine();
text.append(line.replaceFirst("ASCII", "UTF-8") + NL);
while (scanner.hasNextLine()) {
text.append(scanner.nextLine() + NL);
}
}
} catch (FileNotFoundException ex) {
Logger.getLogger(MetsAdaption.class.getName()).log(Level.SEVERE, null, ex);
}
return text.toString();
}
private void parseXML(String fileName) throws SAXException, IOException, ParserConfigurationException {
String xmlString = withEditedDeclaration(fileName);
//parsing the xml file
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(xmlString));
metsDoc = builder.parse(is);
}
它有效,但它似乎是一个丑陋的解决方案。如果有人知道更好的方法,我将不胜感激。