azure-data-lake - 如何在 U-SQL 中使用 XML 提取器从 XML 元素中提取属性值

Question

如何使用U-SQL 中的XML 提取器从 XML 元素中提取属性值以用于我的 Azure 数据湖分析作业。

更新：有关该问题的更多详细信息

我的 XML 文件如下所示：

<?xml version="1.0" encoding="utf-8"?>
<testelement testatr="xyz">
</testelement>

这是我的 U-SQL 脚本：

DECLARE @testfile string = "sample2.xml";
@logText =
EXTRACT log string            
FROM @testfile
USING Extractors.Tsv();

@gethID = SELECT Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(@logText.log, "testelement/attribute::testatr").ElementAt(0) AS siteName FROM @logText;
OUTPUT @gethID TO "result.out" USING Outputters.Tsv();

调试后观察，XPath类的Load方法尝试加载时出现异常：

"<?xml version=1.0 encoding=utf-8?>"

这是一个例外：

Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException was unhandled
Message: An unhandled exception of type 'Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException' occurred in Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.dll
Additional information: {"diagnosticCode":195887111,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXPRESSIONEVALUATION","message":"Error while evaluating expression Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(log, \"testelement/attribute::testatr\").ElementAt(0)","description":"Inner exception from user expression: '1.0' is an unexpected token. The expected token is '\"' or '''. Line 1, position 15.\nCurrent row dump: \tlog:\t\"<?xml version=1.0 encoding=utf-8?>\"
\n","resolution":"","helpLink":"","details":"==== Caught exception System.Xml.XmlException\n\n   at System.Xml.XmlTextReaderImpl.Throw(Exception e)
\n   at System.Xml.XmlTextReaderImpl.ParseXmlDeclaration(Boolean isTextDecl)
\n   at System.Xml.XmlTextReaderImpl.Read()
\n   at System.Xml.XmlLoader.Load(XmlDocument doc, XmlReader reader, Boolean preserveWhitespace)
\n   at System.Xml.XmlDocument.Load(XmlReader reader)
\n   at System.Xml.XmlDocument.LoadXml(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Load(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(String xml, String xpath)
\n   at ___Scope_Generated_Classes___.SqlFilterTransformer_2.Process(IRow row, IUpdatableRow output) in c:\\workarea\\bswbigdata\\USQLAppForLogs\\USQLAppForLogs\\bin\\Debug\\A06D46624BBA798\\ReadBlobs.usql.Debug_A54F30D359F939C7\\__ScopeCodeGen__.dll.cs:line 53","internalDiagnostics":""}

更新 2：

使用 quoting:false 后，我得到另一个异常：

Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException was unhandled
Message: An unhandled exception of type 'Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException' occurred in Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.dll
Additional information: {"diagnosticCode":195887111,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXPRESSIONEVALUATION","message":"Error while evaluating expression Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(log, \"testelement/attribute::testatr\").ElementAt(0)","description":"Inner exception from user expression: Root element is missing.\nCurrent row dump: \tlog:\t\"<?xml version=\"1.0\" encoding=\"utf-8\"?>\"
\n","resolution":"","helpLink":"","details":"==== Caught exception System.Xml.XmlException\n\n   at System.Xml.XmlTextReaderImpl.Throw(Exception e)
\n   at System.Xml.XmlTextReaderImpl.ParseDocumentContent()
\n   at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
\n   at System.Xml.XmlDocument.Load(XmlReader reader)
\n   at System.Xml.XmlDocument.LoadXml(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Load(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(String xml, String xpath)
\n   at ___Scope_Generated_Classes___.SqlFilterTransformer_2.Process(IRow row, IUpdatableRow output) in c:\\workarea\\bswbigdata\\USQLAppForLogs\\USQLAppForLogs\\bin\\Debug\\A06D46624BBA798\\ReadBlobs.usql.Debug_A54F30D359F939C7\\__ScopeCodeGen__.dll.cs:line 53","internalDiagnostics":""}

score 3 · Accepted Answer

您可以使用 XPath 表达式识别值。使用@attr_name（或全轴表达式attribute::attr_name）查询属性。

基于问题更新的更新：

看起来解析器不知何故被 XML 声明中的 " 混淆了。我看到您使用内置的 Tsv() 提取器，默认情况下，当前当前处理字段内的 " 作为引用字符，然后删除它。这是我们计划修复的错误。

在此之前，我建议您使用Extractors.Tsv(quoting:false).

此外，如果您使用任何内置文本提取器 ( )，请确保您的 XML 文档不包含任何 CR/LF，如果您使用的Extractors.*是 .Tsv，请确保它不包含制表符值。

如果您的 XML 将包含 CR 和/或 LF，那么您将不得不使用自定义提取器来使用不同的行分隔符。如果您需要这样做，请给我留言，因为我目前正在跟踪此类请求，以查看我们可以在内置提取器中改进什么。

如果您的文件只包含一个 XML 文档（而不是几行 XML 文档），我建议使用 XML 提取器，它也是 GitHub 上 XML 示例的一部分。

score 0 · Accepted Answer

关于新的错误消息：看起来 XML 文档在 XML 声明之后包含 CR 或 LF 或两者，因此 Tsv() 提取器拆分 XML 文档。请参阅我在上一个答案中的评论：

此外，如果您使用任何内置文本提取器 (Extractors.*)，请确保您的 XML 文档不包含任何 CR/LF，如果您使用的是 .Tsv，请确保它不包含制表符值。

如果您的 XML 将包含 CR 和/或 LF，那么您将不得不使用自定义提取器来使用不同的行分隔符。如果您需要这样做，请给我留言，因为我目前正在跟踪此类请求，以查看我们可以在内置提取器中改进什么。

azure-data-lake - 如何在 U-SQL 中使用 XML 提取器从 XML 元素中提取属性值

2 回答 2

Related

Reference