我有一些 .sgm 格式的文件,我必须对它们进行评估(应用语言模型并获得文本的困惑度)。
主要问题是我需要这些纯格式文件,即txt 格式。但是,我一直在互联网上搜索在线转换或执行此操作的某种脚本,但找不到。
除此之外,我的一位老师在 perl 中给我发了这个命令:
perl -n 'print $1."\n" if /<seg[^>]+>\s*(.*\S)\s*<.seg>/i;’ < file.sgm > file
我从来没有使用过 perl,老实说,我对此一无所知。我想我已经安装了 perl:
$ perl -v
This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)
Copyright 1987-2013, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.
顺便说一句,我使用的是 Mac OS X。
示例 .sgm 文件:
<srcset setid="newsdiscusstest2015" srclang="any">
<doc sysid="ref" docid="39-Guardian" genre="newsdiscuss" origlang="en">
<p>
<seg id="1">This is perfectly illustrated by the UKIP numbties banning people with HIV.</seg>
<seg id="2">You mean Nigel Farage saying the NHS should not be used to pay for people coming to the UK as health tourists, and saying yes when the interviewer specifically asked if, with the aforementioned in mind, people with HIV were included in not being welcome.</seg>
<seg id="3">You raise a straw man and then knock it down with thinly veiled homophobia.</seg>
输出 .txt 文件:
UKIP 对艾滋病毒感染者的禁令完美地说明了这一点。你的意思是 Nigel Farage 说 NHS 不应该被用来支付作为健康游客来英国的人,并且当采访者特别询问是否考虑到上述内容时,HIV 感染者是否被包括在不受欢迎的范围内时说是的。你养了一个稻草人,然后带着一丝不苟的恐同症把它打倒。