python - 如何使用 PyGTK 的 TextBuffer.register_serialize_format？

Question

我现在正在使用序列化和反序列化，当用 utf-8 解码序列化的文本缓冲区时，我得到了这个：

GTKTEXTBUFFERCONTENTS-0001 <text_view_markup>
 <tags>
  <tag name="bold" priority="1">
   <attr name="weight" type="gint" value="700" />
  </tag>
  <tag name="#efef29292929" priority="2">
   <attr name="foreground-gdk" type="GdkColor" value="efef:2929:2929" />
  </tag>
  <tag name="underline" priority="0">
   <attr name="underline" type="PangoUnderline" value="PANGO_UNDERLINE_SINGLE" />
  </tag>
 </tags>
<text><apply_tag name="underline">At</apply_tag> the first <apply_tag name="bold">comes</apply_tag> rock!  <apply_tag name="underline">Rock</apply_tag>, <apply_tag name="bold">paper,</apply_tag> <apply_tag name="#efef29292929">scissors!</apply_tag></text>
</text_view_markup>

我正在尝试使用一些 html 标签来应用标签<u></u><b></b>，就像我之前问过的那样，它作为重复项被关闭了，我会以不同的方式提出要求。那么，如果这些标签都以结尾，我怎么能知道它们在哪里结束</apply_tag>，而不是像</apply_tag name="nameoftag">我之前尝试过的那样：

def correctTags(text):
    tags = []
    newstring = ''
    for i in range(len(text)):
        if string[i] == '<' and i+18 <= len(text):
            if text[i+17] == '#':
                tags.append('</font color>')
            elif text[i+17] == 'b':
                tags.append('</b>')
            elif text[i+17] == 'u':
                tags.append('</u>')
    
    newstring = string.replace('<apply_tag name="#', '<font color="#').replace('<apply_tag name="bold">', '<b>').replace('<apply_tag name="underline">', '<u>')

    for j in tags:
        newstring = newstring.replace('</apply_tag>', j, 1)    

    return '<text>' + newstring + '</text>'

但是内部标签有一个问题，它们会在不应该的地方关闭。我想也许答案是gtk.TextBuffer.register_serialize_format因为我认为这应该使用我传递给它的 mime 序列化，比如 html，然后我应该知道标签在哪里结束。但我没有发现任何广泛友好使用它的例子。

score 0 · Accepted Answer

我在Serialising Gtk TextBuffers to HTML找到了从序列化文本缓冲区中正确获取标签的解决方案，它不是register_serialize_format，但正如网站上所说，可以编写序列化程序，但文档很少（我认为是使用 register_serialize_format）。无论哪种方式，该解决方案都使用htlm.parser和xml.etree.ElementTree，但可以使用BeautifulSoup。

基本上，这个脚本将使用 html paser 处理序列化的文本缓冲区内容，艰苦的工作从提要开始，接收字节内容（序列化的文本缓冲区内容）并返回一个字符串（带有 html 标签的格式化文本），首先它会找到<text_view_markup>退出阅读器的索引GTKTEXTBUFFERCONTENTS-0001（这是无法使用 decode('utf-8') 解码的内容），因为它将导致“UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position :无效的起始字节”，您可以使用 decode('utf-8', erros='ignore') 或 erros='replace' ，但由于 feed 方法会丢弃这部分内容，因此使用简单的 .decode() 解码内容.

然后标签和文本将分开处理，首先标签将被处理，这里我使用了xml.etree.ElementTree，但可以使用beautifulsoup作为原始脚本，在处理完标签后调用feed并传递文本，这个feed 是 HTMLParser 的方法。

此外，对于标签，它可能处理的不仅仅是 italis、bold 和 color，您只需要更新 tag2html 字典。

除了不使用beautifulsoup之外，我还做了一些其他更改，至于标签名称，所有标签都有名称，所以它们没有使用id，我的颜色标签也已经有十六进制值，所以我不需要使用pango_to_html_hex方法。这是它现在的样子：

from html.parser            import HTMLParser
from typing                 import Dict, List, Optional, Tuple
from xml.etree.ElementTree  import fromstring

from gi import require_version
require_version('Pango', '1.0')
from gi.repository import Pango

class PangoToHtml(HTMLParser):
    """Decode a subset of Pango markup and serialize it as HTML.

    Only the Pango markup used within Gourmet is handled, although expanding it
    is not difficult.

    Due to the way that Pango attributes work, the HTML is not necessarily the
    simplest. For example italic tags may be closed early and reopened if other
    attributes, eg. bold, are inserted mid-way:

        <i> italic text </i><i><u>and underlined</u></i>

    This means that the HTML resulting from the conversion by this object may
    differ from the original that was fed to the caller.
    """
    def __init__(self):
        super().__init__()
        self.markup_text:           str  = ""  # the resulting content
        self.current_opening_tags:  str  = ""  # used during parsing
        self.current_closing_tags:  List = []  # used during parsing

        # The key is the Pango id of a tag, and the value is a tuple of opening
        # and closing html tags for this id.
        self.tags: Dict[str: Tuple[str, str]] = {}

    tag2html: Dict[str, Tuple[str, str]] = {
                                            Pango.Style.ITALIC.value_name:      ("<i>", "</i>"),  # Pango doesn't do <em>
                                            str(Pango.Weight.BOLD.real):        ("<b>", "</b>"),
                                            Pango.Underline.SINGLE.value_name:  ("<u>", "</u>"),
                                            "foreground-gdk":                   (r'<span foreground="{}">', "</span>"),
                                            "background-gdk":                   (r'<span background="{}">', "</span>")
                                            }

    def feed(self, data: bytes) -> str:
        """Convert a buffer (text and and the buffer's iterators to html string.

        Unlike an HTMLParser, the whole string must be passed at once, chunks
        are not supported.
        """
        # Remove the Pango header: it contains a length mark, which we don't
        # care about, but which does not necessarily decodes as valid char.
        header_end  = data.find(b"<text_view_markup>")
        data        = data[header_end:].decode()

        # Get the tags
        tags_begin  = data.index("<tags>")
        tags_end    = data.index("</tags>") + len("</tags>")
        tags        = data[tags_begin:tags_end]
        data        = data[tags_end:]

        # Get the textual content
        text_begin  = data.index("<text>")
        text_end    = data.index("</text>") + len("</text>")
        text        = data[text_begin:text_end]

        # Convert the tags to html.
        # We know that only a subset of HTML is handled in Gourmet:
        # italics, bold, underlined and normal

        root            = fromstring(tags)
        tags_name       = list(root.iter('tag'))
        tags_attributes = list(root.iter('attr'))
        tags            = [ [tag_name, tag_attribute] for tag_name, tag_attribute in zip(tags_name, tags_attributes)]

        tags_list = {}
        for tag in tags:
            opening_tags = ""
            closing_tags = ""

            tag_name    = tag[0].attrib['name']
            vtype       = tag[1].attrib['type']
            value       = tag[1].attrib['value'] 
            name        = tag[1].attrib['name']

            if vtype == "GdkColor":  # Convert colours to html
                if name in ['foreground-gdk', 'background-gdk']:
                    opening, closing = self.tag2html[name]
                    hex_color = f'{value.replace(":","")}' #hex color already handled by gtk.gdk.color.to_string() method
                    opening = opening.format(hex_color)
                else:
                    continue  # no idea!
            else:
                opening, closing = self.tag2html[value]

            opening_tags += opening
            closing_tags = closing + closing_tags   # closing tags are FILO

            tags_list[tag_name] = opening_tags, closing_tags

            if opening_tags:
                tags_list[tag_name] = opening_tags, closing_tags

        self.tags = tags_list

        # Create a single output string that will be sequentially appended to
        # during feeding of text. It can then be returned once we've parse all
        self.markup_text                = ""
        self.current_opening_tags       = ""
        self.current_closing_tags       = []  # Closing tags are FILO

        super().feed(text)

        return self.markup_text

    def handle_starttag(self, tag: str, attrs: List[Tuple[str, str]]) -> None:
        # The pango tags are either "apply_tag", or "text". We only really care
        # about the "apply_tag". There could be an assert, but we let the
        # parser quietly handle nonsense.
        if tag == "apply_tag":
            attrs       = dict(attrs)
            tag_name    = attrs.get('name')
            tags        = self.tags.get(tag_name)

            if tags is not None:
                (self.current_opening_tags, closing_tag) = tags
                self.current_closing_tags.append(closing_tag)

    def handle_data(self, data: str) -> None:
        data = self.current_opening_tags + data
        self.markup_text += data

    def handle_endtag(self, tag: str) -> None:
        if self.current_closing_tags:  # Can be empty due to closing "text" tag
            self.markup_text += self.current_closing_tags.pop()
        self.current_opening_tags = ""

还要非常感谢写这篇文章的Cyril Danilevski，所有的功劳都归功于他。正如他解释的那样，“还有 , 标记 TextBuffer 内容的开始和结束。” 因此，如果您按照站点中的示例进行操作，在它具有的 handle_endtag 处self.markup_text += self.current_closing_tags.pop()，它将尝试弹出一个空列表，因此我建议任何想要处理标签的人也可以查看pango_html.py，它通过检查列表是否不存在来处理此问题空（它也在handle_endtag的这个答案的代码上），还有一个测试文件test_pango_html.py。

使用示例

import PangoToHtml

start_iter  = text_buffer.get_start_iter()
end_iter    = text_buffer.get_end_iter()
format      = text_buffer.register_serialize_tagset()
exported    = text_buffer.serialize( text_buffer,
                                     format,
                                     start_iter,
                                     end_iter )

p = PangoToHtml()
p.feed(exported)

python - 如何使用 PyGTK 的 TextBuffer.register_serialize_format？

1 回答 1

Related

Reference