我在Serialising Gtk TextBuffers to HTML找到了从序列化文本缓冲区中正确获取标签的解决方案,它不是register_serialize_format,但正如网站上所说,可以编写序列化程序,但文档很少(我认为是使用 register_serialize_format)。无论哪种方式,该解决方案都使用htlm.parser和xml.etree.ElementTree,但可以使用BeautifulSoup。
基本上,这个脚本将使用 html paser 处理序列化的文本缓冲区内容,艰苦的工作从提要开始,接收字节内容(序列化的文本缓冲区内容)并返回一个字符串(带有 html 标签的格式化文本),首先它会找到<text_view_markup>退出阅读器的索引GTKTEXTBUFFERCONTENTS-0001(这是无法使用 decode('utf-8') 解码的内容),因为它将导致“UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position :无效的起始字节”,您可以使用 decode('utf-8', erros='ignore') 或 erros='replace' ,但由于 feed 方法会丢弃这部分内容,因此使用简单的 .decode() 解码内容.
然后标签和文本将分开处理,首先标签将被处理,这里我使用了xml.etree.ElementTree,但可以使用beautifulsoup作为原始脚本,在处理完标签后调用feed并传递文本,这个feed 是 HTMLParser 的方法。
此外,对于标签,它可能处理的不仅仅是 italis、bold 和 color,您只需要更新 tag2html 字典。
除了不使用beautifulsoup之外,我还做了一些其他更改,至于标签名称,所有标签都有名称,所以它们没有使用id,我的颜色标签也已经有十六进制值,所以我不需要使用pango_to_html_hex方法。这是它现在的样子:
from html.parser import HTMLParser
from typing import Dict, List, Optional, Tuple
from xml.etree.ElementTree import fromstring
from gi import require_version
require_version('Pango', '1.0')
from gi.repository import Pango
class PangoToHtml(HTMLParser):
"""Decode a subset of Pango markup and serialize it as HTML.
Only the Pango markup used within Gourmet is handled, although expanding it
is not difficult.
Due to the way that Pango attributes work, the HTML is not necessarily the
simplest. For example italic tags may be closed early and reopened if other
attributes, eg. bold, are inserted mid-way:
<i> italic text </i><i><u>and underlined</u></i>
This means that the HTML resulting from the conversion by this object may
differ from the original that was fed to the caller.
"""
def __init__(self):
super().__init__()
self.markup_text: str = "" # the resulting content
self.current_opening_tags: str = "" # used during parsing
self.current_closing_tags: List = [] # used during parsing
# The key is the Pango id of a tag, and the value is a tuple of opening
# and closing html tags for this id.
self.tags: Dict[str: Tuple[str, str]] = {}
tag2html: Dict[str, Tuple[str, str]] = {
Pango.Style.ITALIC.value_name: ("<i>", "</i>"), # Pango doesn't do <em>
str(Pango.Weight.BOLD.real): ("<b>", "</b>"),
Pango.Underline.SINGLE.value_name: ("<u>", "</u>"),
"foreground-gdk": (r'<span foreground="{}">', "</span>"),
"background-gdk": (r'<span background="{}">', "</span>")
}
def feed(self, data: bytes) -> str:
"""Convert a buffer (text and and the buffer's iterators to html string.
Unlike an HTMLParser, the whole string must be passed at once, chunks
are not supported.
"""
# Remove the Pango header: it contains a length mark, which we don't
# care about, but which does not necessarily decodes as valid char.
header_end = data.find(b"<text_view_markup>")
data = data[header_end:].decode()
# Get the tags
tags_begin = data.index("<tags>")
tags_end = data.index("</tags>") + len("</tags>")
tags = data[tags_begin:tags_end]
data = data[tags_end:]
# Get the textual content
text_begin = data.index("<text>")
text_end = data.index("</text>") + len("</text>")
text = data[text_begin:text_end]
# Convert the tags to html.
# We know that only a subset of HTML is handled in Gourmet:
# italics, bold, underlined and normal
root = fromstring(tags)
tags_name = list(root.iter('tag'))
tags_attributes = list(root.iter('attr'))
tags = [ [tag_name, tag_attribute] for tag_name, tag_attribute in zip(tags_name, tags_attributes)]
tags_list = {}
for tag in tags:
opening_tags = ""
closing_tags = ""
tag_name = tag[0].attrib['name']
vtype = tag[1].attrib['type']
value = tag[1].attrib['value']
name = tag[1].attrib['name']
if vtype == "GdkColor": # Convert colours to html
if name in ['foreground-gdk', 'background-gdk']:
opening, closing = self.tag2html[name]
hex_color = f'{value.replace(":","")}' #hex color already handled by gtk.gdk.color.to_string() method
opening = opening.format(hex_color)
else:
continue # no idea!
else:
opening, closing = self.tag2html[value]
opening_tags += opening
closing_tags = closing + closing_tags # closing tags are FILO
tags_list[tag_name] = opening_tags, closing_tags
if opening_tags:
tags_list[tag_name] = opening_tags, closing_tags
self.tags = tags_list
# Create a single output string that will be sequentially appended to
# during feeding of text. It can then be returned once we've parse all
self.markup_text = ""
self.current_opening_tags = ""
self.current_closing_tags = [] # Closing tags are FILO
super().feed(text)
return self.markup_text
def handle_starttag(self, tag: str, attrs: List[Tuple[str, str]]) -> None:
# The pango tags are either "apply_tag", or "text". We only really care
# about the "apply_tag". There could be an assert, but we let the
# parser quietly handle nonsense.
if tag == "apply_tag":
attrs = dict(attrs)
tag_name = attrs.get('name')
tags = self.tags.get(tag_name)
if tags is not None:
(self.current_opening_tags, closing_tag) = tags
self.current_closing_tags.append(closing_tag)
def handle_data(self, data: str) -> None:
data = self.current_opening_tags + data
self.markup_text += data
def handle_endtag(self, tag: str) -> None:
if self.current_closing_tags: # Can be empty due to closing "text" tag
self.markup_text += self.current_closing_tags.pop()
self.current_opening_tags = ""
还要非常感谢写这篇文章的Cyril Danilevski,所有的功劳都归功于他。正如他解释的那样,“还有 , 标记 TextBuffer 内容的开始和结束。” 因此,如果您按照站点中的示例进行操作,在它具有的 handle_endtag 处self.markup_text += self.current_closing_tags.pop(),它将尝试弹出一个空列表,因此我建议任何想要处理标签的人也可以查看pango_html.py,它通过检查列表是否不存在来处理此问题空(它也在handle_endtag的这个答案的代码上),还有一个测试文件test_pango_html.py。
使用示例
import PangoToHtml
start_iter = text_buffer.get_start_iter()
end_iter = text_buffer.get_end_iter()
format = text_buffer.register_serialize_tagset()
exported = text_buffer.serialize( text_buffer,
format,
start_iter,
end_iter )
p = PangoToHtml()
p.feed(exported)