python - 如何在python中将html文本的大小写更改为句子大小写

Question

看到我有一个包含 html 文本的字符串，我们称之为 S。

S = "<b>this is a sentence. and this is one more sentence</b>"

我想要的是将上面的 S 转换为以下文本

S = <b>This is a sentence. And this is one more sentence</b>

问题是我可以使用我的函数将任何文本转换为句子大小写，但是当文本包含 html 时，无法告诉我的函数哪一部分是文本，哪一部分是应该避免的 html。因此，当我将 S 作为函数的输入时，它会产生不正确的结果，如下所示

S = <b>this is a sentence. And this is one more sentence</b>

因为它认为'<'是句子的第一个字符，所以它尝试将'<'转换为与'<'相同的大写。

我现在向你们提出的问题是，如果文本已经以 html 形式编码，如何在 python 中将文本转换为句子大小写？而且我不想松散 HTML 格式

score 0 · Accepted Answer

一个过于简单的方法是

import xml.etree.ElementTree as ET
S = "<b> This is sentence. and this is one more. </b>"

delim = '. ' 

def convert(sentence):
    return sentence[0].upper() + sentence[1:] + delim


def convert_node(child):
    sentences = child.text
    if sentences:
        child.text = ''
        for sentence in sentences.split(delim):
            if sentence:
                child.text += convert(sentence)
    sentences = child.tail
    if sentences:
        child.tail = ''
        for sentence in sentences.split(delim):
            if sentence:
                child.tail += convert(sentence)
    return child

node = ET.fromstring(S)
S = ET.tostring(convert_node(node))

# gives '<b> This is sentence. And this is one more. </b>'

显然，这不会涵盖所有情况，但如果任务受到足够好的约束，它将起作用。这种方法应该适用于您已经拥有的功能。本质上，我相信您需要使用解析器来解析 HTML，然后操作每个 html 节点的文本值。

如果您不愿意使用解析器，请使用正则表达式。这可能要脆弱得多，因此您必须更多地限制输入。像这样的开始：

>>> split_str = re.split('(</?\w+>|\.)', S)
# split_str is ['', '<b>', 'this is a sentence', '.', ' and this is one more sentence', '</b>', '']

然后，您可以检查拆分字符串中的单词是否以 < 和 > 开头和结尾

for i, word in enumerate(split_str):
    if len(word) > 1 and not word.startswith('<') or not word.endswith('>'):
       split_str[i] = convert(word)

S = ' '.join(split_str)

python - 如何在python中将html文本的大小写更改为句子大小写

1 回答 1

Related

Reference