一个过于简单的方法是
import xml.etree.ElementTree as ET
S = "<b> This is sentence. and this is one more. </b>"
delim = '. '
def convert(sentence):
return sentence[0].upper() + sentence[1:] + delim
def convert_node(child):
sentences = child.text
if sentences:
child.text = ''
for sentence in sentences.split(delim):
if sentence:
child.text += convert(sentence)
sentences = child.tail
if sentences:
child.tail = ''
for sentence in sentences.split(delim):
if sentence:
child.tail += convert(sentence)
return child
node = ET.fromstring(S)
S = ET.tostring(convert_node(node))
# gives '<b> This is sentence. And this is one more. </b>'
显然,这不会涵盖所有情况,但如果任务受到足够好的约束,它将起作用。这种方法应该适用于您已经拥有的功能。本质上,我相信您需要使用解析器来解析 HTML,然后操作每个 html 节点的文本值。
如果您不愿意使用解析器,请使用正则表达式。这可能要脆弱得多,因此您必须更多地限制输入。像这样的开始:
>>> split_str = re.split('(</?\w+>|\.)', S)
# split_str is ['', '<b>', 'this is a sentence', '.', ' and this is one more sentence', '</b>', '']
然后,您可以检查拆分字符串中的单词是否以 < 和 > 开头和结尾
for i, word in enumerate(split_str):
if len(word) > 1 and not word.startswith('<') or not word.endswith('>'):
split_str[i] = convert(word)
S = ' '.join(split_str)