0

我有一些文本,其中一些实际上有预定义的模板,这些模板对分析没有任何价值。

我想用regex系统地删除template(通常由header textlikegreetingsclosing textlike组成thank you,这样我就可以专注于variable text.

header和都closing可能具有可变文本,例如variable locationor variable staff name。所以text 1可能有locationequalsABCstaff nameequals Sofia

have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"

want <- "\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n"


header <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:"

tail <- "\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"


我目前的尝试如下。

# remove everything before 'menu'
gsub('(.*)menu:','', have)
# want to correct the above to
# remove everything that 
# starts with "Hello, thank you for contacting" up to "Please find our available menu"

# remove everything after Sincerely, inclusive
gsub('Sincerely.*','', have)
# want to correct the above to
# remove everything that 
# starts with "Sincerely,\nThe Awesome Pizza Team" up to "\nDelivering Pizza 24/7"

第二次尝试

# text
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"

# remove any text in between 'Hello, thank you for contacting` 
# up to 'Please find below our available menu:'
# and also the anchoring texts
(want <- gsub(pattern = '(Hello, thank you for contacting).*(Please find below our available menu:)',''
     , x = have))

# remove any text after `\n\n Sincerely,\nThe Awesome Pizza Team\n`, inclusive the text itself 
(want <- gsub(pattern = '\n\n Sincerely,\nThe Awesome Pizza Team\n.*',''
             , x = want))

4

1 回答 1

1

一个选项可能是匹配 Menu 之前的所有行。然后捕获所有以 Menu 开头的连续行,并匹配以sincerely 开头的其余行。

在替换使用捕获组 1。

^[\s\S]*?\R((?:Menu .*\R+)*)\s*Sincerely,[\s\S]*

模式匹配:

  • ^字符串的开始
  • [\s\S]*?\R尽可能少地匹配任何字符,后跟换行符
  • (捕获组 1
    • (?:Menu .*\R+)*Menu 重复匹配所有以换行符开头的行
  • )关闭组 1
  • \s*匹配可选的空白字符
  • Sincerely,从字面上匹配
  • [\s\S]*匹配其余的行

正则表达式演示| R 演示

例子

have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
trimws(gsub('^[\\s\\S]*?\\R((?:Menu .*\\R+)*)\\s*Sincerely,[\\s\\S]*','\\1', have, perl = TRUE))

输出

[1] "Menu 1 USD 1.99\nMenu 2 USD 3.99"

更长一点的更精确的模式可能是:

 ^(?:(?!Menu ).*(?:\R(?!Menu ).*)*\R+)?(Menu .*(?:\RMenu .*)*)\R\s*Sincerely,[\s\S]*

正则表达式演示| R 演示

于 2021-07-08T09:57:18.153 回答