我有一些文本,其中一些实际上有预定义的模板,这些模板对分析没有任何价值。
我想用regex系统地删除template(通常由header textlikegreetings和closing textlike组成thank you,这样我就可以专注于variable text.
header和都closing可能具有可变文本,例如variable locationor variable staff name。所以text 1可能有locationequalsABC和staff nameequals Sofia。
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
want <- "\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n"
header <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:"
tail <- "\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
我目前的尝试如下。
# remove everything before 'menu'
gsub('(.*)menu:','', have)
# want to correct the above to
# remove everything that
# starts with "Hello, thank you for contacting" up to "Please find our available menu"
# remove everything after Sincerely, inclusive
gsub('Sincerely.*','', have)
# want to correct the above to
# remove everything that
# starts with "Sincerely,\nThe Awesome Pizza Team" up to "\nDelivering Pizza 24/7"
第二次尝试
# text
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
# remove any text in between 'Hello, thank you for contacting`
# up to 'Please find below our available menu:'
# and also the anchoring texts
(want <- gsub(pattern = '(Hello, thank you for contacting).*(Please find below our available menu:)',''
, x = have))
# remove any text after `\n\n Sincerely,\nThe Awesome Pizza Team\n`, inclusive the text itself
(want <- gsub(pattern = '\n\n Sincerely,\nThe Awesome Pizza Team\n.*',''
, x = want))