r - R中的正则表达式，在R中的关键字处拆分句子

Question

你好，

我想把句子分成两部分，从关键字_1到关键字_2，从关键字_2到句子的结尾，最好使用正则表达式。

例如（我的理想输出 - 如下所示）：在此处输入图像描述

下面是我制作的一个数据集。

数据集

    library(tibble)

    keyword_1 <- c("coffee", "apple", "rainbow", "strawberry shortcake")
    keyword_2 <- c("life", "new york", "seven colours", "sweet and yummy")

    raw <-
      tibble(
        sentence = c(
          "coffee is keyword_1_1 life is keyword_2_1",
          "apple is keyword_1_2 new york is keyword_2_2",
          "rainbow is keyword_1_3 seven colours is keyword_2_3",
          "strawberry shortcake is keyword_1_4 sweet and yummy is keyword 2_4"
        ))
        
    raw

    #> # A tibble: 4 x 1
    #>   sentence                                                          
    #>   <chr>                                                             
    #> 1 coffee is keyword_1_1 life is keyword_2_1                         
    #> 2 apple is keyword_1_2 new york is keyword_2_2                      
    #> 3 rainbow is keyword_1_3 seven colours is keyword_2_3               
    #> 4 strawberry shortcake is keyword_1_4 sweet and yummy is keyword 2_4

预期输出

library(tibble)

output = tibble(
  output1 = c(
    "coffee is keyword_1_1",
    "apple is keyword_1_2",
    "rainbow is keyword_1_3",
    "strawberry shortcake is keyword_1_4"
  ),
  output2 = c("life is keyword_2_1", "new york is keyword_2_2",
              "seven colours is keyword_2_3", "sweet and yummy is keyword 2_4")
)

output

#> # A tibble: 4 x 2
#>   output1                             output2                       
#>   <chr>                               <chr>                         
#> 1 coffee is keyword_1_1               life is keyword_2_1           
#> 2 apple is keyword_1_2                new york is keyword_2_2       
#> 3 rainbow is keyword_1_3              seven colours is keyword_2_3  
#> 4 strawberry shortcake is keyword_1_4 sweet and yummy is keyword 2_4

^{由reprex 包（v0.3.0）于 2021-03-18 创建}

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.2 (2020-06-22)
#>  os       macOS  10.16                
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  ctype    en_AU.UTF-8                 
#>  tz       Australia/Melbourne         
#>  date     2021-03-18                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source                     
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.2)             
#>  callr         3.5.1   2020-10-13 [1] CRAN (R 4.0.2)             
#>  cli           2.3.1   2021-02-23 [1] CRAN (R 4.0.2)             
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.2)             
#>  debugme       1.1.0   2017-10-22 [1] CRAN (R 4.0.2)             
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.2)             
#>  devtools      2.3.2   2020-09-18 [1] CRAN (R 4.0.2)             
#>  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.2)             
#>  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.2)             
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.1)             
#>  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.2)             
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)             
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)             
#>  highr         0.8     2019-03-20 [1] CRAN (R 4.0.2)             
#>  htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)             
#>  knitr         1.31    2021-01-27 [1] CRAN (R 4.0.2)             
#>  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.2)             
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.2)             
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.2)             
#>  pillar        1.5.0   2021-02-22 [1] CRAN (R 4.0.2)             
#>  pkgbuild      1.1.0   2020-07-13 [1] CRAN (R 4.0.2)             
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.2)             
#>  pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.2)             
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.2)             
#>  processx      3.4.4   2020-09-03 [1] CRAN (R 4.0.2)             
#>  ps            1.4.0   2020-10-07 [1] CRAN (R 4.0.2)             
#>  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.2)             
#>  remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)             
#>  rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.2)             
#>  rmarkdown     2.5     2020-10-21 [1] CRAN (R 4.0.2)             
#>  rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.2)             
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.2)             
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)             
#>  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)             
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.2)             
#>  testthat      3.0.0   2020-10-31 [1] CRAN (R 4.0.2)             
#>  tibble      * 3.1.0   2021-02-25 [1] CRAN (R 4.0.2)             
#>  usethis       1.6.3   2020-09-17 [1] CRAN (R 4.0.2)             
#>  utf8          1.1.4   2018-05-24 [1] CRAN (R 4.0.2)             
#>  vctrs         0.3.4   2020-08-29 [1] CRAN (R 4.0.2)             
#>  withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.2)             
#>  xfun          0.19.3  2020-11-06 [1] Github (yihui/xfun@12e77f5)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.2)             
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

score 0 · Accepted Answer

这是一种data.table方法，使用后视正则表达式模式进行拆分

library( data.table )
setDT(raw)[, paste0( "output", 1:2 ) := 
             lapply( tstrsplit(sentence, "(?<=_[0-9]{1}_[0-9]{1})", perl = TRUE ), 
                     trimws ) ][, sentence := NULL][]

#                                output1                        output2
# 1:               coffee is keyword_1_1            life is keyword_2_1
# 2:                apple is keyword_1_2        new york is keyword_2_2
# 3:              rainbow is keyword_1_3   seven colours is keyword_2_3
# 4: strawberry shortcake is keyword_1_4 sweet and yummy is keyword 2_4

score 0 · Accepted Answer

假设模式总是“keyword_number_number”，那么第四个入口缺少一个“_”，应该是：

raw[4,1] = "strawberry shortcake is keyword_1_4 sweet and yummy is keyword_2_4"

然后我们可以写：

pattern = "([a-z ]+ keyword_[0-9]_[0-9]) ([a-z ]+ keyword_[0-9]_[0-9])"

a = matrix(NA, nrow(raw), 2)
for(i in 1:nrow(raw)){
  for(j in 1:2)
    a[i,j] = gsub(pattern, paste0("\\",j), raw[i,1])}

输出：

> a
     [,1]                                  [,2]                            
[1,] "coffee is keyword_1_1"               "life is keyword_2_1"           
[2,] "apple is keyword_1_2"                "new york is keyword_2_2"       
[3,] "rainbow is keyword_1_3"              "seven colours is keyword_2_3"  
[4,] "strawberry shortcake is keyword_1_4" "sweet and yummy is keyword_2_4"

r - R中的正则表达式，在R中的关键字处拆分句子

数据集

预期输出

2 回答 2

Related

Reference