r - 在 R 中搜索每第 n 个连续或冻结的字符

Question

date                   val                          cal_val
1/12/2017 0:15  (0_04),(1_08),(0_12),(1_14)         (0_04),(1_08),(0_12),(1_14)
1/12/2017 0:30  (0_22),(0_25),(1_29)                 (0_22),(1_29)
1/12/2017 0:45  (1_34),(1_38),(0_40),(1_44)         (1_38),(0_40),(1_44)
1/12/2017 1:00  (1_47),(1_49),(1_53),(1_57),(0_59)  (1_57),(0_59)
1/12/2017 1:15  (0_07),(0_09),(0_10),(0_13),(1_14)  (0_7),(1_14)

如何在特殊字符“（”之后搜索每个字符，如果它们是连续的或
带有“0”的冻结值，那么在“_”之后考虑最小值，否则如果它是“1”，则从最大位置考虑，如果有没有连续的值，它保持
不变。

i.e in row_1 : there is no consecutive values.   
       row_2 : (0_22),(0_25) are consecutive then consider min i.e (0_22) and later  
       row_3 : (1_34),(1_38) are consecutive then consider max i.e (1_38) and later  
       row_4 : (0_07),(0_09),(0_10),(0_13),(1_14) are consecutive then consider min i.e (0_7) and later

提前致谢。

score 0 · Accepted Answer

另一种方法可能是

library(tidyverse)
library(data.table)

#prepare data to count consecutive 0 or 1
df1 <- df %>%
  mutate(val = gsub("[()]", "", val)) %>%
  separate_rows(val, sep = ",") %>%
  separate("val", c("val_pre", "val_post")) 

#identify consecutive 0 or 1 - TRUE in 'flag' column indicates consecutive 0 or 1
setDT(df1)[, seq_ind := seq(.N), by = .(date_col, rleid(val_pre))
           ][, flag := shift(seq_ind, type="lead",) > 1 | seq_ind > 1, by = date_col]

#filter consecutive rows. In there zero's repetition is replaced with min value & 1's repetition with max value
df2 <- setDF(df1) %>%
  filter(flag == T) %>%
  group_by(date_col, val_pre) %>%
  mutate(val_post = ifelse(val_pre == 0, min(val_post), max(val_post))) %>%
#row-bind non-consecutive rows as is
  bind_rows(setDF(df1) %>% filter(flag == F | is.na(flag))) %>%
  select(-seq_ind, -flag) %>%
  distinct() %>%
  mutate(cal_val = paste0("(", val_pre, "_", val_post, ")")) %>%
  group_by(date_col) %>%
  summarise(cal_val = paste(cal_val, collapse = ","))

这使

df2

  date_col       cal_val                                 
1 1/12/2017 0:15 (0_04),(1_08),(0_12),(1_14)
2 1/12/2017 0:30 (0_22),(1_29)              
3 1/12/2017 0:45 (1_38),(0_40),(1_44)       
4 1/12/2017 1:00 (1_57),(0_59)              
5 1/12/2017 1:15 (0_07),(1_14)

样本数据：

df <- structure(list(date_col = c("1/12/2017 0:15", "1/12/2017 0:30", 
"1/12/2017 0:45", "1/12/2017 1:00", "1/12/2017 1:15"), val = c("(0_04),(1_08),(0_12),(1_14)", 
"(0_22),(0_25),(1_29)", "(1_34),(1_38),(0_40),(1_44)", "(1_47),(1_49),(1_53),(1_57),(0_59)", 
"(0_07),(0_09),(0_10),(0_13),(1_14)")), .Names = c("date_col", 
"val"), class = "data.frame", row.names = c(NA, -5L))

score 0 · Accepted Answer

这是一个tidyverse解决方案：

您可以使用stringr函数分别拉出0-matching 和1-matching 案例，然后在应用min/后将它们组合起来max：

df %>%
  rowwise() %>%
  mutate(
    zero = min(
      as.numeric(
        str_extract_all(
          str_extract(val, "(\\(0_\\d+\\),){2,}"), # find 0-consecutives
          "\\d{2}")[[1]])), # pull out the 2-digit values
    one = max(
      as.numeric(
        str_extract_all(
          str_extract(val, "(\\(1_\\d+\\),){2,}"), # find 1-consecutives
          "\\d{2}")[[1]])),
    final = sum(zero, one, na.rm=TRUE)) 

# A tibble: 5 x 5
  date           val                          zero   one final
  <chr>          <chr>                       <dbl> <dbl> <dbl>
1 1/12/2017 0:15 (0_04),(1_08),(0_12),(1_14)   NA    NA     0.
2 1/12/2017 0:30 (0_22),(0_25),(1_29)          22.   NA    22.
3 1/12/2017 0:45 (1_34),(1_38),(0_40),(1_44)   NA    38.   38.
4 1/12/2017 1:00 (1_47),(1_49),(1_53),(1_57…   NA    57.   57.
5 1/12/2017 1:15 (0_07),(0_09),(0_10),(0_13…    7.   NA     7.

r - 在 R 中搜索每第 n 个连续或冻结的字符

2 回答 2

Related

Reference