0

我正在尝试检测数据帧中的一个变量中是否存在/不存在某些模式组合。

有一些问题是相似的,但我找不到一个能准确回答我想要达到的目标的问题。

我试图找到:

  • 如果模式存在
  • 使用逻辑运算符(and、or、not = $、|、!)定义多个模式
  • 忽略大小写
  • 以 true/false 将输出作为另一列返回

我仍然找不到解决方法,但我会分享我到目前为止所做的事情,以获得您的指导:

创建示例数据框

x=structure(list(Sources = structure(c(1L, 7L, 6L, 8L, 9L, 4L,
3L, 5L, 2L), .Label = 
  c("Found in all nutritious foods in moderate amounts: pork, whole grain foods or enriched breads and cereals, legumes, nuts and seeds",
  
"Found only in fruits and vegetables, especially citrus fruits, vegetables in the cabbage family, cantaloupe, strawberries, peppers, tomatoes, potatoes, lettuce, papayas, mangoes, kiwifruit",
  
"Leafy green vegetables and legumes, seeds, orange juice, and liver; now added to most refined grains",
"Meat, fish, poultry, vegetables, fruits", 
  "Meat, poultry, fish, seafood, eggs, milk and milk products; not found in plant foods",
"Meat, poultry, fish, whole grain foods, enriched breads and cereals, vegetables (especially mushrooms, asparagus, and leafy green vegetables), peanut butter",
  
"Milk and milk products; leafy green vegetables; whole grain foods, enriched breads and cereals",
"Widespread in foods", "Widespread in foods; also produced in intestinal tract by bacteria"
), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))

此代码检测到任何 2 个指定字符串 (?i) 的存在意味着忽略大小写。

x$present = str_detect(x$Sources, "(?i)Vegetables|(?i)Meat")

# but it does not work with "and"
x$present =str_detect(x$Sources, "(?i)Vegetables&(?i)Meat")

#here it gives FALSE for all, my expected output is to return TRUE for those that contain both words

这个通过过滤所需的组合来工作:

  • 它适用于| & !
  • 但它只过滤感兴趣的行,如果模式存在,有没有办法将另一列添加到数据集中?
x %>% filter (str_detect(x$Sources, "(?i)Vegetables") & str_detect(x$Sources, "(?i)Meat"))
 
x %>% filter (str_detect(x$Sources, "(?i)Vegetables") & !str_detect(x$Sources, "(?i)Meat")) #does not contain meat

x %>% filter (!str_detect(x$Sources, "(?i)Meat") & str_detect(x$Sources, "(?i)Vegetables") & str_detect(x$Sources, "(?i)Grain"))

最后,我发现这个包看起来可以完成这项工作,但它只适用于向量,有没有办法让它适用于数据框中的变量?像使用 lapply 或其他东西来返回另一个带有 True/False 的变量?

library(sjmisc)
 
str_contains(x$Sources, "Meat", ignore.case = T)
4

2 回答 2

1

使用mutatewithstr_detect创建新列:

library(tidyverse)

x %>% 
  mutate(pattern_detected = 
           str_detect(Sources, "(?i)Vegetables") & 
           str_detect(Sources, "(?i)Meat"))
于 2020-09-26T18:59:24.193 回答
1

sjmisc在 data.frame 上使用包中的函数。这里的主力是sapply两次 - 一次用于 data.frame 中的列,一次用于行。

library(sjmisc)
# build dummy data.frame
df <- data.frame(x, x, x)

sapply(df, function(x) sapply(x, 
                             str_contains, 
                             pattern = c("Meat", "Vegetables"), 
                             logic = "and", ignore.case = TRUE))
         Sources Sources.1 Sources.2
 [1,]   FALSE     FALSE     FALSE
 [2,]   FALSE     FALSE     FALSE
 [3,]    TRUE      TRUE      TRUE
 [4,]   FALSE     FALSE     FALSE
 [5,]   FALSE     FALSE     FALSE
 [6,]    TRUE      TRUE      TRUE
 [7,]   FALSE     FALSE     FALSE
 [8,]   FALSE     FALSE     FALSE
 [9,]   FALSE     FALSE     FALSE

输出是一个矩阵。如果您想要一个 data.frame,请将其包装在 as.data.frame 中。

as.data.frame(sapply(df, function(x) sapply(x, 
                                            str_contains, 
                                            pattern = c("Meat", "Vegetables"), 
                                            logic = "and", ignore.case = TRUE)))

  Sources Sources.1 Sources.2
1   FALSE     FALSE     FALSE
2   FALSE     FALSE     FALSE
3    TRUE      TRUE      TRUE
4   FALSE     FALSE     FALSE
5   FALSE     FALSE     FALSE
6    TRUE      TRUE      TRUE
7   FALSE     FALSE     FALSE
8   FALSE     FALSE     FALSE
9   FALSE     FALSE     FALSE
于 2020-09-26T19:07:08.153 回答