r - R中的带状变量

Question

对于以下代码：

x <- data.frame(year = c(1730, 1860, 1941, 2011))

century_bands <- data.frame(min_year = c(1700, 1800, 1900, 2000),
                            max_year = c(1799, 1899, 1999, 2099),
                            century_name = c("18th", "19th", "20th", "21st"))

对于中的每个值，我想x使用中的信息计算出它所属的世纪的名称century_bands。我无法想象这很难实现，但我无法弄清楚。有人可以帮忙吗？有没有办法使用这个dplyr包（我经常使用它）或者其他一些技术？

这只是现实生活中的一个非常简单的例子，乐队在 100 年的步长上不是很整齐——所以不幸的是，任何基于将年份除以 100 等的捷径都行不通。

谢谢你。

score 3 · Accepted Answer

利用的一种选择fuzzyjoin可能是：

fuzzy_left_join(x, century_bands, 
                by = c("year" = "min_year",
                       "year" = "max_year"),
                match_fun = list(`>=`, `<=`)) 

  year min_year max_year century_name
1 1730     1700     1799         18th
2 1860     1800     1899         19th
3 1941     1900     1999         20th
4 2011     2000     2099         21st

score 3 · Accepted Answer

这里有一些方法。

1) sqldf 在 SQL 中，可以在复杂条件下进行连接。如果大于或等于下限且小于或等于上限，则使用between匹配的语法。year对于特定年份，如果没有匹配项，左连接将导致使用 NA（尽管问题示例中没有出现这种情况）。

library(sqldf)
sqldf("select year, century_name from x
  left join century_bands on year between min_year and max_year")
##   year century_name
## 1 1730         18th
## 2 1860         19th
## 3 1941         20th
## 4 2011         21st

2) findInterval 此方法仅使用基数 R。对于其第一个参数的每个组件，findInterval返回其第二个参数中小于或等于它的值的数量。假定第二个参数按升序排序。返回的数字findInterval可用于索引century_name. findInterval往往是相当有效的。

transform(x, year_name = 
  with(century_bands, century_name[findInterval(year, min_year)]))
##   year year_name
## 1 1730      18th
## 2 1860      19th
## 3 1941      20th
## 4 2011      21st

虽然这不会出现在问题中，如果有可能在所有波段之外，那么我们可以通过添加与 NA相关联的year额外行来扩展它而不更改代码，否则我们可以扩展如下：century_bandscentury_namefindInterval

FindInterval <- function(x, vec, upper) {
  ifelse(x < vec[1] | x > upper, NA, findInterval(x, vec))
}
transform(x, year_name = 
  with(century_bands, century_name[FindInterval(year, min_year, max(max_year))]))

如果无论如何使用 dplyr，我们可以替换transform为；mutate否则， usingtransform消除了这种依赖性。

3) sapply另一个基本解决方案是

Match <- function(x) with(century_bands, century_name[x >= min_year & x <= max_year])
transform(x, century_name = sapply(year, Match))
##   year century_name
## 1 1730         18th
## 2 1860         19th
## 3 1941         20th
## 4 2011         21st

如果所有年份都在范围内，这应该就足够了。问题中的示例就是这种情况，但如果不能保证这一点，则扩展Match如下：

Match <- function(x) {
  Name <- with(century_bands, century_name[x >= min_year & x <= max_year])
  if (length(Name)) Name else NA
}

4) cut此基本解决方案类似于，findInterval但如果year不在任何波段内，则返回 NA。

transform(x, year_name = with(century_bands, century_name[
    cut(year, c(min_year, max(max_year)), label = FALSE, include.lowest = TRUE)
]))

5) car::recode 该函数允许对值进行如下重新编码。

library(car)

recodes <- 
  "1700:1799='18th'; 1800:1899='19th'; 1900:1999='20th'; 2000:2099='21st'; else=NA"
transform(x, year_name = recode(year, recodes))
##   year year_name
## 1 1730      18th
## 2 1860      19th
## 3 1941      20th
## 4 2011      21st

为了避免对字符串进行硬编码，它可以像这样recodes派生century_bands

recodes <- with(century_bands, 
  paste(sprintf("%d:%d='%s'", min_year, max_year, century_name), collapse = ";")
)
recodes <- paste0(recodes, "; else=NA")

6）扩大乐队我们可以把乐队扩大到个别年份，在这种情况下，我们可以简单地进行匹配。与任何波段都不匹配的年份会导致century_name.

century_bands2 <- with(century_bands, 
  stack(setNames(Map(seq, min_year, max_year), century_name)))
transform(x, century_name = with(century_bands2, ind[match(year, values)]))
##   year century_name
## 1 1730         18th
## 2 1860         19th
## 3 1941         20th
## 4 2011         21st

6a)一个 tidyverse 的变化主要是这样的：

library(dplyr)
library(purrr)
library(tibble)
library(tidyr)

century_bands2 <- century_bands %>%
  { set_names(map2(.$min_year, .$max_year, seq), .$century_name) %>%
    as_tibble %>%
    pivot_longer(everything(), names_to = "century_name", values_to = "year")
  }

x %>% left_join(century_bands2, by = "year")
##   year century_name
## 1 1730         18th
## 2 1860         19th
## 3 1941         20th
## 4 2011         21st

7) case_when。我们可以将波段定义硬编码为case_when：

library(dplyr)

x %>% mutate(century_name = case_when(
    year %in% 1700:1799 ~ "18th",
    year %in% 1800:1899 ~ "19th",
    year %in% 1900:1999 ~ "20th",
    year %in% 2000:2099 ~ "21st",
    TRUE ~ NA_character_))
##   year century_name
## 1 1730         18th
## 2 1860         19th
## 3 1941         20th
## 4 2011         21st

7a)另一种表达方式case_when是：

x %>% mutate(century_name = case_when(
    year < 1700 ~ NA_character_,
    year < 1800 ~ "18th",
    year < 1900 ~ "19th",
    year < 2000 ~ "20th",
    year < 2100 ~ "21st",
    TRUE ~ NA_character_))
##   year century_name
## 1 1730         18th
## 2 1860         19th
## 3 1941         20th
## 4 2011         21st

score 2 · Accepted Answer

由于该max_year列似乎是多余的，您也可以轻松地执行以下操作：

century_bands[colSums(sapply(x$year, function(x) `>=`(x, century_bands$min_year))), 3]
# [1] "18th" "19th" "20th" "21st"

r - R中的带状变量

3 回答 3

Related

Reference