r - 捕获在列中使用“to”指示的范围 - tidycensus（美国人口普查 API）

Question

我如何在一个列中捕获所有年龄，其中一组值为“20 至 24 岁”，另一组为“22 至 24 岁”？这将使我能够确认我在tidycensus（R 包）美国人口普查 API 查询中捕获了所有工作年龄（18-64 岁）变量名称。

目标

我想要的是，对于这个例子中的 20-24 岁，一个从标签条目中提取年龄的数据框，例如"22 to 24 years"：

MEN  WOMEN ETHNORACE
18   18    BLACK
19   19    BLACK
20   20    BLACK
21   21    BLACK
22   22    BLACK
23   23    BLACK
24         BLACK

然后，我可以轻松地创建一个包含所有年龄的数据框并进行比较以查看是否缺少任何数据。

人口普查变量（tidycensus）

可以在https://api.census.gov/data/2019/acs/acs5/variables.html看到，至少美国人口普查局的美国社区调查 (ACS) 具有不同语法的年龄范围字段（例如“20年”和“22 至 24 年”）：

tidycensus 包的 load_variables 函数中的示例行

tidycensus R 包 1.1 版

## Example rows from tidycensus using:
library(tidycensus)
library(magrittr)
library(stringr)

v19     <- load_variables(2019, "acs5", cache = TRUE)
v19 %>% 
  dplyr::filter(
    str_detect(label, "18|20|24") & 
               concept %in% c("SEX BY AGE",
                              "SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
               grepl('FEMALE', toupper(label))
                )

v19_Total_AndBlack_Age18_24 <-
  v19 %>% dplyr::filter(
  str_detect(label, "18|20|24") & 
  concept %in% c("SEX BY AGE",
                 "SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
  grepl('FEMALE', toupper(label)))

 print(v19_Total_AndBlack_Age18_24)

  name        label                                      concept                                     
  <chr>       <chr>                                      <chr>                                       
1 B01001_031  Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE                                  
2 B01001_032  Estimate!!Total:!!Female:!!20 years        SEX BY AGE                                  
3 B01001_034  Estimate!!Total:!!Female:!!22 to 24 years  SEX BY AGE                                  
4 B01001B_022 Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
5 B01001B_023 Estimate!!Total:!!Female:!!20 to 24 years  SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
...

在此示例中，我想确保从18到 24 岁的总人口和黑人人口的每个年龄都存在于如下所示的数据框中 - 请注意使用上面的人口普查 API 名称v19_Total_AndBlack_Age18_24。

v19_Total_AndBlack_Age18_24 <-
  get_acs(
    year = 2019,
    geography = "zcta",
    variables = c(v19_Total_AndBlack_Age18_24$name)
 )

请注意，Total的“ 22到 24 年”与Black的“ 20到 24 年”相比。

让我们关注v19_Total_AndBlack_Age18_24上面的数据框，它列出了 18 - 24 岁的人口普查 API 名称和标签，旨在确认所有年份都存在。

我可以通过以下方式使用正则表达式获取年龄中的所有数字：

unlist(str_extract_all(v19_Total_AndBlack_Age18_24$label,"\\d{2}"))
[1] "18" "19" "20" "22" "24" "18" "19" "20" "24"

但是我按类别分组的尝试失败了，当“to”这个词出现在“20 到 24”时，我仍然需要得到一个跨越年龄范围的向量。

v19_Total_AndBlack_Age18_24_grp <- 
  v19_Total_AndBlack_Age18_24 %>% 
   mutate(EthnoRace = case_when(
   grepl('BLACK', concept) ~ "BLACK",
   TRUE ~ "TOTAL"))

v19_Total_AndBlack_Age18_24_grp %>% 
  group_by(EthnoRace) %>% 
  mutate(ages = str_extract_all(label, "\\d{2"))

错误

Error: Problem with `mutate()` column `ages`.
i `ages = str_extract_all(label, "\\d{2")`.
x Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`\d{2`)
i The error occurred in group 1: Group = "TOTAL".

score 1 · Accepted Answer

我倾向于分两步进行。第一步指定元数据文件中的一些特征。第二步将元数据应用于问题。

看起来你需要近似一些，因为水平不完全符合你的界限。例如，“15 至 19 岁”跨越 18。

（为了简单起见，我假设您永远不想保持“20 到 24 岁”的水平，但排除“20 到 24岁”的水平。

# Step 1a: create a list of poential age labels
v19 |> 
  dplyr::mutate(
    concept_age = grepl(pattern = "AGE$", concept),       # Concept must end with "AGE"
  ) |> 
  dplyr::filter(concept_age) |> 
  tidyr::separate_rows(label, sep = "!!")  |>             # Isolate the different dimensions of a variable
  dplyr::rename(level = label) |> 
  dplyr::mutate(
    level_year  = grepl(pattern = "\\byears?\\b", level), # Label must contain "year" or "years"
  ) |> 
  dplyr::filter(level_year) |> 
  dplyr::count(level, name = "variable_count") |>         # Reduce to the unique (overlapping) age levels
  dplyr::mutate(
    desired = TRUE                                        # Create variable to manually toggle in Step 2
  ) |> 
  dplyr::arrange(level) |>                                # Careful this is still a string, so "26" precedes "3"
  # View()
  readr::write_csv(path_metadata_age_label)

# Step 1b: Manual edit the 78 `desired` values in the csv & save.

第 1 步输出：

level,variable_count,desired
10 to 14 years,2,TRUE
12 to 14 years,5,TRUE
12 to 17 years,5,TRUE
15 to 17 years,45,TRUE
15 to 19 years old,6,TRUE
...

# Step 2a: Read your metadata, retaining only the desired age levels.
pattern_age <-
  path_metadata_age_label |> 
  readr::read_csv() |> 
  dplyr::filter(desired) |> 
  dplyr::mutate(
    level = paste0("\\b", level, "\\b")   # Text starts & stops with a word boundary
  ) |> 
  dplyr::pull(level) |> 
  paste(collapse = "|")

# Step 2b: Apply the age levels to `v19`
v19 |> 
  dplyr::mutate(
     keep = grepl(pattern_age, label, perl = T)
  ) |> 
  dplyr::filter(keep)

第 2 步输出：

   name       label                                    concept    keep 
   <chr>      <chr>                                    <chr>      <lgl>
 1 B01001_003 Estimate!!Total:!!Male:!!Under 5 years   SEX BY AGE TRUE 
 2 B01001_004 Estimate!!Total:!!Male:!!5 to 9 years    SEX BY AGE TRUE 
 3 B01001_005 Estimate!!Total:!!Male:!!10 to 14 years  SEX BY AGE TRUE 
 4 B01001_006 Estimate!!Total:!!Male:!!15 to 17 years  SEX BY AGE TRUE 
...

score 0 · Accepted Answer

首先，人们想要哪些年龄、性别和种族群体的数据？这可以修改为只选择一种性别。Gender_var需要出现在https://api.census.gov/data/2019/acs/acs5/variables.html的标签列中（或者可以在创建load_variables()数据框时使用如下所示v19）。

参数

只需设置这些代码即可运行。

min_age_desired <- 18
max_age_desired <- 24
Gender_var = c("MALE", "FEMALE")
EthnoRace_var = c("BLACK","TOTAL")

现在，让我们创建一个包含我们需要的所有种族和年龄组的 QC 数据框。

加载 R 包

library(arsenal)
library(dplyr)
library(stringr)
library(tidyr)
library(tidycensus)

options(scipen = 8)

验证数据框

拥有人口普查 API 中想要的所有年龄和种族群体

AGE_var  = as.numeric(seq(min_age_desired, max_age_desired, 1))
all_grp_qc_frm <- 
  data.frame(
    ## dupli
    expand.grid(EthnoRace_var,
                Gender_var,
                AGE_var
          )
  )

colnames(all_grp_qc_frm) <- 
  c("EthnoRace", "Gender", "AGE")

all_grp_qc_frm$AGE <- as.numeric(
  all_grp_qc_frm$AGE)
all_grp_qc_frm$EthnoRace <- as.character(
  all_grp_qc_frm$EthnoRace)
all_grp_qc_frm$Gender <- as.character(
  all_grp_qc_frm$Gender)

all_grp_qc_frm <- all_grp_qc_frm %>% 
  arrange(EthnoRace,Gender,AGE)

print(all_grp_qc_frm)

   EthnoRace Gender AGE
1      BLACK   MALE  18
2      BLACK   MALE  19
3      BLACK   MALE  20
4      BLACK   MALE  21
5      BLACK   MALE  22
6      BLACK   MALE  23
7      BLACK   MALE  24
8      BLACK FEMALE  18
9      BLACK FEMALE  19
10     BLACK FEMALE  20
11     BLACK FEMALE  21
12     BLACK FEMALE  22
13     BLACK FEMALE  23
14     BLACK FEMALE  24
15     TOTAL   MALE  18
16     TOTAL   MALE  19
17     TOTAL   MALE  20
18     TOTAL   MALE  21
19     TOTAL   MALE  22
20     TOTAL   MALE  23
21     TOTAL   MALE  24
22     TOTAL FEMALE  18
23     TOTAL FEMALE  19
24     TOTAL FEMALE  20
25     TOTAL FEMALE  21
26     TOTAL FEMALE  22
27     TOTAL FEMALE  23
28     TOTAL FEMALE  24

使用 tidycensus 加载 Census 变量

这是 2019 年美国社区调查的 5 年估计

v19     <- load_variables(2019, "acs5", cache = TRUE)

将这些变量设置为所需的子集

人口普查 API 提供了许多变量。

作为子集，我们首先获取一个向量，每个年龄从 18 到 24 用管道分隔。

working_age_vec <- paste0(seq(18,24,1), collapse = "|")

请注意，我需要使用正确的概念值来获得黑人和跨种族群体的总人口。

v19_Total_And_EthnoRace_Age18_24 <-
  v19 %>% dplyr::filter(
  str_detect(label, working_age_vec) & 
  concept %in% c("SEX BY AGE",
                 "SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
  grepl('FEMALE|MALE', toupper(label)))

print(v19_Total_And_EthnoRace_Age18_24)

# A tibble: 12 x 3
   name        label                                      concept                                     
   <chr>       <chr>                                      <chr>                                       
 1 B01001_007  Estimate!!Total:!!Male:!!18 and 19 years   SEX BY AGE                                  
 2 B01001_008  Estimate!!Total:!!Male:!!20 years          SEX BY AGE                                  
 3 B01001_009  Estimate!!Total:!!Male:!!21 years          SEX BY AGE                                  
 4 B01001_010  Estimate!!Total:!!Male:!!22 to 24 years    SEX BY AGE                                  
 5 B01001_031  Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE                                  
 6 B01001_032  Estimate!!Total:!!Female:!!20 years        SEX BY AGE                                  
 7 B01001_033  Estimate!!Total:!!Female:!!21 years        SEX BY AGE                                  
 8 B01001_034  Estimate!!Total:!!Female:!!22 to 24 years  SEX BY AGE                                  
 9 B01001B_007 Estimate!!Total:!!Male:!!18 and 19 years   SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
10 B01001B_008 Estimate!!Total:!!Male:!!20 to 24 years    SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
11 B01001B_022 Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
12 B01001B_023 Estimate!!Total:!!Female:!!20 to 24 years  SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)

使用 Census API 提取这些变量

Census_Total_AndBlack_Age18_24 <-
  get_acs(
    year = 2019,
    geography = "zcta",
    variables = c(v19_Total_AndBlack_Age18_24$name)
  )

获取类似的概念值和类似SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE) 的标签值Estimate!!Total:!!Male:!!10 to 14 years。

Census_Total_AndBlack_Age18_24 <- left_join(
  Census_Total_AndBlack_Age18_24, 
  v19 %>% 
    select(name, concept, label) %>% 
    rename(variable = name)
 )

正则表达式提取+序列

用于提取年龄并创建一个向量的正则表达式，该向量表示每个范围内从最低年龄到最高年龄的序列。

Census_Total_AndBlack_Age18_24_grp <- 
  Census_Total_AndBlack_Age18_24 %>%
  distinct(label, concept) %>% 
    ## regular expression to extract all the numbers in labels like
    ## Estimate!!Total:!!Male:!!5 to 9 years
    mutate(ages = sapply(str_extract_all(label,"\\d{2}"),
                            function(x) paste(x,collapse=""))) %>% 
    mutate(start = str_sub(ages, 1, 2),
             end = str_sub(ages, 3, 4)) %>% 
    mutate(
      start = case_when(
            is.na(start) ~ "99",
            TRUE ~ start),
      end = case_when(
            is.na(end) ~ "99",
            TRUE ~ end)) %>% 
    dplyr::filter(grepl('Female|Male', label)) %>% 
    mutate(Gender = case_when(
      grepl('Female', label) ~ "FEMALE",
      grepl('Male', label) ~ "MALE",
      TRUE ~ "MISSING")) %>% 
    mutate(EthnoRace = case_when(
      grepl('BLACK', concept) ~ "BLACK",
      TRUE ~ "TOTAL")) %>% 
    mutate(end = case_when(
      is.na(end) | end == "" ~ start,
      TRUE ~ end))
    
    Census_Total_AndBlack_Age18_24_grp_sum <- Census_Total_AndBlack_Age18_24_grp %>% 
      group_by(EthnoRace, Gender) %>% 
      summarize(AGE = as.numeric(unlist(purrr::map2(start, end, `:`)))) %>% 
      ungroup() %>% 
      distinct(EthnoRace, Gender, AGE)

质检比较

请记住，所需的种族群体、性别群体和年龄打印在此解决方案的顶部。

arsenal::comparedf(all_grp_qc_frm, Census_Total_AndBlack_Age18_24_grp_sum)

[...]

Not shared: 0 variables and 0 observations.

Differences found in 0/3 variables compared.
0 variables compared have non-identical attributes.