r - 难以均匀地拆分数据？

Question

我试图在 R 中均匀地拆分数据。例如，我正在使用 R Studio 中内置的数据集汽车，有 50 行。如果我想将数据分成两部分，我会按照以下方式做一些事情： cars$split <- rep(1:2, each=25)我将创建一个名为的列split并将前 25 个值分配给 1，将接下来的 25 个值分配给 2。但是，如果我想将我的数据分成 8 个部分（根据用户的判断），我无法将 50 / 8 平均划分为 6.25。在这种情况下，我只需使用上面的函数将最后两行（因为 50 / 8 = 6.25 和 6 * 8 = 48 所以我们将剩下 2 行）分配给数字 8。但是，我无法做到这一点，因为rep函数需要正确划分，所以我试图写出一些逻辑，但我遇到了一个问题：

Error in `$<-.data.frame`(`*tmp*`, "split", value = c(1L, 1L, 1L, 1L,  : replacement has 48 rows, data has 50

有想法该怎么解决这个吗？我的尝试如下所示：

numDataPerSection <- floor(nrow(cars) / userInputNum)
if(nrow(cars) %% userInputNum != 0){
  #If not divisible, assign last few data points to the last number
  cars$split <- rep(1:ncls, each=numDataPerSection, len = nrow(cars) - (nrow(cars) %% userInputNum))
  for(i in nrow(cars) %% userInputNum){
    cars$split[nrow(cars) - i] <- userInputNum 
  }
}
#Everything divides correctly
else{
  cars$split <- rep(1:ncls, each=numDataPerSection)
}

score 0 · Accepted Answer

你可以使用类似的东西

split1 <- function(n,s){ c( rep(1:s, each=n%/%s), rep(s, n%%s) ) } 
cars$split <- split1(nrow(cars,userInputNum))

但这不是很平衡，因为在您的示例中，类别 8 比任何其他类别大两个，并且 55 行和 8 个部分会更糟：

> split1(50,8)
 [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7
[39] 7 7 7 7 8 8 8 8 8 8 8 8
> table(split1(50,8))
1 2 3 4 5 6 7 8 
6 6 6 6 6 6 6 8
> table(split1(55,8))
 1  2  3  4  5  6  7  8 
 6  6  6  6  6  6  6 13

你可以用类似的东西做得更好

split2 <- function(n,s){ ((1:n)*s+n-s) %/% n }

产生

> split2(50,8)
 [1] 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 6 6
[39] 7 7 7 7 7 7 8 8 8 8 8 8
> table(split2(50,8))
1 2 3 4 5 6 7 8 
7 6 6 6 7 6 6 6 
> table(split2(55,8))
1 2 3 4 5 6 7 8 
7 7 7 7 7 7 7 6

score 0 · Accepted Answer

使用这样的函数来创建索引怎么样？

create.indices <- function(nrows, sections) {
  indices <- rep(1:sections,each=floor(nrows/sections))
  indices <- append(indices, rep(tail(indices, 1), nrows%%sections))
  return(indices)
}

create.indices(50,8)
# [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 8 8

score 0 · Accepted Answer

您可以使用的length.out参数rep()来创建您的split列：rep(1:8, length.out = 50, each = round(50/8))。使用该round()函数可以很好地实现组大小的均匀分布：

> table(rep(1:8, length.out = 50, each = round(50/8)))

1 2 3 4 5 6 7 8 
8 6 6 6 6 6 6 6

r - 难以均匀地拆分数据？

3 回答 3

Related

Reference