我想以某种方式对数据框进行分类R。
假设有如下数据框:
> data = sample(1:500, 5000, replace = TRUE)
为了对这个数据框进行分类,我正在制作这些类:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
> table(data.cl)
data.cl
(0,10] (10,20] (20,30] (30,40] (40,50]
102 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500]
1002 1492 1318 194
如果我想0被包括在内,我只需要添加include.lowest = TRUE:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE)
> table(data.cl)
data.cl
[0,10] (10,20] (20,30] (30,40] (40,50]
102 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500]
1002 1492 1318 194
在此示例中,这没有显示任何差异,因为0此数据帧中根本没有出现。但是,如果它会,例如,在 class 中4会有元素106而不是元素:102[0,10]
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE)
> table(data.cl)
data.cl
[0,10] (10,20] (20,30] (30,40] (40,50]
106 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500]
1002 1492 1318 194
更改班级限制还有另一种选择。的默认选项cut()是right = FALSE。如果你改变它,right = TRUE你会得到:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE, right = FALSE)
> table(data.cl)
data.cl
[0,10) [10,20) [20,30) [30,40) [40,50)
92 81 87 111 118
[50,60) [60,70) [70,80) [80,90) [90,100)
103 89 94 103 103
[100,200) [200,350) [350,480) [480,500]
1003 1497 1320 199
include.lowest现在变为“<code>include.highest”,代价是更改类限制,因此在某些类中返回不同数量的类成员,因为类限制略有变化。
但是如果我想要数据框
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
> table(data.cl)
data.cl
(0,10] (10,20] (20,30] (30,40] (40,50]
102 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500)
1002 1492 1318 194
也排除 500,我该怎么办?
当然,人们可以说:“只写data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 499))而不是data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500)),因为您正在处理整数。”<br> 没错,但如果不是这种情况,我会使用浮点数来代替? 那我怎么排除500呢?