考虑这个数据框:
set.seed(123)
dat1 <- data.frame(Region = rep(c("r1","r2"), each = 100),
State = rep(c("NY","MA","FL","GA"), each = 10),
Loc = rep(c("a","b","c","d","e","f","g","h"),each = 5),
ID = rep(c(1:10), each = 2),
var1 = rnorm(200),
var2 = rnorm(200),
var3 = rnorm(200),
var4 = rnorm(200),
var5 = rnorm(200),
var6 = rnorm(200))
dat1$ID <- factor(dat1$ID)
我正在使用 mclust 包来拟合混合模型并获得自举标准错误,如下所示:
library(tidyverse)
library(mclust)
set.seed(123)
mod <- Mclust(dat1[,5:10], G=5, modelNames = "VEI", initialization = list(hcPairs = randomPairs(dat1[,5:10], seed = 123)))
plot(mod, what = "classification")
plot(mod, what = "density")
boot <- MclustBootstrap(mod, nboot = 25, type = "bs")#25 for now to speed up
接下来我想创建一个新的数据框来显示每个观察的混合概率,所以我将原始数据dat1
与mod$z
.
probs <- mod$z
colnames(probs) <- paste0("Prob", 1:mod$G)
probs <- cbind(dat1, probs)
probs <- cbind(probs, cluster = mod$classification)
现在我要收集:1)每个集群中每个变量的平均值,以及 2)每个平均值的标准误差(来自summary(boot, what = "SE")$mean
)。我将使用这些来创建下面的图,并返回一个显示平均值和 SE 值的表。
newdat <- cbind(dat1, cluster = mod$classification)
a <-
newdat%>%
dplyr::select_if(is.numeric)%>%
dplyr::group_by(cluster)%>%
summarise_all(mean)%>%
pivot_longer(-c("cluster"), names_to = "Vars", values_to = "mean")
b<-
as.data.frame(cbind(cluster=1:mod$G, t(summary(boot, what = "se")$mean)))%>%
pivot_longer(., -c("cluster"), names_to = "Vars", values_to = "SE")
a<-dplyr::mutate(a, "SE" = b$SE)
ggplot(a, aes(x=Vars, y=mean, group=factor(cluster))) +
geom_line(aes(colour=factor(cluster)))+
geom_point()+
geom_errorbar(aes(ymin=mean-SE, ymax=mean+SE), width = .1)+
ggtitle("Mean by cluster") +
expand_limits(y=0) +
scale_y_continuous(breaks=0:20*4) +
labs(x = "Var", y = "Cluster Average")+
theme_bw() +
theme(legend.justification=c(1,0),
legend.position=c(1,0))
library(knitr)
kable(a)
最终,我想编写一个函数来执行每个步骤并立即返回输出。它将应用于结构类似于 的数据帧dat1
,但它们的var
列数不同。我还需要指定要使用的集群数量以及创建时要使用的模型mod
。有什么更好的方法来收集存储在其中的信息,a
以便可以将相同的过程应用于函数内的变量(var.
列)和混合成分(G
在 中指定mod
)的任何组合?