Original 生信技能树 生信技能树 2020-01-09 14:2
Hi大家好,我是老米!生信入门一个月,感谢生信技能树平台。这是我的第3篇Markdown。不知道多少人还记得我的前两个投稿:
现在继续完成生信技能树的作业题:重复一篇WGCNA分析的文章。WGCNA学习教程:一文学会WGCNA分析。
复现的文献:A novel microglial subset plays a key role in myelinogenesis in developing brain。EMBO J,好杂志。
该文章对五个处理组,共17个老鼠:
orange represents neonatal CD11c+ microglia (n = 4),
green neonatal CD11c microglia (n = 4),
blue EAE CD11c+ microglia (n = 3),
purple EAE CD11c microglia (n = 3),
black adult microglia (n = 3).
其实就是 neonatal(新生的) 和 EAE的Microglia,还有CD11C阳性和阴性,然后和成年小鼠的Microglia进行比较。这些样本进行了RNASeq测序,数据在GEO可供下载:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE78809。当然我偷了个懒,该文章Supplemental material提供了整理之后的csv矩阵,大概1万3千个基因。
作者进行了WGCNA分析,以此证明小胶质细胞的作用及寻找关键信号通路及基因(嗯,估计这个实验室后续还有大文章出来)。
组学及生信分析的篇幅占了整个文章的一半以上。在此,仅重复文章的Figure3。
其实我跌跌撞撞学了WGCNA半个月,大概了解了这个原理,现在还一知半解,对很多细节摸索了很久,头很大。下面进入正题,请大家一起跟着我头大。
1.数据处理
- rm(list = ls())
- dev.off()
- library(WGCNA)# 偷懒,读取作者给出的附件的表达矩阵
- norm.counts <- read.csv("Dataset_EV2.csv",header = T,sep = ",") ##original counts
- rownames(norm.counts)<-norm.counts[,1]
- norm.counts <- norm.counts[,2:18]
- save(norm.counts,file = "norm.counts.Rdata")
-
- ## 处理表型信息,就是老鼠的分组信息
- a <- colnames(norm.counts)
- library(stringr)
- str_tmp=as.data.frame(str_split(a,"[_]",simplify = T))
- rownames(str_tmp) <- colnames(norm.counts)
- colnames(str_tmp) <- c("group","sampleRep")
- datTraits <- str_tmp
- datTraits$groupNo <- c(rep(1,3),rep(2,4),rep(3,4),rep(4,3),rep(5,3))
- ###数据初步处理完成
-
- ## 筛选中位绝对偏差前75%的基因,取MAD大于0.3
- ## 筛选后会降低运算量,也会失去部分信息
- ## 也可不做筛选,使MAD大于0即可
- norm.counts <- log2(norm.counts) ##注意我在这里进行了log2,因为我发现如果不log的话,出现了个很有意思的现象,请您往下看
- m.mad <- apply(norm.counts,1,mad)
- dataExprMad <- norm.counts[which(m.mad > max(quantile(m.mad, probs=seq(0, 1, 0.25))[2],0.3)),] #25% = 0.22
- datExpr0 <- as.data.frame(t(dataExprMad))
- ##最终取到8223个基因。
-
- ###########start handling the data
- ##看看有没有为空的值,需要剔除。这里下载的都是作者处理好的,基本没问题。
- gsg <- goodSamplesGenes(datExpr0,verbose = 3)
- gsg$allOK
- if (!gsg$allOK){
- # Optionally, print the gene and sample names that were removed:
- if (sum(!gsg$goodGenes)>0)
- printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes],
- collapse = ", ")));
- if (sum(!gsg$goodSamples)>0)
- printFlush(paste("Removing samples:",
- paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));
- # Remove the offending genes and samples from the data:
- datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
- }
- gsg <- goodSamplesGenes(datExpr0,verbose = 3)
- gsg$allOK
-
- ###画个树看看样本有没有离群值!!!
- if(T){
- # Plot a line to show the cut
- sampleTree <- hclust(dist(datExpr0),method = "average")
- par(cex=.6)
- par(mar=c(0,4,2,0))
- plot(sampleTree, main = "Sample clustering to detect outliers", sub = "",
- xlab = "",cex.lab = 1.5, cex.axis =1.5, cex.main=2
- )
- }
-
- ##然后保存数据
- dim(datExpr0) ##17,8222
- datExpr <- datExpr0
- nGenes <- ncol(datExpr)
- nSamples <- nrow(datExpr)
- save(nGenes,nSamples,datExpr,datTraits,file="input.Rdata")
结果如图:

这结果挺好的,符合实验设计,都聚类好了。但是!但是如果不对原始数值进行log2处理,会出现一个outlier老鼠,如图:

看到没,那个新生的NEO1号鼠有点调皮!他超脱于所有老鼠之上,统领其他所有老鼠!这要么是问题少年,要么是智商超群,比如科大少年班的那种!而我大概深究了一下,控制这个离群值的大概是500个基因,哈哈哈。。。
2. MDS分析
专业名词叫多维标度法(Multidimensional Scaling)MDS,是一种维数缩减方法,把高维的数据点映射到一个低维的流形上;同时也是一种可视化方法,实践中通常利用2D或3D的MDS 结果观察(投影后)点的分布和聚集来研究数据的性质。实际上你也可以把它看做是PCA分析,反正都是看样本间的相关性计算距离远近。
说人话叫做:通过基因表达log2组间差异,看看这17个老鼠是否能够分到一个类别,比如成年鼠归到一类,EAE归到一类,新生鼠归到一类。和上面的有点类似,又不完全一样。
参考学习了stat的视频,也参考了他提供的代码。如下:
- rm(list = ls())
- load(file = 'input.Rdata')
- log2.data.matrix <- as.data.frame(t(datExpr))
- ## now create an empty distance matrix
- log2.distance.matrix <- matrix(0,
- nrow=ncol(log2.data.matrix),
- ncol=ncol(log2.data.matrix),
- dimnames=list(colnames(log2.data.matrix),
- colnames(log2.data.matrix)))
- ## now compute the distance matrix using avg(absolute value(log2(FC)))
- for(i in 1:ncol(log2.distance.matrix)) {
- for(j in 1:i) {
- log2.distance.matrix[i, j] <-
- mean(abs(log2.data.matrix[,i] - log2.data.matrix[,j]))
- }
- }
- ## do the MDS math (this is basically eigen value decomposition)
- ## cmdscale() is the function for "Classical Multi-Dimensional Scalign"
- mds.stuff <- cmdscale(as.dist(log2.distance.matrix),
- eig=TRUE,
- x.ret=TRUE)
- save(mds.stuff,file = "step7_MDS.Rdata")
- ## calculate the percentage of variation that each MDS axis accounts for...
- mds.var.per <- round(mds.stuff$eig/sum(mds.stuff$eig)*100, 1)
- ## now make a fancy looking plot that shows the MDS axes and the variation:
- mds.values <- mds.stuff$points
- mds.data <- data.frame(Sample=rownames(mds.values),
- X=mds.values[,1],
- Y=mds.values[,2])
- ## 然后画图
- library(ggpubr)
- library(ggplot2)
- groups <- datTraits$group
- png("figures/step7_MDS_logfc.png",width = 800,height=600)
- ggplot(mds.data, aes(x=X, y=Y, label=Sample,col=groups)) +
- geom_point(size = 10, alpha = 0.8) +
- theme(panel.grid.minor = element_blank()) +
- coord_fixed() + theme_bw()+
- ggtitle("MDS plot using avg(logFC) as the distance")+
- xlab(paste("Leading logFC dim1 ", mds.var.per[1], "%", sep="")) +
- ylab(paste("Leading logFC dim2 ", mds.var.per[2], "%", sep="")) +
- ggtitle("MDS plot using avg(logFC) as the distance")
- dev.off()
结果如下:
嗯,,,结果表示完美!别问我为什么要先做这个图,因为它好看,我忍不住!
点评:老米在这里开始秀技术了,实际上大家随便跑一下PCA分析即可,暂时无需深入写这么长代码!
下面正式进入WGCNA分析。
3. 计算beta值
这个没啥好讲的,代码如下:
- rm(list = ls())
- library(WGCNA)
- load(file = "input.Rdata")
- dim(datExpr)
- #################################################
- if(T){
- powers = c(c(1:10), seq(from = 12, to=30, by=2))
- # Call the network topology analysis function
- sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
- png("figures/step2-beta-value.png",width = 800,height = 600)
- # Plot the results:
- ##sizeGrWindow(9, 5)
- par(mfrow = c(1,2));
- cex1 = 0.9;
- # Scale-free topology fit index as a function of the soft-thresholding power
- plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
- xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
- main = paste("Scale independence"));
- text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
- labels=powers,cex=cex1,col="red");
- # this line corresponds to using an R^2 cut-off of h
- abline(h=0.9,col="red")
- # Mean connectivity as a function of the soft-thresholding power
- plot(sft$fitIndices[,1], sft$fitIndices[,5],
- xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
- main = paste("Mean connectivity"))
- text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
- dev.off()
- }
- sft$powerEstimate ## beta=22 SCI文章里面用了20
- save(sft,file = "step2_beta_value.Rdata")

4. 采用一步法构建加权共表达网络
(weight co-expression network)
- rm(list = ls())
- library(WGCNA)
- load(file = "input.Rdata")
- load(file = "step2_beta_value.Rdata")
- enableWGCNAThreads()
- if(T){
- net = blockwiseModules(
- datExpr,
- power = sft$powerEstimate,
- maxBlockSize = nGenes,
- TOMType = "unsigned", minModuleSize = 30,
- reassignThreshold = 0, mergeCutHeight = 0.28, ## 注意这个0.28
- numericLabels = TRUE, pamRespectsDendro = FALSE,
- saveTOMs = F,
- verbose = 3
- )
- table(net$colors)
- }
-
- ##模块可视化,画那个传说中的树
- if(T){
- # Convert labels to colors for plotting
- moduleColors=labels2colors(net$colors)
- table(moduleColors)
- # Plot the dendrogram and the module colors underneath
- png("figures/step3-genes-modules.png",width = 800,height = 600)
- plotDendroAndColors(net$dendrograms[[1]], moduleColors[net$blockGenes[[1]]],
- "Module colors",
- dendroLabels = FALSE, hang = 0.03,
- addGuide = TRUE, guideHang = 0.05)
- dev.off()
- }
- save(net,moduleColors,file = "step3_genes_modules.Rdata")
结果如下:

这里需要停顿一下:SCI文章只取到8个模块,图注说用了12691个基因(被我发现全部加起来也才9999,没错!9999个基因)。事实上,能够分这么少模块,我不知道作者参数细节。有两个参数特别影响模块大小:
一个是minModuleSize,表示每个模块里面最少放多少个基因,这很好理解,设定越大,模块越少;
另一个是mergeCutHeight,这个参数表示你在哪里砍树。值设定越小,树枝越多,通常是0.25。
google一下,发现这么一个说法:
I am not aware of a principle from which
这里,我的mergeCutHeight = 0.28,最终取到13个模块!
5. 模块与基因与表型
这是关键,其实就是开始解读这些关联的基因群对老鼠表型的影响,比如敲除CDC11之后的老鼠,哪些信号通路激活或抑制,EAE的老鼠哪些基因激活?等等等。




- rm(list = ls())
- library(WGCNA)
- load(file = "input.Rdata")
- load(file = "step2_beta_value.Rdata")
- load(file = "step3_genes_modules.Rdata")
-
- if(T){
- nGenes = ncol(datExpr)
- nSamples = nrow(datExpr)
- design <- model.matrix(~0+datTraits$group)
- colnames(design)= levels(datTraits$group) ## get the group
- MES0 <- moduleEigengenes(datExpr,moduleColors)$eigengenes
- MEs = orderMEs(MES0)
- moduleTraitCor <- cor(MEs,design,use = "p")
- moduleTraitPvalue <- corPvalueStudent(moduleTraitCor,nSamples)
- textMatrix = paste(signif(moduleTraitCor,2),"\n(",
- signif(moduleTraitPvalue,1),")",sep = "")
- dim(textMatrix)=dim(moduleTraitCor)
-
- png("figures/step4-Module-trait-relationship.png",width = 800,height = 1200,res = 120)
- par(mar=c(6, 8.5, 3, 3))
- labeledHeatmap(Matrix = moduleTraitCor,
- xLabels = colnames(design),
- yLabels = names(MEs),
- ySymbols = names(MEs),
- colorLabels = FALSE,
- colors = blueWhiteRed(50),
- textMatrix = textMatrix,
- setStdMargins = FALSE,
- cex.text = 0.5,
- zlim = c(-1,1),
- main = paste("Module-trait relationships"))
- dev.off()
- save(design,file = "step4_design.Rdata")
- }




通过模块与各种表型的相关系数,可发现自己感兴趣的模块。如图:

可以看到有些模块与成年鼠关联,有些与EAE相关性很高。和原SCI的图有点像。
就是上面的图转化成bar图。
- if(T){
- mes_group <- merge(MEs,datTraits,by="row.names") # 小技巧,可以通过rowname进行merge
- ##写了个画图的function
- library(gplots)
- library(ggplot2)
- library(ggpubr)
- library(grid)
- library(gridExtra)
- draw_ggboxplot <- function(data,gene="P53",group="group"){
- #print(gene)
- ggboxplot(data,x=group, y=gene,
- ylab = sprintf("Expression of %s",gene),
- xlab = group,
- fill = group,
- palette = "jco",
- #add="jitter",
- legend = "") +stat_compare_means()
- }
- ###开始批量画boxplot
- colorNames = names(MEs)
- png("figures/step4-expression-group.png",width = 800,height=2000,res = 120)
- #par(mfrow=c(ceiling(length(colorNames)/2),2))
- p <- lapply(colorNames,function(x) {
- draw_ggboxplot(mes_group,gene= x,group="group")
- })
- do.call(grid.arrange,c(p,ncol=2))
- dev.off()
- }

mes_group <- merge(MEs,datTraits,by="row.names") # 小技巧,可以通过rowname进行merge

部分结果如下:

- if(T){
- modNames = substring(names(MEs), 3)
- geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"))
- ## 算出每个模块跟基因的皮尔森相关系数矩阵
- ## MEs是每个模块在每个样本里面的值
- ## datExpr是每个基因在每个样本的表达量
- MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples))
- names(geneModuleMembership) = paste("MM", modNames, sep="")
- names(MMPvalue) = paste("p.MM", modNames, sep="")
-
- geneTraitSignificance <- as.data.frame(cor(datExpr,datTraits$groupNo,use = "p"))
- GSPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance),nSamples))
- names(geneTraitSignificance)<- paste("GS.",names(datTraits$group),sep = "")
- names(GSPvalue)<-paste("GS.",names(datTraits$group),sep = "")
-
- #selectModule<-c("blue","green","purple","grey") ##可以选择自己喜欢的模块
- selectModule <- modNames ## 也可以批量作图
- png("figures/step4-Module-trait-significance.png",width = 800,height=2000,res = 120)
- par(mfrow=c(ceiling(length(selectModule)/2),2)) #批量作图开始
- for(module in selectModule){
- column <- match(module,selectModule)
- print(module)
- moduleGenes <- moduleColors==module
- verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
- abs(geneTraitSignificance[moduleGenes, 1]),
- xlab = paste("Module Membership in", module, "module"),
- ylab = paste("Gene significance for", module, "module"),
- main = paste("Module membership vs. gene significance\n"),
- cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
- }
- dev.off()
- }





截取部分如下:

5.4 关注感兴趣的模块,导出基因进行GO分析



- rm(list = ls())
- library(WGCNA)
- load(file = 'input.Rdata')
- load(file = "step2_beta_value.Rdata")
- load(file = "step3_genes_modules.Rdata")
- load(file = "step4_design.Rdata")
- load(file="step4_datTraits.Rdata")
-
- table(moduleColors)
- group_g <- data.frame(gene=colnames(datExpr),
- group=moduleColors)
- write.csv(group_g,file = "figures/group_g.csv") ## 导出了对应模块所有基因
-
- # 选择mouse的基因组进行注释及ID转化啥的,如果是人的,另有R包
- if(!require("clusterProfiler")) BiocManager::install("clusterProfiler",ask = F,update = F)
- library(clusterProfiler)
- if(!require("org.Mm.eg.db")) BiocManager::install("org.Mm.eg.db",ask = F,update = F)
- library(org.Mm.eg.db)
- tmp <- bitr(group_g$gene,fromType = "ENSEMBL",
- toType = "ENTREZID",
- OrgDb = "org.Mm.eg.db")
- de_gene_cluster <- merge(tmp,group_g,by.x="ENSEMBL",by.y="gene")
- table(de_gene_cluster$group)
-
- ###run go analysis
- formula_res <- compareCluster(
- ENTREZID~group,
- data = de_gene_cluster,
- fun = "enrichGO",
- OrgDb = "org.Mm.eg.db",
- style="margin: 0px; padding: 0px; font-size: inherit; line-height: inherit; color: rgb(80, 161, 79); overflow-wrap: inherit !important; word-break: inherit !important;">"BP",
- pAdjustMethod = "BH",
- pvalueCutoff = 0.01,
- qvalueCutoff = 0.05
- )
-
- lineage1_ego <-simplify(
- formula_res,
- cutoff=0.5,
- by="p.adjust",
- select_fun=min
- )
- save(group_g,formula_res,lineage1_ego,file="step5_GOananlysis.Rdata")
- #出图
- pdf("figures/step5_Microglia_GO_term_DE.pdf",width = 15,height = 15)
- dotplot(lineage1_ego,showCategory=10)
- dev.off()
- write.csv(lineage1_ego@compareClusterResult,
- file="figures/Microglia_GO_term_DE.csv")

GO结果分析如下:

通过与上面的几个图表配合,发现EAE鼠与炎症通路/免疫相关,而新生鼠的胶质细胞则是cell cycle/胶质细胞分化/神经细胞发育/轴树突形成通路相关。与原SCI分析相符,与预期相符。
- rm(list = ls())
- library(WGCNA)
- load(file = 'input.Rdata')
- load(file = "step2_beta_value.Rdata")
- load(file = "step3_genes_modules.Rdata")
- load(file = "step4_design.Rdata")
- load(file="step4_datTraits.Rdata")
- load(file="step5_GOananlysis.Rdata")
- # 主要是可视化 TOM矩阵,WGCNA的标准配图
- # 然后可视化不同 模块 的相关性 热图
- # 不同模块的层次聚类图
- # 还有模块诊断,主要是 intramodular connectivity
- if(T){
- #geneTree = net$dendrograms[[1]]
- TOM=TOMsimilarityFromExpr(datExpr,power=20)
- dissTOM=1-TOM
- #plotTOM = dissTOM^7
- #diag(plotTOM)=NA
- #TOMplot(plotTOM,geneTree,moduleColors,main="Network heapmap plot of all genes")
- ### 我这里只取了1000个基因哈,我试了一下全部基因,结果跑了半个小时没跑完,被我强行退出!
- nSelect =1000
- set.seed(20)
- select=sample(nGenes,size = nSelect)
- selectTOM = dissTOM[select,select]
- selectTree = hclust(as.dist(selectTOM),method = "average")
- selectColors = moduleColors[select]
- plotDiss=selectTOM^7
- diag(plotDiss)=NA
- png("figures/step6_select_Network-heatmap.png",width = 800,height=600)
- TOMplot(plotDiss,selectTree,selectColors,main="Network heapmap of selected gene")
- dev.off()
- }
如下:

5.6 提取指定模块的基因并做热图
- if(T){
- module="turquoise"
- which.module=module
- dat=datExpr[,moduleColors==which.module]
- library(pheatmap)
- pheatmap(dat,show_colnames = F,show_rownames = F)
- n=scale(t(dat+1)) # 'scale'可以对log-ratio数值进行归一化
- n[n>2]=2
- n[n< -2]= -2
- n[1:4,1:4]
- pheatmap(n,show_colnames =F,show_rownames = F)
- group_list=datTraits$group
- ac=data.frame(g=group_list)
- rownames(ac)=colnames(n)
- png("figures/step6-moduleGene-heatmap.png",width = 800,height = 600)
- pheatmap(n,show_colnames =F,show_rownames = F,annotation_col =ac )
- dev.off()
-
- }

都挺好的。
点评:这个热图有问题,具体代码是 n=scale(t(dat+1)) 里面少了一个转置!
- if(T){
- ## 连续性的变量,如体重等才好和模块进行比较。
- MEs=moduleEigengenes(datExpr,moduleColors)$eigengenes
- MET = orderMEs(cbind(MEs,datTraits$groupNo))
- par(cex = 0.9)
- png("figures/step6-Eigengene-dendrogram.png",width = 800,height = 600)
- plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(3,4,1,2), cex.lab = 0.8, xLabelsAngle
- = 90)
- dev.off()
- }
如图:

6. 模块导出
感兴趣的模块导出到cytoscape,VisANT等软件进一步进行可视化分析。
- #######gene export for VisANT or cytoscape
- if(T){
-
- module="turquoise"
- probes = colnames(datExpr)
- inModule = (moduleColors==module)
- modProbes=probes[inModule]
- head(modProbes)
- modTOM = TOM[inModule,inModule]
- dimnames(modTOM)=list(modProbes,modProbes)
- ### 这里只选了top100的基因
- nTop=100
- IMConn = softConnectivity(datExpr[,modProbes])
- top=(rank(-IMConn)<=nTop)
- filterTOM=modTOM[top,top]
- # for visANT
- vis = exportNetworkToVisANT(filterTOM,file = paste("figures/visANTinput-",module,".txt",sep = ""),
- weighted = T,threshold = 0)
-
- # for cytoscape
- cyt = exportNetworkToCytoscape(filterTOM,
- edgeFile = paste("figures/CytoscapeInput-edges-", paste(module, collapse="-"), ".txt", sep=""),
- nodeFile = paste("figures/CytoscapeInput-nodes-", paste(module, collapse="-"), ".txt", sep=""),
- weighted = TRUE,
- threshold = 0.02,
- nodeNames = modProbes[top],
- nodeAttr = moduleColors[inModule][top])
- }
总结
至此,复现结束。如果你仔细的看到了这里,说明你是想学WGCNA的了。回到课题本身,是否对小胶质细胞的认识更近了一步?提个小小的科学问题:在PD中,是否有一个胶质细胞亚群可对MPP+诱导的黑质神经细胞死亡有抵抗和保护作用?哈哈,别说我是搞肿瘤的了。。。
这些内容,换个思路应该够一个硕士毕业了。。。
- rm(list = ls())
- dev.off()
- library(WGCNA)
- # 偷懒,读取作者给出的附件的表达矩阵
- getwd()
- file="G:/papers_for_ARDS/r_scripts_for_ARDS_From_zhongda/lncRNA_mRNA_cicleRNA/microglial cells/EMBJ-36-3292-s002/embj201696056-sup-0003-DatasetEV1"
- dir.create(file)
- setwd(file)
- getwd()
- ??read.csv
- norm.counts <- openxlsx::read.xlsx("Dataset_EV1.xlsx",
- sheet = 2) ##original counts
- head(norm.counts)
- rownames(norm.counts)<-norm.counts[,1]
- norm.counts <- norm.counts[,2:18]
- head(norm.counts)
- save(norm.counts,file = "norm.counts.Rdata")
- load("norm.counts.Rdata")
-
- ## 处理表型信息,就是老鼠的分组信息
- a <- colnames(norm.counts)
- library(stringr)
- str_split(a,"[,]",simplify = T)
- strsplit("UK, USA, Germany", ",(?=[^,]+$)", perl=TRUE)
- str_split(a, ",(?=[^,]+$)",simplify = T)
-
- #以最后一个逗号为分割符号 切割字符串
- str_tmp=as.data.frame(str_split(a, ",(?=[^,]+$)",simplify = T))
- rownames(str_tmp) <- colnames(norm.counts)
- colnames(str_tmp) <- c("group","sampleRep")
- str_tmp
- datTraits <- str_tmp
- datTraits$groupNo <- c(rep(1,3),rep(2,4),rep(3,4),rep(4,3),rep(5,3))
- datTraits
- ###数据初步处理完成
-
- ## 筛选中位绝对偏差前75%的基因,取MAD大于0.3
- ## 筛选后会降低运算量,也会失去部分信息
- ## 也可不做筛选,使MAD大于0即可
- head(norm.counts)
- norm.counts <- log2(norm.counts) ##注意我在这里进行了log2,因为我发现如果不log的话,出现了个很有意思的现象,请您往下看
- m.mad <- apply(norm.counts,1,mad)
- dataExprMad <- norm.counts[which(m.mad > max(quantile(m.mad, probs=seq(0, 1, 0.25))[2],0.3)),] #25% = 0.22
- datExpr0 <- as.data.frame(t(dataExprMad))
- dim(datExpr0)
- head(datExpr0)[1:10,1:10]
- ##最终取到8222个基因。
-
-
-
- ###########start handling the data
- ##看看有没有为空的值,需要剔除。这里下载的都是作者处理好的,基本没问题。
- gsg <- goodSamplesGenes(datExpr0,verbose = 3)
- gsg$allOK
- if (!gsg$allOK){
- # Optionally, print the gene and sample names that were removed:
- if (sum(!gsg$goodGenes)>0)
- printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes],
- collapse = ", ")));
- if (sum(!gsg$goodSamples)>0)
- printFlush(paste("Removing samples:",
- paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));
- # Remove the offending genes and samples from the data:
- datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
- }
- gsg <- goodSamplesGenes(datExpr0,verbose = 3)
- gsg$allOK
-
-
- ###画个树看看样本有没有离群值!!!
- if(T){
- # Plot a line to show the cut
- sampleTree <- hclust(dist(datExpr0),method = "average")
- par(cex=.6)
- par(mar=c(0,4,2,0))
- plot(sampleTree, main = "Sample clustering to detect outliers", sub = "",
- xlab = "",cex.lab = 1.5, cex.axis =1.5, cex.main=2
- )
- }
-
- ##然后保存数据
- dim(datExpr0) ##17,8222
- datExpr <- datExpr0
- nGenes <- ncol(datExpr)
- nSamples <- nrow(datExpr)
- save(nGenes,nSamples,datExpr,datTraits,file="input.Rdata")
-
-
-
-
- rm(list = ls())
- library(WGCNA)
- load(file = "input.Rdata")
- dim(datExpr)
- head(datTraits)
- #################################################
- if(T){
- powers = c(c(1:10), seq(from = 12, to=30, by=2))
- # Call the network topology analysis function
- sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
-
- dev.off()
- getwd()
- dir.create("./figures")
- png("figures/step2-beta-value.png",width = 800,height = 600)
- # Plot the results:
- ##sizeGrWindow(9, 5)
- par(mfrow = c(1,2));
- cex1 = 0.9;
- # Scale-free topology fit index as a function of the soft-thresholding power
- plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
- xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
- main = paste("Scale independence"));
- text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
- labels=powers,cex=cex1,col="red");
- # this line corresponds to using an R^2 cut-off of h
- abline(h=0.9,col="red")
- # Mean connectivity as a function of the soft-thresholding power
- plot(sft$fitIndices[,1], sft$fitIndices[,5],
- xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
- main = paste("Mean connectivity"))
- text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
- dev.off()
- }
- sft$powerEstimate ## beta=22 SCI文章里面用了20
- save(sft,file = "step2_beta_value.Rdata")
-
-
-
-
-
- rm(list = ls())
- library(WGCNA)
- load(file = "input.Rdata")
- load(file = "step2_beta_value.Rdata")
- enableWGCNAThreads()
- if(T){
- net = blockwiseModules(
- datExpr,
- power = sft$powerEstimate,
- maxBlockSize = nGenes,
- TOMType = "unsigned", minModuleSize = 30,
- reassignThreshold = 0, mergeCutHeight = 0.28, ## 注意这个0.28
- numericLabels = TRUE, pamRespectsDendro = FALSE,
- saveTOMs = F,
- verbose = 3
- )
- table(net$colors)
- }
-
- ##模块可视化,画那个传说中的树
- if(T){
- # Convert labels to colors for plotting
- moduleColors=labels2colors(net$colors)
- table(moduleColors)
- # Plot the dendrogram and the module colors underneath
- png("figures/step3-genes-modules.png",width = 800,height = 600)
- plotDendroAndColors(net$dendrograms[[1]], moduleColors[net$blockGenes[[1]]],
- "Module colors",
- dendroLabels = FALSE, hang = 0.03,
- addGuide = TRUE, guideHang = 0.05)
- dev.off()
- }
- save(net,moduleColors,file = "step3_genes_modules.Rdata")
-
-
-
- rm(list = ls())
- library(WGCNA)
- load(file = "input.Rdata")
- load(file = "step2_beta_value.Rdata")
- load(file = "step3_genes_modules.Rdata")
-
- if(T){
- nGenes = ncol(datExpr)
- nSamples = nrow(datExpr)
- datTraits
- design <- model.matrix(~0+datTraits$group)
- colnames(design)= levels(as.factor(datTraits$group)) ## get the group
- design
- MES0 <- moduleEigengenes(datExpr,moduleColors)$eigengenes
- head(MES0)
- MEs = orderMEs(MES0)
- head(MEs)
- head(design)
- moduleTraitCor <- cor(MEs,design,use = "p")
- head(moduleTraitCor)
- moduleTraitPvalue <- corPvalueStudent(moduleTraitCor,nSamples)
- textMatrix = paste(signif(moduleTraitCor,2),"\n(",
- signif(moduleTraitPvalue,1),")",sep = "")
- dim(textMatrix)=dim(moduleTraitCor)
-
- textMatrix
-
- png("figures/step4-Module-trait-relationship.png",width = 800,height = 1200,res = 120)
- par(mar=c(6, 8.5, 3, 3))
- labeledHeatmap(Matrix = moduleTraitCor,
- xLabels = colnames(design),
- yLabels = names(MEs),
- ySymbols = names(MEs),
- colorLabels = FALSE,
- colors = blueWhiteRed(50),
- textMatrix = textMatrix,
- setStdMargins = FALSE,
- cex.text = 0.5,
- zlim = c(-1,1),
- main = paste("Module-trait relationships"))
- dev.off()
- save(design,file = "step4_design.Rdata")
- }
-
-
-
- if(T){
- mes_group <- merge(MEs,datTraits,by="row.names") # 小技巧,可以通过rowname进行merge
- ##写了个画图的function
- library(gplots)
- library(ggplot2)
- library(ggpubr)
- library(grid)
- library(gridExtra)
- draw_ggboxplot <- function(data,gene="P53",group="group"){
- #print(gene)
- ggboxplot(data,x=group, y=gene,
- ylab = sprintf("Expression of %s",gene),
- xlab = group,
- fill = group,
- palette = "jco",
- #add="jitter",
- legend = "") +stat_compare_means()
- }
- ###开始批量画boxplot
- colorNames = names(MEs)
- png("figures/step4-expression-group.png",width = 800,height=2000,res = 120)
- #par(mfrow=c(ceiling(length(colorNames)/2),2))
- p <- lapply(colorNames,function(x) {
- draw_ggboxplot(mes_group,gene= x,group="group")
- })
- do.call(grid.arrange,c(p,ncol=2))
- dev.off()
- }
-
-
-
- if(T){
- modNames = substring(names(MEs), 3)
- modNames
- head(MEs)
- head(datExpr)[1:6:1:9]
- geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"))
- ## 算出每个模块跟基因的皮尔森相关系数矩阵
- ## MEs是每个模块在每个样本里面的值
- ## datExpr是每个基因在每个样本的表达量
- MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples))
- names(geneModuleMembership) = paste("MM", modNames, sep="")
- names(MMPvalue) = paste("p.MM", modNames, sep="")
- head(MMPvalue)
-
- geneTraitSignificance <- as.data.frame(cor(datExpr,datTraits$groupNo,use = "p"))
- head(geneTraitSignificance)
- GSPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance),nSamples))
- head(GSPvalue)
- names(geneTraitSignificance)<- paste("GS.",names(datTraits$group),sep = "")
- names(GSPvalue)<-paste("GS.",names(datTraits$group),sep = "")
- head(GSPvalue)
-
- #selectModule<-c("blue","green","purple","grey") ##可以选择自己喜欢的模块
- selectModule <- modNames ## 也可以批量作图
- png("figures/step4-Module-trait-significance.png",width = 800,height=2000,res = 120)
- par(mfrow=c(ceiling(length(selectModule)/2),2)) #批量作图开始
- for(module in selectModule){
- column <- match(module,selectModule)
- print(module)
- moduleGenes <- moduleColors==module
- verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
- abs(geneTraitSignificance[moduleGenes, 1]),
- xlab = paste("Module Membership in", module, "module"),
- ylab = paste("Gene significance for", module, "module"),
- main = paste("Module membership vs. gene significance\n"),
- cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
- }
- dev.off()
- }
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-