原文链接Chapter 13 Integrating Datasets
1、目的
- 在实际试验中,不可避免的可能generate data across multiple batches;
- 正如天下没有两片完全的叶子一样,分批实验总会对数据产生不可避免且不同的影响。different batches is often subject to uncontrollable differences
-Hence,computational correction of these effects is critical for eliminating batch-to-batch variation
2、准备实验示例数据
使用TENxPBMCData()
包提供的多种 10X PBMC data中的两批做示范
BiocManager::install("TENxPBMCData")
library(TENxPBMCData)
all.sce <- list(
pbmc3k=TENxPBMCData('pbmc3k'),
pbmc4k=TENxPBMCData('pbmc4k'))
colnames(rowData(all.sce$pbmc3k))
#将gene的不同ID储存到rowData是一种值得借鉴的方法
head((rowData(all.sce$pbmc3k)))
table(is.na(rowData(all.sce$pbmc3k)$Symbol))
#也有1w+个symbol为NA值
- Separate processing prior to the batch correction step is more convenient, scalable and (on occasion) more reliable.
- 因此分别对两批数据进行前期的处理
注意下面流程中的R语言技巧值得借鉴学习,以及复习下截止目前学习的pipelines
(1)质控
以线粒体基因为示例
library(scater)
stats <- high.mito <- list()
for (n in names(all.sce)) {
current <- all.sce[[n]]
is.mito <- grep("MT", rowData(current)$Symbol_TENx)
stats[[n]] <- perCellQCMetrics(current, subsets=list(Mito=is.mito))
high.mito[[n]] <- isOutlier(stats[[n]]$subsets_Mito_percent, type="higher")
all.sce[[n]] <- current[,!high.mito[[n]]]
}
(2)标准化
all.sce <- lapply(all.sce, logNormCounts)
(3)高变基因
library(scran)
all.dec <- lapply(all.sce, modelGeneVar)
all.hvgs <- lapply(all.dec, getTopHVGs, prop=0.1)
(4)降维
library(BiocSingular)
set.seed(10000)
all.sce <- mapply(FUN=runPCA, x=all.sce, subset_row=all.hvgs,
MoreArgs=list(ncomponents=25, BSPARAM=RandomParam()),
SIMPLIFY=FALSE)
set.seed(100000)
all.sce <- lapply(all.sce, runTSNE, dimred="PCA")
set.seed(1000000)
all.sce <- lapply(all.sce, runUMAP, dimred="PCA")
(5)聚类
for (n in names(all.sce)) {
g <- buildSNNGraph(all.sce[[n]], k=10, use.dimred='PCA')
clust <- igraph::cluster_walktrap(g)$membership
all.sce[[n]]$clust <- factor(clust)
}
(6)整理
pbmc3k <- all.sce$pbmc3k
dec3k <- all.dec$pbmc3k
pbmc3k
pbmc4k <- all.sce$pbmc4k
dec4k <- all.dec$pbmc4k
pbmc4k
#挑选两个批次相同的基因
universe <- intersect(rownames(pbmc3k), rownames(pbmc4k))
length(universe)
pbmc3k <- all.sce$pbmc3k
dec3k <- all.dec$pbmc3k
pbmc3k
pbmc4k <- all.sce$pbmc4k
dec4k <- all.dec$pbmc4k
pbmc4k
(7)消除测序深度差异
- rescale each batch to adjust for differences in sequencing depth between batches(adjusting the size factors)
- 可以认为先去除批次效应的一个方面
library(batchelor)
rescaled <- multiBatchNorm(pbmc3k, pbmc4k)
pbmc3k <- rescaled[[1]]
pbmc4k <- rescaled[[2]]
(8)新的hvg
library(scran)
combined.dec <- combineVar(dec3k, dec4k)
chosen.hvgs <- combined.dec$bio > 0
sum(chosen.hvgs)
最终得到的数据是
pbmc3k
、pbmc4k
两个processed and batch-normalized sce 以及chosen.hvgs
。供下面的批次效应流程处理。
3、检查是否存在批次效应
- 简单思路:直接合并两个sce,进行整体聚类;再与原始sce的分类结果比较观察。
- 理想的情况是整体数据的细胞分类平均分到每个batch中,而不是单一、异常的分布。
(1)合并sce对象
rowData(pbmc3k) <- rowData(pbmc4k) #使用4k的rowData
#因为两个原始sce对象的rowData(feature annotation)不完全一致(无法合并)
pbmc3k$batch <- "3k" #增加批次colData annotation
pbmc4k$batch <- "4k"
#直接合并两个sce对象(未校正)
uncorrected <- cbind(pbmc3k, pbmc4k)
#基于直接合并的sce进行聚类
# Using RandomParam() as it is more efficient for file-backed matrices.
library(scater)
set.seed(0010101010)
uncorrected <- runPCA(uncorrected, subset_row=chosen.hvgs,
BSPARAM=BiocSingular::RandomParam())
library(scran)
snn.gr <- buildSNNGraph(uncorrected, use.dimred="PCA")
clusters <- igraph::cluster_walktrap(snn.gr)$membership
(2)观察直接合并的聚类与原始sce的聚类有无差异
- Ideally, each cluster should ideally consist of cells from both batches.
tab <- table(Cluster=clusters, Batch=uncorrected$batch)
tab

- 但是从上面结果来看, cells of the same type are artificially separated due to technical differences between batches.
- 可视化看一看,的确批次效应造成的分离影响很大,需要去除该效应
set.seed(1111001)
uncorrected <- runTSNE(uncorrected, dimred="PCA")
plotTSNE(uncorrected, colour_by="batch")

4、去除方法
法一:Linear regression
(1)前提假设
- assume that the composition of cell subpopulations is the same across batches.
- assume that the batch effect is additive.
(2)实现方法
- batchelor包的
rescaleBatches()
函数 - roughly equivalent to applying a linear regression to the log-expression values per gene.
set.seed(1010101010) # To ensure reproducibility of IRLBA.
#降维
rescaled <- runPCA(rescaled, subset_row=chosen.hvgs, exprs_values="corrected")
#聚类
snn.gr <- buildSNNGraph(rescaled, use.dimred="PCA")
clusters.resc <- igraph::cluster_walktrap(snn.gr)$membership
#聚类结果在不同batch的分布情况
tab.resc <- table(Cluster=clusters.resc, Batch=rescaled$batch)
tab.resc
-
如下图结果,可以看到相当部分的批次效应导致的分类偏离已经消除了
-
However, at least one batch-specific cluster is still present, indicating that the correction is not entirely complete.(batch effect 未完全消除)
4-1
-
再tsne可视化看一看
rescaled <- runTSNE(rescaled, dimred="PCA")
rescaled$batch <- factor(rescaled$batch)
plotTSNE(rescaled, colour_by="batch")

法二:MNN(Mutual nearest neighbors)【推荐】
- Mutual nearest neighbors are pairs of cells from different batches that belong in each other’s set of nearest neighbors.
- 简单来说寻找两个批次中理想情况下不受影响,互相距离最近的点(pairs)。利用这些pairs将两个batch“拉到一起”
- batchelor包的
fastMNN()
函数
set.seed(1000101001)
mnn.out <- fastMNN(pbmc3k, pbmc4k, d=50, k=20, subset.row=chosen.hvgs,
BSPARAM=BiocSingular::RandomParam(deferred=TRUE))
mnn.out
assay slot 的 reconstructed 矩阵为两个batch 的调整后的融合矩阵;
reducedDim slot 的 corrected 是融合矩阵的主成分信息,即不需要再进行降维处理。
关于fastMNN()
函数的参数--
-
d=
:To reduce computational work and technical noise, all cells in all batches are projected into the low-dimensional space defined by the top d principal components. -
k=
:specifies the number of nearest neighbors to consider when defining MNN pairs. -
subset.row=
一般仅挑选矩阵的hvg代表基因
library(scran)
#聚类
snn.gr <- buildSNNGraph(mnn.out, use.dimred="corrected")
clusters.mnn <- igraph::cluster_walktrap(snn.gr)$membership
tab.mnn <- table(Cluster=clusters.mnn, Batch=mnn.out$batch)
tab.mnn
如下图结果
- We see that all clusters contain contributions from each batch after correction, consistent with our expectation that the two batches are replicates of each other.

library(scater)
set.seed(0010101010)
mnn.out <- runTSNE(mnn.out, dimred="corrected")
mnn.out$batch <- factor(mnn.out$batch)
plotTSNE(mnn.out, colour_by="batch")
-
与第一种方法相比,两个batch也的确更加紧密的结合在一起了
4-4
以上是第十三章Integrating Datasets部分的简单流程笔记,主要学习了linear regression与MNN两种整合数据并去除批次效应的两种方法。但整合的同时也要考虑是否去除了数据本身的 biological difference,因此可进一步对整合数据进行诊断,这里暂时不做记录了,详见Chapter 13 Integrating Datasets
本系列笔记基于OSCA全流程的大致流程梳理,详细原理可参考原文。如有错误,恳请指正!
此外还有刘小泽老师整理的全文翻译笔记,详见目录。
网友评论