美文网首页单细胞转录组
8、Integrating Datasets

8、Integrating Datasets

作者: 小贝学生信 | 来源:发表于2020-09-29 20:52 被阅读0次

原文链接Chapter 13 Integrating Datasets

1、目的

  • 在实际试验中,不可避免的可能generate data across multiple batches;
  • 正如天下没有两片完全的叶子一样,分批实验总会对数据产生不可避免且不同的影响。different batches is often subject to uncontrollable differences
    -Hence,computational correction of these effects is critical for eliminating batch-to-batch variation

2、准备实验示例数据

使用TENxPBMCData()包提供的多种 10X PBMC data中的两批做示范

BiocManager::install("TENxPBMCData")
library(TENxPBMCData)

all.sce <- list(
  pbmc3k=TENxPBMCData('pbmc3k'),
  pbmc4k=TENxPBMCData('pbmc4k'))

colnames(rowData(all.sce$pbmc3k))
#将gene的不同ID储存到rowData是一种值得借鉴的方法
head((rowData(all.sce$pbmc3k)))
table(is.na(rowData(all.sce$pbmc3k)$Symbol))
#也有1w+个symbol为NA值
  • Separate processing prior to the batch correction step is more convenient, scalable and (on occasion) more reliable.
  • 因此分别对两批数据进行前期的处理

注意下面流程中的R语言技巧值得借鉴学习,以及复习下截止目前学习的pipelines

(1)质控

以线粒体基因为示例

library(scater)
stats <- high.mito <- list()
for (n in names(all.sce)) {
  current <- all.sce[[n]]
  is.mito <- grep("MT", rowData(current)$Symbol_TENx)
  stats[[n]] <- perCellQCMetrics(current, subsets=list(Mito=is.mito))
  high.mito[[n]] <- isOutlier(stats[[n]]$subsets_Mito_percent, type="higher")
  all.sce[[n]] <- current[,!high.mito[[n]]]
}

(2)标准化

all.sce <- lapply(all.sce, logNormCounts)

(3)高变基因

library(scran)
all.dec <- lapply(all.sce, modelGeneVar)
all.hvgs <- lapply(all.dec, getTopHVGs, prop=0.1)

(4)降维

library(BiocSingular)
set.seed(10000)
all.sce <- mapply(FUN=runPCA, x=all.sce, subset_row=all.hvgs, 
                  MoreArgs=list(ncomponents=25, BSPARAM=RandomParam()), 
                  SIMPLIFY=FALSE)

set.seed(100000)
all.sce <- lapply(all.sce, runTSNE, dimred="PCA")

set.seed(1000000)
all.sce <- lapply(all.sce, runUMAP, dimred="PCA")

(5)聚类

for (n in names(all.sce)) {
  g <- buildSNNGraph(all.sce[[n]], k=10, use.dimred='PCA')
  clust <- igraph::cluster_walktrap(g)$membership
  all.sce[[n]]$clust  <- factor(clust)
}

(6)整理

pbmc3k <- all.sce$pbmc3k
dec3k <- all.dec$pbmc3k
pbmc3k
pbmc4k <- all.sce$pbmc4k
dec4k <- all.dec$pbmc4k
pbmc4k

#挑选两个批次相同的基因
universe <- intersect(rownames(pbmc3k), rownames(pbmc4k))
length(universe)
pbmc3k <- all.sce$pbmc3k
dec3k <- all.dec$pbmc3k
pbmc3k
pbmc4k <- all.sce$pbmc4k
dec4k <- all.dec$pbmc4k
pbmc4k

(7)消除测序深度差异

  • rescale each batch to adjust for differences in sequencing depth between batches(adjusting the size factors)
  • 可以认为先去除批次效应的一个方面
library(batchelor)
rescaled <- multiBatchNorm(pbmc3k, pbmc4k)
pbmc3k <- rescaled[[1]]
pbmc4k <- rescaled[[2]]

(8)新的hvg

library(scran)
combined.dec <- combineVar(dec3k, dec4k)
chosen.hvgs <- combined.dec$bio > 0
sum(chosen.hvgs)

最终得到的数据是pbmc3kpbmc4k两个processed and batch-normalized sce 以及chosen.hvgs。供下面的批次效应流程处理。

3、检查是否存在批次效应

  • 简单思路:直接合并两个sce,进行整体聚类;再与原始sce的分类结果比较观察。
  • 理想的情况是整体数据的细胞分类平均分到每个batch中,而不是单一、异常的分布。

(1)合并sce对象

rowData(pbmc3k) <- rowData(pbmc4k) #使用4k的rowData
#因为两个原始sce对象的rowData(feature annotation)不完全一致(无法合并)
pbmc3k$batch <- "3k" #增加批次colData annotation
pbmc4k$batch <- "4k"
#直接合并两个sce对象(未校正)
uncorrected <- cbind(pbmc3k, pbmc4k)

#基于直接合并的sce进行聚类
# Using RandomParam() as it is more efficient for file-backed matrices.
library(scater)
set.seed(0010101010)
uncorrected <- runPCA(uncorrected, subset_row=chosen.hvgs,
                      BSPARAM=BiocSingular::RandomParam())
library(scran)
snn.gr <- buildSNNGraph(uncorrected, use.dimred="PCA")
clusters <- igraph::cluster_walktrap(snn.gr)$membership

(2)观察直接合并的聚类与原始sce的聚类有无差异

  • Ideally, each cluster should ideally consist of cells from both batches.
tab <- table(Cluster=clusters, Batch=uncorrected$batch)
tab
3-1
  • 但是从上面结果来看, cells of the same type are artificially separated due to technical differences between batches.
  • 可视化看一看,的确批次效应造成的分离影响很大,需要去除该效应
set.seed(1111001)
uncorrected <- runTSNE(uncorrected, dimred="PCA")
plotTSNE(uncorrected, colour_by="batch")
3-2

4、去除方法

法一:Linear regression

(1)前提假设
  • assume that the composition of cell subpopulations is the same across batches.
  • assume that the batch effect is additive.
(2)实现方法
  • batchelor包的rescaleBatches()函数
  • roughly equivalent to applying a linear regression to the log-expression values per gene.
set.seed(1010101010) # To ensure reproducibility of IRLBA.
#降维
rescaled <- runPCA(rescaled, subset_row=chosen.hvgs, exprs_values="corrected")
#聚类
snn.gr <- buildSNNGraph(rescaled, use.dimred="PCA")
clusters.resc <- igraph::cluster_walktrap(snn.gr)$membership
#聚类结果在不同batch的分布情况
tab.resc <- table(Cluster=clusters.resc, Batch=rescaled$batch)
tab.resc
  • 如下图结果,可以看到相当部分的批次效应导致的分类偏离已经消除了

  • However, at least one batch-specific cluster is still present, indicating that the correction is not entirely complete.(batch effect 未完全消除)


    4-1
  • 再tsne可视化看一看

rescaled <- runTSNE(rescaled, dimred="PCA")
rescaled$batch <- factor(rescaled$batch)
plotTSNE(rescaled, colour_by="batch")
4-2

法二:MNN(Mutual nearest neighbors)【推荐】

  • Mutual nearest neighbors are pairs of cells from different batches that belong in each other’s set of nearest neighbors.
  • 简单来说寻找两个批次中理想情况下不受影响,互相距离最近的点(pairs)。利用这些pairs将两个batch“拉到一起”
  • batchelor包的fastMNN()函数
set.seed(1000101001)
mnn.out <- fastMNN(pbmc3k, pbmc4k, d=50, k=20, subset.row=chosen.hvgs,
                   BSPARAM=BiocSingular::RandomParam(deferred=TRUE))
mnn.out

assay slot 的 reconstructed 矩阵为两个batch 的调整后的融合矩阵;
reducedDim slot 的 corrected 是融合矩阵的主成分信息,即不需要再进行降维处理。
关于fastMNN()函数的参数--

  • d=:To reduce computational work and technical noise, all cells in all batches are projected into the low-dimensional space defined by the top d principal components.
  • k=:specifies the number of nearest neighbors to consider when defining MNN pairs.
  • subset.row=一般仅挑选矩阵的hvg代表基因
library(scran)
#聚类
snn.gr <- buildSNNGraph(mnn.out, use.dimred="corrected")
clusters.mnn <- igraph::cluster_walktrap(snn.gr)$membership
tab.mnn <- table(Cluster=clusters.mnn, Batch=mnn.out$batch)
tab.mnn

如下图结果

  • We see that all clusters contain contributions from each batch after correction, consistent with our expectation that the two batches are replicates of each other.
4-3
library(scater)
set.seed(0010101010)
mnn.out <- runTSNE(mnn.out, dimred="corrected")

mnn.out$batch <- factor(mnn.out$batch)
plotTSNE(mnn.out, colour_by="batch")
  • 与第一种方法相比,两个batch也的确更加紧密的结合在一起了


    4-4

以上是第十三章Integrating Datasets部分的简单流程笔记,主要学习了linear regression与MNN两种整合数据并去除批次效应的两种方法。但整合的同时也要考虑是否去除了数据本身的 biological difference,因此可进一步对整合数据进行诊断,这里暂时不做记录了,详见Chapter 13 Integrating Datasets
本系列笔记基于OSCA全流程的大致流程梳理,详细原理可参考原文。如有错误,恳请指正!
此外还有刘小泽老师整理的全文翻译笔记,详见目录

相关文章

网友评论

    本文标题:8、Integrating Datasets

    本文链接:https://www.haomeiwen.com/subject/jagjuktx.html