MuSiC: BulkRNA Deconvolution tool using scRNA-seq data
๐ Wang, Xuran, et al. โBulk tissue cell type deconvolution with multi-subject single-cell expression reference.โ Nature communications 10.1 (2019): 380.
MUlti-Subject SIngle Cell deconvolution (MuSiC)
Deconvolution ์ด๋..
Bulk RNA seq data ๋ถ์์ ํ ๋ฐฉ๋ฒ์ค ํ๋๋ก. ๋ชจ๋ cell type์ด ๊ตฌ๋ถ์์ด ๋ค ํผ์ฌ๋์ด์๋ bulkRNA ๋ฐ์ดํฐ์์ ์ฌ๋ฌ cell type์ ๋น์จ์ ์ถ์ ํ๊ฑฐ๋, ์ฌ๋ฌ ์ํ๋ค์์ ๊ฐ cell type์ ์ ์ ์ ๋ฐํ ํจํด์ ๋ถ๋ฆฌํด๋ด๊ธฐ ์ํด ์ฌ์ฉ๋๋ ๋ถ์ ๋ฐฉ๋ฒ์ด๋ค.
๋ณดํต Deconvolution ๊ณผ์ ์๋ 2๊ฐ์ง input์ด ํ์ํ๋ฐ,..
- Reference profile
- ์์๋ด๊ณ ์ํ๋ cell type์ ์์ํ ์ ์ ์ ๋ฐํ ํ๋กํ์ผ. ์์ ์๋ FACs(Fluorescence-Activated Cell Sorting) ๊ฐ์ ๊ฑธ๋ก ๋ถ๋ฆฌ๋ ์ธํฌ๊ตฐ์์ ์ป์ด์ง bulkRNA-seq data ๋๋ microarray data๋ก ๋ง๋ค์ด์ง database๋ฅผ ์ฌ์ฉํ๋ค. ์์ฆ์ single-cell data์ป๊ธฐ๊ฐ ๋ ์ฌ์์ ธ์ ์ด๋ฅผ ํ์ฉํ ํด์ด ๋ง์ด ๋์ค๊ณ ์๋ ์ถ์ธ๋ค. like MuSiC!!
- Query profile (Mixed)
- cell type proportion์ ์์๋ด๊ณ ์ ํ๋ Bulk RNA-seq data. ์ฌ๋ฌ cell type์ด ํผํฉ๋ผ ์๋ ์ํ์์ ์ป์ ๋ฐ์ดํฐ.
์ด๋ ๊ฒ ๋๊ฐ์ง data๋ฅผ ๊ฐ๊ณ deconvolution tool์ ์ด ๋์ ๋น๊ตํ๋ฉด์ Query data๋ด์ cell type ๋น์จ์ ์ถ์ ํ๋ค. ์ด ๊ณผ์ ์ค์ ์ด๋ค ๋ฐฉ๋ฒ๋ก ์ ์ฌ์ฉํ ๊ฒ์ด๋์ ๋ฐ๋ผ์ tool์ ์์ธก๋ ฅ๊ฐ์๊ฒ ์ฐจ์ด๋๊ฒ ๋๋๊ฒ. ๋ณดํต Linear regression, Machine learning, Matrix fractorization๊ฐ์ ๋ฐฉ๋ฒ์ ์ฌ์ฉํ๋ค.
MuSiC์..
wieghted non-negtive least squares regression(W-NNLS) ๋ฅผ ์ฌ์ฉํ ๋ฐฉ๋ฒ์ผ๋ก linear regression์ ์ผ์ข ์ ์ฌ์ฉํ๋ค. ๊ฐ๋ตํ ์ค๋ช ํ๋ฉด ์ฐ๋ฆฌ๊ฐ ์์๋ด๊ณ ์ ํ๋ cell type์ ๋น์จ์ ์ถ์ ํ๋๋ฐ ๋ฐฉ์ ์์ ์ฌ์ฉํ๊ณ ๊ฐ ํด๊ฐ ์์๊ฐ ์๋ ๊ฒ์ด๋ผ๊ณ ๊ฐ์ ํ๊ณ ์ด๋ฅผ ํ์ด๊ฐ๋ ๊ฒ์ด๋ค. ์ ์๊ฐํด ๋ณด๋ฉด ๋น์ฐํ ๊ฒ. cell type์ ๋น์จ์ด ์๋ฌด๋ฆฌ ์์๋ 0์ด์ง ์์๊ฐ ๋ ์ ์์ผ๋๊ป..
MuSic์ ํน์ฅ์ ?์ โmarker gene consistencyโ๋ผ๊ณ ํจ. ์ง๊ธ๊น์ง ๋์จ CIBERSORT๋ (CIBERSORx ์๋) BSEQ-sc, TIMBER๊ฐ์ ํด๋ค์ pre-selected cell type-specific marker gene์ ์ฌ์ฉํด์๋ค๋ฉด, MuSiC์ ํด๋ด์์ subject
์ง๊ธ๊น์ง์ ํด๋ค์ cross-subject heterogeneity์ with-in cell type stocahsticitiy๋ฅผ ๋ฌด์ํ๊ณ ๊ทธ๋ฅ ํ๊ท ์ ์ผ๋ก ๊ฐ cell type์ marker๋ก ์ฌ์ฉ๋๋ ์ ์ ์๋ค์ ์ฌ์ฉํ๋ค๋ฉด, MuSic์ cross-subject and cross-cell consistency๋ฅผ ๊ณ ๋ คํ gene์ weight๋ฅผ ์ค์ deconvolutionํ๋ค.
๋ฌด์จ ๋ง์ด๋๋ฉด, ์ง๊ธ๊น์ง์ ํด๋ค์ single-cell data๋ฅผ reference๋ก ์ฌ์ฉํด์ bulkRNA๋ฅผ deconvolutionํ๋ค๊ณ ํด๋,.. ๊ทธ๋ฅ scRNA data์์ T cell cluster ํน์ด์ ์ผ๋ก logFC๋์ ์ ์ ์๋ฅผ ๊ณจ๋ผ์ marker๋ก ์ฌ์ฉํ๊ฑฐ๋ ๋๋ ์คํ์ ์ผ๋ก ๋ง์ด ์๋ ค์ ธ ์๋ surface protein marker๋ฅผ ๊ธฐ์ค์ผ๋ก ์ด๋ค์ ๋ฐํ์ ๋ฐ๋ผ์ deconvolution์ ์งํํ๋๋ฐ.. MuSiC์ด ๊ผฌ์ง๋ ์ด๋ฐ ๋ฐฉ๋ฒ์ ๋ฌธ์ ๋ reference๋ก ์ฌ์ฉํ๋ ๋ฐ์ดํฐ๊ฐ ํ์๋ค์ phenotype์ ๋ฐ๋ผ์ ๊ฐ์ cell type์ด๋๋ผ๋ subject๊ฐ์ ๋ฐํ ์ฐจ์ด๊ฐ ์์ ์ ์๊ณ (heterogeneity) ๋๋ ๊ฐ์ cell type๋ด์์ stocahsticity์ ์ํด์ ์ด๋ค ์ ์ ์๊ฐ ๋ฐํ์ด ๋ง๊ฑฐ๋/์ ๊ฒ ์บก์ณ๋์ ์ ์๋๋ฐ ์ด๋ฐ ๊ฒ๋ค์ ๊ณ ๋ คํ์ง ์๊ณ ๋ฐ์ง์๋ฉด noise๋ฅผ ๊ณ์ฐ์ ์ง์ด ๋ฃ์๋ค๋ ๊ฒ์ด๋ค.
์๋ฅผ ๋ค์ด ์ ์ ์ A๊ฐ T cell population์ marker๋ก ์กํ๋๋ฐ ์๊ณ ๋ดค๋๋ ํน์ ์งํ์ ์๋ ํ์์ Tcell์์ ๊ฐ๋ ฅํ๊ฒ ๋ฐํ๋๋ ์ ์ ์์๋ค. ์ด๋ฐ ์ ์ ์๋ค์ cross-subject consistency๋ฅผ ๋ณธ๋ค๋ฉด ํ๋ฝ. ๋น์ทํ ๊ฒฝ์ฐ๋ก ํ์ phenotype์ผ๋ก ๋๋ ์ ์์ง๋ง stocahsticity์ ์ํด์ ๋ช๋ช cell์์ ํน์ ์ ์ ์ B์ ๋ฐํ์ด ๋๊ฒ ๋์๋๋ฐ population์ ์ฌ์ด์ฆ๊ฐ ์ ์ด์ ๊ทธ๋งํผ๋ง์ผ๋ก๋ B๋ฅผ marker๋ก ๋ฝํ๊ฒ ํ ์ ์์๋ค๋ฉด?..
MuSiC์ cross-subject and cross-cell consistency ๋ฅผ ์ ์งํ๋ marker gene์ ์ฐพ์๋ด๊ธฐ ์ด์ ์, collinearity๋ฅผ ์ํด tree-guided ๋จ๊ณ๋ฅผ ์ถ๊ฐํ๋ค. ๋ณ๊ฑด ์๋๊ณ hierarchical clusteringํด์ ์ป์ ์ ์๋ ๊ฒฐ๊ณผ๋ฅผ ์ฌ์ฉ. ๋ด๊ฐ input์ผ๋ก ์ฃผ์ด์ง ๋ชจ๋ cell type์ ๋ํด์ clustering์ ๋จผ์ ์งํดํด์ ์ด cell type๋ค์ ๋ ํฐ ๋ฉ์ด๋ฆฌ๋ก ๋๋๊ณ ๊ทธ ์์์ ๋จผ์ cluster-consistent genes
์ ์ฐพ๊ณ ํฐ cluster๋ณ๋ก ๊ทธ ์์์ ๋ค์ intra-cell type variance๋ฅผ ๋ณด์ด๋ ์ ์ ์๋ฅผ ์ฐพ์์ deconvolution์ ์ด์ฉํ๊ฒ ๋๋ค. ๋ง์ฝ ํ์ํ๋ค๋ฉด recursiveํ๊ฒ ์ด ๊ณผ์ ์ ๋ฐ๋ณตํ๋ค.
TL;DR
Subject๊ฐ์์๋ ์ฐจ์ด๋ฅผ ๋ณด์ด์ง ์๊ณ constantly cell type์ ๋ํํ๋ ์ ์ ์๋ค์ weight์ ์ฃผ์ด์ deconvolution์ ์ฌ์ฉํ๋ ํด์ด๋ค.
์ค๋น๋ฌผ
- Bulk RNA-seq data (Query) : raw count ์ฌ์ฉ ๊ถ์ฅ
- scRNA-seq data (Reference) : cell types๋ ์ด๋ฏธ annotated ๋ ์ํ, raw UMI counts matrix
SingleCellExperiment
object๋ฅผ input์ผ๋ก ๋ฐ์ ๐ Seurat v5 to SCE ํฌ์คํธ ์ฐธ์กฐ
Installation
1
devtools::install_github('xuranw/MuSiC')
๐จ v.1.0.0 ๋ณ ๋ฌธ์ ์์ด ์ค์น ๋๊ธดํ๋๋ฐ ๋๋ฆฌ๋ ์ค bug๊ฐ ์๋ ๊ฒ ๊ฐ์์ music_prop()
function ๋ถ๋ถ ์ฝ๋ github์์ ๋ฐ์ ๊ฐ์ธ์ ์ผ๋ก ์์ ํ ์ฌ์ฉ. ์๋๋ด์ฉ ์ฐธ์กฐ.
01. Data Preparations
์์ ๋งํ๋ฏ์ด MuSic์ single-cell data๋ฅผ SCE ํฌ๋งท์ผ๋ก ๋ฐ๊ธฐ ๋๋ฌธ์ ๋ณ๊ฒฝํด์ ์ค๋นํด์ค๋ค.
1
2
3
4
5
6
7
8
GEX[["RNA"]] <- as(object = GEX[["RNA"]], Class = "Assay")
saveRDS(GEX,"GEX.v3.rds")
conda activate Seurat4
library(Seurat)
GEX <- readRDS("GEX.v3.rds")
test <- as.SingleCellExperiment(GEX)
saveRDS(test,"GEX.SCE.rds")
02. Estimation of cell type proportions
1
2
3
library(MuSiC)
library(reshape)
library(cowplot)
music_prop() functionโs parameters
bulk.eset
: ExpressionSet of bulk data; rownames = genename, colnames = sampleIDsc.eset
: ExpressionSet of annotated single cell data;markers
: vector or list of gene names, default as NULL. If NULL, then use all genes that overlapping in bulk and single-cell dataset;clusters
: character, must be one of the column names of the phenoData from single cell data, used as clusters;samples
: character, must be one of the column names of the phenoData from single cell data, used as samples;select.ct
: vector of cell types included for deconvolution, default as NULL. If NULL, then use all cell types that provided by single-cell dataset.verbose
: logical that toggles log messages.
bulk.eset
(bulkRNA-seq expression matrix)๋ ๊ฐ์ ์์ ๋ค์ด์์ผ๋ฉด ๋์ค์ ์๋ฌ๋๋ค.
๋ ผ๋ฌธ์์๋ raw count ๊ฐ์ ์ฌ์ฉํ๋ ๊ฒ์ ๊ถ์ฅํ๊ณ , TPM์ผ๋ก๋ ๊ฒฐ๊ณผ๊ฐ ์๋์จ๋ค๊ณ ๋์ด ์์. FPKM์ ๋๋ค๊ณ ํจ.
์๋ํด๋ณธ ๊ฒฐ๊ณผ, CUF๋ก๋ ๋์๊ฐ์ง๋ง logCPM์ผ๋ก๋ ์ ๋๋ก ๋์๊ฐ์ง ์๋๋ค.
bulk.eset
์์ ์ํ๋ณ๋ก ๋ฐํ๋๋ ์ ์ ์์ ๊ฐฏ์์ ์ด ์ ์ ์๋ค์ด single-cell data์ ์ ์ ์ ๋ฆฌ์คํธ์ ๊ฒน์น๋์ง๊ฐ ์ค์ํ๋ค.
๊ฒน์น๋ ์ ์ ์์ ๊ฐฏ์๊ฐ 20%์ดํ๋ฉด ์๋ฌ๋์ ๋์๊ฐ์ง ์๋๋ค.๋ฐ๋ผ์ ๋ฏธ๋ฆฌ input์ผ๋ก ์ฃผ๋ bulk.eset์ matrix๋ฅผ ์ ๋ฆฌํด์ ๋ฃ๊ฑฐ๋
(์ ์ ์ ๋ฐํ์ด ์ ์ ์ํ์ ๊ฑฐ ๋ฐ single-cell dataset์์ ๋ฐํ๋๋ ์ ์ ์๋ง row์ ํฌํจ์ํค๊ธฐ)
markers
์ต์ ์ผ๋ก ์ฌ์ฉํ ์ ์ ์๋ฅผ limit์ํค๋๊ฒ์ด ํ์
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Bulk expression matrix
CUF = read.table("CUF.tsv",sep='\t',quote="",header=1)
# single-cell RNA-seq exp SCE obj.
sc.sce <- readRDS("GEX.SCE.rds")
# (optional) data cleaning
## rowname = hgnc_symbol๋ก ํ๊ธฐ ์ํด ์ ๋ฆฌ
# bulk.mtx = CUF[CUF$hgnc_symbol!="",c(1,9:826)] # 818 samples
# gene_names <- bulk.mtx$hgnc_symbol
# bulk.mtx <- bulk.mtx %>% group_by(hgnc_symbol) %>%
# summarise(across(everything(), max, na.rm = TRUE)) %>% as.data.frame()
# rownames(bulk.mtx) <- bulk.mtx$hgnc_symbol
# bulk.mtx <- bulk.mtx[,-1]
## singlecell data์ ์ผ์นํ๋ ์ ์ ์๋ง ์ฌ์ฉ
# bulk.mtx <- bulk.mtx[intersect(rownames(sc.sce),rownames(bulk.mtx)),]
## ๋ฐํ๋๋ ์ ์ ์์ ๊ฐฏ์๊ฐ ์ ์ฒด์ 20% ์ดํ์ธ ์ํ์ ์ฐพ์์ ์ ๊ฑฐ
# threshold <- nrow(bulk.mtx) * 0.2
# bulk.mtx <- bulk.mtx[, colSums(bulk.mtx > 0) >= threshold] # 808 samples
# Estimate cell type proportions
Est.prop = music_prop(bulk.mtx, sc.sce, clusters = 'predicted.celltype.l2',samples = 'SampleID')
# saveRDS(Est.prop,"Est.prop_result.rds")
# names(Est.prop)
#[1] "Est.prop.weighted" "Est.prop.allgene" "Weight.gene" "r.squared.full" "Var.prop"
๐จ music_prop ์์ ๋ ์ฝ๋
๊ทธ๋ฅ ๋ค์ด๋ฐ์ library ๊ทธ๋๋ก ๋๋ ธ๋๋, ์ถฉ๋ถํ ์์ ์ ์ ์๊ฐ ๋ฐํ๋๊ณ ์์์๋ ๋ถ๊ตฌํ๊ณ ์๋๊ฐ์ ์๋ฌ๋ฉ์ธ์ง๊ฐ ๋์ค๊ณ ์งํ์ด ์๋ผ์ ๋ช์ค ์ถ๊ฐํด์ ์์ ํจ
1
2
3
4
5
6
7
8
9
# error code
Creating Relative Abudance Matrix...
Creating Variance Matrix...
Creating Library Size Matrix...
Used 13092 common genes...
Used 29 cell types in deconvolution...
X208.d0 has common genes 13063 ...
Error in music.iter(Yjg.temp, D1.temp, M.S, Sigma.temp, iter.max = iter.max, :
Not enough common genes!
๐๐๐ ์์ ํ ์ฝ๋
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
## music_prop ์ฝ๋ ์์
music_prop = function(bulk.mtx, sc.sce, markers = NULL, clusters, samples, select.ct = NULL, cell_size = NULL, ct.cov = FALSE, verbose = TRUE,
iter.max = 1000, nu = 0.0001, eps = 0.01, centered = FALSE, normalize = FALSE, ... ){
bulk.gene = rownames(bulk.mtx)[rowMeans(bulk.mtx) != 0]
bulk.mtx = bulk.mtx[bulk.gene, ]
if(is.null(markers)){
sc.markers = bulk.gene
}else{
sc.markers = intersect(bulk.gene, unlist(markers))
}
sc.basis = music_basis(sc.sce, non.zero = TRUE, markers = sc.markers, clusters = clusters, samples = samples, select.ct = select.ct, cell_size = cell_size, ct.cov = ct.cov, verbose = verbose)
cm.gene = intersect( rownames(sc.basis$Disgn.mtx), bulk.gene )
if(is.null(markers)){
if(length(cm.gene)< 0.2*min(length(bulk.gene), nrow(sc.sce)) )
stop("Too few common genes!")
}else{
if(length(cm.gene)< 0.2*length(unlist(markers)))
stop("Too few common genes!")
}
if(verbose){message(paste('Used', length(cm.gene), 'common genes...'))}
m.sc = match(cm.gene, rownames(sc.basis$Disgn.mtx)); m.bulk = match(cm.gene, bulk.gene)
D1 = sc.basis$Disgn.mtx[m.sc, ];
M.S = colMeans(sc.basis$S, na.rm = T);
if(!is.null(cell_size)){
if(!is.data.frame(cell_size)){
stop("cell_size paramter should be a data.frame with 1st column for cell type names and 2nd column for cell sizes")
}else if(sum(names(M.S) %in% cell_size[, 1]) != length(names(M.S))){
stop("Cell type names in cell_size must match clusters")
}else if (any(is.na(as.numeric(cell_size[, 2])))){
stop("Cell sizes should all be numeric")
}
my_ms_names <- names(M.S)
cell_size <- cell_size[my_ms_names %in% cell_size[, 1], ]
M.S <- cell_size[match(my_ms_names, cell_size[, 1]),]
M.S <- M.S[, 2]
names(M.S) <- my_ms_names
}
Yjg = relative.ab(bulk.mtx[m.bulk, ]); N.bulk = ncol(bulk.mtx);
if(ct.cov){
Sigma.ct = sc.basis$Sigma.ct[, m.sc];
Est.prop.allgene = NULL
Est.prop.weighted = NULL
Weight.gene = NULL
r.squared.full = NULL
Var.prop = NULL
for(i in 1:N.bulk){
if(sum(Yjg[, i] == 0) > 0){
D1.temp = D1[Yjg[, i]!=0, ];
Yjg.temp = Yjg[Yjg[, i]!=0, i];
names(Yjg.temp) <- rownames(Yjg[Yjg[, i]!=0, ])
Sigma.ct.temp = Sigma.ct[, Yjg[,i]!=0];
if(verbose) message(paste(colnames(Yjg)[i], 'has common genes', sum(Yjg[, i] != 0), '...') )
}else{
D1.temp = D1;
Yjg.temp = Yjg[, i];
names(Yjg.temp) <- rownames(Yjg)
Sigma.ct.temp = Sigma.ct;
if(verbose) message(paste(colnames(Yjg)[i], 'has common genes', sum(Yjg[, i] != 0), '...'))
}
lm.D1.weighted = music.iter.ct(Yjg.temp, D1.temp, M.S, Sigma.ct.temp, iter.max = iter.max,
nu = nu, eps = eps, centered = centered, normalize = normalize)
Est.prop.allgene = rbind(Est.prop.allgene, lm.D1.weighted$p.nnls)
Est.prop.weighted = rbind(Est.prop.weighted, lm.D1.weighted$p.weight)
weight.gene.temp = rep(NA, nrow(Yjg)); weight.gene.temp[Yjg[,i]!=0] = lm.D1.weighted$weight.gene;
Weight.gene = cbind(Weight.gene, weight.gene.temp)
r.squared.full = c(r.squared.full, lm.D1.weighted$R.squared)
Var.prop = rbind(Var.prop, lm.D1.weighted$var.p)
}
}else{
Sigma = sc.basis$Sigma[m.sc, ];
valid.ct = (colSums(is.na(Sigma)) == 0)&(colSums(is.na(D1)) == 0)&(!is.na(M.S))
if(sum(valid.ct)<=1){
stop("Not enough valid cell type!")
}
if(verbose){message(paste('Used', sum(valid.ct), 'cell types in deconvolution...' ))}
D1 = D1[, valid.ct]; M.S = M.S[valid.ct]; Sigma = Sigma[, valid.ct];
Est.prop.allgene = NULL
Est.prop.weighted = NULL
Weight.gene = NULL
r.squared.full = NULL
Var.prop = NULL
for(i in 1:N.bulk){
if(sum(Yjg[, i] == 0) > 0){
D1.temp = D1[Yjg[, i]!=0, ];
Yjg.temp = Yjg[Yjg[, i]!=0, i];
names(Yjg.temp) <- rownames(Yjg[Yjg[, i]!=0, ])
Sigma.temp = Sigma[Yjg[,i]!=0, ];
if(verbose) message(paste(colnames(Yjg)[i], 'has common genes', sum(Yjg[, i] != 0), '...') )
}else{
D1.temp = D1;
Yjg.temp = Yjg[, i];
names(Yjg.temp) <- rownames(Yjg)
Sigma.temp = Sigma;
if(verbose) message(paste(colnames(Yjg)[i], 'has common genes', sum(Yjg[, i] != 0), '...'))
}
lm.D1.weighted = music.iter(Yjg.temp, D1.temp, M.S, Sigma.temp, iter.max = iter.max,
nu = nu, eps = eps, centered = centered, normalize = normalize)
Est.prop.allgene = rbind(Est.prop.allgene, lm.D1.weighted$p.nnls)
Est.prop.weighted = rbind(Est.prop.weighted, lm.D1.weighted$p.weight)
weight.gene.temp = rep(NA, nrow(Yjg)); weight.gene.temp[Yjg[,i]!=0] = lm.D1.weighted$weight.gene;
Weight.gene = cbind(Weight.gene, weight.gene.temp)
r.squared.full = c(r.squared.full, lm.D1.weighted$R.squared)
Var.prop = rbind(Var.prop, lm.D1.weighted$var.p)
}
}
colnames(Est.prop.weighted) = colnames(D1)
rownames(Est.prop.weighted) = colnames(Yjg)
colnames(Est.prop.allgene) = colnames(D1)
rownames(Est.prop.allgene) = colnames(Yjg)
names(r.squared.full) = colnames(Yjg)
colnames(Weight.gene) = colnames(Yjg)
rownames(Weight.gene) = cm.gene
colnames(Var.prop) = colnames(D1)
rownames(Var.prop) = colnames(Yjg)
return(list(Est.prop.weighted = Est.prop.weighted, Est.prop.allgene = Est.prop.allgene,
Weight.gene = Weight.gene, r.squared.full = r.squared.full, Var.prop = Var.prop))
}
๐จ music_prop.cluster ์์ ๋ ์ฝ๋
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
#music_prop.cluster ์ฝ๋ ์์
music_prop.cluster = function(bulk.mtx, sc.sce, group.markers, groups, clusters, samples, clusters.type,
verbose = TRUE, iter.max = 1000, nu = 0.0001, eps = 0.01, centered = FALSE, normalize = FALSE, ... ){
bulk.gene = rownames(bulk.mtx)[rowMeans(bulk.mtx) != 0]
bulk.mtx = bulk.mtx[bulk.gene, ]
select.ct = unlist(clusters.type)
if(length(setdiff(names(group.markers), names(clusters.type))) > 0 || length(setdiff(names(clusters.type), names(group.markers))) > 0){
stop("Cluster number is not matching!")
}else{
group.markers = group.markers[names(clusters.type)]
}
if(verbose){message('Start: cluster estimations...')}
cluster.sc.basis = music_basis(sc.sce, non.zero = TRUE, markers = NULL, clusters = groups, samples = samples, select.ct = names(clusters.type), verbose = verbose)
if(verbose){message('Start: cell type estimations...')}
sc.basis = music_basis(sc.sce, non.zero = TRUE, markers = NULL, clusters = clusters, samples = samples, select.ct = select.ct, verbose = verbose)
cm.gene = intersect(rownames(sc.basis$Disgn.mtx), bulk.gene)
if(length(cm.gene)< 0.2*min(length(bulk.gene), nrow(sc.sce)) ){
stop("Too few common genes!")
}
if(verbose){message(paste('Used', length(cm.gene), 'common genes...'))}
m.sc = match(cm.gene, rownames(sc.basis$Disgn.mtx)); m.bulk = match(cm.gene, bulk.gene)
group.markers = lapply(group.markers, intersect, cm.gene)
D1 = sc.basis$Disgn.mtx[m.sc,]; M.S = sc.basis$M.S; Sigma = sc.basis$Sigma[m.sc, ]
cluster.select = setdiff(rownames(D1), unique(unlist(group.markers)))
cluster.diff = unique(unlist(group.markers))
D1.cluster = cluster.sc.basis$Disgn.mtx[cluster.select, ]; M.S.cluster = cluster.sc.basis$M.S;
Yjg = relative.ab(bulk.mtx[m.bulk, ]); N.bulk = ncol(bulk.mtx);
Sigma.cluster = cluster.sc.basis$Sigma[cluster.select, ];
D1.sub = cluster.sc.basis$Disgn.mtx[cluster.diff, ]; Sigma.sub = cluster.sc.basis$Sigma[cluster.diff, ];
Est.prop.weighted.cluster = NULL
for(i in 1:N.bulk){
if(sum(Yjg[, i] == 0) > 0){
name.temp = rownames(Yjg)[Yjg[, i] != 0]
D1.cluster.temp = D1.cluster[rownames(D1.cluster)%in%name.temp, ];
D1.sub.temp = D1.sub[rownames(D1.sub)%in%name.temp, ];
Yjg.temp = Yjg[Yjg[, i]!=0, i];
names(Yjg.temp) <- rownames(Yjg[Yjg[, i]!=0, ])
Sigma.cluster.temp = Sigma.cluster[rownames(Sigma.cluster)%in%name.temp, ];
Sigma.sub.temp = Sigma.sub[rownames(Sigma.sub)%in%name.temp, ];
if(verbose) message(paste(colnames(Yjg)[i], 'has common genes', sum(Yjg[, i] != 0), '...') )
}else{
D1.cluster.temp = D1.cluster;
D1.sub.temp = D1.sub;
Yjg.temp = Yjg[, i];
names(Yjg.temp) <- rownames(Yjg)
Sigma.cluster.temp = Sigma.cluster;
Sigma.sub.temp = Sigma.sub;
if(verbose) message(paste(colnames(Yjg)[i], 'has common genes', sum(Yjg[, i] != 0), '...'))
}
lm.D1.cluster = music.iter(Yjg.temp, D1.cluster.temp, M.S.cluster, Sigma.cluster.temp, iter.max = iter.max,
nu = nu, eps = eps, centered = centered, normalize = normalize)
p.weight = NULL
p.cluster.weight = lm.D1.cluster$p.weight
for(j in 1:length(clusters.type)){
if(length(clusters.type[[j]]) == 1){
p.weight = c(p.weight, p.cluster.weight[j])
}else{
if(p.cluster.weight[j] == 0){
p.weight = c(p.weight, rep(0, length(clusters.type[[j]])))
}else{
c.marker = intersect(group.markers[[j]], names(Yjg.temp))
Y.sub = D1.sub.temp[c.marker, j]*p.cluster.weight[j] +
(Yjg.temp[c.marker] - D1.sub.temp[c.marker, ]%*% p.cluster.weight) * p.cluster.weight[j]
names(Y.sub) = c.marker
Y.sub = Y.sub[Y.sub > 0]
lm.D1.sub = music.iter(Y.sub, D1[c.marker, clusters.type[[j]]], M.S[clusters.type[[j]]],
Sigma[c.marker, clusters.type[[j]]])
p.weight = c(p.weight, p.cluster.weight[j] * lm.D1.sub$p.weight)
}
}
}
Est.prop.weighted.cluster = rbind(Est.prop.weighted.cluster, p.weight)
}
colnames(Est.prop.weighted.cluster) = unlist(clusters.type)
rownames(Est.prop.weighted.cluster) = colnames(Yjg)
return(list(Est.prop.weighted.cluster = Est.prop.weighted.cluster))
}
Results
๊ทธ๋์ ๋๋ฆฌ๊ณ ๋ Est.prop
์๋ ์ํ๋ณ๋ก ์์ธก๋ ๊ฒฐ๊ณผ๊ฐ ์ ์ฅ๋๋๋ฐ..
Est.prop.weighted
: data.frame of MuSiC estimated proportions, subjects by cell types;Est.prop.allgene
: data.frame of NNLS estimated proportions, subjects by cell types;Weight.gene
: matrix, MuSiC estimated weight for each gene, genes by subjects;r.squared.full
: vector of R squared from MuSiC estimated proportions for each subject;Var.prop
: matrix of variance of MuSiC estimates.
NNLS๋โฆ
Non-Negative Least Squares์ ์ฝ์ด. ๋น์์ ์ต์์ ๊ณฑ๋ฒ์ด๋ผ๊ณ ๋ํจ.
๋ ผ๋ฌธ์์ CIBERSORT, BSEQ-sc์ ํจ๊ป MuSiC์ ๊ฒฐ๊ณผ๋ฅผ ํ๊ฐํ๊ธฐ ์ํ ๋ฐฉ๋ฒ์ค ํ๋๋ก ์ฌ์ฉํจ.linear regression์ ์ํ ์ต์ ํ ์๊ณ ๋ฆฌ์ฆ ์ค ํ๋.
์ด ๋ฐฉ๋ฒ์ ํน์ง์ ๋ชจ๋ x๊ฐ ๋น์์์ฌ์ผ ํ๋ค๋ ์ ์ฝ์กฐ๊ฑด์ด ์์.์ฌ๋ฌ cell type์ ํผํฉ๋ฌผ (bulkRNA-seq data)์์ ๊ฐ cell type์ ๋น์จ์ ์ ํ๋ฐฉ์ ์์ผ๋ก ํํ์ด ๊ฐ๋ฅํ ํ ๋ฐ, ์ด๋ ๊ฐ cell type์ ๋น์จ์ ์ต์ 0์ด์ง ์์๊ฐ ๋ ์ ์๊ธฐ ๋๋ฌธ์
1
2
Est.prop$Est.prop.weighted # sample X celltypes
rowSums(Est.prop$Est.prop.weighted) # ์ ๋ถ 1
๊ทธ๋ฆผ์ผ๋ก ํํ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# ๊ฒฐ๊ณผ์ ๋ฆฌ
m.prop = rbind(meltnew(Est.prop$Est.prop.weighted), meltnew(Est.prop$Est.prop.allgene))
colnames(m.prop) = c('Sub', 'CellType', 'Prop')
m.prop$CellType = factor(m.prop$CellType, levels = unique(sort(colData(sc.sce)$predicted.celltype.l2)))
m.prop$Method = factor(rep(c('MuSiC', 'NNLS'), each = ncol(expression_matrix_filtered)*length(levels(m.prop$CellType))), levels = c('MuSiC', 'NNLS'))
m.prop$day = sapply(m.prop$Sub,function(x) strsplit(as.character(x),split="[.]")[[1]][2])
m.prop$day = factor(m.prop$day,levels=c('d0','d3','d7','d28'))
m.prop$PatientID = sapply(m.prop$Sub,function(x) strsplit(as.character(x),split="[.]")[[1]][1])
# celltype๋ณ๋ก ๊ทธ๋ฆผ
cell_types <- unique(m.prop$CellType)
for (cell_type in cell_types) {
# ํด๋น Cell Type์ ๋ํ ๋ฐ์ดํฐ ์ถ์ถ
cell_type_data <- m.prop[m.prop$Method=="MuSiC",] %>% filter(CellType == cell_type)
# Box Plot ๊ทธ๋ฆฌ๊ธฐ
p <- ggplot(data = cell_type_data, aes(x = factor(day), y = Prop, fill = factor(day))) +
geom_boxplot() +
labs(title = paste("Box Plot for", cell_type), x = "Day", y = "Proportion") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# ๊ทธ๋ฆผ ์ ์ฅ
ggsave(filename = paste(cell_type, "_MuSiC.png", sep = ""), plot = p)
}
์ด ํด ๋ง๊ณ ๋ ์ฌ๋ฌ๊ฐ์ง ๋ค๋ฅธ deconvolution ํด์ด ์กด์ฌํจ.
๋
Reference
- https://xuranw.github.io/MuSiC/articles/pages/MuSiC2.html#sample-analysis