inferCNV
Inferring copy number alterations from tumor single cell RNA-seq data
tumor sample로 scRNA seq data생산했을때, 가장 기본적으로 진행하는 분석이다. 어떤 cell이 cancer cell인지 알아내는것! 일반적인 cell type annotation tool(such as SingleR)을 사용해서 분석하면 cancer cell은 fibroblast나 epithelial cell같은 type으로 분류될 수 있다.
inferCNV는 RNAseq data에서 large-scale choromosomal Copy Number alteration을 추론해 normal에 비해 alteration이 많은 cancer cell을 구분한다.
Installation
Pre-requisites
JAGS (Just Another Gibbs Sampler)
install JAGS from source code (downlaod here)
1
2
3
4
tar -xvzf JAGS-4.3.0.tar.gz
cd JAGS-4.3.0
./configure --prefix=[JAGS_installation_dir]
make & make install
:rotating_light: configure: error: "You need to install the LAPACK library"
1
2
3
...
checking for cheev_ in -llapack_rs6k... no
configure: error: "You need to install the LAPACK library"
configuration하는 도중에 위와 같은 error가 나면, lapack
설치해준다 (download source from here)
1
2
3
4
5
tar -xvzf v3.9.1.tar.gz
cd lapack-3.9.1
#make.inc.example 파일의 내용을 본인 환경에 맞게 수정한 후, make.inc파일로 저장한다.
mv make.inc.example make.inc
make & make install
잘 설치되면 다시 한번 configuration시도해보고, lapack 깔았는데도 같은 에러가 난다면 이렇게 해보자
1
LDFLAGS="-L/source/lapack-3.9.1" F77=gfortran ./configure --prefix=/source/JAGS-4.3.0 --with-lapack=/source/lapack-3.9.1
아니면 conda로 설치
1
conda install -c conda-forge r-rjags
Install inferCNV
1
2
3
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("infercnv")
Running inferCNV
Input files
Raw counts matrix (Gene x Cell)
The counts matrix can be generated using any conventional single cell transcriptome quantification pipeline, yielding a matrix of genes (rows) vs. cells (columns) containing assigned read counts.
난 10X data사용하고 seurat으로 QC거친 data가 있으니까 거기에서 뽑아내면
1
2
counts_matrix = GetAssayData(seurat_obj, slot="counts")
write.table(round(counts_matrix, digits=3), file='sc.10x.counts.matrix', quote=F, sep="\t")
Sample Annotation file
The sample annotation file is used to define the different cell types, and optionally, indicating how the cells should be grouped according to sample (ie. patient). The format is simply two columns, tab-delimited, and there is no column header.
Only those cells listed in the sample annotations file will be analyzed by inferCNV.This is useful in case you cells of interest are a subset of the total counts matrix, without needing create a new matrix containing the subset of interest.
나는 immune cell type을 미리 나눠놨어서 그 cell들을 normal이라고 annotation 해서 넣을 예정
1
2
3
4
5
6
7
8
9
10
11
12
annot_n <- data.frame(cell_id=colnames(subset(seurat_obj, idents=c(2, 5, 9, 10))), annot="normal immune")
annot_u <- data.frame(cell_id=colnames(subset(seurat_obj, idents=c(2, 5, 9, 10), invert=T)), annot="not immune")
annot<-rbind(annot_n, annot_u)
#다른 방법###
annot_n <- data.frame(cell_id=colnames(subset(seurat_obj, idents=c(2,10,0,8,4))),
annot=as.character(sapply(colnames(subset(sub, idents=c(2,10,0,8,4))),function(x) paste0("cluster",sub$seurat_clusters[x]))))
annot_u <- data.frame(cell_id=colnames(subset(seurat_obj, idents=c(2,10,0,8,4), invert=T)),
annot=as.character(sapply(colnames(subset(sub, idents=c(2,10,0,8,4), invert=T)),function(x) paste0("cluster",sub$seurat_clusters[x]))))
annot<-rbind(annot_n, annot_u)
########################
write.table(annot,"cellAnnotations.txt", sep="\t", col.names = F, row.names = F,quote=F)
Gene ordering file
The gene ordering file provides the chromosomal location for each gene. The format is tab-delimited and has no column header, simply providing the gene name, chromosome, and gene span
count할때 사용했던 reference와 같은 버전을 사용하자. (Gencode gtf file download here)
bed file형식으로 만들기 위해 다음과 같은 과정을 거쳤다.
1
2
3
4
5
# gene_name chromosome start end 의 포맷으로 만들기 위해 전처리
cat gencode.vM23.primary_assembly.annotation.gtf | awk 'OFS="\t" {if ($3=="gene") {print $14,$1,$4-1,$5}}' | tr -d '";' > inferCNV_geneOrdering.txt
#보니까 중복되는 유전자가 있어서 R로 다시 한번 처리해줌
geneAnnot <- read.table("gene_ordering_file.txt",sep='\t',stringsAsFactors=F)
write.table(geneAnnot %>% distinct(V1, .keep_all = TRUE), "gene_ordering_file.txt", sep="\t", col.names = F, row.names = F,quote=F)
InferCNV 2-step execution:
Step1. Create inferCNV object
만들어둔 input file들 넣어서 inferCNV obj만들어준다.
1
2
3
4
5
6
infercnv_obj = CreateInfercnvObject(raw_counts_matrix="sc.10x.counts.matrix",
annotations_file="cellAnnotations.txt",
delim="\t",
gene_order_file="gene_ordering_file.txt",
ref_group_names=c("normal immune"),
chr_exclude = c("chrX", "chrY","chrM"))
verbose
1
2
3
4
5
6
7
8
9
INFO [2021-05-25 17:35:49] Parsing matrix: sc.10x.counts.matrix
INFO [2021-05-25 17:36:02] Parsing gene order file: gene_ordering_file.txt
INFO [2021-05-25 17:36:02] Parsing cell annotations file: cellAnnotations.txt
INFO [2021-05-25 17:36:02] ::order_reduce:Start.
INFO [2021-05-25 17:36:02] .order_reduce(): expr and order match.
INFO [2021-05-25 17:36:02] ::process_data:order_reduce:Reduction from positional data, new dimensions (r,c) = 16766,5684 Total=114172878 Min=0 Max=5041.
INFO [2021-05-25 17:36:02] num genes removed taking into account provided gene ordering list: 529 = 3.15519503757605% removed.
INFO [2021-05-25 17:36:02] -filtering out cells < 100 or > Inf, removing 0 % of cells
INFO [2021-05-25 17:36:04] validating infercnv_obj
Step2. perform inferCNV operation
obj만들어지면 standard invercnv procedure실행 infercnv::run()
시간꽤 걸린다
1
2
3
4
5
6
7
8
infercnv_obj = infercnv::run(infercnv_obj,
cutoff=0.1, #0.1 for 10x-genomics
out_dir="output_dir", # auto-created for storing outputs
cluster_by_groups=T, # cluster
denoise=T,
HMM=T,
num_threads = 30,
output_format="pdf"))
verbose
1
Results
잘 돌아가고 나면, out_dir
옵션에 넣은 값과 같은 이름의 output directory가 생성된다.
inferCNV.pdf