inferCNV

Posted May 25, 2021

By Subin Cho 5 min read

Inferring copy number alterations from tumor single cell RNA-seq data

tumor sample로 scRNA seq data생산했을때, 가장 기본적으로 진행하는 분석이다. 어떤 cell이 cancer cell인지 알아내는것! 일반적인 cell type annotation tool(such as SingleR)을 사용해서 분석하면 cancer cell은 fibroblast나 epithelial cell같은 type으로 분류될 수 있다.
inferCNV는 RNAseq data에서 large-scale choromosomal Copy Number alteration을 추론해 normal에 비해 alteration이 많은 cancer cell을 구분한다.

Installation

Pre-requisites

JAGS (Just Another Gibbs Sampler)

install JAGS from source code (downlaod here)

  
tar -xvzf JAGS-4.3.0.tar.gz
cd JAGS-4.3.0
./configure --prefix=[JAGS_installation_dir]
make & make install

:rotating_light: configure: error: "You need to install the LAPACK library"

  
...
checking for cheev_ in -llapack_rs6k... no
configure: error: "You need to install the LAPACK library"

configuration하는 도중에 위와 같은 error가 나면, lapack 설치해준다 (download source from here)

  
tar -xvzf v3.9.1.tar.gz
cd lapack-3.9.1
#make.inc.example 파일의 내용을 본인 환경에 맞게 수정한 후, make.inc파일로 저장한다.
mv make.inc.example make.inc 
make & make install

잘 설치되면 다시 한번 configuration시도해보고, lapack 깔았는데도 같은 에러가 난다면 이렇게 해보자

LDFLAGS="-L/source/lapack-3.9.1" F77=gfortran ./configure --prefix=/source/JAGS-4.3.0 --with-lapack=/source/lapack-3.9.1

아니면 conda로 설치

conda install -c conda-forge r-rjags

Install inferCNV

  
if (!requireNamespace("BiocManager", quietly = TRUE))
     install.packages("BiocManager")
BiocManager::install("infercnv")

Running inferCNV

Input files

Raw counts matrix (Gene x Cell)

The counts matrix can be generated using any conventional single cell transcriptome quantification pipeline, yielding a matrix of genes (rows) vs. cells (columns) containing assigned read counts.

난 10X data사용하고 seurat으로 QC거친 data가 있으니까 거기에서 뽑아내면

  
counts_matrix = GetAssayData(seurat_obj, slot="counts")
write.table(round(counts_matrix, digits=3), file='sc.10x.counts.matrix', quote=F, sep="\t")                                                                                                        

Sample Annotation file

The sample annotation file is used to define the different cell types, and optionally, indicating how the cells should be grouped according to sample (ie. patient). The format is simply two columns, tab-delimited, and there is no column header.
Only those cells listed in the sample annotations file will be analyzed by inferCNV.This is useful in case you cells of interest are a subset of the total counts matrix, without needing create a new matrix containing the subset of interest.

나는 immune cell type을 미리 나눠놨어서 그 cell들을 normal이라고 annotation 해서 넣을 예정

  
annot_n <- data.frame(cell_id=colnames(subset(seurat_obj, idents=c(2, 5, 9, 10))), annot="normal immune")
annot_u <- data.frame(cell_id=colnames(subset(seurat_obj, idents=c(2, 5, 9, 10), invert=T)), annot="not immune")
annot<-rbind(annot_n, annot_u)

#다른 방법###
annot_n <- data.frame(cell_id=colnames(subset(seurat_obj, idents=c(2,10,0,8,4))), 
    annot=as.character(sapply(colnames(subset(sub, idents=c(2,10,0,8,4))),function(x) paste0("cluster",sub$seurat_clusters[x]))))
annot_u <- data.frame(cell_id=colnames(subset(seurat_obj, idents=c(2,10,0,8,4), invert=T)), 
    annot=as.character(sapply(colnames(subset(sub, idents=c(2,10,0,8,4), invert=T)),function(x) paste0("cluster",sub$seurat_clusters[x]))))
annot<-rbind(annot_n, annot_u)
########################
write.table(annot,"cellAnnotations.txt", sep="\t", col.names = F, row.names = F,quote=F)

Gene ordering file

The gene ordering file provides the chromosomal location for each gene. The format is tab-delimited and has no column header, simply providing the gene name, chromosome, and gene span

count할때 사용했던 reference와 같은 버전을 사용하자. (Gencode gtf file download here)
bed file형식으로 만들기 위해 다음과 같은 과정을 거쳤다.

  
# gene_name  chromosome  start  end 의 포맷으로 만들기 위해 전처리
cat gencode.vM23.primary_assembly.annotation.gtf |  awk 'OFS="\t" {if ($3=="gene") {print $14,$1,$4-1,$5}}' | tr -d '";' > inferCNV_geneOrdering.txt
#보니까 중복되는 유전자가 있어서 R로 다시 한번 처리해줌
geneAnnot <- read.table("gene_ordering_file.txt",sep='\t',stringsAsFactors=F)
write.table(geneAnnot %>% distinct(V1, .keep_all = TRUE), "gene_ordering_file.txt", sep="\t", col.names = F, row.names = F,quote=F)

InferCNV 2-step execution:

Step1. Create inferCNV object

만들어둔 input file들 넣어서 inferCNV obj만들어준다.

  
infercnv_obj = CreateInfercnvObject(raw_counts_matrix="sc.10x.counts.matrix",
                                    annotations_file="cellAnnotations.txt",
                                    delim="\t",
                                    gene_order_file="gene_ordering_file.txt",
                                    ref_group_names=c("normal immune"),
                                    chr_exclude = c("chrX", "chrY","chrM"))

verbose

  
INFO [2021-05-25 17:35:49] Parsing matrix: sc.10x.counts.matrix
INFO [2021-05-25 17:36:02] Parsing gene order file: gene_ordering_file.txt
INFO [2021-05-25 17:36:02] Parsing cell annotations file: cellAnnotations.txt
INFO [2021-05-25 17:36:02] ::order_reduce:Start.
INFO [2021-05-25 17:36:02] .order_reduce(): expr and order match.
INFO [2021-05-25 17:36:02] ::process_data:order_reduce:Reduction from positional data, new dimensions (r,c) = 16766,5684 Total=114172878 Min=0 Max=5041.
INFO [2021-05-25 17:36:02] num genes removed taking into account provided gene ordering list: 529 = 3.15519503757605% removed.
INFO [2021-05-25 17:36:02] -filtering out cells < 100 or > Inf, removing 0 % of cells
INFO [2021-05-25 17:36:04] validating infercnv_obj

Step2. perform inferCNV operation

obj만들어지면 standard invercnv procedure실행 infercnv::run() 시간꽤 걸린다

  
infercnv_obj = infercnv::run(infercnv_obj,
                     cutoff=0.1,  #0.1 for 10x-genomics
                     out_dir="output_dir",  # auto-created for storing outputs
                     cluster_by_groups=T,   # cluster
                     denoise=T,
                     HMM=T,
                     num_threads = 30,
                     output_format="pdf"))

verbose

Results

잘 돌아가고 나면, out_dir 옵션에 넣은 값과 같은 이름의 output directory가 생성된다.
inferCNV.pdf

reference

inferCNV github page

Study, Single-Cell

sc tool

This post is licensed under CC BY 4.0 by the author.