RSEM custom reference로 single-cell 10X 버전이랑 일치 시키기

Posted Apr 9, 2026

By Subin Cho 4 min read

같은 subject에서 Bulk RNA-seq 이랑 single-cell data를 함께 제작해서 분석하는 프로젝트가 생겨서 둘의 reference를 맞추기 위해서.. single-cell 쪽을 맞추기보다는 Bulk 쪽을 재제작하는게 나을것 같다는 판단하에 진행함.

사실 둘다 최신버전으로 다운받아서 진행했으면 더 깔끔했을수 있지만 귀찮으니까.

scRNA-seq 분석에서 사용한 reference ver. = References 2024-A (March 13, 2024)
🔗10X genomics: step to build references 참조.

01. Downlaod source files

일치가 목표니까 10X가 reference만들때 사용했던것과 동일한 버전의 .fa 랑 .gtf 파일을 다운 받아준다.

        
      
fasta_url="http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
gtf_url="http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.primary_assembly.annotation.gtf.gz"

fasta_in="${build}/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
gtf_in="${build}/gencode.v44.primary_assembly.annotation.gtf"

if [ ! -f "$fasta_in" ]; then
    curl -L "$fasta_url" | zcat > "$fasta_in"
fi

if [ ! -f "$gtf_in" ]; then
    curl -L "$gtf_url" | zcat > "$gtf_in"
fi

02. header 수정

10X에서 다운받은 fa(ENSEMBL)랑 gtf(GENCODE)는 서로 header모양이 달라서 그냥 RSEM한테 주고 reference만들라고 하면 에러날 가능성이 10000%기 때문에 맞게 수정해줘야한다. 어떻게 알았냐고? 알고 싶지 않았다. 10X reference buil step에도 있었는데 안읽었다가 똥인지 된장인지 먹어보고 수정함.

        
      
# 1) FASTA header harmonization
# Ensembl: >1 ...  ->  GENCODE/UCSC style: >chr1 1
# MT -> chrM
build="/sc/arion/scratch/chos14/sysFluR01/ref"
fasta_modified="${build}/$(basename "$fasta_in").modified"

cat "$fasta_in" \
    | sed -E 's/^>(\S+).*/>\1 \1/' \
    | sed -E 's/^>([0-9]+|[XY]) />chr\1 /' \
    | sed -E 's/^>MT />chrM /' \
    > "$fasta_modified"

# 2) Remove version suffix from gene/transcript/exon IDs
# gene_id "ENSG00000223972.5" -> gene_id "ENSG00000223972"; gene_version "5";
gtf_modified="${build}/$(basename "$gtf_in").modified"
ID="(ENS(MUS)?[GTE][0-9]+)\.([0-9]+)"

cat "$gtf_in" \
    | sed -E 's/gene_id "'"$ID"'";/gene_id "\1"; gene_version "\3";/' \
    | sed -E 's/transcript_id "'"$ID"'";/transcript_id "\1"; transcript_version "\3";/' \
    | sed -E 's/exon_id "'"$ID"'";/exon_id "\1"; exon_version "\3";/' \
    > "$gtf_modified"

03. 10X랑 맞게 gene filtering

10X측에서도 ensembl이랑 genecode 합치려고 뭔가 열심히 한거 같은데 GTF에서 특정 패턴의 유전자만 골라서 사용했다. 주석 함께 퍼옴.

        
      
# Since Ensembl 110, polymorphic pseudogenes are now just protein_coding.
# Readthrough genes are annotated with the readthrough_transcript tag.

# Construct the gene ID allowlist. We filter the list of all transcripts
# based on these criteria:
#   - allowable gene_type (biotype)
#   - allowable transcript_type (biotype)
#   - no "readthrough_transcript" tag
# We then collect the list of gene IDs that have at least one associated
# transcript passing the filters.

BIOTYPE_PATTERN="(protein_coding|protein_coding_LoF|lncRNA|IG_C_gene|IG_D_gene|IG_J_gene|IG_LV_gene|IG_V_gene|IG_V_pseudogene|IG_J_pseudogene|IG_C_pseudogene|TR_C_gene|TR_D_gene|TR_J_gene|TR_V_gene|TR_V_pseudogene|TR_J_pseudogene)"
GENE_PATTERN="gene_type \"${BIOTYPE_PATTERN}\""
TX_PATTERN="transcript_type \"${BIOTYPE_PATTERN}\""
READTHROUGH_PATTERN='tag "readthrough_transcript"'

cat "$gtf_modified" \
    | awk '$3 == "transcript"' \
    | grep -E "$GENE_PATTERN" \
    | grep -E "$TX_PATTERN" \
    | grep -Ev "$READTHROUGH_PATTERN" \
    | sed -E 's/.*(gene_id "[^"]+").*/\1/' \
    | sort \
    | uniq \
    > "${build}/gene_allowlist"

# Since Ensembl 110, the PAR locus genes are included on chrY as copies of chrX
# Using the GRCh38.p13 assembly hard masks these regions on chrY, but removing the
# chrY PAR genes is still desirable so they do not end up as extra entries in the output.
# The awk command below excludes all PAR_Y genes, including XGY2.
# The non-coding gene XGY2 straddles the PAR1 boundary on chrY, and is homologous to XG on chrX.
# GRCh38-2024-A excludes XGY2, but includes SRY and ENSG00000286130, which are in an intron of XGY2,
# and RPS4Y1, which overlaps XGY2.

gtf_filtered="${build}/$(basename "$gtf_in").filtered"

grep -E '^#' "$gtf_modified" > "$gtf_filtered"

grep -Ff "${build}/gene_allowlist" "$gtf_modified" \
    | awk -F '\t' '$1 != "chrY" || ($1 == "chrY" && $4 >= 2752083 && $4 < 56887903 && !/ENSG00000290840/)' \
    >> "$gtf_filtered"

04. Build RSEM custom reference

여기까지하면 필요한 input준비는 완. 이제 build만 하면된다.

        
      
rsem_ref="${build}/rsem-Href"

rsem-prepare-reference \
    --gtf "$gtf_filtered" \
    --star \
    -p "$threads" \
    "$fasta_modified" \
    "$rsem_ref"

Study, Bulk RNA-seq

code NYU/UGA4/STAR-RSEM

This post is licensed under CC BY 4.0 by the author.

01. Downlaod source files

02. header 수정

03. 10X랑 맞게 gene filtering

04. Build RSEM custom reference

Trending Tags