computeCSMat¶
computeCSMat is quantifies the genomic distribution patterns of chromatin states (CSs) across gene or transcript regions. It generates a segment-wise CS occupancy matrix, which is fundamental for revealing the spatial preferences of different CSs relative to transcriptional units (e.g., promoters, gene bodies). Subsequently, it selects a subset of Highly Informative Transcripts (HITs) or genes based on their distinct CS occupancy profiles. These HITs are chosen to enhance the biological signal-to-noise ratio and are used as the core input for downstream functional clustering of chromatin states (CSCs) in chromIDEAS.
Usage: computeCSMat [options] -i <input_CS> -r <region_file> -o <out_prefix> -f gtf -t tx -O 0.1
or: computeCSMat [options] -i <input_CS> -r <region_file> -o <out_prefix> -f gtf -t gene
or: computeCSMat [options] -i <input_CS> -r <region_file> -o <out_prefix> -f bed
Content¶
Required arguments¶
-i <input_CS>Chromatin state segmentation file (space delimited) generated by chromIDEAS or ideasCS [Default: None]. Example:
#ID CHR POSst POSed cell1 cell2 1 chr1 792600 792800 1 1 2 chr1 792800 793000 0 0 3 chr1 793000 793200 0 0 4 chr1 793200 793400 0 0
-r <region_file>Input genomic regions file. Defines the set of regions (e.g., all genes/transcripts) across which CS distribution will be analyzed. The final regions used for functional clustering (HITs) are a selected subset of these. Supported formats are BED or GTF. If GTF format is used, you must specify “-f gtf” and choose the analysis unit with
-t. [Default: None]-o <out_prefix>Output file path and prefix. Three primary output files will be generated [Default: None]:
1. <out_prefix>.<n_HITs>_Highly_Informative_<location_type>s.qs 2. <out_prefix>.<location_type>_Body_<body_bin_num>segments_based_on_CSPercentage.<cell>.qs - File 1) Contains the list of selected HITs (or genes), for each analyzed cell type. - File 2) Contains segment-wise CS occupancy matrix. It contains the proportion of each CS across the subdivided body of each region for each cell type.
Optional arguments¶
-l <body_bin_num>Number of segments per region body. Each input region is divided into <body_bin_num> equal segments to quantify positional CS preferences. To ensure each segment contains at least one CS, regions shorter than <body_bin_num> bins are automatically filtered out. [Default: 10]
-z <length_leveles>Number of length strata for stratification. To mitigate bias caused by substantial variation in region lengths, regions are stratified into this many quantiles based on their length. HITs selection is then performed within each stratum to ensure balanced representation. Recommended to be ≤ 10. [Default: 10]
-f <file_type>Format of the region file. Specifies whether the <region_file> is in gtf or bed format. The standard GTF format is a 9-column, tab-delimited file. The required BED format should contain 4 columns: chrom, chromStart, chromEnd, and strand. [Default: gtf]
-t <location_type>Analysis unit for GTF files. Required if “-f gtf”. Choose “tx” to analyze non-overlapping transcripts or “gene” to analyze genes (collapsing all transcripts per gene). [Default: tx]
-O <overlap_cutoff>Overlap cutoff for transcript selection. Used only when “-t tx”. Transcripts sharing more than this fraction of their length with any other transcript are considered overlapping and are excluded to avoid ambiguous CS assignments. [Default: 0.1]
-n <n_HITs>Number of Highly Informative Transcripts (HITs) to select per sample. This defines the size of the output HITs list for each cell type, which serves as the high-signal feature set for CS clustering. Analogous to selecting highly variable genes in scRNA-seq analysis. [Default: 2000]
-p <nthreads>Number of parallel processes. Recommended to be ≤ 10. [Default: 4]
-hShow this help message and exit.
-vShow program’s version number and exit.