chromIDEAS_CSC¶

chromIDEAS_CSC identifies functionally coherent chromatin state clusters (CSCs) by integrating two complementary biological perspectives: 1) the genomic spatial distribution patterns of chromatin states (CSs) across gene bodies, and 2) their epigenetic signal compositions. The tool operates on the principle that CSs sharing similar genomic distributions and epigenetic profiles likely perform related biological functions. To implement this, chromIDEAS_CSC employs the Weighted Nearest Neighbor (WNN) algorithm to construct a multimodal similarity space that balances contributions from both feature types. Within this integrated space, graph-based clustering partitions CSs into functionally consistent groups. For methodological details of WNN, see: https://doi.org/10.1016/j.cell.2021.04.048.

Usage:   chromIDEAS_CSC [options] -i <input_CS> -e <emission> -r <region_file> -o <out_prefix> -f gtf -t tx -O 0.1
   or:   chromIDEAS_CSC [options] -i <input_CS> -e <emission> -r <region_file> -o <out_prefix> -f gtf -t gene
   or:   chromIDEAS_CSC [options] -i <input_CS> -e <emission> -r <region_file> -o <out_prefix> -f bed

Content¶

Required arguments ¶

-i <input_CS>

Chromatin state segmentation file (space delimited) generated by chromIDEAS or ideasCS [Default: None]. Example:

#ID CHR POSst POSed cell1 cell2
chr1 792600 792800 1 1
chr1 792800 793000 0 0
chr1 793000 793200 0 0
chr1 793200 793400 0 0

-e <emission>

Chromatin state emission file (tab delimited) generated by chromIDEAS or ideasCS that defines each CS by the co-occurrence probabilities of epigenetic signals [Default: None]. Example:

State Percentage  ATAC  H3K27ac ...  H3K79me2  H3K9me3
S24   0.12        11.32 4.08    ...  0.41      0.22
S30   0.05        10.80 0.63    ...  0.34      0.27
S3    4.61        0.34  0.22    ...  0.42      0.20

-r <region_file>

Input genomic regions file. Defines the set of regions (e.g., all genes/transcripts) across which CS distribution will be analyzed. The final regions used for functional clustering (HITs) are a selected subset of these. Supported formats are BED or GTF. If GTF format is used, you must specify “-f gtf” and choose the analysis unit with -t. [Default: None]

-o <out_prefix>

Output file path and prefix. Six primary output files will be generated [Default: None]:

1. <out_prefix>.<cell/merge>.cluster.csv
2. <out_prefix>.<cell/merge>.clustree.pdf
3. <out_prefix>.<n_HITs>_Highly_Informative_<location_type>s.qs
4. <out_prefix>.<location_type>_Body_<body_bin_num>segments_based_on_CSPercentage.<cell>.qs
5. <out_prefix>.<cell/merge>.CS_Distance.qs

- File 1) A CSV file containing cluster membership for each tested resolution.
- File 2) Creates a clustering tree plot (generated by the clustree package) showing relationships
          between clusterings across different resolutions.
- File 3) Contains the list of selected HITs (or genes), for each analyzed cell type.
- File 4) Contains segment-wise CS occupancy matrix. It contains the proportion of each CS
          across the subdivided body of each region for each cell type.
- File 5) A qs format file containing the distance matrix of all CSs within the WNN space,
          used for downstream differential CSC gene analysis.

Optional arguments ¶

-l <body_bin_num>

Number of segments per region body. Each input region is divided into <body_bin_num> equal segments to quantify positional CS preferences. To ensure each segment contains at least one CS, regions shorter than <body_bin_num> bins are automatically filtered out. [Default: 10]

-z <length_leveles>

Number of length strata for stratification. To mitigate bias caused by substantial variation in region lengths, regions are stratified into this many quantiles based on their length. HITs selection is then performed within each stratum to ensure balanced representation. Recommended to be ≤ 10. [Default: 10]

-f <file_type>

Format of the region file. Specifies whether the <region_file> is in gtf or bed format. The standard GTF format is a 9-column, tab-delimited file. The required BED format should contain 4 columns: chrom, chromStart, chromEnd, and strand. [Default: gtf]

-t <location_type>

Analysis unit for GTF files. Required if “-f gtf”. Choose “tx” to analyze non-overlapping transcripts or “gene” to analyze genes (collapsing all transcripts per gene). [Default: tx]

-O <overlap_cutoff>

Overlap cutoff for transcript selection. Used only when “-t tx”. Transcripts sharing more than this fraction of their length with any other transcript are considered overlapping and are excluded to avoid ambiguous CS assignments. [Default: 0.1]

-n <n_HITs>

Number of Highly Informative Transcripts (HITs) to select per sample. This defines the size of the output HITs list for each cell type, which serves as the high-signal feature set for CS clustering. Analogous to selecting highly variable genes in scRNA-seq analysis. [Default: 2000]

-p <nthreads>

Number of parallel processes. Recommended to be ≤ 10. [Default: 4]

-E <excludeCS>

Exclude specified CSs from functional clustering. When users have clear evidence about the functions of certain CSs, they can manually cluster them and exclude these states from chromIDEAS’s unsupervised functional clustering analysis. CS are specified directly by their numerical labels, separated by commas (e.g., “1,2,3” will exclude states S1, S2, and S3 from the analysis). [Default: none]

-m <mode>

Analysis mode. Specifies which datasets to cluster. [Default: 3]

Cluster CSs from cell type 1 only (independent analysis).
Cluster CSs from cell type 2 only (independent analysis).
Joint clustering of CSs from both cell types (merged analysis).
Perform all three analyses (modes 1, 2, and 3) simultaneously.

-R <resolutions>

Clustering resolution(s). Specifies the graph clustering resolution parameter(s) to test. Accepts a comma-separated list that can include individual values and ranges. Ranges use the “start-end-step” format. All values must be within (0, 5]. Example: “0.9,1.3-1.8-0.1,2” tests 8 resolutions: 0.9, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, and 2. [Default: “0.1-0.9-0.1,1-1.8-0.2,2-5-1”]

-P

Disable the creation of multi-resolution clustering tree plots. By default, these visualizations are automatically generated. Use this flag to suppress plot output if only the CSV results are needed. [Default: enabled]

-h

Show this help message and exit.

-v

Show program’s version number and exit.

chromIDEAS_CSC¶

Content¶

Required arguments¶

Optional arguments¶

Required arguments ¶

Optional arguments ¶