CRAM Subsetting¶
Subset CRAM to Genomic Region¶
Extract reads from a specific genomic region to create a smaller, focused CRAM file.
Python API Documentation¶
grid.utils.subset_cram¶
- grid.utils.subset_cram.subset_cram(cram_path, region, output_path, reference=None)¶
Subset a CRAM file for a specific genomic region.
- Parameters:
cram_path (str) – Input CRAM file.
region (str) – Region in ‘chr:start-end’ format (e.g., ‘chr6:160000000-160100000’).
output_path (str) – Path to output CRAM file.
reference (str) – Reference genome FASTA (required for CRAM output)
- Returns:
Path to the subset CRAM file.
- Return type:
str
Usage¶
Via CLI¶
# Using region string
grid subset \
--cram input.cram \
--region chr6:160000000-160100000 \
--output subset.cram \
--reference refs/hs37d5.fa
# Using separate coordinates
grid subset \
--cram input.cram \
--chrom chr6 \
--start 160000000 \
--end 160100000 \
--output subset.cram \
--reference refs/hs37d5.fa
Via Python¶
from grid.utils.subset_cram import subset_cram
subset_path = subset_cram(
cram_path="input.cram",
region="chr6:160000000-160100000",
output_path="subset.cram",
reference="refs/hs37d5.fa"
)
Description¶
This utility extracts reads overlapping a specific genomic region:
Uses samtools view to extract region
Maintains CRAM format and compression
Creates index (.crai) for output file
Preserves header information
Use cases:
Testing - Work with smaller files during development
Sharing - Extract specific regions for collaboration
Performance - Faster processing on region of interest
Storage - Reduce file sizes for targeted analysis
Region Format¶
Regions can be specified as:
Full format:
chr6:160000000-160100000Chromosome only:
chr6(entire chromosome)No end:
chr6:160000000(from position to end)
Coordinates are 0-based, following SAM/BAM conventions.
Dependencies¶
samtools - Must be installed and in PATH
pysam - Python library for BAM/CRAM handling
Output¶
Creates two files:
<output>.cram- Subset CRAM file<output>.cram.crai- Index for subset CRAM