Read Counting Module¶
Overview¶
Read Counting Module¶
Count properly paired reads within specified genomic regions (e.g., LPA KIV-2 VNTR) across multiple CRAM files.
Purpose¶
This module is the first analytical step in the GRiD pipeline. It quantifies how many properly paired, mapped reads overlap with the repeat region of interest. These counts are essential for downstream copy number estimation.
Main Module¶
count_reads.py¶
Located in parent directory: grid/utils/count_reads.py
Key Function:
def count_reads(
cram_dir: str,
output_file: str,
ref_fasta: str,
chrom: str,
start: int,
end: int,
config: str,
threads: int = 1
) -> None
Parameters:
cram_dir: Directory containing CRAM filesoutput_file: Output TSV file pathref_fasta: Reference genome FASTA (for CRAM decoding)chrom: Chromosome name (e.g., “chr6”)start: Region start coordinate (0-based)end: Region end coordinate (0-based)config: YAML config with SAM flag filtersthreads: Parallel processing threads
Output Format:
sample_id read_count
sample001 1234
sample002 2345
sample003 3456
Usage¶
Via CLI¶
grid count-reads \
--cram-dir data/CRAMs \
--output-file output/read_counts.tsv \
--ref-fasta refs/hs37d5.fa \
--chrom chr6 \
--start 160000000 \
--end 160100000 \
--config config.yaml \
--threads 4
As Python Module¶
from grid.utils.count_reads import count_reads
count_reads(
cram_dir="data/CRAMs",
output_file="output/counts.tsv",
ref_fasta="refs/hs37d5.fa",
chrom="chr6",
start=160000000,
end=160100000,
config="config.yaml",
threads=4
)
SAM Flag Filtering¶
Reads are filtered based on flags specified in the YAML config:
sam_flags:
required_flags:
- PROPER_PAIR # Read is properly paired
- READ_MAPPED # Read is mapped
excluded_flags:
- DUPLICATE # PCR/optical duplicate
- SECONDARY # Secondary alignment
- SUPPLEMENTARY # Supplementary alignment
Common Required Flags:
PROPER_PAIR(0x2): Both reads in pair properly alignedREAD_MAPPED(0x1): Read is mapped to reference
Common Excluded Flags:
DUPLICATE(0x400): PCR or optical duplicateSECONDARY(0x100): Not primary alignmentSUPPLEMENTARY(0x800): Supplementary alignmentQCFAIL(0x200): Fails quality checksUNMAPPED(0x4): Read unmapped
Technical Details¶
Algorithm¶
Discover all CRAM files in specified directory
For each CRAM:
Ensure .crai index exists
Open CRAM with reference genome
Fetch reads overlapping target region
Filter reads by SAM flags
Count properly paired reads
Write results to TSV
Performance¶
Parallelization: Uses multiprocessing to process CRAMs concurrently
Memory: Streams reads, minimal memory footprint
Speed: ~30-60 seconds per CRAM (varies by region size and coverage)
Dependencies¶
pysam- CRAM/BAM readingpandas- Data outputmultiprocessing- Parallel execution
Troubleshooting¶
CRAM index missing:
Error: index file <file>.crai does not exist
→ Run grid crai or ensure index files exist
Reference mismatch:
Error: [E::hts_open_format] fail to open file
→ Verify reference FASTA matches CRAM build (hg19/hg38)
No reads found:
Check chromosome naming (“chr6” vs “6”)
Verify coordinates are correct for genome build
Ensure region overlaps with sequencing data
Python API Documentation¶
grid.utils.count_reads¶
- grid.utils.count_reads.count_reads(cram_dir, output_file, ref_fasta, chrom, start, end, config_file, threads=1)¶
Count reads for all CRAMs in a directory using parallel processing.
- Parameters:
cram_dir (str) – Directory containing CRAM files
output_file (str) – Path to output TSV file
ref_fasta (str) – Path to reference genome FASTA
chrom (str) – Chromosome name
start (int) – Start position
end (int) – End position
config_file (str) – Path to YAML config file
threads (int) – Number of threads for parallel processing
grid.utils.count_reads_dir¶
Count reads utilities.
This package contains modular functions for counting reads in directories and processing read count results.
- grid.utils.count_reads_dir.count_reads_in_region(cram_file, ref_fasta, chrom, start, end, proper_flags)¶
Count properly paired reads in a genomic region using pysam. Ensures CRAI index exists.
- Parameters:
cram_file (str)
ref_fasta (str)
chrom (str)
start (int)
end (int)
proper_flags (set[int])
- Return type:
int
- grid.utils.count_reads_dir.process_single_cram(cram_path, ref_fasta, chrom, start, end, proper_flags)¶
Process a single CRAM file and count reads in region.
- Parameters:
cram_path (str) – Path to CRAM file
ref_fasta (str) – Path to reference genome FASTA
chrom (str) – Chromosome name
start (int) – Start position
end (int) – End position
proper_flags (set[int]) – Set of SAM flags to count
- Returns:
Tuple of (basename, count) where count is int or “Error”
- Return type:
tuple[str, int | str]