Read Counting Module¶

Overview¶

Read Counting Module¶

Count properly paired reads within specified genomic regions (e.g., LPA KIV-2 VNTR) across multiple CRAM files.

Purpose¶

This module is the first analytical step in the GRiD pipeline. It quantifies how many properly paired, mapped reads overlap with the repeat region of interest. These counts are essential for downstream copy number estimation.

Main Module¶

`count_reads.py`¶

Located in parent directory: grid/utils/count_reads.py

Key Function:

def count_reads(
    cram_dir: str,
    output_file: str,
    ref_fasta: str,
    chrom: str,
    start: int,
    end: int,
    config: str,
    threads: int = 1
) -> None

Parameters:

cram_dir: Directory containing CRAM files
output_file: Output TSV file path
ref_fasta: Reference genome FASTA (for CRAM decoding)
chrom: Chromosome name (e.g., “chr6”)
start: Region start coordinate (0-based)
end: Region end coordinate (0-based)
config: YAML config with SAM flag filters
threads: Parallel processing threads

Output Format:

sample_id       read_count
sample001       1234
sample002       2345
sample003       3456

Usage¶

Via CLI¶

grid count-reads \
    --cram-dir data/CRAMs \
    --output-file output/read_counts.tsv \
    --ref-fasta refs/hs37d5.fa \
    --chrom chr6 \
    --start 160000000 \
    --end 160100000 \
    --config config.yaml \
    --threads 4

As Python Module¶

from grid.utils.count_reads import count_reads

count_reads(
    cram_dir="data/CRAMs",
    output_file="output/counts.tsv",
    ref_fasta="refs/hs37d5.fa",
    chrom="chr6",
    start=160000000,
    end=160100000,
    config="config.yaml",
    threads=4
)

SAM Flag Filtering¶

Reads are filtered based on flags specified in the YAML config:

sam_flags:
  required_flags:
    - PROPER_PAIR    # Read is properly paired
    - READ_MAPPED    # Read is mapped
  excluded_flags:
    - DUPLICATE      # PCR/optical duplicate
    - SECONDARY      # Secondary alignment
    - SUPPLEMENTARY  # Supplementary alignment

Common Required Flags:

PROPER_PAIR (0x2): Both reads in pair properly aligned
READ_MAPPED (0x1): Read is mapped to reference

Common Excluded Flags:

DUPLICATE (0x400): PCR or optical duplicate
SECONDARY (0x100): Not primary alignment
SUPPLEMENTARY (0x800): Supplementary alignment
QCFAIL (0x200): Fails quality checks
UNMAPPED (0x4): Read unmapped

Technical Details¶

Algorithm¶

Discover all CRAM files in specified directory
For each CRAM:
- Ensure .crai index exists
- Open CRAM with reference genome
- Fetch reads overlapping target region
- Filter reads by SAM flags
- Count properly paired reads
Write results to TSV

Performance¶

Parallelization: Uses multiprocessing to process CRAMs concurrently
Memory: Streams reads, minimal memory footprint
Speed: ~30-60 seconds per CRAM (varies by region size and coverage)

Dependencies¶

pysam - CRAM/BAM reading
pandas - Data output
multiprocessing - Parallel execution

Troubleshooting¶

CRAM index missing:

Error: index file <file>.crai does not exist

→ Run grid crai or ensure index files exist

Reference mismatch:

Error: [E::hts_open_format] fail to open file

→ Verify reference FASTA matches CRAM build (hg19/hg38)

No reads found:

Check chromosome naming (“chr6” vs “6”)
Verify coordinates are correct for genome build
Ensure region overlaps with sequencing data

Python API Documentation¶

grid.utils.count_reads¶

grid.utils.count_reads.count_reads(cram_dir, output_file, ref_fasta, chrom, start, end, config_file, threads=1)¶

Count reads for all CRAMs in a directory using parallel processing.

Parameters:

cram_dir (str) – Directory containing CRAM files
output_file (str) – Path to output TSV file
ref_fasta (str) – Path to reference genome FASTA
chrom (str) – Chromosome name
start (int) – Start position
end (int) – End position
config_file (str) – Path to YAML config file
threads (int) – Number of threads for parallel processing

grid.utils.count_reads_dir¶

Count reads utilities.

This package contains modular functions for counting reads in directories and processing read count results.

grid.utils.count_reads_dir.count_reads_in_region(cram_file, ref_fasta, chrom, start, end, proper_flags)¶

Count properly paired reads in a genomic region using pysam. Ensures CRAI index exists.

Parameters:

cram_file (str)
ref_fasta (str)
chrom (str)
start (int)
end (int)
proper_flags (set[int])

Return type:

int

grid.utils.count_reads_dir.process_single_cram(cram_path, ref_fasta, chrom, start, end, proper_flags)¶

Process a single CRAM file and count reads in region.

Parameters:

cram_path (str) – Path to CRAM file
ref_fasta (str) – Path to reference genome FASTA
chrom (str) – Chromosome name
start (int) – Start position
end (int) – End position
proper_flags (set[int]) – Set of SAM flags to count

Returns:

Tuple of (basename, count) where count is int or “Error”

Return type:

tuple[str, int | str]

Read Counting Module¶

Overview¶