Mosdepth Coverage Analysis

Overview

Mosdepth Coverage Analysis Module

Compute per-base or binned read depth coverage across genomic regions using the mosdepth tool.

Purpose

This module generates coverage profiles for all samples, which are essential for:

  1. Identifying high-coverage VNTR regions

  2. Normalizing copy number estimates across samples

  3. Quality control and outlier detection

Main Module

run_mosdepth.py

Located in parent directory: grid/utils/run_mosdepth.py

Key Function:

def run_mosdepth(
    cram_dir: str,
    output_file: str,
    ref_fasta: str,
    chrom: str,
    start: int,
    end: int,
    region_name: str,
    work_dir: str = "~/mosdepth_work",
    by: int = 1000,
    fast_mode: bool = True,
    threads: int = 1
) -> None

Parameters:

  • cram_dir: Directory with CRAM files

  • output_file: Final aggregated TSV output

  • ref_fasta: Reference genome FASTA

  • chrom: Chromosome (e.g., “chr6”)

  • start: Region start position

  • end: Region end position

  • region_name: Name for mosdepth output (e.g., “LPA_VNTR”)

  • work_dir: Temporary directory for mosdepth outputs

  • by: Bin size in base pairs (default: 1000)

  • fast_mode: Use mosdepth –fast-mode (recommended: True)

  • threads: Parallel processing threads

Output Format:

sample_id       position        depth
sample001       160000000       45.2
sample001       160001000       48.7
sample002       160000000       52.3

Usage

Via CLI
grid mosdepth \
    --cram-dir data/CRAMs \
    --output-file output/coverage.tsv \
    --ref-fasta refs/hs37d5.fa \
    --chrom chr6 \
    --start 160000000 \
    --end 160100000 \
    --region-name LPA_VNTR \
    --work-dir /tmp/mosdepth \
    --by 1000 \
    --fast \
    --threads 4
As Python Module
from grid.utils.run_mosdepth import run_mosdepth

run_mosdepth(
    cram_dir="data/CRAMs",
    output_file="output/coverage.tsv",
    ref_fasta="refs/hs37d5.fa",
    chrom="chr6",
    start=160000000,
    end=160100000,
    region_name="LPA_VNTR",
    by=1000,
    fast_mode=True,
    threads=4
)

Mosdepth Integration

This module wraps the mosdepth tool with GRiD-specific configurations.

Mosdepth Command Structure
mosdepth \
    --by <bin_size> \
    --chrom <chromosome> \
    --fast-mode \
    --threads <n> \
    <output_prefix> \
    <input.cram>
Fast Mode

The --fast-mode flag is enabled by default:

  • Faster: ~3-5x speed improvement

  • Trade-off: May miss some edge cases in coverage calculation

  • Recommended: Use for initial analysis and large datasets

Output Files

Intermediate Files (in work_dir)

For each sample, mosdepth generates:

<sample>.regions.bed.gz         # Per-region coverage summary
<sample>.mosdepth.global.dist.txt  # Global coverage distribution
<sample>.mosdepth.region.dist.txt  # Region coverage distribution
Final Output

Aggregated TSV with columns:

  • sample_id: Sample identifier (from CRAM filename)

  • position: Genomic position (bin start)

  • depth: Mean coverage depth in that bin

Technical Details

Algorithm
  1. Create BED file for target region

  2. For each CRAM in parallel:

    • Run mosdepth with specified parameters

    • Parse .regions.bed.gz output

    • Extract coverage values for target region

  3. Aggregate all results into single TSV

  4. Clean up intermediate files (optional)

Performance
  • Parallelization: Process multiple CRAMs simultaneously

  • Memory: Low memory footprint with streaming

  • Speed: ~1-2 minutes per CRAM for 100kb region

  • Disk: Intermediate files are small (<1MB per sample)

Dependencies
  • External: mosdepth (must be in PATH)

  • Python: pandas, subprocess, pathlib

Binning Strategies

Per-base (–by 0)
--by 0
  • Most detailed coverage profile

  • Large output files

  • Slower processing

  • Use for small regions (<10kb)

Fixed bins (default)
--by 1000  # 1kb bins
  • Balanced detail vs. speed

  • Recommended for VNTR analysis

  • Typical bin sizes: 500-2000bp

Larger bins
--by 5000  # 5kb bins
  • Faster processing

  • Less detail

  • Use for genome-wide analysis

Quality Control

Expected Coverage

For 30X WGS data in a VNTR region:

  • Single-copy: ~30X coverage

  • Duplicated: ~60X coverage

  • High-copy (10+): 300X+ coverage

Outlier Detection

Samples with unusual coverage patterns may indicate:

  • Library preparation issues

  • Alignment problems

  • True structural variants

Troubleshooting

Mosdepth not found:

Error: mosdepth: command not found

→ Install mosdepth: conda install -c bioconda mosdepth

Empty output:

  • Check BED coordinates match reference

  • Verify CRAM has coverage in region

  • Ensure chromosome naming matches (“chr6” vs “6”)

Slow performance:

  • Use --fast mode

  • Increase bin size (--by)

  • Reduce --threads if I/O limited

Python API Documentation

grid.utils.run_mosdepth

grid.utils.run_mosdepth.run_mosdepth(cram_dir, output_file, ref_fasta, chrom, start, end, region_name, work_dir, by, fast_mode, threads=1)

Run mosdepth on all CRAMs in a directory using parallel processing.

Parameters:
  • cram_dir (str) – Directory containing CRAM files

  • output_file (str) – Path to output TSV file

  • ref_fasta (str) – Path to reference genome FASTA

  • chrom (str) – Chromosome name

  • start (int) – Start position

  • end (int) – End position

  • work_dir (str) – Working directory for intermediate files

  • by (int) – Bin size for mosdepth

  • fast_mode (bool) – Whether to use fast mode

  • threads (int) – Number of threads for parallel processing

  • region_name (str)

grid.utils.mosdepth_dir

grid.utils.mosdepth_dir.build_mosdepth_command(cram_path, ref_fasta, output_prefix, by, fast_mode)

Build mosdepth command with appropriate flags.

Parameters:
  • cram_path (str) – Path to CRAM file

  • ref_fasta (str) – Path to reference genome FASTA

  • output_prefix (Path) – Output prefix for mosdepth

  • by (int) – Bin size for coverage calculation

  • fast_mode (bool) – Whether to use fast mode

Returns:

List of command arguments

Return type:

List[str]

grid.utils.mosdepth_dir.check_mosdepth_available()

Check if mosdepth executable is available in PATH.

Parameters:

None

Returns:

None

Raises:

RuntimeError – If mosdepth is not found.

grid.utils.mosdepth_dir.compute_region_coverage(regions_file, chrom, start, end)

Compute average coverage for a genomic region from mosdepth output.

Parameters:
  • regions_file (Path) – Path to mosdepth regions.bed.gz file

  • chrom (str) – Chromosome name

  • start (int) – Start position

  • end (int) – End position

Returns:

Average coverage as integer (rounded and scaled by 100)

Return type:

int

grid.utils.mosdepth_dir.run_mosdepth_single_cram(cram_path, ref_fasta, work_dir, chrom, start, end, region_name, by, fast_mode)

Run mosdepth on a single CRAM file and compute coverage for region.

Parameters:
  • cram_path (str) – Path to CRAM file

  • ref_fasta (str) – Path to reference genome FASTA

  • work_dir (Path) – Working directory for intermediate files

  • chrom (str) – Chromosome name

  • start (int) – Start position

  • end (int) – End position

  • by (int) – Bin size for mosdepth

  • fast_mode (bool) – Whether to use fast mode

  • region_name (str)

Returns:

Tuple of (sample_name, coverage) where coverage is int or “Error”

Return type:

Tuple[str, int | str]

grid.utils.mosdepth_dir.wait_for_mosdepth_output(work_dir, sample_name, max_attempts=3, sleep_seconds=2)

Wait for mosdepth output file to appear.

Parameters:
  • work_dir (Path) – Working directory where mosdepth writes output

  • sample_name (str) – Sample name to search for

  • max_attempts (int) – Maximum number of attempts to find file

  • sleep_seconds (int) – Seconds to wait between attempts

Returns:

Path to regions.bed.gz file

Raises:

FileNotFoundError – If file is not found after max attempts

Return type:

Path

grid.utils.mosdepth_dir.write_coverage_result(output_file, sample_name, coverage, write_lock, progress_console=None)

Write coverage result to output file in a thread-safe manner.

Parameters:
  • output_file (Path) – Path to output TSV file

  • sample_name (str) – Sample name

  • coverage (int) – Coverage value

  • write_lock (lock) – Threading lock for safe file writing

  • progress_console – Optional Progress console for proper rendering

Return type:

None