Mosdepth Coverage Analysis¶

Overview¶

Mosdepth Coverage Analysis Module¶

Compute per-base or binned read depth coverage across genomic regions using the mosdepth tool.

Purpose¶

This module generates coverage profiles for all samples, which are essential for:

Identifying high-coverage VNTR regions
Normalizing copy number estimates across samples
Quality control and outlier detection

Main Module¶

`run_mosdepth.py`¶

Located in parent directory: grid/utils/run_mosdepth.py

Key Function:

def run_mosdepth(
    cram_dir: str,
    output_file: str,
    ref_fasta: str,
    chrom: str,
    start: int,
    end: int,
    region_name: str,
    work_dir: str = "~/mosdepth_work",
    by: int = 1000,
    fast_mode: bool = True,
    threads: int = 1
) -> None

Parameters:

cram_dir: Directory with CRAM files
output_file: Final aggregated TSV output
ref_fasta: Reference genome FASTA
chrom: Chromosome (e.g., “chr6”)
start: Region start position
end: Region end position
region_name: Name for mosdepth output (e.g., “LPA_VNTR”)
work_dir: Temporary directory for mosdepth outputs
by: Bin size in base pairs (default: 1000)
fast_mode: Use mosdepth –fast-mode (recommended: True)
threads: Parallel processing threads

Output Format:

sample_id       position        depth
sample001       160000000       45.2
sample001       160001000       48.7
sample002       160000000       52.3

Usage¶

Via CLI¶

grid mosdepth \
    --cram-dir data/CRAMs \
    --output-file output/coverage.tsv \
    --ref-fasta refs/hs37d5.fa \
    --chrom chr6 \
    --start 160000000 \
    --end 160100000 \
    --region-name LPA_VNTR \
    --work-dir /tmp/mosdepth \
    --by 1000 \
    --fast \
    --threads 4

As Python Module¶

from grid.utils.run_mosdepth import run_mosdepth

run_mosdepth(
    cram_dir="data/CRAMs",
    output_file="output/coverage.tsv",
    ref_fasta="refs/hs37d5.fa",
    chrom="chr6",
    start=160000000,
    end=160100000,
    region_name="LPA_VNTR",
    by=1000,
    fast_mode=True,
    threads=4
)

Mosdepth Integration¶

This module wraps the mosdepth tool with GRiD-specific configurations.

Mosdepth Command Structure¶

mosdepth \
    --by <bin_size> \
    --chrom <chromosome> \
    --fast-mode \
    --threads <n> \
    <output_prefix> \
    <input.cram>

Fast Mode¶

The --fast-mode flag is enabled by default:

Faster: ~3-5x speed improvement
Trade-off: May miss some edge cases in coverage calculation
Recommended: Use for initial analysis and large datasets

Output Files¶

Intermediate Files (in work_dir)¶

For each sample, mosdepth generates:

<sample>.regions.bed.gz         # Per-region coverage summary
<sample>.mosdepth.global.dist.txt  # Global coverage distribution
<sample>.mosdepth.region.dist.txt  # Region coverage distribution

Final Output¶

Aggregated TSV with columns:

sample_id: Sample identifier (from CRAM filename)
position: Genomic position (bin start)
depth: Mean coverage depth in that bin

Technical Details¶

Algorithm¶

Create BED file for target region
For each CRAM in parallel:
- Run mosdepth with specified parameters
- Parse .regions.bed.gz output
- Extract coverage values for target region
Aggregate all results into single TSV
Clean up intermediate files (optional)

Performance¶

Parallelization: Process multiple CRAMs simultaneously
Memory: Low memory footprint with streaming
Speed: ~1-2 minutes per CRAM for 100kb region
Disk: Intermediate files are small (<1MB per sample)

Dependencies¶

External: mosdepth (must be in PATH)
Python: pandas, subprocess, pathlib

Binning Strategies¶

Per-base (–by 0)¶

--by 0

Most detailed coverage profile
Large output files
Slower processing
Use for small regions (<10kb)

Fixed bins (default)¶

--by 1000  # 1kb bins

Balanced detail vs. speed
Recommended for VNTR analysis
Typical bin sizes: 500-2000bp

Larger bins¶

--by 5000  # 5kb bins

Faster processing
Less detail
Use for genome-wide analysis

Quality Control¶

Expected Coverage¶

For 30X WGS data in a VNTR region:

Single-copy: ~30X coverage
Duplicated: ~60X coverage
High-copy (10+): 300X+ coverage

Outlier Detection¶

Samples with unusual coverage patterns may indicate:

Library preparation issues
Alignment problems
True structural variants

Troubleshooting¶

Mosdepth not found:

Error: mosdepth: command not found

→ Install mosdepth: conda install -c bioconda mosdepth

Empty output:

Check BED coordinates match reference
Verify CRAM has coverage in region
Ensure chromosome naming matches (“chr6” vs “6”)

Slow performance:

Use --fast mode
Increase bin size (--by)
Reduce --threads if I/O limited

Python API Documentation¶

grid.utils.run_mosdepth¶

grid.utils.run_mosdepth.run_mosdepth(cram_dir, output_file, ref_fasta, chrom, start, end, region_name, work_dir, by, fast_mode, threads=1)¶

Run mosdepth on all CRAMs in a directory using parallel processing.

Parameters:

cram_dir (str) – Directory containing CRAM files
output_file (str) – Path to output TSV file
ref_fasta (str) – Path to reference genome FASTA
chrom (str) – Chromosome name
start (int) – Start position
end (int) – End position
work_dir (str) – Working directory for intermediate files
by (int) – Bin size for mosdepth
fast_mode (bool) – Whether to use fast mode
threads (int) – Number of threads for parallel processing
region_name (str)

grid.utils.mosdepth_dir¶

grid.utils.mosdepth_dir.build_mosdepth_command(cram_path, ref_fasta, output_prefix, by, fast_mode)¶

Build mosdepth command with appropriate flags.

Parameters:

cram_path (str) – Path to CRAM file
ref_fasta (str) – Path to reference genome FASTA
output_prefix (Path) – Output prefix for mosdepth
by (int) – Bin size for coverage calculation
fast_mode (bool) – Whether to use fast mode

Returns:

List of command arguments

Return type:

List[str]

grid.utils.mosdepth_dir.check_mosdepth_available()¶

Check if mosdepth executable is available in PATH.

Parameters:: None
Returns:: None
Raises:: RuntimeError – If mosdepth is not found.

grid.utils.mosdepth_dir.compute_region_coverage(regions_file, chrom, start, end)¶

Compute average coverage for a genomic region from mosdepth output.

Parameters:

regions_file (Path) – Path to mosdepth regions.bed.gz file
chrom (str) – Chromosome name
start (int) – Start position
end (int) – End position

Returns:

Average coverage as integer (rounded and scaled by 100)

Return type:

int

grid.utils.mosdepth_dir.run_mosdepth_single_cram(cram_path, ref_fasta, work_dir, chrom, start, end, region_name, by, fast_mode)¶

Run mosdepth on a single CRAM file and compute coverage for region.

Parameters:

cram_path (str) – Path to CRAM file
ref_fasta (str) – Path to reference genome FASTA
work_dir (Path) – Working directory for intermediate files
chrom (str) – Chromosome name
start (int) – Start position
end (int) – End position
by (int) – Bin size for mosdepth
fast_mode (bool) – Whether to use fast mode
region_name (str)

Returns:

Tuple of (sample_name, coverage) where coverage is int or “Error”

Return type:

Tuple[str, int | str]

grid.utils.mosdepth_dir.wait_for_mosdepth_output(work_dir, sample_name, max_attempts=3, sleep_seconds=2)¶

Wait for mosdepth output file to appear.

Parameters:

work_dir (Path) – Working directory where mosdepth writes output
sample_name (str) – Sample name to search for
max_attempts (int) – Maximum number of attempts to find file
sleep_seconds (int) – Seconds to wait between attempts

Returns:

Path to regions.bed.gz file

Raises:

FileNotFoundError – If file is not found after max attempts

Return type:

Path

grid.utils.mosdepth_dir.write_coverage_result(output_file, sample_name, coverage, write_lock, progress_console=None)¶

Write coverage result to output file in a thread-safe manner.

Parameters:

output_file (Path) – Path to output TSV file
sample_name (str) – Sample name
coverage (int) – Coverage value
write_lock (lock) – Threading lock for safe file writing
progress_console – Optional Progress console for proper rendering

Return type:

None

Mosdepth Coverage Analysis¶

Overview¶