Mosdepth Coverage Analysis¶
Overview¶
Mosdepth Coverage Analysis Module¶
Compute per-base or binned read depth coverage across genomic regions using the mosdepth tool.
Purpose¶
This module generates coverage profiles for all samples, which are essential for:
Identifying high-coverage VNTR regions
Normalizing copy number estimates across samples
Quality control and outlier detection
Main Module¶
run_mosdepth.py¶
Located in parent directory: grid/utils/run_mosdepth.py
Key Function:
def run_mosdepth(
cram_dir: str,
output_file: str,
ref_fasta: str,
chrom: str,
start: int,
end: int,
region_name: str,
work_dir: str = "~/mosdepth_work",
by: int = 1000,
fast_mode: bool = True,
threads: int = 1
) -> None
Parameters:
cram_dir: Directory with CRAM filesoutput_file: Final aggregated TSV outputref_fasta: Reference genome FASTAchrom: Chromosome (e.g., “chr6”)start: Region start positionend: Region end positionregion_name: Name for mosdepth output (e.g., “LPA_VNTR”)work_dir: Temporary directory for mosdepth outputsby: Bin size in base pairs (default: 1000)fast_mode: Use mosdepth –fast-mode (recommended: True)threads: Parallel processing threads
Output Format:
sample_id position depth
sample001 160000000 45.2
sample001 160001000 48.7
sample002 160000000 52.3
Usage¶
Via CLI¶
grid mosdepth \
--cram-dir data/CRAMs \
--output-file output/coverage.tsv \
--ref-fasta refs/hs37d5.fa \
--chrom chr6 \
--start 160000000 \
--end 160100000 \
--region-name LPA_VNTR \
--work-dir /tmp/mosdepth \
--by 1000 \
--fast \
--threads 4
As Python Module¶
from grid.utils.run_mosdepth import run_mosdepth
run_mosdepth(
cram_dir="data/CRAMs",
output_file="output/coverage.tsv",
ref_fasta="refs/hs37d5.fa",
chrom="chr6",
start=160000000,
end=160100000,
region_name="LPA_VNTR",
by=1000,
fast_mode=True,
threads=4
)
Mosdepth Integration¶
This module wraps the mosdepth tool with GRiD-specific configurations.
Mosdepth Command Structure¶
mosdepth \
--by <bin_size> \
--chrom <chromosome> \
--fast-mode \
--threads <n> \
<output_prefix> \
<input.cram>
Fast Mode¶
The --fast-mode flag is enabled by default:
Faster: ~3-5x speed improvement
Trade-off: May miss some edge cases in coverage calculation
Recommended: Use for initial analysis and large datasets
Output Files¶
Intermediate Files (in work_dir)¶
For each sample, mosdepth generates:
<sample>.regions.bed.gz # Per-region coverage summary
<sample>.mosdepth.global.dist.txt # Global coverage distribution
<sample>.mosdepth.region.dist.txt # Region coverage distribution
Final Output¶
Aggregated TSV with columns:
sample_id: Sample identifier (from CRAM filename)position: Genomic position (bin start)depth: Mean coverage depth in that bin
Technical Details¶
Algorithm¶
Create BED file for target region
For each CRAM in parallel:
Run mosdepth with specified parameters
Parse
.regions.bed.gzoutputExtract coverage values for target region
Aggregate all results into single TSV
Clean up intermediate files (optional)
Performance¶
Parallelization: Process multiple CRAMs simultaneously
Memory: Low memory footprint with streaming
Speed: ~1-2 minutes per CRAM for 100kb region
Disk: Intermediate files are small (<1MB per sample)
Dependencies¶
External:
mosdepth(must be in PATH)Python:
pandas,subprocess,pathlib
Binning Strategies¶
Per-base (–by 0)¶
--by 0
Most detailed coverage profile
Large output files
Slower processing
Use for small regions (<10kb)
Fixed bins (default)¶
--by 1000 # 1kb bins
Balanced detail vs. speed
Recommended for VNTR analysis
Typical bin sizes: 500-2000bp
Larger bins¶
--by 5000 # 5kb bins
Faster processing
Less detail
Use for genome-wide analysis
Quality Control¶
Expected Coverage¶
For 30X WGS data in a VNTR region:
Single-copy: ~30X coverage
Duplicated: ~60X coverage
High-copy (10+): 300X+ coverage
Outlier Detection¶
Samples with unusual coverage patterns may indicate:
Library preparation issues
Alignment problems
True structural variants
Troubleshooting¶
Mosdepth not found:
Error: mosdepth: command not found
→ Install mosdepth: conda install -c bioconda mosdepth
Empty output:
Check BED coordinates match reference
Verify CRAM has coverage in region
Ensure chromosome naming matches (“chr6” vs “6”)
Slow performance:
Use
--fastmodeIncrease bin size (
--by)Reduce
--threadsif I/O limited
Python API Documentation¶
grid.utils.run_mosdepth¶
- grid.utils.run_mosdepth.run_mosdepth(cram_dir, output_file, ref_fasta, chrom, start, end, region_name, work_dir, by, fast_mode, threads=1)¶
Run mosdepth on all CRAMs in a directory using parallel processing.
- Parameters:
cram_dir (str) – Directory containing CRAM files
output_file (str) – Path to output TSV file
ref_fasta (str) – Path to reference genome FASTA
chrom (str) – Chromosome name
start (int) – Start position
end (int) – End position
work_dir (str) – Working directory for intermediate files
by (int) – Bin size for mosdepth
fast_mode (bool) – Whether to use fast mode
threads (int) – Number of threads for parallel processing
region_name (str)
grid.utils.mosdepth_dir¶
- grid.utils.mosdepth_dir.build_mosdepth_command(cram_path, ref_fasta, output_prefix, by, fast_mode)¶
Build mosdepth command with appropriate flags.
- Parameters:
cram_path (str) – Path to CRAM file
ref_fasta (str) – Path to reference genome FASTA
output_prefix (Path) – Output prefix for mosdepth
by (int) – Bin size for coverage calculation
fast_mode (bool) – Whether to use fast mode
- Returns:
List of command arguments
- Return type:
List[str]
- grid.utils.mosdepth_dir.check_mosdepth_available()¶
Check if mosdepth executable is available in PATH.
- Parameters:
None
- Returns:
None
- Raises:
RuntimeError – If mosdepth is not found.
- grid.utils.mosdepth_dir.compute_region_coverage(regions_file, chrom, start, end)¶
Compute average coverage for a genomic region from mosdepth output.
- Parameters:
regions_file (Path) – Path to mosdepth regions.bed.gz file
chrom (str) – Chromosome name
start (int) – Start position
end (int) – End position
- Returns:
Average coverage as integer (rounded and scaled by 100)
- Return type:
int
- grid.utils.mosdepth_dir.run_mosdepth_single_cram(cram_path, ref_fasta, work_dir, chrom, start, end, region_name, by, fast_mode)¶
Run mosdepth on a single CRAM file and compute coverage for region.
- Parameters:
cram_path (str) – Path to CRAM file
ref_fasta (str) – Path to reference genome FASTA
work_dir (Path) – Working directory for intermediate files
chrom (str) – Chromosome name
start (int) – Start position
end (int) – End position
by (int) – Bin size for mosdepth
fast_mode (bool) – Whether to use fast mode
region_name (str)
- Returns:
Tuple of (sample_name, coverage) where coverage is int or “Error”
- Return type:
Tuple[str, int | str]
- grid.utils.mosdepth_dir.wait_for_mosdepth_output(work_dir, sample_name, max_attempts=3, sleep_seconds=2)¶
Wait for mosdepth output file to appear.
- Parameters:
work_dir (Path) – Working directory where mosdepth writes output
sample_name (str) – Sample name to search for
max_attempts (int) – Maximum number of attempts to find file
sleep_seconds (int) – Seconds to wait between attempts
- Returns:
Path to regions.bed.gz file
- Raises:
FileNotFoundError – If file is not found after max attempts
- Return type:
Path
- grid.utils.mosdepth_dir.write_coverage_result(output_file, sample_name, coverage, write_lock, progress_console=None)¶
Write coverage result to output file in a thread-safe manner.
- Parameters:
output_file (Path) – Path to output TSV file
sample_name (str) – Sample name
coverage (int) – Coverage value
write_lock (lock) – Threading lock for safe file writing
progress_console – Optional Progress console for proper rendering
- Return type:
None