Mosdepth Coverage Normalization¶
Overview¶
Mosdepth Coverage Normalization Module¶
Normalize raw coverage data across samples to account for sequencing depth variation and technical artifacts.
Purpose¶
Raw coverage values vary widely between samples due to:
Different sequencing depths (20X vs 60X)
Library preparation batch effects
GC content bias
Mappability differences
This module normalizes coverage using z-score transformation to enable fair comparison across samples.
Main Module¶
run_normalized_mosdepth.py¶
Located in parent directory: grid/utils/run_normalized_mosdepth.py
Key Function:
def run_normalize_mosdepth(
mosdepth_dir: str,
output_file: str,
repeat_mask: str,
chrom: str,
start: int,
end: int,
min_depth: int = 20,
max_depth: int = 100,
top_frac: float = 0.1,
threads: int = 1
) -> None
Parameters:
mosdepth_dir: Directory with mosdepth.regions.bed.gzfilesoutput_file: Output normalized z-depth file (.gz)repeat_mask: BED file of repeat-masked regions to excludechrom: Chromosomestart: Region startend: Region endmin_depth: Minimum mean depth threshold (default: 20)max_depth: Maximum mean depth threshold (default: 100)top_frac: Top fraction of high-variance regions to keep (default: 0.1)threads: Parallel processing threads
Output Format: Gzipped TSV with z-transformed depths:
sample_id position z_depth
sample001 160000000 1.23
sample001 160001000 -0.45
sample002 160000000 0.87
Usage¶
Via CLI¶
grid normalize-mosdepth \
--mosdepth-dir output/mosdepth/ \
--output-file output/normalized.zdepth.gz \
--repeat-mask refs/repeat_mask.bed \
--chrom chr6 \
--start 160000000 \
--end 160100000 \
--min-depth 20 \
--max-depth 100 \
--top-frac 0.1 \
--threads 4
As Python Module¶
from grid.utils.run_normalized_mosdepth import run_normalize_mosdepth
run_normalize_mosdepth(
mosdepth_dir="output/mosdepth/",
output_file="output/normalized.zdepth.gz",
repeat_mask="refs/repeat_mask.bed",
chrom="chr6",
start=160000000,
end=160100000,
min_depth=20,
max_depth=100,
top_frac=0.1,
threads=4
)
Normalization Algorithm¶
Step 1: Load Coverage Data¶
Read all mosdepth
.regions.bed.gzfilesExtract coverage for target region
Create sample × position matrix
Step 2: Filter Regions¶
Repeat Masking:
Remove positions overlapping repeat-masked regions
Prevents bias from low-complexity sequences
Depth Filtering:
Remove positions with mean depth <
min_depth(low coverage)Remove positions with mean depth >
max_depth(extreme outliers)
Variance Selection:
Calculate variance across samples for each position
Keep top
top_fracof positions by varianceFocuses on informative, variable regions
Step 3: Z-Score Transformation¶
For each sample independently:
z_depth = (depth - mean_depth) / std_depth
Where:
depth: Raw coverage at positionmean_depth: Sample’s mean coverage across all positionsstd_depth: Sample’s standard deviation
Result:
Mean = 0, StdDev = 1 for each sample
Comparable across samples with different depths
Output Interpretation¶
Z-Depth Values¶
z = 0: Average coverage for this sample
z = +2: 2 standard deviations above average (potential duplication)
z = -2: 2 standard deviations below average (potential deletion)
z > +3: Very high confidence variant
Example Interpretation¶
Sample Position Z-Depth Interpretation
S001 160050000 2.5 Likely duplication
S001 160051000 2.8 Strong duplication signal
S001 160052000 -0.3 Normal coverage
S002 160050000 0.1 Normal coverage
Sample S001 likely has a duplication in the 160050000-160051000 region.
Parameter Tuning¶
min_depth (default: 20)¶
Purpose: Filter low-coverage positions
Too low: Include unreliable positions
Too high: Remove informative data
Recommendation:
30X WGS: use 15-20
60X WGS: use 30-40
max_depth (default: 100)¶
Purpose: Remove extreme outliers
Too low: Filter true high-copy variants
Too high: Include technical artifacts
Recommendation:
Depends on expected max copy number
LPA KIV-2: 100-150 (max ~50 copies × 2 haploids)
top_frac (default: 0.1)¶
Purpose: Focus on most variable positions
Range: 0.05 - 0.3
Lower values: More stringent, fewer positions
Higher values: More positions, may include noise
Recommendation:
VNTR analysis: 0.1 (10% most variable)
CNV discovery: 0.2 (20% most variable)
Repeat Mask BED Format¶
The repeat mask file should be a standard BED file:
chr6 160000000 160001000 SimpleRepeat
chr6 160005000 160005500 LINE
chr6 160010000 160011000 Satellite
Purpose:
Exclude low-complexity sequences
Remove problematic alignment regions
Focus on informative VNTR sequences
How to generate:
Download RepeatMasker track from UCSC Genome Browser
Filter for target region
Optionally keep only certain repeat types
Technical Details¶
Performance¶
Parallelization: Sample-level parallel processing
Memory: Entire dataset loaded into memory (~GB for 1000 samples)
Speed: ~5-10 minutes for 1000 samples, 100kb region
Dependencies¶
pandas,numpy- Data processingscipy- Statistical functionsmultiprocessing- Parallel execution
Quality Control¶
Expected Results¶
After normalization:
All samples have mean ≈ 0, std ≈ 1
High-copy regions show consistently high z-scores
Low-noise regions show z-scores near 0
Diagnostic Plots (manual)¶
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("normalized.zdepth.gz", sep="\t")
sample_means = df.groupby("sample_id")["z_depth"].mean()
plt.hist(sample_means, bins=50)
plt.xlabel("Mean Z-Depth")
plt.ylabel("Samples")
plt.title("Should be centered at 0")
plt.show()
Troubleshooting¶
All z-scores near zero:
Region may not have sufficient variation
Try reducing
top_fracthresholdCheck that region contains actual VNTRs
Extreme z-scores (|z| > 10):
May indicate technical artifacts
Check for alignment issues
Verify repeat mask is correct
Memory errors:
Reduce number of samples processed at once
Increase system RAM
Use smaller genomic region
Python API Documentation¶
grid.utils.run_normalized_mosdepth¶
- grid.utils.run_normalized_mosdepth.run_normalize_mosdepth(mosdepth_dir, output_file, repeat_mask, chrom, start, end, min_depth=20, max_depth=100, top_frac=0.1, threads=1, console=None)¶
Normalize mosdepth coverage across all samples in a directory.
- Parameters:
mosdepth_dir (str) – Directory containing mosdepth output files
output_file (str) – Path to output normalized .tsv.gz file
repeat_mask (str) – Path to repeat mask file
chrom (str) – Chromosome name
start (int) – Start position
end (int) – End position
min_depth (int) – Minimum depth threshold for regions
max_depth (int) – Maximum depth threshold for regions
top_frac (float) – Fraction of high-variance regions to select
threads (int) – Number of threads for parallel processing
console (Console)
- Returns:
None
grid.utils.normalize_mosdepth_dir¶
- grid.utils.normalize_mosdepth_dir._process_one_individual(args_tuple)¶
This runs in its own process. Returns (individual_id, [(start,end,depth),…]) args_tuple contains:
individual_id, global_dist_file, chromosome, start, end, min_depth, max_depth, excluded
- grid.utils.normalize_mosdepth_dir.build_depth_matrix(regions_to_extract)¶
Build depth matrix from extracted regions.
- Parameters:
regions_to_extract (dict[str, list[tuple[int, int, float]]]) – Dictionary of sample IDs to region lists
- Returns:
{(start,end): depth}}
- Return type:
Dictionary mapping {sample
- grid.utils.normalize_mosdepth_dir.build_matrix_from_regions(regions_to_extract, individuals_order=None)¶
Build numpy matrix from regions_to_extract.
- Parameters:
regions_to_extract – dict of {individual_id: [(start,end,depth),…]}
individuals_order – optional list of individual IDs to order rows
- Returns:
list of individual IDs in order regions_list: list of (start,end) tuples in order mat: numpy array of shape (n_individuals, n_regions) with depths
- Return type:
individuals_order
- grid.utils.normalize_mosdepth_dir.extract_regions(individuals, chromosome, start, end, min_depth=20, max_depth=100, excluded=None, progress=None, task=None)¶
Extract depth regions passing filters and not overlapping repeats.
- Parameters:
individuals (dict[str, Path]) – dict mapping sample IDs to their mosdepth.global.dist file paths
chromosome (str) – Chromosome name to extract
start (int) – Start position of region
end (int) – End position of region
min_depth (int) – Minimum depth threshold
max_depth (int) – Maximum depth threshold
excluded (dict[str, set[int]]) – dict mapping chromosome to set of 1kb bins to exclude (from repeat mask)
progress – Optional Rich Progress object for updates
task – Optional Rich Progress task ID for updates
- Returns:
[(start, end, depth), …]} for extracted regions
- Return type:
Dictionary mapping {sample_id
- grid.utils.normalize_mosdepth_dir.get_individuals(mosdepth_dir)¶
Map sample IDs to their mosdepth.global.dist files.
- Parameters:
mosdepth_dir (str) – Directory containing mosdepth output files
- Returns:
Dictionary mapping sample IDs to their global.dist file paths
- Return type:
dict[str, Path]
- grid.utils.normalize_mosdepth_dir.load_repeat_mask(repeat_bed)¶
Load repeat regions into {chrom: set(kb_bins)}.
- Parameters:
repeat_bed (str) – Path to repeat mask BED file
- Returns:
Dictionary mapping chromosomes to sets of excluded kb bins
- Return type:
dict[str, set[int]]
- grid.utils.normalize_mosdepth_dir.normalize_matrix(mat)¶
Normalize the depth matrix in two steps: 1) Within-individual normalization (row-wise): divide each depth by the individual’s mean 2) Across-individual normalization (column-wise): for each region, compute mean (mu) and variance (var)
Transform each depth x to (x - mu) / sqrt(mu) where mu > 0
Also compute variance ratio var/mu for each region.
- Parameters:
mat – numpy array of shape (n_individuals, n_regions) with raw depths
- Returns:
numpy array of same shape with normalized depths variance_ratios: dict of {region_index: variance_ratio}
- Return type:
normalized_mat
- grid.utils.normalize_mosdepth_dir.regions_bed_gz(global_dist_file)¶
Get the path to the regions BED file corresponding to a global distribution file.
- Parameters:
global_dist_file (Path) – Path to the mosdepth.global.dist.txt file
- Returns:
Path to the corresponding .regions.bed.gz file
- Return type:
Path
- grid.utils.normalize_mosdepth_dir.select_high_variance_regions(variance_ratios, top_frac=0.1)¶
Select top fraction of regions by variance ratio.
- Parameters:
variance_ratios (dict[int, float]) – {region_index: variance_ratio}
top_frac (float) – fraction of top regions to retain
- Returns:
List of selected region indices
- Return type:
list[int]
- grid.utils.normalize_mosdepth_dir.write_normalized_output(mat, individuals_order, regions_list, selected_indices, output_file)¶
Write normalized matrix to gzipped TSV.
- Parameters:
mat (ndarray) – normalized matrix (n_individuals × n_regions)
individuals_order (list[str]) – list of sample IDs
regions_list (list[tuple[int, int]]) – list of region tuples
selected_indices (list[int]) – list of region indices to keep
output_file (Path) – Path to write .tsv.gz file