Mosdepth Coverage Normalization¶

Overview¶

Mosdepth Coverage Normalization Module¶

Normalize raw coverage data across samples to account for sequencing depth variation and technical artifacts.

Purpose¶

Raw coverage values vary widely between samples due to:

Different sequencing depths (20X vs 60X)
Library preparation batch effects
GC content bias
Mappability differences

This module normalizes coverage using z-score transformation to enable fair comparison across samples.

Main Module¶

`run_normalized_mosdepth.py`¶

Located in parent directory: grid/utils/run_normalized_mosdepth.py

Key Function:

def run_normalize_mosdepth(
    mosdepth_dir: str,
    output_file: str,
    repeat_mask: str,
    chrom: str,
    start: int,
    end: int,
    min_depth: int = 20,
    max_depth: int = 100,
    top_frac: float = 0.1,
    threads: int = 1
) -> None

Parameters:

mosdepth_dir: Directory with mosdepth .regions.bed.gz files
output_file: Output normalized z-depth file (.gz)
repeat_mask: BED file of repeat-masked regions to exclude
chrom: Chromosome
start: Region start
end: Region end
min_depth: Minimum mean depth threshold (default: 20)
max_depth: Maximum mean depth threshold (default: 100)
top_frac: Top fraction of high-variance regions to keep (default: 0.1)
threads: Parallel processing threads

Output Format: Gzipped TSV with z-transformed depths:

sample_id       position        z_depth
sample001       160000000       1.23
sample001       160001000       -0.45
sample002       160000000       0.87

Usage¶

Via CLI¶

grid normalize-mosdepth \
    --mosdepth-dir output/mosdepth/ \
    --output-file output/normalized.zdepth.gz \
    --repeat-mask refs/repeat_mask.bed \
    --chrom chr6 \
    --start 160000000 \
    --end 160100000 \
    --min-depth 20 \
    --max-depth 100 \
    --top-frac 0.1 \
    --threads 4

As Python Module¶

from grid.utils.run_normalized_mosdepth import run_normalize_mosdepth

run_normalize_mosdepth(
    mosdepth_dir="output/mosdepth/",
    output_file="output/normalized.zdepth.gz",
    repeat_mask="refs/repeat_mask.bed",
    chrom="chr6",
    start=160000000,
    end=160100000,
    min_depth=20,
    max_depth=100,
    top_frac=0.1,
    threads=4
)

Normalization Algorithm¶

Step 1: Load Coverage Data¶

Read all mosdepth .regions.bed.gz files
Extract coverage for target region
Create sample × position matrix

Step 2: Filter Regions¶

Repeat Masking:

Remove positions overlapping repeat-masked regions
Prevents bias from low-complexity sequences

Depth Filtering:

Remove positions with mean depth < min_depth (low coverage)
Remove positions with mean depth > max_depth (extreme outliers)

Variance Selection:

Calculate variance across samples for each position
Keep top top_frac of positions by variance
Focuses on informative, variable regions

Step 3: Z-Score Transformation¶

For each sample independently:

z_depth = (depth - mean_depth) / std_depth

Where:

depth: Raw coverage at position
mean_depth: Sample’s mean coverage across all positions
std_depth: Sample’s standard deviation

Result:

Mean = 0, StdDev = 1 for each sample
Comparable across samples with different depths

Output Interpretation¶

Z-Depth Values¶

z = 0: Average coverage for this sample
z = +2: 2 standard deviations above average (potential duplication)
z = -2: 2 standard deviations below average (potential deletion)
z > +3: Very high confidence variant

Example Interpretation¶

Sample    Position      Z-Depth    Interpretation
S001      160050000     2.5        Likely duplication
S001      160051000     2.8        Strong duplication signal
S001      160052000     -0.3       Normal coverage
S002      160050000     0.1        Normal coverage

Sample S001 likely has a duplication in the 160050000-160051000 region.

Parameter Tuning¶

`min_depth` (default: 20)¶

Purpose: Filter low-coverage positions
Too low: Include unreliable positions
Too high: Remove informative data
Recommendation:
- 30X WGS: use 15-20
- 60X WGS: use 30-40

`max_depth` (default: 100)¶

Purpose: Remove extreme outliers
Too low: Filter true high-copy variants
Too high: Include technical artifacts
Recommendation:
- Depends on expected max copy number
- LPA KIV-2: 100-150 (max ~50 copies × 2 haploids)

`top_frac` (default: 0.1)¶

Purpose: Focus on most variable positions
Range: 0.05 - 0.3
Lower values: More stringent, fewer positions
Higher values: More positions, may include noise
Recommendation:
- VNTR analysis: 0.1 (10% most variable)
- CNV discovery: 0.2 (20% most variable)

Repeat Mask BED Format¶

The repeat mask file should be a standard BED file:

chr6    160000000    160001000    SimpleRepeat
chr6    160005000    160005500    LINE
chr6    160010000    160011000    Satellite

Purpose:

Exclude low-complexity sequences
Remove problematic alignment regions
Focus on informative VNTR sequences

How to generate:

Download RepeatMasker track from UCSC Genome Browser
Filter for target region
Optionally keep only certain repeat types

Technical Details¶

Performance¶

Parallelization: Sample-level parallel processing
Memory: Entire dataset loaded into memory (~GB for 1000 samples)
Speed: ~5-10 minutes for 1000 samples, 100kb region

Dependencies¶

pandas, numpy - Data processing
scipy - Statistical functions
multiprocessing - Parallel execution

Quality Control¶

Expected Results¶

After normalization:

All samples have mean ≈ 0, std ≈ 1
High-copy regions show consistently high z-scores
Low-noise regions show z-scores near 0

Diagnostic Plots (manual)¶

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("normalized.zdepth.gz", sep="\t")
sample_means = df.groupby("sample_id")["z_depth"].mean()

plt.hist(sample_means, bins=50)
plt.xlabel("Mean Z-Depth")
plt.ylabel("Samples")
plt.title("Should be centered at 0")
plt.show()

Troubleshooting¶

All z-scores near zero:

Region may not have sufficient variation
Try reducing top_frac threshold
Check that region contains actual VNTRs

Extreme z-scores (|z| > 10):

May indicate technical artifacts
Check for alignment issues
Verify repeat mask is correct

Memory errors:

Reduce number of samples processed at once
Increase system RAM
Use smaller genomic region

Python API Documentation¶

grid.utils.run_normalized_mosdepth¶

grid.utils.run_normalized_mosdepth.run_normalize_mosdepth(mosdepth_dir, output_file, repeat_mask, chrom, start, end, min_depth=20, max_depth=100, top_frac=0.1, threads=1, console=None)¶

Normalize mosdepth coverage across all samples in a directory.

Parameters:

mosdepth_dir (str) – Directory containing mosdepth output files
output_file (str) – Path to output normalized .tsv.gz file
repeat_mask (str) – Path to repeat mask file
chrom (str) – Chromosome name
start (int) – Start position
end (int) – End position
min_depth (int) – Minimum depth threshold for regions
max_depth (int) – Maximum depth threshold for regions
top_frac (float) – Fraction of high-variance regions to select
threads (int) – Number of threads for parallel processing
console (Console)

Returns:

None

grid.utils.normalize_mosdepth_dir¶

grid.utils.normalize_mosdepth_dir._process_one_individual(args_tuple)¶

This runs in its own process. Returns (individual_id, [(start,end,depth),…]) args_tuple contains:

individual_id, global_dist_file, chromosome, start, end, min_depth, max_depth, excluded

grid.utils.normalize_mosdepth_dir.build_depth_matrix(regions_to_extract)¶

Build depth matrix from extracted regions.

Parameters:: regions_to_extract (dict[str, list[tuple[int, int, float]]]) – Dictionary of sample IDs to region lists
Returns:: {(start,end): depth}}
Return type:: Dictionary mapping {sample

grid.utils.normalize_mosdepth_dir.build_matrix_from_regions(regions_to_extract, individuals_order=None)¶

Build numpy matrix from regions_to_extract.

Parameters:

regions_to_extract – dict of {individual_id: [(start,end,depth),…]}
individuals_order – optional list of individual IDs to order rows

Returns:

list of individual IDs in order regions_list: list of (start,end) tuples in order mat: numpy array of shape (n_individuals, n_regions) with depths

Return type:

individuals_order

grid.utils.normalize_mosdepth_dir.extract_regions(individuals, chromosome, start, end, min_depth=20, max_depth=100, excluded=None, progress=None, task=None)¶

Extract depth regions passing filters and not overlapping repeats.

Parameters:

individuals (dict[str, Path]) – dict mapping sample IDs to their mosdepth.global.dist file paths
chromosome (str) – Chromosome name to extract
start (int) – Start position of region
end (int) – End position of region
min_depth (int) – Minimum depth threshold
max_depth (int) – Maximum depth threshold
excluded (dict[str, set[int]]) – dict mapping chromosome to set of 1kb bins to exclude (from repeat mask)
progress – Optional Rich Progress object for updates
task – Optional Rich Progress task ID for updates

Returns:

[(start, end, depth), …]} for extracted regions

Return type:

Dictionary mapping {sample_id

grid.utils.normalize_mosdepth_dir.get_individuals(mosdepth_dir)¶

Map sample IDs to their mosdepth.global.dist files.

Parameters:: mosdepth_dir (str) – Directory containing mosdepth output files
Returns:: Dictionary mapping sample IDs to their global.dist file paths
Return type:: dict[str, Path]

grid.utils.normalize_mosdepth_dir.load_repeat_mask(repeat_bed)¶

Load repeat regions into {chrom: set(kb_bins)}.

Parameters:: repeat_bed (str) – Path to repeat mask BED file
Returns:: Dictionary mapping chromosomes to sets of excluded kb bins
Return type:: dict[str, set[int]]

grid.utils.normalize_mosdepth_dir.normalize_matrix(mat)¶

Normalize the depth matrix in two steps: 1) Within-individual normalization (row-wise): divide each depth by the individual’s mean 2) Across-individual normalization (column-wise): for each region, compute mean (mu) and variance (var)

Transform each depth x to (x - mu) / sqrt(mu) where mu > 0
Also compute variance ratio var/mu for each region.

Parameters:: mat – numpy array of shape (n_individuals, n_regions) with raw depths
Returns:: numpy array of same shape with normalized depths variance_ratios: dict of {region_index: variance_ratio}
Return type:: normalized_mat

grid.utils.normalize_mosdepth_dir.regions_bed_gz(global_dist_file)¶

Get the path to the regions BED file corresponding to a global distribution file.

Parameters:: global_dist_file (Path) – Path to the mosdepth.global.dist.txt file
Returns:: Path to the corresponding .regions.bed.gz file
Return type:: Path

grid.utils.normalize_mosdepth_dir.select_high_variance_regions(variance_ratios, top_frac=0.1)¶

Select top fraction of regions by variance ratio.

Parameters:

variance_ratios (dict[int, float]) – {region_index: variance_ratio}
top_frac (float) – fraction of top regions to retain

Returns:

List of selected region indices

Return type:

list[int]

grid.utils.normalize_mosdepth_dir.write_normalized_output(mat, individuals_order, regions_list, selected_indices, output_file)¶

Write normalized matrix to gzipped TSV.

Parameters:

mat (ndarray) – normalized matrix (n_individuals × n_regions)
individuals_order (list[str]) – list of sample IDs
regions_list (list[tuple[int, int]]) – list of region tuples
selected_indices (list[int]) – list of region indices to keep
output_file (Path) – Path to write .tsv.gz file

Mosdepth Coverage Normalization¶

Overview¶