Mosdepth Coverage Normalization

Overview

Mosdepth Coverage Normalization Module

Normalize raw coverage data across samples to account for sequencing depth variation and technical artifacts.

Purpose

Raw coverage values vary widely between samples due to:

  • Different sequencing depths (20X vs 60X)

  • Library preparation batch effects

  • GC content bias

  • Mappability differences

This module normalizes coverage using z-score transformation to enable fair comparison across samples.

Main Module

run_normalized_mosdepth.py

Located in parent directory: grid/utils/run_normalized_mosdepth.py

Key Function:

def run_normalize_mosdepth(
    mosdepth_dir: str,
    output_file: str,
    repeat_mask: str,
    chrom: str,
    start: int,
    end: int,
    min_depth: int = 20,
    max_depth: int = 100,
    top_frac: float = 0.1,
    threads: int = 1
) -> None

Parameters:

  • mosdepth_dir: Directory with mosdepth .regions.bed.gz files

  • output_file: Output normalized z-depth file (.gz)

  • repeat_mask: BED file of repeat-masked regions to exclude

  • chrom: Chromosome

  • start: Region start

  • end: Region end

  • min_depth: Minimum mean depth threshold (default: 20)

  • max_depth: Maximum mean depth threshold (default: 100)

  • top_frac: Top fraction of high-variance regions to keep (default: 0.1)

  • threads: Parallel processing threads

Output Format: Gzipped TSV with z-transformed depths:

sample_id       position        z_depth
sample001       160000000       1.23
sample001       160001000       -0.45
sample002       160000000       0.87

Usage

Via CLI
grid normalize-mosdepth \
    --mosdepth-dir output/mosdepth/ \
    --output-file output/normalized.zdepth.gz \
    --repeat-mask refs/repeat_mask.bed \
    --chrom chr6 \
    --start 160000000 \
    --end 160100000 \
    --min-depth 20 \
    --max-depth 100 \
    --top-frac 0.1 \
    --threads 4
As Python Module
from grid.utils.run_normalized_mosdepth import run_normalize_mosdepth

run_normalize_mosdepth(
    mosdepth_dir="output/mosdepth/",
    output_file="output/normalized.zdepth.gz",
    repeat_mask="refs/repeat_mask.bed",
    chrom="chr6",
    start=160000000,
    end=160100000,
    min_depth=20,
    max_depth=100,
    top_frac=0.1,
    threads=4
)

Normalization Algorithm

Step 1: Load Coverage Data
  • Read all mosdepth .regions.bed.gz files

  • Extract coverage for target region

  • Create sample × position matrix

Step 2: Filter Regions

Repeat Masking:

  • Remove positions overlapping repeat-masked regions

  • Prevents bias from low-complexity sequences

Depth Filtering:

  • Remove positions with mean depth < min_depth (low coverage)

  • Remove positions with mean depth > max_depth (extreme outliers)

Variance Selection:

  • Calculate variance across samples for each position

  • Keep top top_frac of positions by variance

  • Focuses on informative, variable regions

Step 3: Z-Score Transformation

For each sample independently:

z_depth = (depth - mean_depth) / std_depth

Where:

  • depth: Raw coverage at position

  • mean_depth: Sample’s mean coverage across all positions

  • std_depth: Sample’s standard deviation

Result:

  • Mean = 0, StdDev = 1 for each sample

  • Comparable across samples with different depths

Output Interpretation

Z-Depth Values
  • z = 0: Average coverage for this sample

  • z = +2: 2 standard deviations above average (potential duplication)

  • z = -2: 2 standard deviations below average (potential deletion)

  • z > +3: Very high confidence variant

Example Interpretation
Sample    Position      Z-Depth    Interpretation
S001      160050000     2.5        Likely duplication
S001      160051000     2.8        Strong duplication signal
S001      160052000     -0.3       Normal coverage
S002      160050000     0.1        Normal coverage

Sample S001 likely has a duplication in the 160050000-160051000 region.

Parameter Tuning

min_depth (default: 20)
  • Purpose: Filter low-coverage positions

  • Too low: Include unreliable positions

  • Too high: Remove informative data

  • Recommendation:

    • 30X WGS: use 15-20

    • 60X WGS: use 30-40

max_depth (default: 100)
  • Purpose: Remove extreme outliers

  • Too low: Filter true high-copy variants

  • Too high: Include technical artifacts

  • Recommendation:

    • Depends on expected max copy number

    • LPA KIV-2: 100-150 (max ~50 copies × 2 haploids)

top_frac (default: 0.1)
  • Purpose: Focus on most variable positions

  • Range: 0.05 - 0.3

  • Lower values: More stringent, fewer positions

  • Higher values: More positions, may include noise

  • Recommendation:

    • VNTR analysis: 0.1 (10% most variable)

    • CNV discovery: 0.2 (20% most variable)

Repeat Mask BED Format

The repeat mask file should be a standard BED file:

chr6    160000000    160001000    SimpleRepeat
chr6    160005000    160005500    LINE
chr6    160010000    160011000    Satellite

Purpose:

  • Exclude low-complexity sequences

  • Remove problematic alignment regions

  • Focus on informative VNTR sequences

How to generate:

  1. Download RepeatMasker track from UCSC Genome Browser

  2. Filter for target region

  3. Optionally keep only certain repeat types

Technical Details

Performance
  • Parallelization: Sample-level parallel processing

  • Memory: Entire dataset loaded into memory (~GB for 1000 samples)

  • Speed: ~5-10 minutes for 1000 samples, 100kb region

Dependencies
  • pandas, numpy - Data processing

  • scipy - Statistical functions

  • multiprocessing - Parallel execution

Quality Control

Expected Results

After normalization:

  • All samples have mean ≈ 0, std ≈ 1

  • High-copy regions show consistently high z-scores

  • Low-noise regions show z-scores near 0

Diagnostic Plots (manual)
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("normalized.zdepth.gz", sep="\t")
sample_means = df.groupby("sample_id")["z_depth"].mean()

plt.hist(sample_means, bins=50)
plt.xlabel("Mean Z-Depth")
plt.ylabel("Samples")
plt.title("Should be centered at 0")
plt.show()

Troubleshooting

All z-scores near zero:

  • Region may not have sufficient variation

  • Try reducing top_frac threshold

  • Check that region contains actual VNTRs

Extreme z-scores (|z| > 10):

  • May indicate technical artifacts

  • Check for alignment issues

  • Verify repeat mask is correct

Memory errors:

  • Reduce number of samples processed at once

  • Increase system RAM

  • Use smaller genomic region

Python API Documentation

grid.utils.run_normalized_mosdepth

grid.utils.run_normalized_mosdepth.run_normalize_mosdepth(mosdepth_dir, output_file, repeat_mask, chrom, start, end, min_depth=20, max_depth=100, top_frac=0.1, threads=1, console=None)

Normalize mosdepth coverage across all samples in a directory.

Parameters:
  • mosdepth_dir (str) – Directory containing mosdepth output files

  • output_file (str) – Path to output normalized .tsv.gz file

  • repeat_mask (str) – Path to repeat mask file

  • chrom (str) – Chromosome name

  • start (int) – Start position

  • end (int) – End position

  • min_depth (int) – Minimum depth threshold for regions

  • max_depth (int) – Maximum depth threshold for regions

  • top_frac (float) – Fraction of high-variance regions to select

  • threads (int) – Number of threads for parallel processing

  • console (Console)

Returns:

None

grid.utils.normalize_mosdepth_dir

grid.utils.normalize_mosdepth_dir._process_one_individual(args_tuple)

This runs in its own process. Returns (individual_id, [(start,end,depth),…]) args_tuple contains:

individual_id, global_dist_file, chromosome, start, end, min_depth, max_depth, excluded

grid.utils.normalize_mosdepth_dir.build_depth_matrix(regions_to_extract)

Build depth matrix from extracted regions.

Parameters:

regions_to_extract (dict[str, list[tuple[int, int, float]]]) – Dictionary of sample IDs to region lists

Returns:

{(start,end): depth}}

Return type:

Dictionary mapping {sample

grid.utils.normalize_mosdepth_dir.build_matrix_from_regions(regions_to_extract, individuals_order=None)

Build numpy matrix from regions_to_extract.

Parameters:
  • regions_to_extract – dict of {individual_id: [(start,end,depth),…]}

  • individuals_order – optional list of individual IDs to order rows

Returns:

list of individual IDs in order regions_list: list of (start,end) tuples in order mat: numpy array of shape (n_individuals, n_regions) with depths

Return type:

individuals_order

grid.utils.normalize_mosdepth_dir.extract_regions(individuals, chromosome, start, end, min_depth=20, max_depth=100, excluded=None, progress=None, task=None)

Extract depth regions passing filters and not overlapping repeats.

Parameters:
  • individuals (dict[str, Path]) – dict mapping sample IDs to their mosdepth.global.dist file paths

  • chromosome (str) – Chromosome name to extract

  • start (int) – Start position of region

  • end (int) – End position of region

  • min_depth (int) – Minimum depth threshold

  • max_depth (int) – Maximum depth threshold

  • excluded (dict[str, set[int]]) – dict mapping chromosome to set of 1kb bins to exclude (from repeat mask)

  • progress – Optional Rich Progress object for updates

  • task – Optional Rich Progress task ID for updates

Returns:

[(start, end, depth), …]} for extracted regions

Return type:

Dictionary mapping {sample_id

grid.utils.normalize_mosdepth_dir.get_individuals(mosdepth_dir)

Map sample IDs to their mosdepth.global.dist files.

Parameters:

mosdepth_dir (str) – Directory containing mosdepth output files

Returns:

Dictionary mapping sample IDs to their global.dist file paths

Return type:

dict[str, Path]

grid.utils.normalize_mosdepth_dir.load_repeat_mask(repeat_bed)

Load repeat regions into {chrom: set(kb_bins)}.

Parameters:

repeat_bed (str) – Path to repeat mask BED file

Returns:

Dictionary mapping chromosomes to sets of excluded kb bins

Return type:

dict[str, set[int]]

grid.utils.normalize_mosdepth_dir.normalize_matrix(mat)

Normalize the depth matrix in two steps: 1) Within-individual normalization (row-wise): divide each depth by the individual’s mean 2) Across-individual normalization (column-wise): for each region, compute mean (mu) and variance (var)

  • Transform each depth x to (x - mu) / sqrt(mu) where mu > 0

  • Also compute variance ratio var/mu for each region.

Parameters:

mat – numpy array of shape (n_individuals, n_regions) with raw depths

Returns:

numpy array of same shape with normalized depths variance_ratios: dict of {region_index: variance_ratio}

Return type:

normalized_mat

grid.utils.normalize_mosdepth_dir.regions_bed_gz(global_dist_file)

Get the path to the regions BED file corresponding to a global distribution file.

Parameters:

global_dist_file (Path) – Path to the mosdepth.global.dist.txt file

Returns:

Path to the corresponding .regions.bed.gz file

Return type:

Path

grid.utils.normalize_mosdepth_dir.select_high_variance_regions(variance_ratios, top_frac=0.1)

Select top fraction of regions by variance ratio.

Parameters:
  • variance_ratios (dict[int, float]) – {region_index: variance_ratio}

  • top_frac (float) – fraction of top regions to retain

Returns:

List of selected region indices

Return type:

list[int]

grid.utils.normalize_mosdepth_dir.write_normalized_output(mat, individuals_order, regions_list, selected_indices, output_file)

Write normalized matrix to gzipped TSV.

Parameters:
  • mat (ndarray) – normalized matrix (n_individuals × n_regions)

  • individuals_order (list[str]) – list of sample IDs

  • regions_list (list[tuple[int, int]]) – list of region tuples

  • selected_indices (list[int]) – list of region indices to keep

  • output_file (Path) – Path to write .tsv.gz file