Neighbor Finding Module¶

Overview¶

Neighbor Finding Module¶

Identify nearest neighbors among samples based on genome-wide normalized coverage patterns. Used for sample-specific normalization in copy number estimation.

Purpose¶

Copy number estimation benefits from comparing each sample to genetically similar individuals rather than the entire cohort. This module finds the most similar samples based on coverage profiles, which:

Reduces batch effects - Similar samples likely sequenced together
Improves accuracy - Better baseline for CN estimation
Handles population structure - Accounts for ancestry-related variation
Minimizes noise - Related samples have similar technical characteristics

Main Module¶

`find_neighbors.py`¶

Located in parent directory: grid/utils/find_neighbors.py

Key Function:

def find_neighbors(
    input_file: str,
    output_file: str,
    zmax: float = 2.0,
    n_neighbors: int = 500,
    sigma2_max: float = 1000.0
) -> None

Parameters:

input_file: Normalized z-depth file from normalize_mosdepth (.gz)
output_file: Output neighbor file (auto-suffixed with .zMax{zmax}.txt.gz)
zmax: Maximum z-score clipping threshold (default: 2.0)
n_neighbors: Number of nearest neighbors to find (default: 500)
sigma2_max: Maximum variance threshold for region filtering (default: 1000.0)

Output Format:

sample_id       neighbor_1      neighbor_2      neighbor_3      ...
sample001       sample042       sample103       sample205       ...
sample002       sample073       sample192       sample301       ...

Each row contains a sample and its N nearest neighbors, ranked by similarity.

Usage¶

Via CLI¶

grid find-neighbors \
    --input-file output/normalized.zdepth.gz \
    --output-file output/neighbors.txt \
    --zmax 2.0 \
    --n-neighbors 500 \
    --sigma2-max 1000.0

The output will be automatically named: output/neighbors.zMax2.0.txt.gz

As Python Module¶

from grid.utils.find_neighbors import find_neighbors

find_neighbors(
    input_file="output/normalized.zdepth.gz",
    output_file="output/neighbors.txt",
    zmax=2.0,
    n_neighbors=500,
    sigma2_max=1000.0
)

Algorithm Details¶

Step 1: Load and Filter Data¶

Load z-depths: Read normalized coverage matrix
Clip outliers: Cap z-scores at ±zmax to reduce outlier influence
Filter high-variance regions: Remove positions with variance > sigma2_max

Z-clipping example:

Original:  [-3.2, -0.5, 1.2, 5.8, 0.3]
After clip (zmax=2): [-2.0, -0.5, 1.2, 2.0, 0.3]

Step 2: Compute Pairwise Distances¶

Calculate Euclidean distance between all sample pairs:

distance(A, B) = sqrt(Σ(z_A[i] - z_B[i])²)

Where the sum is over all retained genomic positions.

Similarity interpretation:

Small distance = Similar coverage patterns = Good neighbors
Large distance = Different patterns = Poor neighbors

Step 3: Rank Neighbors¶

For each sample:

Sort all other samples by distance (ascending)
Select top N closest samples as neighbors
Write neighbor list to output file

Parameter Tuning¶

`zmax` (default: 2.0)¶

Purpose: Reduce influence of extreme outliers

Lower (1.5): More conservative, less outlier influence
Higher (3.0): Keep more extreme variation
Recommended: 2.0 for most applications

Effect:

Too low: May lose real biological signal
Too high: Outliers dominate distance calculation

`n_neighbors` (default: 500)¶

Purpose: Number of similar samples to identify

Fewer (100-200): Very stringent, only closest matches
More (500-1000): Include more distant but still similar samples
Recommended:
- Small cohort (<500): 100-200 neighbors
- Medium cohort (500-2000): 200-500 neighbors
- Large cohort (>2000): 500-1000 neighbors

Trade-off:

Too few: May miss good reference samples
Too many: Include dissimilar samples, reduce specificity

`sigma2_max` (default: 1000.0)¶

Purpose: Filter extremely variable regions

Effect:

Lower (500): More stringent filtering, fewer positions
Higher (2000): Keep more variable positions
Infinite: Keep all positions

Recommended:

Standard VNTR: 1000
Highly variable CNVs: 2000
Conserved regions: 500

Output Interpretation¶

Neighbor Ranking¶

Neighbors are ordered by similarity:

sample001: neighbor1 (closest), neighbor2, neighbor3, ... neighbor500 (farthest)

Using neighbor data: The top neighbors are used in compute_dipcn for sample-specific normalization:

# Typically use top 100-200 neighbors for CN estimation
top_neighbors = neighbor_list[:200]
reference_coverage = mean(top_neighbors.coverage)

Quality Metrics¶

Good neighborhood:

Neighbors show consistent coverage patterns in VNTR
Similar sequencing depth across genome
Likely from same sequencing batch/population

Poor neighborhood:

High variance among neighbors
May indicate population outlier
Could suggest unique structural variants

Technical Details¶

Computational Complexity¶

Memory: O(samples × positions) - full matrix in memory
Time: O(samples² × positions) - pairwise comparisons
Parallelization: Not currently implemented (TODO)

Performance estimates:

1,000 samples, 1,000 positions: ~1-2 minutes
5,000 samples, 1,000 positions: ~10-20 minutes
10,000 samples: Memory intensive (>32GB RAM)

Memory Optimization¶

For very large cohorts (>5000 samples):

Use chunked processing
Implement approximate nearest neighbor search (e.g., FAISS)
Reduce number of positions used

Use in Pipeline¶

The neighbor file is critical for downstream analysis:

normalize_mosdepth → find_neighbors → compute_dipcn → estimate_kiv
                            ↓
                   Neighbor-based CN normalization

In compute_dipcn:

Load neighbor lists
For each sample, extract coverage of top N neighbors
Use neighbor median/mean as reference baseline
Compute sample CN relative to neighbor baseline

Validation¶

Manual Check¶

Inspect a few samples’ neighbors:

import pandas as pd

neighbors = pd.read_csv("neighbors.zMax2.0.txt.gz", sep="\t")
print(neighbors.iloc[0])  # Check first sample's neighbors

# Are neighbors from same population/batch?
# Do they have similar coverage characteristics?

PCA Visualization¶

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Project samples into 2D
pca = PCA(n_components=2)
coords = pca.fit_transform(zdepth_matrix)

plt.scatter(coords[:, 0], coords[:, 1])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Sample similarity space")

Neighbors should cluster together in PCA space.

Troubleshooting¶

All distances similar:

May indicate insufficient variation in data
Check that normalization worked correctly
Verify region contains informative positions

Extreme distances:

Check for outlier samples with very unusual coverage
May indicate technical failures or true large CNVs
Consider removing outliers before neighbor finding

Memory errors:

Reduce n_neighbors
Process samples in batches
Use sparse matrix representation (TODO)

Python API Documentation¶

grid.utils.find_neighbors¶

grid.utils.find_neighbors.find_neighbors(input_file, output_file, zmax, n_neighbors, sigma2_max)¶

Main function to find nearest neighbors in normalized depth data.

Parameters:

input_file (str) – Path to gzipped normalized depth file.
output_file (str) – Path to output file (will be converted to .zMax{zmax}.txt.gz).
zmax (float) – Maximum z-score clipping threshold.
n_neighbors (int) – Number of nearest neighbors to find.
sigma2_max (float) – Maximum variance ratio for region filtering.

Returns:

None

grid.utils.find_neighbors_dir¶

grid.utils.find_neighbors_dir.filter_regions_by_variance(data_matrix, regions, sigma2_max=1000.0, frac_r=1.0)¶

Filter regions based on variance ratios, similar to C++ code.

Parameters:

data_matrix (np.ndarray) – Data matrix of shape [regions x individuals].
regions (list) – List of region identifiers.
sigma2_max (float) – Maximum allowed variance ratio.
frac_r (float) – Fraction of regions to retain based on variance.

Returns:

Indices of regions that pass the filtering. variance_ratios (np.ndarray): Calculated variance ratios for all regions.

Return type:

valid_indices (np.ndarray)

grid.utils.find_neighbors_dir.find_neighbors_sklearn(data_matrix, individuals, n_neighbors=500, zmax=2.0)¶

Find nearest neighbors using scikit-learn.

Parameters:

data_matrix – np.array of shape [regions x individuals]
individuals – list of IDs (length = number of individuals)
n_neighbors – how many neighbors to return per individual
zmax – max z-score clipping

Returns:

dict mapping individual ID to list of (neighbor ID, distance)

grid.utils.find_neighbors_dir.read_normalized_data(input_file)¶

Read gzipped normalized depth data.

Parameters:: input_file (str) – Path to gzipped normalized depth file.
Returns:: List of individual IDs. regions (list): List of region identifiers. mat (np.ndarray): Data matrix of shape [regions x individuals]. scales (dict): Dictionary of scaling factors for each individual.
Return type:: individuals (list)

grid.utils.find_neighbors_dir.save_neighbors(neighbors_dict, scales, output_prefix, zmax, R_use=None)¶

Save the nearest neighbors dictionary to a gzipped text file.

Parameters:

neighbors_dict – dict of {individual_id: [(neighbor_id, distance), …]}
scales – dict of {individual_id: scale_value}
output_prefix – str, prefix for output file
zmax – float, max z-score clipping
R_use – int, number of regions used for normalization (for normalized distance)

Neighbor Finding Module¶

Overview¶