Diploid Copy Number Computation¶

Overview¶

Diploid Copy Number Computation Module¶

Compute diploid copy numbers for LPA KIV-2 exons using neighbor-based normalization. This is the penultimate step before final KIV-2 CN estimation.

Purpose¶

This module transforms raw realignment read counts into normalized diploid copy number estimates by:

Neighbor-based normalization - Compare each sample to similar individuals
Batch effect correction - Remove technical variation
Diploid CN estimation - Account for both haplotypes
Exon-specific quantification - Separate estimates for 1A, 1B, and subtypes

The output provides refined CN estimates for each exon type, which are combined in the final step to estimate total KIV-2 copy number.

Main Module¶

`compute_dipcn.py`¶

Located in parent directory: grid/utils/compute_dipcn.py

Key Function:

def compute_dipcn_pipeline(
    count_file: Path,
    neighbor_file: Path,
    output_prefix: str,
    n_neighbors: int = 200,
    console: Optional[Console] = None
) -> None

Parameters:

count_file: Realignment output TSV (from lpa-realign)
neighbor_file: Neighbor file (from find-neighbors)
output_prefix: Prefix for output files (e.g., “output/sample”)
n_neighbors: Number of top neighbors to use (default: 200)
console: Rich console for pretty output (optional)

Output Files: Four files are generated with diploid CN for each exon type:

<output_prefix>.exon1A.dipCN.txt
<output_prefix>.exon1B.dipCN.txt
<output_prefix>.exon1B_KIV3.dipCN.txt
<output_prefix>.exon1B_notKIV3.dipCN.txt

Output Format (each file):

sample_id       diploid_CN
sample001       18.45
sample002       22.13
sample003       15.89

Usage¶

Via CLI¶

grid compute-dipcn \
    --count-file output/lpa_realignment_counts.tsv \
    --neighbor-file output/neighbors.zMax2.0.txt.gz \
    --output-prefix output/sample_cohort \
    --n-neighbors 200

Output:

output/sample_cohort.exon1A.dipCN.txt
output/sample_cohort.exon1B.dipCN.txt
output/sample_cohort.exon1B_KIV3.dipCN.txt
output/sample_cohort.exon1B_notKIV3.dipCN.txt

As Python Module¶

from pathlib import Path
from grid.utils.compute_dipcn import compute_dipcn_pipeline

compute_dipcn_pipeline(
    count_file=Path("output/lpa_realignment_counts.tsv"),
    neighbor_file=Path("output/neighbors.zMax2.0.txt.gz"),
    output_prefix="output/sample_cohort",
    n_neighbors=200
)

Algorithm Details¶

Step 1: Load Data¶

Read counts: Load realignment counts for all samples
Neighbors: Load neighbor lists for each sample
Validate: Ensure all samples have neighbor information

Step 2: Neighbor-Based Normalization¶

For each sample and each exon type:

# Get top N neighbors
neighbors = neighbor_list[:n_neighbors]

# Calculate neighbor statistics
neighbor_counts = counts[neighbors]
neighbor_median = median(neighbor_counts)
neighbor_std = std(neighbor_counts)

# Normalize sample count
normalized_count = sample_count / neighbor_median

Why neighbor-based?

Reduces batch effects: Similar samples likely from same batch
Population-aware: Accounts for ancestry-related variation
Adaptive: Each sample gets custom reference baseline

Step 3: Copy Number Estimation¶

Convert normalized counts to diploid CN:

# Assuming KIV-3 is diploid (2 copies) as anchor
reference_diploid = 2.0
diploid_CN = normalized_count r'\times{×}' reference_diploid

For exon_1B_KIV3:

Expected to be 2 copies in diploid individuals (1 per haplotype)
Used as normalization anchor
Deviations may indicate rare KIV-3 variants

For exon_1A and exon_1B:

Variable copy number
Normalized against neighbors with similar genomic background
Diploid estimate accounts for both haplotypes

Step 4: Quality Control¶

Filter and flag potential issues:

Extreme outliers (|z-score| > 4)
Very low read counts (< 10 reads)
Samples with insufficient neighbor data

Copy Number Interpretation¶

Exon 1A Diploid CN¶

Range: Typically 10-40 (diploid)

10-15: Lower copy number genotype
20-30: Common range
35-40: Higher copy number genotype
>40: Rare, verify with orthogonal methods

Biological meaning: Each KIV-2 repeat unit has 1 exon 1A, so:

haploid KIV-2 copies ≈ diploid_CN / 2

Exon 1B Diploid CN¶

Range: Typically 12-45 (diploid)

Includes all KIV types (KIV-2, KIV-3, etc.)
Should be slightly higher than exon 1A
Difference reflects non-KIV-2 copies

Relationship:

exon_1B_CN ≈ exon_1A_CN + exon_1B_KIV3_CN

Exon 1B_KIV3 Diploid CN¶

Expected: ~2.0 (diploid, single copy per haplotype)

1.5-2.5: Normal range with measurement noise
<1.0: Possible deletion (very rare)
>3.0: Possible duplication (very rare)

Use as QC: Samples with very aberrant KIV-3 CN may have:

Alignment issues
True rare variants
Technical artifacts

Exon 1B_notKIV3 Diploid CN¶

Calculation:

exon_1B_notKIV3 = exon_1B - exon_1B_KIV3

Range: Should closely match exon 1A

Additional normalization check
Validates exon relationships

Parameter Tuning¶

`n_neighbors` (default: 200)¶

Purpose: Number of similar samples for normalization baseline

Fewer (50-100):
- More stringent
- Only closest matches
- Better for homogeneous cohorts
More (300-500):
- More robust to outliers
- Better for diverse cohorts
- May include less similar samples

Recommendation:

Small cohort (<500): 100-150
Medium cohort (500-2000): 200-300
Large cohort (>2000): 200-500

Trade-off:

Too few: Sensitive to outlier neighbors
Too many: Include dissimilar samples, reduce specificity

Expected Results¶

Typical Distribution¶

For 1000 samples from diverse populations:

Exon 1A Diploid CN:
  Mean: 25.3
  Median: 24.8
  Range: 12.5 - 42.7

Exon 1B_KIV3 Diploid CN:
  Mean: 2.04
  Median: 2.01
  Range: 1.7 - 2.3

Quality Metrics¶

Coefficient of Variation (CV):

cv = std(diploid_CN) / mean(diploid_CN)

Good: CV < 0.3 for exon 1A
Acceptable: CV < 0.5
Poor: CV > 0.5 (check for batch effects)

Exon Correlation:

correlation(exon_1A, exon_1B)

Good: r > 0.85
Acceptable: r > 0.75
Poor: r < 0.75 (alignment or normalization issues)

Validation¶

Internal Consistency Checks¶

Check 1: Exon relationships

# exon_1B should be >= exon_1A
assert (exon_1B >= exon_1A).all()

# exon_1B_KIV3 should be ~2
assert 1.5 < exon_1B_KIV3.mean() < 2.5

Check 2: Correlation

import pandas as pd

df = pd.DataFrame({
    'exon_1A': exon_1A_CN,
    'exon_1B': exon_1B_CN
})

correlation = df.corr()
print(correlation)
# Should show strong positive correlation

Comparison with Known Samples¶

If validation samples with known CN are available:

known_samples = {
    'sample001': 20,  # Known KIV-2 CN
    'sample002': 25,
    'sample003': 15
}

for sample, true_CN in known_samples.items():
    estimated_CN = diploid_CN_df.loc[sample, 'diploid_CN'] / 2
    error = abs(estimated_CN - true_CN)
    print(f"{sample}: True={true_CN}, Est={estimated_CN:.1f}, Error={error:.1f}")

Technical Details¶

Performance¶

Memory: <4GB for 1000 samples
Speed: ~30-60 seconds for 1000 samples
Parallelization: Not currently implemented (fast enough)

Dependencies¶

pandas - Data manipulation
numpy - Numerical operations
scipy - Statistical functions (optional)

Troubleshooting¶

All diploid CNs near 2¶

Possible causes:

Normalization issue - check neighbor file
All samples similar - low biological variation
Incorrect reference scaling

Solution:

Verify neighbor file is correct
Check exon count distributions
Ensure exon_1B_KIV3 used as anchor

Extreme outliers (CN > 100)¶

Possible causes:

Technical artifacts
Very poor neighbors for that sample
True extreme variant (rare)

Solution:

Check read counts for that sample
Inspect neighbor similarity
Consider removing sample from analysis

Negative diploid CNs¶

Should never happen - indicates serious error

Check input count file format
Verify neighbor file integrity
Report as bug if reproducible

Poor correlation between exons¶

Possible causes:

Alignment issues in realignment step
Batch effects not properly normalized
Incorrect exon positions file

Solution:

Rerun realignment with validated positions file
Check for batch confounding in neighbor selection
Increase n_neighbors for more robust normalization

Next Step¶

After computing diploid CNs, proceed to final KIV-2 estimation:

grid estimate-kiv \
    --exon1a output/sample_cohort.exon1A.dipCN.txt \
    --exon1b output/sample_cohort.exon1B.dipCN.txt \
    --output output/final_kiv2_estimates.tsv

This combines the exon diploid CNs using the empirically-derived formula to produce final KIV-2 copy number estimates.

Python API Documentation¶

grid.utils.compute_dipcn¶

grid.utils.compute_dipcn.compute_dipcn_pipeline(count_file, neighbor_file, output_prefix, n_neighbors, console=None)¶

Main function to compute diploid copy numbers for all exon types.

Parameters:

count_file (Path) – Path to realignment count file
neighbor_file (Path) – Path to neighbor results file (can be gzipped)
output_prefix (Path) – Output file prefix
n_neighbors (int) – Number of top neighbors to use
console (Console) – Rich console for output (optional)

Returns:

None

grid.utils.compute_dipcn_dir¶

Diploid copy number computation utilities.

This package contains modular components for computing diploid copy numbers for LPA KIV-2 repeats using neighbor-based normalization.

grid.utils.compute_dipcn_dir.compute_diploid_cn_for_exon(counts, neighbors, exon_type, n_neighbors=200)¶

Compute diploid copy number for all samples for a given exon type.

Formula: dipCN = (sample_count / sample_scale) / (mean_neighbor_normalized_count) where mean_neighbor_normalized_count = mean(neighbor_count / neighbor_scale)

Parameters:

counts (Dict) – Dictionary mapping sample_id to count dict
neighbors (Dict) – Dictionary mapping sample_id to (scale, neighbor_list)
exon_type (str) – Exon type to compute (‘1B_KIV3’, ‘1B_notKIV3’, ‘1B’, ‘1A’)
n_neighbors (int) – Number of top neighbors to use

Returns:

Dictionary mapping sample_id to diploid copy number

Return type:

Dict[str, float]

grid.utils.compute_dipcn_dir.get_exon_count(counts_dict, exon_type)¶

Get the appropriate count for a given exon type.

Exon types: - 1B_KIV3: just the 1B_KIV3 count - 1B_notKIV3: 1B_KIV2 + 1B_tied - 1B: 1B_KIV3 + 1B_KIV2 + 1B_tied - 1A: just the 1A count

Parameters:

counts_dict (Dict[str, int]) – Dictionary with keys ‘1B_KIV3’, ‘1B_KIV2’, ‘1B_tied’, ‘1A’
exon_type (str) – Exon type to compute

Returns:

Count for the specified exon type

Return type:

int

Raises:

ValueError – If exon_type is not recognized

grid.utils.compute_dipcn_dir.load_count_results(count_file)¶

Load read counts from realignment output.

Expected format: sample_id 1B_KIV3 1B_KIV2 1B_tied 1A

Parameters:

count_file (Path) – Path to count file

Returns:

Dictionary mapping sample_id to count dict: Example: {‘NWD123’: {‘1B_KIV3’: 45, ‘1B_KIV2’: 23, …}}

Return type:

Dict[str, Dict[str, int]]

grid.utils.compute_dipcn_dir.load_neighbor_results(neighbor_file)¶

Load neighbor information from find_neighbors output.

Expected format: sample_id scale neighbor1 neighbor1_scale neighbor1_distance …

Parameters:: neighbor_file (Path) – Path to neighbor file (can be gzipped)
Returns:: Dict mapping sample_id to (scale, neighbor_list) where neighbor_list is [(neighbor_id, neighbor_scale, distance), …] Example: {‘NWD123’: (1.0, [(‘NWD456’, 0.98, 0.05), (‘NWD789’, 1.02, 0.07), …])}
Return type:: Dict[str, Tuple[float, List[Tuple[str, float, float]]]]

grid.utils.compute_dipcn_dir.normalize_sample_id(sample_id)¶

Normalize sample IDs by removing common file extensions and suffixes.

Examples

NWD278973.b38.irc.v1_subset → NWD278973 NWD278973.cram → NWD278973 NWD278973 → NWD278973

Parameters:: sample_id (str) – Original sample ID
Returns:: Normalized sample ID
Return type:: str

grid.utils.compute_dipcn_dir.validate_sample_overlap(counts, neighbors, console=None)¶

Validate that there are overlapping samples between counts and neighbors.

Parameters:

counts (Dict) – Dictionary of count results
neighbors (Dict) – Dictionary of neighbor results
console (Console) – Rich console for output (optional)

Returns:

Number of overlapping samples

Return type:

int

grid.utils.compute_dipcn_dir.write_dipcn_output(results, output_file)¶

Write diploid copy numbers to output file.

Output format: ID dipCN sample1 2.345678 sample2 1.987654 …

Parameters:

results (Dict[str, float]) – Dictionary mapping sample_id to diploid CN
output_file (str) – Output file path

Returns:

None

Return type:

None

Diploid Copy Number Computation¶

Overview¶