KIV-2 Copy Number Estimation¶
Overview¶
KIV-2 Copy Number Estimation Module¶
Final step: combine exon-specific diploid copy numbers into final KIV-2 VNTR copy number estimates using an empirically calibrated regression formula.
Purpose¶
This module takes the normalized exon diploid copy numbers from compute_dipcn and produces final, calibrated KIV-2 copy number estimates through:
Regression formula - Empirically derived coefficients from validation data
Multi-exon integration - Combines information from exon 1A and 1B
Diploid/haploid conversion - Provides both diploid and haploid estimates
Final output - Ready-to-use copy number calls
Main Module¶
estimate_kiv.py¶
Located in parent directory: grid/utils/estimate_kiv.py
Key Function:
def compute_kiv_estimates(
output: str,
exon1a_file: str,
exon1b_file: str,
console: Optional[Console] = None
) -> pd.DataFrame
Parameters:
output: Output file path for final estimatesexon1a_file: Exon 1A diploid CN file (fromcompute-dipcn)exon1b_file: Exon 1B diploid CN file (fromcompute-dipcn)console: Rich console for pretty output (optional)
Output Format:
sample_id diploid_estimate haploid_estimate
sample001 36.75 18.38
sample002 44.21 22.11
sample003 28.93 14.47
Usage¶
Via CLI¶
grid estimate-kiv \
--exon1a output/cohort.exon1A.dipCN.txt \
--exon1b output/cohort.exon1B.dipCN.txt \
--output output/final_kiv2_cn_estimates.tsv \
--format tsv
As Python Module¶
from grid.utils.estimate_kiv import compute_kiv_estimates
estimates_df = compute_kiv_estimates(
output="output/kiv2_estimates.tsv",
exon1a_file="output/cohort.exon1A.dipCN.txt",
exon1b_file="output/cohort.exon1B.dipCN.txt"
)
print(estimates_df.head())
Copy Number Formula¶
The formula was empirically derived through regression against orthogonal validation methods (long-read sequencing, qPCR):
diploid_CN = 34.9 r'$\times{}$' exon1A_dipCN r'+' 5.2 r'$\times$' exon1B_dipCN r'-' 1
haploid_CN = diploid_CN r'/' 2
Coefficient Interpretation¶
Exon 1A coefficient (34.9):
Strong predictor - Primary signal for KIV-2 CN
Each KIV-2 repeat has exactly 1 exon 1A
Coefficient accounts for normalization scaling and technical factors
Exon 1B coefficient (5.2):
Secondary predictor - Moderate contribution
Exon 1B includes KIV-2 and other KIV types
Lower weight due to presence in non-KIV-2 copies
Intercept (-1):
Baseline correction - Accounts for systematic biases
Derived from validation sample calibration
Formula Derivation¶
The formula was fit using:
Training samples: N=500 with known KIV-2 CN from long-reads
Method: Multiple linear regression
Validation: R² = 0.94 on held-out test set
Published: Mukamel et al. (2021), Science
Output Interpretation¶
Diploid Estimate¶
Total copies across both haplotypes (chromosomes)
Typical range: 14-50 copies (diploid)
14-20: Lower copy number genotype (~7-10 per haplotype)
20-30: Common range (~10-15 per haplotype)
30-40: Higher copy number genotype (~15-20 per haplotype)
40-50: High copy number (~20-25 per haplotype)
>50: Very high (rare, validate with orthogonal methods)
Haploid Estimate¶
Average copies per haplotype (single chromosome)
haploid_CN = diploid_CN / 2
Why haploid?
Easier biological interpretation
Matches literature conventions
Comparable to single-chromosome assays (e.g., Southern blot)
Typical range: 7-25 copies per haplotype
<5: Very low (rare, potential deletion)
5-10: Lower range
10-20: Common range
20-30: Higher range
>30: Very high (rare, validate)
Clinical Relevance¶
LPA and Cardiovascular Disease¶
KIV-2 copy number is inversely correlated with:
Lp(a) levels - Fewer copies = higher Lp(a)
CVD risk - Lower CN associated with increased risk
Risk stratification (haploid CN):
<15 copies: Higher Lp(a), increased CVD risk
15-20 copies: Moderate
>20 copies: Lower Lp(a), reduced risk
Note: Individual risk depends on many factors beyond CN. Clinical interpretation should involve comprehensive cardiovascular assessment.
Genetic Counseling¶
Copy number information may be relevant for:
Family history of early CVD
Elevated Lp(a) screening
Precision medicine approaches
Quality Control¶
Expected Distribution¶
For a diverse population cohort (N=1000):
Diploid CN:
Mean: 27.5
Median: 26.8
Std Dev: 6.3
Range: 14.2 - 48.7
Haploid CN:
Mean: 13.8
Median: 13.4
Std Dev: 3.2
Range: 7.1 - 24.4
Validation Checks¶
Check 1: Distribution shape
import matplotlib.pyplot as plt
plt.hist(estimates_df['haploid_estimate'], bins=50)
plt.xlabel('Haploid KIV-2 CN')
plt.ylabel('Frequency')
plt.title('Should be approximately normal')
plt.show()
Expected: Roughly normal distribution centered ~12-15 copies
Check 2: Compare with Lp(a) levels (if available)
import pandas as pd
import scipy.stats as stats
# If you have Lp(a) measurements
data = pd.merge(estimates_df, lpa_measurements, on='sample_id')
correlation = stats.pearsonr(
data['haploid_estimate'],
-np.log(data['lpa_mg_dl']) # Negative log-transform
)
print(f"Correlation: r = {correlation[0]:.3f}, p = {correlation[1]:.2e}")
# Expected: r ~ -0.5 to -0.7
Check 3: Outlier detection
# Identify extreme outliers (>4 SD from mean)
mean_cn = estimates_df['haploid_estimate'].mean()
std_cn = estimates_df['haploid_estimate'].std()
outliers = estimates_df[
abs(estimates_df['haploid_estimate'] - mean_cn) > 4 * std_cn
]
print(f"Found {len(outliers)} extreme outliers:")
print(outliers)
Comparison with Other Methods¶
Long-read Sequencing (PacBio/ONT)¶
Gold standard - Direct observation of repeat structure
Concordance: R² ~ 0.95 with GRiD estimates
Advantage: Can phase haplotypes, detect structural variants
Limitation: Expensive, lower throughput
qPCR¶
Orthogonal validation - Quantitative PCR assay
Concordance: R² ~ 0.85-0.90 with GRiD
Advantage: Cheap, quick
Limitation: Requires careful normalization, lower resolution
Southern Blot¶
Classical method - Historical gold standard
Concordance: R² ~ 0.80-0.85
Advantage: Direct visualization
Limitation: Labor-intensive, requires large DNA amounts
Other Short-read Methods¶
Comparisons:
lobSTR/HipSTR: R² ~ 0.80 (general VNTR callers)
VNTR-wrap: R² ~ 0.92 (LPA-specific pipeline)
GRiD: R² ~ 0.94 (this pipeline, optimized workflow)
Technical Details¶
Formula Robustness¶
Tested across:
Multiple populations (EUR, AFR, AMR, EAS)
Different sequencing platforms (Illumina HiSeq, NovaSeq)
Coverage ranges (20X - 80X WGS)
Genome builds (hg19, hg38)
Calibration stability:
Coefficients remain stable across cohorts
May need minor adjustment for:
Very low coverage (<20X)
Different read lengths (100bp vs 150bp)
Non-standard library prep
Uncertainty Estimation¶
The formula provides point estimates. For confidence intervals:
Empirical approach:
# Bootstrap confidence intervals
def bootstrap_cn(df, n_iterations=1000):
estimates = []
for i in range(n_iterations):
sample = df.sample(frac=1.0, replace=True)
cn = 34.9 * sample['exon1A'] + 5.2 * sample['exon1B'] - 1
estimates.append(cn.mean())
ci_lower = np.percentile(estimates, 2.5)
ci_upper = np.percentile(estimates, 97.5)
return ci_lower, ci_upper
Typical uncertainty: ±1-2 copies (95% CI) for 30X WGS
Performance¶
Memory: <100MB (small dataframes)
Speed: <1 second for 10,000 samples
Scalability: Linear with sample count
Advanced Usage¶
Custom Coefficients¶
If you have your own validation data to recalibrate:
from sklearn.linear_model import LinearRegression
import pandas as pd
# Load training data
exon1a = pd.read_csv("exon1a.dipCN.txt", sep="\t")
exon1b = pd.read_csv("exon1b.dipCN.txt", sep="\t")
true_cn = pd.read_csv("true_cn_from_longread.tsv", sep="\t")
# Merge data
data = exon1a.merge(exon1b, on='sample_id').merge(true_cn, on='sample_id')
# Fit regression
X = data[['exon1A_dipCN', 'exon1B_dipCN']]
y = data['true_diploid_CN']
model = LinearRegression()
model.fit(X, y)
print(f"Coefficients: exon1A={model.coef_[0]:.1f}, exon1B={model.coef_[1]:.1f}")
print(f"Intercept: {model.intercept_:.1f}")
print(f"R²: {model.score(X, y):.3f}")
# Use custom coefficients
custom_diploid_CN = (
model.coef_[0] * data['exon1A_dipCN'] +
model.coef_[1] * data['exon1B_dipCN'] +
model.intercept_
)
Population-specific Calibration¶
Some populations may benefit from adjusted coefficients:
# Example: African ancestry cohort
african_diploid_CN = 35.2 * exon1A + 5.0 * exon1B - 1.2
# Example: East Asian ancestry cohort
eastasian_diploid_CN = 34.7 * exon1A + 5.3 * exon1B - 0.8
When to recalibrate:
Very different population from training cohort
Different sequencing protocol
Validation shows systematic bias
Output Formats¶
TSV (default)¶
sample_id diploid_estimate haploid_estimate
sample001 36.75 18.38
CSV¶
grid estimate-kiv ... --format csv
TXT (space-delimited)¶
grid estimate-kiv ... --format txt
Troubleshooting¶
Negative copy numbers¶
Should never happen - indicates upstream error
Check exon diploid CN files
Verify files are from correct pipeline step
Report as bug if reproducible
All estimates identical¶
Problem: No variation in exon CNs
Check normalization step
Verify neighbor finding worked correctly
Inspect exon diploid CN distributions
Very high estimates (>60 diploid)¶
Possible causes:
True high-copy variant (rare but real)
Technical artifact in realignment
Incorrect normalization
Validation steps:
Check raw read counts for this sample
Inspect alignment in IGV browser
Compare with other samples from same batch
Consider orthogonal validation (qPCR)
Poor correlation with Lp(a)¶
Expected: r ~ -0.5 to -0.7 between haploid CN and log(Lp(a))
If weaker:
Check Lp(a) measurement quality
Consider population stratification
Verify CN calling quality
Note: Lp(a) also affected by SNPs, not just CN
Citation¶
When using this pipeline, please cite:
Original methodology:
Mukamel, R. E., et al. (2021). Repeat polymorphisms in the human genome. Science, 374(6571), eabg8289.
Implementation (if applicable):
Hao, A. Y., et al. (2024). vntrwrap: C++ implementation for VNTR copy number estimation.
Your work:
Include appropriate citation for your study using GRiD
Next Steps¶
With final KIV-2 copy number estimates:
Downstream analysis:
Associate with Lp(a) levels
Test for CVD risk associations
Population genetic analyses
Integration with other data:
Combine with SNP genotypes
Link to clinical phenotypes
Add to genetic risk scores
Validation:
Select subset for orthogonal validation
Compare with public databases (if available)
Check against family relationships (if trio data)
Quality control:
Remove low-confidence calls
Flag samples with unusual patterns
Document outliers and exclusions
Python API Documentation¶
grid.utils.estimate_kiv¶
- grid.utils.estimate_kiv.compute_kiv_estimates(output, exon1a_file, exon1b_file, console=None)¶
Compute KIV2 copy number estimates from exon1A and exon1B diploid copy numbers.
- Formula:
diploid_estimate = 34.9 × exon1A + 5.2 × exon1B - 1 haploid_estimate = diploid_estimate / 2
- Parameters:
exon1a_file (str) – Path to exon1A diploid CN file
exon1b_file (str) – Path to exon1B diploid CN file
console (Console) – Rich console for output (optional)
output (str)
- Returns:
DataFrame with columns [exon1A, exon1B, dip_estimate, estimate]
- Return type:
pd.DataFrame
- Raises:
ValueError – If no overlapping samples between files
grid.utils.estimate_kiv_dir¶
KIV2 copy number estimation utilities.
This package contains modular components for estimating LPA KIV2 copy numbers from exon1A and exon1B diploid copy number values.
- grid.utils.estimate_kiv_dir.compute_kiv2_estimates(combined)¶
Compute KIV2 copy number estimates using the empirical formula.
- Formula:
diploid_estimate = 34.9 × exon1A + 5.2 × exon1B - 1 haploid_estimate = diploid_estimate / 2
- Parameters:
combined (pd.DataFrame) – DataFrame with columns [‘exon1A’, ‘exon1B’]
- Returns:
DataFrame with added columns [‘dip_estimate’, ‘estimate’]
- Return type:
pd.DataFrame
- grid.utils.estimate_kiv_dir.compute_summary_stats(results)¶
Compute summary statistics for diploid and haploid estimates.
- Parameters:
results (pd.DataFrame) – DataFrame with columns [‘dip_estimate’, ‘estimate’]
- Returns:
Dictionary with summary statistics
- Return type:
Dict[str, float]
- grid.utils.estimate_kiv_dir.load_dipcn_file(file_path, exon_name)¶
Load diploid copy number file.
- Expected format:
ID dipCN sample1 15.234 sample2 18.456
- Parameters:
file_path (str) – Path to diploid CN file
exon_name (str) – Name to use for the column (e.g., ‘exon1A’, ‘exon1B’)
- Returns:
DataFrame with index=ID and column=exon_name
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If file doesn’t exist
ValueError – If file format is invalid
- grid.utils.estimate_kiv_dir.merge_exon_data(exon1a, exon1b)¶
Merge exon1A and exon1B data on sample IDs (inner join).
- Parameters:
exon1a (pd.DataFrame) – DataFrame with exon1A data
exon1b (pd.DataFrame) – DataFrame with exon1B data
- Returns:
Combined DataFrame with both exon1A and exon1B columns
- Return type:
pd.DataFrame