LPA Realignment Module¶
Overview¶
LPA Realignment Module¶
Realign reads from the LPA KIV-2 VNTR region to a high-quality reference sequence for improved copy number estimation.
Purpose¶
Standard genome alignment may be suboptimal in repeat regions due to:
Repetitive sequences confusing aligners
Copy number variation causing ambiguous mappings
Sequence homology between repeat units
This module realigns reads specifically to the LPA KIV-2 reference, enabling:
Precise read assignment to specific exons (1A, 1B, KIV-3, etc.)
Improved copy number accuracy through targeted alignment
Variant detection within repeat units
Main Module¶
align_lpa.py¶
Located in parent directory: grid/utils/align_lpa.py
Key Function:
def run_lpa_realignments(
cram_dir: str,
reference_fa: str,
lpa_ref_fasta: str,
positions_file: str,
genome_build: str,
chrom: str,
start: int,
end: int,
output_file: str,
threads: int = 1
) -> None
Parameters:
cram_dir: Directory containing CRAM filesreference_fa: Genome reference FASTA (for CRAM decoding)lpa_ref_fasta: LPA KIV-2 specific reference FASTApositions_file: TSV with exon positions in LPA referencegenome_build: Genome build (hg19/hg37/hg38)chrom: Chromosome (e.g., “chr6”)start: VNTR region startend: VNTR region endoutput_file: Output TSV with realignment countsthreads: Parallel processing threads
Output Format:
sample_id exon_1A exon_1B exon_1B_KIV3 exon_1B_notKIV3
sample001 234 1456 789 667
sample002 198 1523 823 700
sample003 267 1389 721 668
Usage¶
Via CLI¶
grid lpa-realign \
--cram-dir data/CRAMs \
--output-file output/lpa_realignment_counts.tsv \
--ref-fasta refs/hs37d5.fa \
--lpa-ref-fasta refs/lpa_kiv2_reference.fa \
--positions-file refs/lpa_exon_positions.tsv \
--genome-build hg38 \
--chrom chr6 \
--start 160500000 \
--end 160600000 \
--threads 4
As Python Module¶
from grid.utils.align_lpa import run_lpa_realignments
run_lpa_realignments(
cram_dir="data/CRAMs",
reference_fa="refs/hs37d5.fa",
lpa_ref_fasta="refs/lpa_kiv2_reference.fa",
positions_file="refs/lpa_exon_positions.tsv",
genome_build="hg38",
chrom="chr6",
start=160500000,
end=160600000,
output_file="output/realignment_counts.tsv",
threads=4
)
LPA Reference Structure¶
The LPA gene contains multiple exons in the KIV-2 VNTR:
LPA Gene Structure:
[KIV-1][KIV-2 (variable copies)][KIV-3][KIV-4]...[Protease]
KIV-2 Repeats:
- Exon 1A: Diagnostic marker for KIV-2 copy number
- Exon 1B: Present in all KIV-2 copies
- Exon 1B_KIV3: Specific to KIV-3 (single copy)
- Exon 1B_notKIV3: Non-KIV3 copies
Realignment Algorithm¶
Step 1: Extract Reads¶
For each CRAM:
Fetch reads overlapping LPA VNTR region
Extract read sequences and qualities
Convert to FASTQ format
Step 2: Realignment¶
Using BWA-MEM or similar aligner:
bwa mem lpa_reference.fa reads.fastq > aligned.sam
Parameters:
Optimized for short, repetitive sequences
Allowing multiple alignments per read
Sensitive to detect all exon matches
Step 3: Count Exon Hits¶
For each aligned read:
Check alignment position against exon coordinates
Assign read to exon type (1A, 1B, 1B_KIV3, etc.)
Count reads per exon type
Handle multi-mapping reads appropriately
Step 4: Aggregate Results¶
Combine counts across all samples
Write to TSV output
Positions File Format¶
The positions file defines exon boundaries in the LPA reference:
exon_type start end description
exon_1A 1000 1150 KIV-2 specific marker
exon_1B 2000 2200 All KIV-2 copies
exon_1B_KIV3 3000 3200 KIV-3 specific
exon_1B_notKIV3 4000 4200 Non-KIV3 copies
Columns:
exon_type: Exon identifierstart: Start position in LPA reference (0-based)end: End position in LPA reference (0-based)description: Human-readable description
Exon Types Explained¶
Exon 1A¶
Uniqueness: Found only in KIV-2 repeats
Copy number: ~1 per KIV-2 repeat unit
Use: Primary marker for KIV-2 CN estimation
Formula component: Strong predictor (weight ~35 in final formula)
Exon 1B¶
Uniqueness: Present in all KIV copies (KIV-2, KIV-3, etc.)
Copy number: Sum of all KIV types
Use: Secondary marker
Formula component: Moderate predictor (weight ~5)
Exon 1B_KIV3¶
Uniqueness: Specific to KIV-3 (single copy per haplotype)
Copy number: 1-2 (diploid)
Use: Normalization anchor point
Interpretation: Should be ~2 in diploid individuals
Exon 1B_notKIV3¶
Uniqueness: KIV copies excluding KIV-3
Copy number: Variable
Use: Additional normalization
Calculation: Exon_1B - Exon_1B_KIV3
Copy Number Formula¶
The final KIV-2 copy number is estimated using:
diploid_CN = 34.9 × exon_1A + 5.2 × exon_1B - 1
haploid_CN = diploid_CN / 2
Coefficients derived from:
Regression against orthogonal validation (PacBio long-reads, qPCR)
Empirical calibration on reference samples
Published in Mukamel et al. (2021)
Technical Details¶
Aligner Choice¶
Recommended: BWA-MEM
Fast for short reads
Handles repetitive sequences well
Standard in genomics pipelines
Alternative: Bowtie2, minimap2
May offer speed/sensitivity trade-offs
Performance¶
Parallelization: Process CRAMs in parallel
Speed: ~2-5 minutes per CRAM
Memory: <4GB per process
Disk: Minimal (streaming)
Dependencies¶
External:
bwa,samtoolsPython:
pysam,pandas
Quality Control¶
Expected Counts (30X WGS)¶
For a sample with 20 KIV-2 copies (diploid):
exon_1A: ~600 reads (20 copies × ~30X)
exon_1B: ~750 reads (includes KIV-3)
exon_1B_KIV3: ~60 reads (2 copies)
Validation Checks¶
Exon 1B > Exon 1A: Always true (includes KIV-3)
Exon 1B_KIV3 ≈ 60 reads: Should be ~2 diploid copies
Read ratios consistent: Within expected range
Outlier Detection¶
Warning signs:
Exon_1A = 0: Complete mapping failure
Exon_1B_KIV3 > 200: Possible KIV-3 duplication
Very low counts: Low coverage or alignment issues
Troubleshooting¶
No reads aligned:
Check LPA reference matches genome build
Verify positions file coordinates
Ensure BWA index exists for LPA reference
Unexpected ratios:
# Check exon relationships
ratio = exon_1B / exon_1A
# Should be ~1.2-1.5 for most samples
Index missing:
# Build BWA index
bwa index lpa_reference.fa
High KIV-3 counts:
May indicate KIV-3 duplication (rare variant)
Verify with orthogonal methods
Check family members if available
References¶
Mukamel et al. (2021) - Original LPA CN methodology
Hao et al. (2024) - vntrwrap implementation
Li & Durbin (2009) - BWA aligner algorithm
Future Enhancements¶
[ ] Support for long-read data (PacBio, ONT)
[ ] Phased copy number estimation
[ ] Variant calling within KIV-2 repeats
[ ] Integration with graph-based alignment
Python API Documentation¶
grid.utils.align_lpa¶
- grid.utils.align_lpa.run_lpa_realignments(cram_dir, reference_fa, lpa_ref_fasta, positions_file, genome_build, chrom, start, end, output_file, threads=1)¶
Main function to run LPA realignment for all CRAM files in a directory.
UPDATED: Now supports parallel processing with multiple threads.
- Parameters:
cram_dir (str) – Directory containing CRAM/BAM files
reference_fa (str) – Reference genome FASTA for CRAM files
lpa_ref_fasta (str) – LPA reference FASTA for realignment
positions_file (str) – File with hardcoded repeat positions
genome_build (str) – Genome build (‘hg19’, ‘hg37’, or ‘hg38’)
chrom (str) – Chromosome (e.g., ‘chr6’ or ‘6’)
start (int) – Start position (0-based)
end (int) – End position (0-based)
output_file (str) – Output TSV file path
threads (int) – Number of threads for parallel processing (default: 1)
- Returns:
None
grid.utils.align_lpa_dir¶
LPA realignment utilities.
This package contains modular components for realigning reads in the LPA KIV-2 region and classifying variant types.
- grid.utils.align_lpa_dir.classify_read_pairs(reads, ref_seq, starts, ref_offset)¶
Classify read pairs into variant categories.
UPDATED: Now properly groups reads by qname before processing pairs. This fixes the bug where pairs had to be consecutive in the list.
- Parameters:
reads (List[Dict]) – List of read dictionaries
ref_seq (str) – Reference sequence
starts (List[int]) – List of repeat start positions
ref_offset (int) – Reference offset
- Returns:
- Counts for each variant type:
’1B_KIV3’: Count of 1B_KIV3 variants
’1B_KIV2’: Count of 1B_KIV2 variants
’1B_tied’: Count of 1B_tied variants
’1A’: Count of 1A variants
- Return type:
Dict[str, int]
- grid.utils.align_lpa_dir.classify_variant_from_scores(combined_scores)¶
Classify variant based on combined scores from read pair.
- Parameters:
combined_scores (List[int]) – Array of 7 combined scores
- Returns:
Variant type (‘1B_KIV3’, ‘1B_KIV2’, ‘1B_tied’, ‘1A’) or None
- Return type:
Optional[str]
- grid.utils.align_lpa_dir.compute_alignment_scores(pos, cigar, seq, qual, ref_seq, starts, ref_offset)¶
Compute mismatch scores for each of the 7 repeat positions.
Lower score = better match to that repeat position.
UPDATED: Now uses ±1000bp window and closest repeat matching.
- Parameters:
pos (int) – Read alignment position (1-based from BAM)
cigar (str) – CIGAR string
seq (str) – Read sequence
qual (str) – Quality string
ref_seq (str) – Reference sequence
starts (List[int]) – List of 7 repeat start positions (1-based)
ref_offset (int) – Reference offset position (1-based)
- Returns:
Array of 7 scores, one for each potential repeat alignment
- Return type:
List[int]
- grid.utils.align_lpa_dir.extract_reads_from_cram(cram_file, reference_fa, region)¶
Extract properly paired reads from CRAM/BAM file using pysam.
Filters for flags: 99, 147, 83, 163 (proper pairs, forward/reverse)
- Parameters:
cram_file (str) – Path to CRAM/BAM file
reference_fa (str) – Reference genome FASTA
region (str) – Genomic region to extract (e.g., ‘chr6:160000000-161000000’)
- Returns:
- List of read dictionaries with keys:
qname: read name
pos: alignment position (1-based)
cigar: CIGAR string
seq: sequence
qual: quality string
- Return type:
List[Dict]
- grid.utils.align_lpa_dir.find_cram_files(cram_dir)¶
Find all CRAM and BAM files in the specified directory.
- Parameters:
cram_dir (str) – Directory to search for CRAM/BAM files
- Returns:
List of Path objects for found CRAM/BAM files with valid indices
- Return type:
List[Path]
- grid.utils.align_lpa_dir.load_positions(positions_file, genome_build)¶
Load hardcoded positions from file.
File format (tab-separated): Hg38file Hg19file 160611000 161032032 160611561 161032593 …
- Parameters:
positions_file (str) – Path to positions file
genome_build (str) – Genome build (‘hg38’ or ‘hg19’)
- Returns:
(list of 7 repeat starts, reference offset)
- Return type:
Tuple[List[int], int]
- Raises:
ValueError – If positions file is malformed or doesn’t contain 7 repeats
- grid.utils.align_lpa_dir.load_reference(ref_fasta_path)¶
Load reference sequence from FASTA file.
- Parameters:
ref_fasta_path (str) – Path to reference FASTA file
- Returns:
Reference sequence (without header lines)
- Return type:
str
- Raises:
FileNotFoundError – If FASTA file doesn’t exist
ValueError – If FASTA file is empty or malformed
- grid.utils.align_lpa_dir.parse_cigar(cigar)¶
Parse CIGAR string into list of (length, operation) tuples.
- Parameters:
cigar (str) – CIGAR string (e.g., ‘100M’, ‘50M1I49M’)
- Returns:
List of (length, operation) tuples
- Return type:
List[Tuple[int, str]]
- grid.utils.align_lpa_dir.process_cram_file(cram_file, reference_fa, region, ref_seq, starts, ref_offset, output_file)¶
Process a single CRAM/BAM file to classify LPA variants.
- Parameters:
cram_file (Path) – Path to CRAM/BAM file
reference_fa (str) – Reference genome FASTA
region (str) – Genomic region (e.g., ‘chr6:160000000-161000000’)
ref_seq (str) – LPA reference sequence
starts (List[int]) – List of 7 repeat start positions
ref_offset (int) – Reference offset position
output_file (str) – Output file to append results
- Returns:
Result dictionary with ‘success’, ‘sample_id’, ‘read_count’, ‘counts’
- Return type:
Dict
- grid.utils.align_lpa_dir.validate_inputs(cram_dir, reference_fa, lpa_ref_fasta, positions_file)¶
Validate that all required input files and directories exist.
- Parameters:
cram_dir (str) – Directory containing CRAM/BAM files
reference_fa (str) – Reference genome FASTA
lpa_ref_fasta (str) – LPA reference FASTA
positions_file (str) – Positions file
- Raises:
FileNotFoundError – If any required file/directory is missing