LPA Realignment Module¶

Overview¶

LPA Realignment Module¶

Realign reads from the LPA KIV-2 VNTR region to a high-quality reference sequence for improved copy number estimation.

Purpose¶

Standard genome alignment may be suboptimal in repeat regions due to:

Repetitive sequences confusing aligners
Copy number variation causing ambiguous mappings
Sequence homology between repeat units

This module realigns reads specifically to the LPA KIV-2 reference, enabling:

Precise read assignment to specific exons (1A, 1B, KIV-3, etc.)
Improved copy number accuracy through targeted alignment
Variant detection within repeat units

Main Module¶

`align_lpa.py`¶

Located in parent directory: grid/utils/align_lpa.py

Key Function:

def run_lpa_realignments(
    cram_dir: str,
    reference_fa: str,
    lpa_ref_fasta: str,
    positions_file: str,
    genome_build: str,
    chrom: str,
    start: int,
    end: int,
    output_file: str,
    threads: int = 1
) -> None

Parameters:

cram_dir: Directory containing CRAM files
reference_fa: Genome reference FASTA (for CRAM decoding)
lpa_ref_fasta: LPA KIV-2 specific reference FASTA
positions_file: TSV with exon positions in LPA reference
genome_build: Genome build (hg19/hg37/hg38)
chrom: Chromosome (e.g., “chr6”)
start: VNTR region start
end: VNTR region end
output_file: Output TSV with realignment counts
threads: Parallel processing threads

Output Format:

sample_id       exon_1A exon_1B exon_1B_KIV3    exon_1B_notKIV3
sample001       234     1456    789             667
sample002       198     1523    823             700
sample003       267     1389    721             668

Usage¶

Via CLI¶

grid lpa-realign \
    --cram-dir data/CRAMs \
    --output-file output/lpa_realignment_counts.tsv \
    --ref-fasta refs/hs37d5.fa \
    --lpa-ref-fasta refs/lpa_kiv2_reference.fa \
    --positions-file refs/lpa_exon_positions.tsv \
    --genome-build hg38 \
    --chrom chr6 \
    --start 160500000 \
    --end 160600000 \
    --threads 4

As Python Module¶

from grid.utils.align_lpa import run_lpa_realignments

run_lpa_realignments(
    cram_dir="data/CRAMs",
    reference_fa="refs/hs37d5.fa",
    lpa_ref_fasta="refs/lpa_kiv2_reference.fa",
    positions_file="refs/lpa_exon_positions.tsv",
    genome_build="hg38",
    chrom="chr6",
    start=160500000,
    end=160600000,
    output_file="output/realignment_counts.tsv",
    threads=4
)

LPA Reference Structure¶

The LPA gene contains multiple exons in the KIV-2 VNTR:

LPA Gene Structure:
[KIV-1][KIV-2 (variable copies)][KIV-3][KIV-4]...[Protease]

KIV-2 Repeats:
- Exon 1A: Diagnostic marker for KIV-2 copy number
- Exon 1B: Present in all KIV-2 copies
- Exon 1B_KIV3: Specific to KIV-3 (single copy)
- Exon 1B_notKIV3: Non-KIV3 copies

Realignment Algorithm¶

Step 1: Extract Reads¶

For each CRAM:

Fetch reads overlapping LPA VNTR region
Extract read sequences and qualities
Convert to FASTQ format

Step 2: Realignment¶

Using BWA-MEM or similar aligner:

bwa mem lpa_reference.fa reads.fastq > aligned.sam

Parameters:

Optimized for short, repetitive sequences
Allowing multiple alignments per read
Sensitive to detect all exon matches

Step 3: Count Exon Hits¶

For each aligned read:

Check alignment position against exon coordinates
Assign read to exon type (1A, 1B, 1B_KIV3, etc.)
Count reads per exon type
Handle multi-mapping reads appropriately

Step 4: Aggregate Results¶

Combine counts across all samples
Write to TSV output

Positions File Format¶

The positions file defines exon boundaries in the LPA reference:

exon_type       start   end     description
exon_1A         1000    1150    KIV-2 specific marker
exon_1B         2000    2200    All KIV-2 copies
exon_1B_KIV3    3000    3200    KIV-3 specific
exon_1B_notKIV3 4000    4200    Non-KIV3 copies

Columns:

exon_type: Exon identifier
start: Start position in LPA reference (0-based)
end: End position in LPA reference (0-based)
description: Human-readable description

Exon Types Explained¶

Exon 1A¶

Uniqueness: Found only in KIV-2 repeats
Copy number: ~1 per KIV-2 repeat unit
Use: Primary marker for KIV-2 CN estimation
Formula component: Strong predictor (weight ~35 in final formula)

Exon 1B¶

Uniqueness: Present in all KIV copies (KIV-2, KIV-3, etc.)
Copy number: Sum of all KIV types
Use: Secondary marker
Formula component: Moderate predictor (weight ~5)

Exon 1B_KIV3¶

Uniqueness: Specific to KIV-3 (single copy per haplotype)
Copy number: 1-2 (diploid)
Use: Normalization anchor point
Interpretation: Should be ~2 in diploid individuals

Exon 1B_notKIV3¶

Uniqueness: KIV copies excluding KIV-3
Copy number: Variable
Use: Additional normalization
Calculation: Exon_1B - Exon_1B_KIV3

Copy Number Formula¶

The final KIV-2 copy number is estimated using:

diploid_CN = 34.9 × exon_1A + 5.2 × exon_1B - 1
haploid_CN = diploid_CN / 2

Coefficients derived from:

Regression against orthogonal validation (PacBio long-reads, qPCR)
Empirical calibration on reference samples
Published in Mukamel et al. (2021)

Technical Details¶

Aligner Choice¶

Recommended: BWA-MEM

Fast for short reads
Handles repetitive sequences well
Standard in genomics pipelines

Alternative: Bowtie2, minimap2

May offer speed/sensitivity trade-offs

Performance¶

Parallelization: Process CRAMs in parallel
Speed: ~2-5 minutes per CRAM
Memory: <4GB per process
Disk: Minimal (streaming)

Dependencies¶

External: bwa, samtools
Python: pysam, pandas

Quality Control¶

Expected Counts (30X WGS)¶

For a sample with 20 KIV-2 copies (diploid):

exon_1A: ~600 reads (20 copies × ~30X)
exon_1B: ~750 reads (includes KIV-3)
exon_1B_KIV3: ~60 reads (2 copies)

Validation Checks¶

Exon 1B > Exon 1A: Always true (includes KIV-3)
Exon 1B_KIV3 ≈ 60 reads: Should be ~2 diploid copies
Read ratios consistent: Within expected range

Outlier Detection¶

Warning signs:

Exon_1A = 0: Complete mapping failure
Exon_1B_KIV3 > 200: Possible KIV-3 duplication
Very low counts: Low coverage or alignment issues

Troubleshooting¶

No reads aligned:

Check LPA reference matches genome build
Verify positions file coordinates
Ensure BWA index exists for LPA reference

Unexpected ratios:

# Check exon relationships
ratio = exon_1B / exon_1A
# Should be ~1.2-1.5 for most samples

Index missing:

# Build BWA index
bwa index lpa_reference.fa

High KIV-3 counts:

May indicate KIV-3 duplication (rare variant)
Verify with orthogonal methods
Check family members if available

References¶

Mukamel et al. (2021) - Original LPA CN methodology
Hao et al. (2024) - vntrwrap implementation
Li & Durbin (2009) - BWA aligner algorithm

Future Enhancements¶

[ ] Support for long-read data (PacBio, ONT)
[ ] Phased copy number estimation
[ ] Variant calling within KIV-2 repeats
[ ] Integration with graph-based alignment

Python API Documentation¶

grid.utils.align_lpa¶

grid.utils.align_lpa.run_lpa_realignments(cram_dir, reference_fa, lpa_ref_fasta, positions_file, genome_build, chrom, start, end, output_file, threads=1)¶

Main function to run LPA realignment for all CRAM files in a directory.

UPDATED: Now supports parallel processing with multiple threads.

Parameters:

cram_dir (str) – Directory containing CRAM/BAM files
reference_fa (str) – Reference genome FASTA for CRAM files
lpa_ref_fasta (str) – LPA reference FASTA for realignment
positions_file (str) – File with hardcoded repeat positions
genome_build (str) – Genome build (‘hg19’, ‘hg37’, or ‘hg38’)
chrom (str) – Chromosome (e.g., ‘chr6’ or ‘6’)
start (int) – Start position (0-based)
end (int) – End position (0-based)
output_file (str) – Output TSV file path
threads (int) – Number of threads for parallel processing (default: 1)

Returns:

None

grid.utils.align_lpa_dir¶

LPA realignment utilities.

This package contains modular components for realigning reads in the LPA KIV-2 region and classifying variant types.

grid.utils.align_lpa_dir.classify_read_pairs(reads, ref_seq, starts, ref_offset)¶

Classify read pairs into variant categories.

UPDATED: Now properly groups reads by qname before processing pairs. This fixes the bug where pairs had to be consecutive in the list.

Parameters:

reads (List[Dict]) – List of read dictionaries
ref_seq (str) – Reference sequence
starts (List[int]) – List of repeat start positions
ref_offset (int) – Reference offset

Returns:

Counts for each variant type:

’1B_KIV3’: Count of 1B_KIV3 variants
’1B_KIV2’: Count of 1B_KIV2 variants
’1B_tied’: Count of 1B_tied variants
’1A’: Count of 1A variants

Return type:

Dict[str, int]

grid.utils.align_lpa_dir.classify_variant_from_scores(combined_scores)¶

Classify variant based on combined scores from read pair.

Parameters:: combined_scores (List[int]) – Array of 7 combined scores
Returns:: Variant type (‘1B_KIV3’, ‘1B_KIV2’, ‘1B_tied’, ‘1A’) or None
Return type:: Optional[str]

grid.utils.align_lpa_dir.compute_alignment_scores(pos, cigar, seq, qual, ref_seq, starts, ref_offset)¶

Compute mismatch scores for each of the 7 repeat positions.

Lower score = better match to that repeat position.

UPDATED: Now uses ±1000bp window and closest repeat matching.

Parameters:

pos (int) – Read alignment position (1-based from BAM)
cigar (str) – CIGAR string
seq (str) – Read sequence
qual (str) – Quality string
ref_seq (str) – Reference sequence
starts (List[int]) – List of 7 repeat start positions (1-based)
ref_offset (int) – Reference offset position (1-based)

Returns:

Array of 7 scores, one for each potential repeat alignment

Return type:

List[int]

grid.utils.align_lpa_dir.extract_reads_from_cram(cram_file, reference_fa, region)¶

Extract properly paired reads from CRAM/BAM file using pysam.

Filters for flags: 99, 147, 83, 163 (proper pairs, forward/reverse)

Parameters:

cram_file (str) – Path to CRAM/BAM file
reference_fa (str) – Reference genome FASTA
region (str) – Genomic region to extract (e.g., ‘chr6:160000000-161000000’)

Returns:

List of read dictionaries with keys:

qname: read name
pos: alignment position (1-based)
cigar: CIGAR string
seq: sequence
qual: quality string

Return type:

List[Dict]

grid.utils.align_lpa_dir.find_cram_files(cram_dir)¶

Find all CRAM and BAM files in the specified directory.

Parameters:: cram_dir (str) – Directory to search for CRAM/BAM files
Returns:: List of Path objects for found CRAM/BAM files with valid indices
Return type:: List[Path]

grid.utils.align_lpa_dir.load_positions(positions_file, genome_build)¶

Load hardcoded positions from file.

File format (tab-separated): Hg38file Hg19file 160611000 161032032 160611561 161032593 …

Parameters:

positions_file (str) – Path to positions file
genome_build (str) – Genome build (‘hg38’ or ‘hg19’)

Returns:

(list of 7 repeat starts, reference offset)

Return type:

Tuple[List[int], int]

Raises:

ValueError – If positions file is malformed or doesn’t contain 7 repeats

grid.utils.align_lpa_dir.load_reference(ref_fasta_path)¶

Load reference sequence from FASTA file.

Parameters:

ref_fasta_path (str) – Path to reference FASTA file

Returns:

Reference sequence (without header lines)

Return type:

str

Raises:

FileNotFoundError – If FASTA file doesn’t exist
ValueError – If FASTA file is empty or malformed

grid.utils.align_lpa_dir.parse_cigar(cigar)¶

Parse CIGAR string into list of (length, operation) tuples.

Parameters:: cigar (str) – CIGAR string (e.g., ‘100M’, ‘50M1I49M’)
Returns:: List of (length, operation) tuples
Return type:: List[Tuple[int, str]]

grid.utils.align_lpa_dir.process_cram_file(cram_file, reference_fa, region, ref_seq, starts, ref_offset, output_file)¶

Process a single CRAM/BAM file to classify LPA variants.

Parameters:

cram_file (Path) – Path to CRAM/BAM file
reference_fa (str) – Reference genome FASTA
region (str) – Genomic region (e.g., ‘chr6:160000000-161000000’)
ref_seq (str) – LPA reference sequence
starts (List[int]) – List of 7 repeat start positions
ref_offset (int) – Reference offset position
output_file (str) – Output file to append results

Returns:

Result dictionary with ‘success’, ‘sample_id’, ‘read_count’, ‘counts’

Return type:

Dict

grid.utils.align_lpa_dir.validate_inputs(cram_dir, reference_fa, lpa_ref_fasta, positions_file)¶

Validate that all required input files and directories exist.

Parameters:

cram_dir (str) – Directory containing CRAM/BAM files
reference_fa (str) – Reference genome FASTA
lpa_ref_fasta (str) – LPA reference FASTA
positions_file (str) – Positions file

Raises:

FileNotFoundError – If any required file/directory is missing

LPA Realignment Module¶

Overview¶