LPA Realignment Module

Overview

LPA Realignment Module

Realign reads from the LPA KIV-2 VNTR region to a high-quality reference sequence for improved copy number estimation.

Purpose

Standard genome alignment may be suboptimal in repeat regions due to:

  • Repetitive sequences confusing aligners

  • Copy number variation causing ambiguous mappings

  • Sequence homology between repeat units

This module realigns reads specifically to the LPA KIV-2 reference, enabling:

  1. Precise read assignment to specific exons (1A, 1B, KIV-3, etc.)

  2. Improved copy number accuracy through targeted alignment

  3. Variant detection within repeat units

Main Module

align_lpa.py

Located in parent directory: grid/utils/align_lpa.py

Key Function:

def run_lpa_realignments(
    cram_dir: str,
    reference_fa: str,
    lpa_ref_fasta: str,
    positions_file: str,
    genome_build: str,
    chrom: str,
    start: int,
    end: int,
    output_file: str,
    threads: int = 1
) -> None

Parameters:

  • cram_dir: Directory containing CRAM files

  • reference_fa: Genome reference FASTA (for CRAM decoding)

  • lpa_ref_fasta: LPA KIV-2 specific reference FASTA

  • positions_file: TSV with exon positions in LPA reference

  • genome_build: Genome build (hg19/hg37/hg38)

  • chrom: Chromosome (e.g., “chr6”)

  • start: VNTR region start

  • end: VNTR region end

  • output_file: Output TSV with realignment counts

  • threads: Parallel processing threads

Output Format:

sample_id       exon_1A exon_1B exon_1B_KIV3    exon_1B_notKIV3
sample001       234     1456    789             667
sample002       198     1523    823             700
sample003       267     1389    721             668

Usage

Via CLI
grid lpa-realign \
    --cram-dir data/CRAMs \
    --output-file output/lpa_realignment_counts.tsv \
    --ref-fasta refs/hs37d5.fa \
    --lpa-ref-fasta refs/lpa_kiv2_reference.fa \
    --positions-file refs/lpa_exon_positions.tsv \
    --genome-build hg38 \
    --chrom chr6 \
    --start 160500000 \
    --end 160600000 \
    --threads 4
As Python Module
from grid.utils.align_lpa import run_lpa_realignments

run_lpa_realignments(
    cram_dir="data/CRAMs",
    reference_fa="refs/hs37d5.fa",
    lpa_ref_fasta="refs/lpa_kiv2_reference.fa",
    positions_file="refs/lpa_exon_positions.tsv",
    genome_build="hg38",
    chrom="chr6",
    start=160500000,
    end=160600000,
    output_file="output/realignment_counts.tsv",
    threads=4
)

LPA Reference Structure

The LPA gene contains multiple exons in the KIV-2 VNTR:

LPA Gene Structure:
[KIV-1][KIV-2 (variable copies)][KIV-3][KIV-4]...[Protease]

KIV-2 Repeats:
- Exon 1A: Diagnostic marker for KIV-2 copy number
- Exon 1B: Present in all KIV-2 copies
- Exon 1B_KIV3: Specific to KIV-3 (single copy)
- Exon 1B_notKIV3: Non-KIV3 copies

Realignment Algorithm

Step 1: Extract Reads

For each CRAM:

  1. Fetch reads overlapping LPA VNTR region

  2. Extract read sequences and qualities

  3. Convert to FASTQ format

Step 2: Realignment

Using BWA-MEM or similar aligner:

bwa mem lpa_reference.fa reads.fastq > aligned.sam

Parameters:

  • Optimized for short, repetitive sequences

  • Allowing multiple alignments per read

  • Sensitive to detect all exon matches

Step 3: Count Exon Hits

For each aligned read:

  1. Check alignment position against exon coordinates

  2. Assign read to exon type (1A, 1B, 1B_KIV3, etc.)

  3. Count reads per exon type

  4. Handle multi-mapping reads appropriately

Step 4: Aggregate Results
  • Combine counts across all samples

  • Write to TSV output

Positions File Format

The positions file defines exon boundaries in the LPA reference:

exon_type       start   end     description
exon_1A         1000    1150    KIV-2 specific marker
exon_1B         2000    2200    All KIV-2 copies
exon_1B_KIV3    3000    3200    KIV-3 specific
exon_1B_notKIV3 4000    4200    Non-KIV3 copies

Columns:

  • exon_type: Exon identifier

  • start: Start position in LPA reference (0-based)

  • end: End position in LPA reference (0-based)

  • description: Human-readable description

Exon Types Explained

Exon 1A
  • Uniqueness: Found only in KIV-2 repeats

  • Copy number: ~1 per KIV-2 repeat unit

  • Use: Primary marker for KIV-2 CN estimation

  • Formula component: Strong predictor (weight ~35 in final formula)

Exon 1B
  • Uniqueness: Present in all KIV copies (KIV-2, KIV-3, etc.)

  • Copy number: Sum of all KIV types

  • Use: Secondary marker

  • Formula component: Moderate predictor (weight ~5)

Exon 1B_KIV3
  • Uniqueness: Specific to KIV-3 (single copy per haplotype)

  • Copy number: 1-2 (diploid)

  • Use: Normalization anchor point

  • Interpretation: Should be ~2 in diploid individuals

Exon 1B_notKIV3
  • Uniqueness: KIV copies excluding KIV-3

  • Copy number: Variable

  • Use: Additional normalization

  • Calculation: Exon_1B - Exon_1B_KIV3

Copy Number Formula

The final KIV-2 copy number is estimated using:

diploid_CN = 34.9 × exon_1A + 5.2 × exon_1B - 1
haploid_CN = diploid_CN / 2

Coefficients derived from:

  • Regression against orthogonal validation (PacBio long-reads, qPCR)

  • Empirical calibration on reference samples

  • Published in Mukamel et al. (2021)

Technical Details

Aligner Choice

Recommended: BWA-MEM

  • Fast for short reads

  • Handles repetitive sequences well

  • Standard in genomics pipelines

Alternative: Bowtie2, minimap2

  • May offer speed/sensitivity trade-offs

Performance
  • Parallelization: Process CRAMs in parallel

  • Speed: ~2-5 minutes per CRAM

  • Memory: <4GB per process

  • Disk: Minimal (streaming)

Dependencies
  • External: bwa, samtools

  • Python: pysam, pandas

Quality Control

Expected Counts (30X WGS)

For a sample with 20 KIV-2 copies (diploid):

exon_1A: ~600 reads (20 copies × ~30X)
exon_1B: ~750 reads (includes KIV-3)
exon_1B_KIV3: ~60 reads (2 copies)
Validation Checks
  1. Exon 1B > Exon 1A: Always true (includes KIV-3)

  2. Exon 1B_KIV3 ≈ 60 reads: Should be ~2 diploid copies

  3. Read ratios consistent: Within expected range

Outlier Detection

Warning signs:

  • Exon_1A = 0: Complete mapping failure

  • Exon_1B_KIV3 > 200: Possible KIV-3 duplication

  • Very low counts: Low coverage or alignment issues

Troubleshooting

No reads aligned:

  • Check LPA reference matches genome build

  • Verify positions file coordinates

  • Ensure BWA index exists for LPA reference

Unexpected ratios:

# Check exon relationships
ratio = exon_1B / exon_1A
# Should be ~1.2-1.5 for most samples

Index missing:

# Build BWA index
bwa index lpa_reference.fa

High KIV-3 counts:

  • May indicate KIV-3 duplication (rare variant)

  • Verify with orthogonal methods

  • Check family members if available

References

  • Mukamel et al. (2021) - Original LPA CN methodology

  • Hao et al. (2024) - vntrwrap implementation

  • Li & Durbin (2009) - BWA aligner algorithm

Future Enhancements

  • [ ] Support for long-read data (PacBio, ONT)

  • [ ] Phased copy number estimation

  • [ ] Variant calling within KIV-2 repeats

  • [ ] Integration with graph-based alignment

Python API Documentation

grid.utils.align_lpa

grid.utils.align_lpa.run_lpa_realignments(cram_dir, reference_fa, lpa_ref_fasta, positions_file, genome_build, chrom, start, end, output_file, threads=1)

Main function to run LPA realignment for all CRAM files in a directory.

UPDATED: Now supports parallel processing with multiple threads.

Parameters:
  • cram_dir (str) – Directory containing CRAM/BAM files

  • reference_fa (str) – Reference genome FASTA for CRAM files

  • lpa_ref_fasta (str) – LPA reference FASTA for realignment

  • positions_file (str) – File with hardcoded repeat positions

  • genome_build (str) – Genome build (‘hg19’, ‘hg37’, or ‘hg38’)

  • chrom (str) – Chromosome (e.g., ‘chr6’ or ‘6’)

  • start (int) – Start position (0-based)

  • end (int) – End position (0-based)

  • output_file (str) – Output TSV file path

  • threads (int) – Number of threads for parallel processing (default: 1)

Returns:

None

grid.utils.align_lpa_dir

LPA realignment utilities.

This package contains modular components for realigning reads in the LPA KIV-2 region and classifying variant types.

grid.utils.align_lpa_dir.classify_read_pairs(reads, ref_seq, starts, ref_offset)

Classify read pairs into variant categories.

UPDATED: Now properly groups reads by qname before processing pairs. This fixes the bug where pairs had to be consecutive in the list.

Parameters:
  • reads (List[Dict]) – List of read dictionaries

  • ref_seq (str) – Reference sequence

  • starts (List[int]) – List of repeat start positions

  • ref_offset (int) – Reference offset

Returns:

Counts for each variant type:
  • ’1B_KIV3’: Count of 1B_KIV3 variants

  • ’1B_KIV2’: Count of 1B_KIV2 variants

  • ’1B_tied’: Count of 1B_tied variants

  • ’1A’: Count of 1A variants

Return type:

Dict[str, int]

grid.utils.align_lpa_dir.classify_variant_from_scores(combined_scores)

Classify variant based on combined scores from read pair.

Parameters:

combined_scores (List[int]) – Array of 7 combined scores

Returns:

Variant type (‘1B_KIV3’, ‘1B_KIV2’, ‘1B_tied’, ‘1A’) or None

Return type:

Optional[str]

grid.utils.align_lpa_dir.compute_alignment_scores(pos, cigar, seq, qual, ref_seq, starts, ref_offset)

Compute mismatch scores for each of the 7 repeat positions.

Lower score = better match to that repeat position.

UPDATED: Now uses ±1000bp window and closest repeat matching.

Parameters:
  • pos (int) – Read alignment position (1-based from BAM)

  • cigar (str) – CIGAR string

  • seq (str) – Read sequence

  • qual (str) – Quality string

  • ref_seq (str) – Reference sequence

  • starts (List[int]) – List of 7 repeat start positions (1-based)

  • ref_offset (int) – Reference offset position (1-based)

Returns:

Array of 7 scores, one for each potential repeat alignment

Return type:

List[int]

grid.utils.align_lpa_dir.extract_reads_from_cram(cram_file, reference_fa, region)

Extract properly paired reads from CRAM/BAM file using pysam.

Filters for flags: 99, 147, 83, 163 (proper pairs, forward/reverse)

Parameters:
  • cram_file (str) – Path to CRAM/BAM file

  • reference_fa (str) – Reference genome FASTA

  • region (str) – Genomic region to extract (e.g., ‘chr6:160000000-161000000’)

Returns:

List of read dictionaries with keys:
  • qname: read name

  • pos: alignment position (1-based)

  • cigar: CIGAR string

  • seq: sequence

  • qual: quality string

Return type:

List[Dict]

grid.utils.align_lpa_dir.find_cram_files(cram_dir)

Find all CRAM and BAM files in the specified directory.

Parameters:

cram_dir (str) – Directory to search for CRAM/BAM files

Returns:

List of Path objects for found CRAM/BAM files with valid indices

Return type:

List[Path]

grid.utils.align_lpa_dir.load_positions(positions_file, genome_build)

Load hardcoded positions from file.

File format (tab-separated): Hg38file Hg19file 160611000 161032032 160611561 161032593 …

Parameters:
  • positions_file (str) – Path to positions file

  • genome_build (str) – Genome build (‘hg38’ or ‘hg19’)

Returns:

(list of 7 repeat starts, reference offset)

Return type:

Tuple[List[int], int]

Raises:

ValueError – If positions file is malformed or doesn’t contain 7 repeats

grid.utils.align_lpa_dir.load_reference(ref_fasta_path)

Load reference sequence from FASTA file.

Parameters:

ref_fasta_path (str) – Path to reference FASTA file

Returns:

Reference sequence (without header lines)

Return type:

str

Raises:
  • FileNotFoundError – If FASTA file doesn’t exist

  • ValueError – If FASTA file is empty or malformed

grid.utils.align_lpa_dir.parse_cigar(cigar)

Parse CIGAR string into list of (length, operation) tuples.

Parameters:

cigar (str) – CIGAR string (e.g., ‘100M’, ‘50M1I49M’)

Returns:

List of (length, operation) tuples

Return type:

List[Tuple[int, str]]

grid.utils.align_lpa_dir.process_cram_file(cram_file, reference_fa, region, ref_seq, starts, ref_offset, output_file)

Process a single CRAM/BAM file to classify LPA variants.

Parameters:
  • cram_file (Path) – Path to CRAM/BAM file

  • reference_fa (str) – Reference genome FASTA

  • region (str) – Genomic region (e.g., ‘chr6:160000000-161000000’)

  • ref_seq (str) – LPA reference sequence

  • starts (List[int]) – List of 7 repeat start positions

  • ref_offset (int) – Reference offset position

  • output_file (str) – Output file to append results

Returns:

Result dictionary with ‘success’, ‘sample_id’, ‘read_count’, ‘counts’

Return type:

Dict

grid.utils.align_lpa_dir.validate_inputs(cram_dir, reference_fa, lpa_ref_fasta, positions_file)

Validate that all required input files and directories exist.

Parameters:
  • cram_dir (str) – Directory containing CRAM/BAM files

  • reference_fa (str) – Reference genome FASTA

  • lpa_ref_fasta (str) – LPA reference FASTA

  • positions_file (str) – Positions file

Raises:

FileNotFoundError – If any required file/directory is missing