Extract Reference Sequences¶

Extract FASTA Sequences from Reference¶

Extract specific genomic regions from a reference genome based on BED coordinates.

Python API Documentation¶

grid.utils.extract_reference¶

Extract reference FASTA sequences from a BED file.

grid.utils.extract_reference.extract_reference(reference_fa, bed_file, output_dir, output_prefix='ref_lpa')¶

Extracts sequences from a reference FASTA for regions listed in a BED file.

Parameters:

reference_fa (str) – Path to reference genome FASTA.
bed_file (str) – Path to BED file with regions to extract.
output_dir (str) – Directory to write output FASTA file.
output_prefix (str) – Output FASTA file prefix (default: ref_lpa).

Usage¶

Via CLI¶

grid extract-reference \
    --reference-fa refs/hs37d5.fa \
    --bed-file lpa_regions.bed \
    --output-dir refs/ \
    --output-prefix lpa_kiv2

Via Python¶

from grid.utils.extract_reference import extract_reference

extract_reference(
    reference_fa="refs/hs37d5.fa",
    bed_file="lpa_regions.bed",
    output_dir="refs/",
    output_prefix="lpa_kiv2"
)

Description¶

This utility extracts sequences from a reference genome for specific regions:

Reads BED file with target coordinates
Extracts sequences using samtools faidx
Writes to new FASTA file
Creates index (.fai) for new reference

Use cases:

LPA Reference - Extract KIV-2 region for realignment
Custom References - Build VNTR-specific references
Validation - Create test references for development

BED File Format¶

Standard BED format (0-based coordinates):

chr6    160500000    160510000    KIV2_region1
chr6    160520000    160530000    KIV2_region2

Columns:

Chromosome name
Start position (0-based)
End position (exclusive)
Region name (optional)

Output¶

Creates:

<prefix>.fa - Extracted sequences in FASTA format
<prefix>.fa.fai - Index file for the extracted reference

The output FASTA will have headers based on the BED regions:

>chr6:160500000-160510000
ATCGATCGATCG...
>chr6:160520000-160530000
GCTAGCTAGCTA...

Dependencies¶

samtools - Must be installed and in PATH (uses faidx)
pysam - Python library for FASTA handling

Notes¶

Reference genome must be indexed (.fai file required)
Chromosome names in BED must match reference exactly
Extracted regions maintain original coordinates in headers