Extract Reference Sequences¶
Extract FASTA Sequences from Reference¶
Extract specific genomic regions from a reference genome based on BED coordinates.
Python API Documentation¶
grid.utils.extract_reference¶
Extract reference FASTA sequences from a BED file.
- grid.utils.extract_reference.extract_reference(reference_fa, bed_file, output_dir, output_prefix='ref_lpa')¶
Extracts sequences from a reference FASTA for regions listed in a BED file.
- Parameters:
reference_fa (str) – Path to reference genome FASTA.
bed_file (str) – Path to BED file with regions to extract.
output_dir (str) – Directory to write output FASTA file.
output_prefix (str) – Output FASTA file prefix (default: ref_lpa).
Usage¶
Via CLI¶
grid extract-reference \
--reference-fa refs/hs37d5.fa \
--bed-file lpa_regions.bed \
--output-dir refs/ \
--output-prefix lpa_kiv2
Via Python¶
from grid.utils.extract_reference import extract_reference
extract_reference(
reference_fa="refs/hs37d5.fa",
bed_file="lpa_regions.bed",
output_dir="refs/",
output_prefix="lpa_kiv2"
)
Description¶
This utility extracts sequences from a reference genome for specific regions:
Reads BED file with target coordinates
Extracts sequences using samtools faidx
Writes to new FASTA file
Creates index (.fai) for new reference
Use cases:
LPA Reference - Extract KIV-2 region for realignment
Custom References - Build VNTR-specific references
Validation - Create test references for development
BED File Format¶
Standard BED format (0-based coordinates):
chr6 160500000 160510000 KIV2_region1
chr6 160520000 160530000 KIV2_region2
Columns:
Chromosome name
Start position (0-based)
End position (exclusive)
Region name (optional)
Output¶
Creates:
<prefix>.fa- Extracted sequences in FASTA format<prefix>.fa.fai- Index file for the extracted reference
The output FASTA will have headers based on the BED regions:
>chr6:160500000-160510000
ATCGATCGATCG...
>chr6:160520000-160530000
GCTAGCTAGCTA...
Dependencies¶
samtools - Must be installed and in PATH (uses faidx)
pysam - Python library for FASTA handling
Notes¶
Reference genome must be indexed (
.faifile required)Chromosome names in BED must match reference exactly
Extracted regions maintain original coordinates in headers