CRAM Index Management¶

Overview¶

Ensures CRAM index (.crai) files exist for efficient random access to CRAM files.

CRAM files require an index for efficient random access to specific genomic regions. This utility automatically checks for and creates the necessary .crai index file if it doesn’t exist.

CLI Reference¶

grid crai [OPTIONS]

Options:

-c, --cram PATH: Input CRAM file [required]
-r, --reference PATH: Reference genome FASTA [required]
-h, --help: Show help message and exit

Usage Examples¶

Via CLI¶

Basic usage:

grid crai \
    --cram sample.cram \
    --reference refs/hs37d5.fa

Process multiple files:

# Loop through all CRAMs
for cram in data/CRAMs/*.cram; do
    grid crai --cram "$cram" --reference refs/hs37d5.fa
done

In a pipeline:

# Ensure index exists before processing
grid crai -c sample.cram -r refs/hs37d5.fa
grid count-reads -C data/CRAMs -o counts.tsv ...

Via Python¶

from grid.utils.ensure_crai import ensure_crai

# Single file
crai_path = ensure_crai(
    cram_path="sample.cram",
    reference="refs/hs37d5.fa"
)
print(f"Index ready: {crai_path}")

# Multiple files
from pathlib import Path

cram_dir = Path("data/CRAMs")
ref_fasta = "refs/hs37d5.fa"

for cram_file in cram_dir.glob("*.cram"):
    crai_path = ensure_crai(cram_path=str(cram_file), reference=ref_fasta)
    print(f"✓ {cram_file.name}")

Description¶

What it does:

Checks if .crai index exists alongside the CRAM file
If missing, creates index using samtools index
Returns path to the index file
Validates index is properly created

Why you need it:

CRAM files are compressed BAM files that require an index for:

Random access - Jump to specific genomic regions without reading entire file
Parallel processing - Process different regions simultaneously
Memory efficiency - Read only necessary data
Pipeline requirements - Most downstream tools require indexed CRAMs

Index location:

The index is created in the same directory as the CRAM with a .crai extension:

data/
├── sample001.cram
├── sample001.cram.crai  ← Created automatically
├── sample002.cram
└── sample002.cram.crai  ← Created automatically

Technical Details¶

Algorithm:

def ensure_crai(cram_path, reference):
    crai_path = f"{cram_path}.crai"

    if not exists(crai_path):
        # Create index using samtools
        run(["samtools", "index", "-@", "4", cram_path])

    return crai_path

Performance:

Speed: ~1-5 minutes per CRAM (depends on file size)
Memory: <2GB RAM
Disk: Index file is ~0.1% of CRAM size (e.g., 10MB for 10GB CRAM)
Threads: Can specify number of threads with samtools -@ flag

Index structure:

The .crai file contains:

Genomic position index for rapid seeking
Compression block offsets
Metadata for efficient access
Binary format (not human-readable)

Python API Documentation¶

grid.utils.ensure_crai¶

grid.utils.ensure_crai.ensure_crai(cram_path, reference=None)¶

Ensure a CRAI index exists for the given CRAM file.

Parameters:

cram_path (str) – Path to CRAM file.
reference (str) – Optional reference genome FASTA (required if CRAM is unindexed)

Returns:

Path to CRAI file.

Return type:

str

Common Use Cases¶

Use Case 1: Pipeline Initialization¶

Ensure all CRAMs are indexed before starting analysis:

#!/bin/bash
# prepare_crams.sh

CRAM_DIR="data/CRAMs"
REF="refs/hs37d5.fa"

echo "Indexing CRAMs..."
for cram in "$CRAM_DIR"/*.cram; do
    if [ ! -f "${cram}.crai" ]; then
        echo "Indexing $(basename $cram)..."
        grid crai -c "$cram" -r "$REF"
    else
        echo "✓ $(basename $cram) already indexed"
    fi
done
echo "All CRAMs indexed!"

Use Case 2: Parallel Indexing¶

Index multiple CRAMs in parallel on HPC:

#!/bin/bash
#SBATCH --array=1-100
#SBATCH --cpus-per-task=4

# Get CRAM file for this array job
CRAM=$(ls data/CRAMs/*.cram | sed -n ${SLURM_ARRAY_TASK_ID}p)

grid crai -c "$CRAM" -r refs/hs37d5.fa

Use Case 3: Conditional Indexing¶

Only create index if it doesn’t exist or is outdated:

from pathlib import Path
from grid.utils.ensure_crai import ensure_crai

def index_if_needed(cram_path, reference, force=False):
    """Index CRAM only if needed."""
    cram = Path(cram_path)
    crai = Path(f"{cram_path}.crai")

    # Check if index exists and is newer than CRAM
    if crai.exists() and not force:
        if crai.stat().st_mtime > cram.stat().st_mtime:
            print(f"✓ {cram.name}: Index up to date")
            return str(crai)

    # Create or recreate index
    print(f"Indexing {cram.name}...")
    return ensure_crai(cram_path, reference)

Dependencies¶

Required:

samtools - Must be installed and in PATH

# Install via conda
conda install -c bioconda samtools

# Verify installation
samtools --version

pysam - Python library for BAM/CRAM handling
```
pip install pysam
```

Optional:

parallel - GNU parallel for batch indexing
slurm - For HPC array job indexing

Troubleshooting¶

Error: samtools: command not found

Solution: Install samtools

conda install -c bioconda samtools
# or
apt-get install samtools  # Ubuntu/Debian

Error: [E::hts_open_format] fail to open file

Cause: Reference genome mismatch or corrupted CRAM

Solution: Verify reference matches CRAM:
- Check CRAM header: samtools view -H sample.cram | grep @SQ
- Ensure reference build matches (hg19 vs hg38)
- Try redownloading CRAM if corrupted

Error: Permission denied

Cause: No write permission in CRAM directory

Solution:
- Check directory permissions: ls -la data/CRAMs
- Ensure you have write access
- Or create index in writable location (not recommended)

Warning: Index exists but empty

Cause: Indexing interrupted or failed

Solution: Remove and recreate

rm sample.cram.crai
grid crai -c sample.cram -r refs/hs37d5.fa

Performance: Indexing very slow

Cause: Large CRAM files or slow disk I/O

Solutions:
- Use SSD instead of HDD if possible
- Increase samtools threads: modify ensure_crai.py
- Process in parallel across multiple files
- Consider pre-indexing large datasets

Quality Checks¶

Verify index integrity:

# Check index exists and has content
ls -lh sample.cram.crai

# Verify CRAM can be accessed with index
samtools view -H sample.cram chr6:160000000-160100000

# Compare indexed vs non-indexed access time
time samtools view sample.cram chr6:160000000-160100000 > /dev/null

Expected output:

-rw-r--r-- 1 user group 15M Oct 30 12:00 sample.cram.crai

# Should complete in <1 second with index
real    0m0.523s

Best Practices¶

Index immediately - Create indices right after downloading/generating CRAMs
Store together - Keep .crai files alongside .cram files
Version control - Recreate indices if CRAM is modified
Backup strategy - Indices are small, but can be regenerated if lost
Parallel processing - Index multiple CRAMs simultaneously to save time
Check integrity - Verify index works before starting long pipelines