CRAM Index Management

Overview

Ensures CRAM index (.crai) files exist for efficient random access to CRAM files.

CRAM files require an index for efficient random access to specific genomic regions. This utility automatically checks for and creates the necessary .crai index file if it doesn’t exist.

CLI Reference

grid crai [OPTIONS]

Options:

-c, --cram PATH

Input CRAM file [required]

-r, --reference PATH

Reference genome FASTA [required]

-h, --help

Show help message and exit

Usage Examples

Via CLI

Basic usage:

grid crai \
    --cram sample.cram \
    --reference refs/hs37d5.fa

Process multiple files:

# Loop through all CRAMs
for cram in data/CRAMs/*.cram; do
    grid crai --cram "$cram" --reference refs/hs37d5.fa
done

In a pipeline:

# Ensure index exists before processing
grid crai -c sample.cram -r refs/hs37d5.fa
grid count-reads -C data/CRAMs -o counts.tsv ...

Via Python

from grid.utils.ensure_crai import ensure_crai

# Single file
crai_path = ensure_crai(
    cram_path="sample.cram",
    reference="refs/hs37d5.fa"
)
print(f"Index ready: {crai_path}")

# Multiple files
from pathlib import Path

cram_dir = Path("data/CRAMs")
ref_fasta = "refs/hs37d5.fa"

for cram_file in cram_dir.glob("*.cram"):
    crai_path = ensure_crai(cram_path=str(cram_file), reference=ref_fasta)
    print(f"✓ {cram_file.name}")

Description

What it does:

  1. Checks if .crai index exists alongside the CRAM file

  2. If missing, creates index using samtools index

  3. Returns path to the index file

  4. Validates index is properly created

Why you need it:

CRAM files are compressed BAM files that require an index for:

  • Random access - Jump to specific genomic regions without reading entire file

  • Parallel processing - Process different regions simultaneously

  • Memory efficiency - Read only necessary data

  • Pipeline requirements - Most downstream tools require indexed CRAMs

Index location:

The index is created in the same directory as the CRAM with a .crai extension:

data/
├── sample001.cram
├── sample001.cram.crai  ← Created automatically
├── sample002.cram
└── sample002.cram.crai  ← Created automatically

Technical Details

Algorithm:

def ensure_crai(cram_path, reference):
    crai_path = f"{cram_path}.crai"

    if not exists(crai_path):
        # Create index using samtools
        run(["samtools", "index", "-@", "4", cram_path])

    return crai_path

Performance:

  • Speed: ~1-5 minutes per CRAM (depends on file size)

  • Memory: <2GB RAM

  • Disk: Index file is ~0.1% of CRAM size (e.g., 10MB for 10GB CRAM)

  • Threads: Can specify number of threads with samtools -@ flag

Index structure:

The .crai file contains:

  • Genomic position index for rapid seeking

  • Compression block offsets

  • Metadata for efficient access

  • Binary format (not human-readable)

Python API Documentation

grid.utils.ensure_crai

grid.utils.ensure_crai.ensure_crai(cram_path, reference=None)

Ensure a CRAI index exists for the given CRAM file.

Parameters:
  • cram_path (str) – Path to CRAM file.

  • reference (str) – Optional reference genome FASTA (required if CRAM is unindexed)

Returns:

Path to CRAI file.

Return type:

str

Common Use Cases

Use Case 1: Pipeline Initialization

Ensure all CRAMs are indexed before starting analysis:

#!/bin/bash
# prepare_crams.sh

CRAM_DIR="data/CRAMs"
REF="refs/hs37d5.fa"

echo "Indexing CRAMs..."
for cram in "$CRAM_DIR"/*.cram; do
    if [ ! -f "${cram}.crai" ]; then
        echo "Indexing $(basename $cram)..."
        grid crai -c "$cram" -r "$REF"
    else
        echo "✓ $(basename $cram) already indexed"
    fi
done
echo "All CRAMs indexed!"

Use Case 2: Parallel Indexing

Index multiple CRAMs in parallel on HPC:

#!/bin/bash
#SBATCH --array=1-100
#SBATCH --cpus-per-task=4

# Get CRAM file for this array job
CRAM=$(ls data/CRAMs/*.cram | sed -n ${SLURM_ARRAY_TASK_ID}p)

grid crai -c "$CRAM" -r refs/hs37d5.fa

Use Case 3: Conditional Indexing

Only create index if it doesn’t exist or is outdated:

from pathlib import Path
from grid.utils.ensure_crai import ensure_crai

def index_if_needed(cram_path, reference, force=False):
    """Index CRAM only if needed."""
    cram = Path(cram_path)
    crai = Path(f"{cram_path}.crai")

    # Check if index exists and is newer than CRAM
    if crai.exists() and not force:
        if crai.stat().st_mtime > cram.stat().st_mtime:
            print(f"✓ {cram.name}: Index up to date")
            return str(crai)

    # Create or recreate index
    print(f"Indexing {cram.name}...")
    return ensure_crai(cram_path, reference)

Dependencies

Required:

  • samtools - Must be installed and in PATH

    # Install via conda
    conda install -c bioconda samtools
    
    # Verify installation
    samtools --version
    
  • pysam - Python library for BAM/CRAM handling

    pip install pysam
    

Optional:

  • parallel - GNU parallel for batch indexing

  • slurm - For HPC array job indexing

Troubleshooting

Error: samtools: command not found

Solution: Install samtools

conda install -c bioconda samtools
# or
apt-get install samtools  # Ubuntu/Debian

Error: [E::hts_open_format] fail to open file

Cause: Reference genome mismatch or corrupted CRAM

Solution: Verify reference matches CRAM:
- Check CRAM header: samtools view -H sample.cram | grep @SQ
- Ensure reference build matches (hg19 vs hg38)
- Try redownloading CRAM if corrupted

Error: Permission denied

Cause: No write permission in CRAM directory

Solution:
- Check directory permissions: ls -la data/CRAMs
- Ensure you have write access
- Or create index in writable location (not recommended)

Warning: Index exists but empty

Cause: Indexing interrupted or failed

Solution: Remove and recreate

rm sample.cram.crai
grid crai -c sample.cram -r refs/hs37d5.fa

Performance: Indexing very slow

Cause: Large CRAM files or slow disk I/O

Solutions:
- Use SSD instead of HDD if possible
- Increase samtools threads: modify ensure_crai.py
- Process in parallel across multiple files
- Consider pre-indexing large datasets

Quality Checks

Verify index integrity:

# Check index exists and has content
ls -lh sample.cram.crai

# Verify CRAM can be accessed with index
samtools view -H sample.cram chr6:160000000-160100000

# Compare indexed vs non-indexed access time
time samtools view sample.cram chr6:160000000-160100000 > /dev/null

Expected output:

-rw-r--r-- 1 user group 15M Oct 30 12:00 sample.cram.crai

# Should complete in <1 second with index
real    0m0.523s

Best Practices

  1. Index immediately - Create indices right after downloading/generating CRAMs

  2. Store together - Keep .crai files alongside .cram files

  3. Version control - Recreate indices if CRAM is modified

  4. Backup strategy - Indices are small, but can be regenerated if lost

  5. Parallel processing - Index multiple CRAMs simultaneously to save time

  6. Check integrity - Verify index works before starting long pipelines

See Also