Welcome to GRiD’s documentation!¶
GRiD - Genomic Repeat inference from Depth¶
LPA KIV-2 Copy Number Variant Estimation Pipeline¶
GRiD is a modular, extensible pipeline for estimating copy number variants (CNVs) in the LPA gene’s KIV-2 VNTR region using short-read sequencing data.
Although currently specialized for LPA, GRiD is designed with modularity in mind — future extensions to other repeat loci are encouraged.
Based on the methodology from:
Mukamel et al. (2021) - Original pipeline
Hao (2024) - C++/shell implementation
Installation¶
It is strongly recommended to use a dedicated environment (e.g., conda, micromamba, or venv).
From source¶
git clone https://github.com/caterer-z-t/GRiD.git
cd GRiD
# Create environment (recommended)
conda env create -n grid-env -f env.yaml
conda activate grid-env
# Install in editable mode (for now, we will incoorperate this into pypi + conda)
pip install -e .
Usage¶
GRiD provides a modular command-line interface (CLI) for reproducible and scalable execution on Linux or HPC environments (e.g., UNC Longleaf).
Basic Example¶
grid count-reads --cram-dir data/CRAMs \
--output-file output/counts.tsv \
--ref-fasta refs/hs37d5.fa \
--chrom chr6 \
--start 160000000 \
--end 160100000 \
--config config.yaml
Basic Pipeline Run¶
For large CRAM datasets, the CLI is the recommended interface. See examples/SOL.sh for an example SLURM submission script.
Configuration¶
GRiD uses YAML configuration files for flexible setup.
Customize paths, genomic coordinates, and environment parameters in your config.yaml before execution.
Pipeline Steps¶
Step |
Description |
|---|---|
1. GCloud Copy |
(Deprecated) Copy files from Google Cloud Storage. |
2. Count Reads |
Count reads within VNTR regions for each CRAM. |
3. Mosdepth Coverage |
Compute read coverage using |
4. Normalize Coverage |
Normalize per-sample coverage values. |
5. Find Neighbors |
Identify similar samples for reference-based normalization. |
6. Extract Reference |
Extract VNTR reference sequences. |
7. Realign Reads |
Realign reads to the extracted VNTR reference. |
8. Compute Diploid CN |
Estimate diploid copy numbers. |
9. Estimate KIV CN |
Estimate KIV-2 copy number per individual. |
Each step corresponds to a CLI command (e.g., grid mosdepth, grid normalize-mosdepth, etc.).
Run grid --help for a full command list.
Additionally, each of the corresponding commands also have inline documentation, run grid mosdepth --help, grid normalize-mosdepth --help for stepwise parameters.
Pipeline Orchestration¶
The grid/utils/ directory contains all computational logic and helper scripts.
Module |
Description |
|---|---|
|
Ensures CRAM index (.crai) files exist |
|
Counts paired reads in VNTR region |
|
Computes coverage using |
|
Normalizes coverage across samples |
|
Identifies nearest neighbors for normalization |
|
Extracts VNTR reference sequences |
|
Realigns reads to LPA reference |
|
Computes diploid copy number estimates |
|
Derives KIV-2 CN estimates |
Each utility can be imported directly in Python, e.g.:
from grid.utils.run_mosdepth import run_mosdepth
run_mosdepth(
cram_dir="data/crams",
output_file="out/mosdepth.tsv",
ref_fasta="refs/hs37d5.fa",
chrom="chr6",
start=160000000,
end=160100000
)
Development¶
Code Style¶
We follow standard Python formatting and type-checking conventions:
black grid/
isort grid/
flake8 grid/
mypy grid/
Build Documentation¶
conda activate grid-env
pip install -r docs/requirements.txt
make -C docs html
Documentation is built with Sphinx and published via ReadTheDocs:
Citation¶
If you use this pipeline, please cite:
Mukamel et al. (2021) Original methodology
Hao et al. (2024) Implementation
This pipeline Your paper when published
License¶
Released under MIT License
Support¶
To-Do’s¶
[] Add/Implement
pipeline.pyandconfig.pyfor unifiedgrid run [args]entrypoint[] Finish README Documentation
[] Add test suite
[] CI/CD workflows