blueFish — Sovereign Data Pipeline
Sovereign ETL and data pipeline — NCBI integration, format conversion, no cloud lock-in.
Repository: sporeGarden/blueFish (moving from syntheticChemistry — repo pending)
License: scyBorg (AGPL-3.0-or-later + ORC + CC-BY-SA 4.0)
What It Is
blueFish is a sovereign data pipeline and ETL (Extract-Transform-Load) tool for scientific data. It handles NCBI database integration, format conversion between bioinformatics standards, and data ingestion for the primal ecosystem — all without sending data to external services.
For any lab working with sequence data, taxonomic databases, or clinical datasets, blueFish provides a local pipeline that respects data sovereignty: your data stays on your hardware, processed by auditable code, with full provenance tracking.
Key Capabilities
- NCBI Integration: Direct access to NCBI databases (GenBank, SRA, Taxonomy) with local caching and incremental updates
- Format Conversion: FASTA, FASTQ, SAM/BAM, VCF, GFF3, BED, and other bioinformatics formats
- Provenance: Every transformation step is logged with BearDog-signed provenance via the RootPulse composition
- Offline Operation: Once data is fetched, all processing runs locally — no network required
- Pipeline Composition: Integrates with biomeOS Neural API for orchestrated multi-step pipelines
How It Composes
blueFish consumes primals for data integrity and orchestration:
| Primal | What It Provides |
|---|---|
| NestGate | Content-addressed storage for raw and processed datasets |
| BearDog | Cryptographic verification of data integrity |
| biomeOS | Pipeline orchestration via deploy graphs |
| 💧🔬 wetSpring | Validation of bioinformatics outputs against published methods |
Why It Matters
Most bioinformatics pipelines are shell script chains: fragile, unreproducible, and tied to specific cluster configurations. blueFish replaces that with typed Rust pipelines that compose via JSON-RPC, run identically on a laptop and a cluster, and produce cryptographically signed outputs.
The combination of blueFish (data pipeline) + helixVision (structure prediction) + wetSpring (microbiology validation) creates a sovereign structural genomics stack that runs on consumer hardware.
See also: wetSpring for microbiology validation, Deployment Model for the BYOB workflow, Ecosystem Inventory for the full repository map.