THE PLATFORM
The foundation underneath the science.
DataJoint is a computational database for life sciences R&D, purpose-built to codify experiments, pipelines, and results as first-class scientific data. The platforms you already run get cleaner inputs. Your science gets stronger outcomes.
TO PRODUCTION DEPLOYMENT
DATA PROCESSED AT SCALE
TRUST THE FOUNDATION
THE CORE INNOVATION
The Computational Database.
At the heart of the DataJoint platform is a new kind of database: one that doesn't just store your data, but computes with it. It's the architectural foundation that makes every other capability possible.
TRADITIONAL DATABASE
| sample | value | data | run |
|---|---|---|---|
| Sample_001 | 0.342 | 2024-03-12 | Run_A |
| Sample_002 | 0.186 | 2024-03-12 | Run_A |
| Sample_003 | 0.987 | 2024-03-13 | Run_B |
| Sample_004 | 0.221 | 2024-03-14 | Run_B |
| Sample_005 | 0.103 | 2024-03-14 | Run_C |
Stores values. Each cell is just data: a static record of what was there. Updating one value doesn’t update anything else.
COMPUTATIONAL DATABASE
| sample | value | code | run |
|---|---|---|---|
| Sample_001 | 0.342 | =COMPUTE(raw_001) | Run_A |
| Sample_002 | =NORMALIZE(raw_002) | =COMPUTE(raw_002) | Run_A |
| Sample_003 | 0.987 | =VALIDATE(raw_003) | Run_B |
| Sample_004 | =NORMALIZE(raw_004) | =COMPUTE(raw_004) | Run_B |
| Sample_005 | 0.103 | =COMPUTE(raw_005) | Run_C |
Stores computations. Each cell can be data, code, or both. Change an input, everything downstream recomputes automatically, with full lineage preserved.
WHY THIS MATTERS
When experiments, pipelines, and results are modeled as first-class data, the foundation behaves differently:
- Rerun analyses six months later. Same result.
- Fork colleagues' pipelines with one command.
- Trace outputs to inputs, code, and environment.
That’s not records management. That’s computational reproducibility.
BUILT ON SOUND PRINCIPLES
Four systems. One foundation.
DataJoint isn’t a single tool stitched onto your stack. It’s a unified architecture built on four integrated systems that work as one, managing data, code, computation, and provenance together.
Relational Database Management
Structures and manages your scientific data, enforcing critical relationships to ensure referential integrity across subjects, sessions, instruments, parameters, and results.
Object Storage Integration
Manages data files, raw images, recordings, sequence files, under unified control to maintain organization and context. Files stay where they live; metadata stays connected.
Source Code Management
Captures the pipeline data models, dependencies, and computational steps. Includes version control and CI/CD automation, so every result is reproducible from code.
Workflow Orchestration
Monitors the pipeline, executes compute steps just-in-time on appropriate infrastructure, and propagates changes to preserve internal consistency end to end.
Gold Standard Science, By Policy And Design
Reproducibility and transparency are the first two pillars of the new US policy on the conduct and management of scientific activities.
Executive Order, Restoring Gold Standard Science § 3(a)(i)-(ii), May 23, 2025
Read the Executive Order →HOW IT WORKS
From raw experimental output to business value, in four steps.
Every step preserves code, data, and compute context, turning fragmented experimental output into trusted scientific assets your AI, analytics, and governance can rely on.
Capture
Scientific context preserved, not lost.
Subjects, sessions, devices, parameters, and results are modeled as first-class data. Scientific context travels with the data into every downstream system.
DATA IN CONTEXT
Codify
Same inputs and code, same result every time.
Pipelines, code versions, and compute environments are captured together. Experiments can be rerun, forked, and safely reused across programs, sites, and CRO partners.
DETERMINISTIC WORKFLOWS
Execute
Work that compounds across programs and sites.
Curated, governed outputs become durable, AI-ready assets. The precondition for AI investments, from BI agents to Mosaic-style models, to scale on scientifically coherent data with full provenance.
REUSABLE, AI-READY ASSETS
Activate
Faster decisions on a defensible foundation.
Audit-ready lineage that stands up to internal review, regulatory submission, and AI validation. More time advancing the science the board, the regulator, and the AI thesis depend on.
DEFENSIBLE, TRUSTED SCIENCE
Every step is deterministic. Every result is reproducible. Every asset compounds.
INDEPENDENT VALIDATION
Integration of data, software, and computational resources in one environment will shorten the time to make scientific discoveries.
Frederick National Laboratory for Cancer Research
BUILT FOR ENTERPRISE
Deploys in your environment. Defensible by design.
DataJoint runs where your science runs: your VPC, your cloud, your data residency requirements. Lineage, provenance, and governance are structural, not bolted on.
DEPLOYMENT
- ›Cloud-native architecture on AWS, Azure, and GCP
- ›Deploys in your VPC or hybrid environment
- ›On-premises options for restricted environments
- ›Multi-region data residency
- ›Customer-managed encryption keys (BYOK)
SECURITY & ACCESS
- ›Role-based access control with fine-grained permissions
- ›SSO integration (Okta, Azure AD, Google Workspace)
- ›Audit logging at every access and computation
- ›Encryption in transit and at rest
- ›Infrastructure provisioning and governance
COMPLIANCE & GOVERNANCE
- ›21 CFR (electronic records, signatures)
- ›SOC 2 Type II
- ›HIPAA-aligned architecture
- ›GDPR compliance
- ›GxP-ready deployment patterns
- ›ALCOA+ principles built in
THE INTERFACE
Built for the way scientists actually work.
DataJoint isn’t a foreign environment scientists have to learn. It lives inside the tools they already use: notebooks, visual exploration, dashboards, and code, all powered by the same computational foundation underneath.
Pipeline Explorer
See your entire experiment laid out like a navigable map. Zoom in or out to see what matters most in the moment, or pick just one session to focus deeply.
Custom Dashboards
View experiment progress, animals, sessions, data summaries, processed results, pipeline status, quality metrics, charts, all in real time.
Jupyter Notebooks
Unlock new insights with embedded notebooks and powerful querying. Use any available compute instance, including GPU.
Multi-User Collaboration
Inherently multi-user with robust security and the ability to invite guest users. Securely share a slice of your data with collaborating labs.
One-Click Publishing
Easily export data to standard formats and integrate with repositories like NIH DANDI. Compliance-ready outputs by default.
AI Agents
DataJoint's Agentic AI Control Layer brings trusted scientific automation directly into your pipelines. Reproducible AI, traceable to its training data and code.
FREQUENTLY ASKED
Questions every buyer asks.
The five questions we hear most often during evaluations. More answers on the full FAQ.
ELNs and LIMS capture what was done, they document experiments, samples, and inventory. Scientific data clouds like TetraScience harmonize the data plumbing across instruments. DataJoint captures the computation that produced the result, the experiments, pipelines, code, and compute environment together as first-class data. That’s a different layer entirely. Most teams end up running DataJoint alongside an ELN, not instead of one.
DataJoint runs cloud-native on AWS, Azure, or GCP, and can deploy in your VPC, hybrid environment, or fully on-premises for regulated or air-gapped environments. Data residency is configurable by region. Customer-managed encryption keys are supported.
DataJoint sits upstream of these platforms. We publish governed, provenance-rich scientific data into Delta Lake (Databricks), Unity Catalog, Snowflake’s native tables, or Palantir’s foundry, with full lineage intact. We’re complementary by design: your existing platform investments become more valuable because they finally get the inputs they were built for.
Workflow engines and notebooks are tools for executing computation. DataJoint codifies the science itself: the experiments, subjects, parameters, and results that the computation operates on.
Airflow, Prefect, and Dagster orchestrate when pipelines run. Nextflow and Snakemake describe how bioinformatics steps connect. Jupyter notebooks let scientists write and execute analysis interactively. All of these are useful, and DataJoint complements them. What none of them do is model the scientific work itself, subjects, sessions, experimental conditions, and computational provenance, as first-class data that persists, composes, and reproduces across teams and time.
DataJoint sits underneath these tools. Your team can keep using Airflow for orchestration, Jupyter for interactive analysis, and Nextflow for bioinformatics pipelines. DataJoint codifies what the science actually is, so the work compounds instead of disappearing when the notebook closes or the team turns over.
Yes. The platform is built for 21 CFR Part 11, supports GxP-ready deployment patterns, and follows ALCOA+ data integrity principles. SOC 2 Type II, HIPAA-aligned, GDPR-compliant. Provenance and lineage are structural; every output is traceable to the inputs, code, and compute that produced it, ready for internal review, regulatory submission, or AI validation.
READY TO BUILD ON A FOUNDATION THAT HOLDS UP?