Skip to content

Output schema

canpy builds one analysis in memory and can serialize it two ways. The default is a single PyApplication artifact — analysis.json (or msgpack). With --emit neo4j the same in-memory PyApplication is projected into a labeled property graph instead of a file. This page is the schema reference for both: the JSON model and the Neo4j property graph.

flowchart TB
    SRC["Python project (--input)"] --> IR["in-memory PyApplication"]
    IR -->|"--emit json (default)"| JSON["analysis.json / msgpack"]
    IR -->|"--emit neo4j"| PG["labeled property graph"]
    IR -->|"--emit schema"| CONTRACT["schema.json (versioned contract)"]

    JSON --> APP[PyApplication]
    APP --> ST["symbol_table: {path: PyModule}"]
    APP --> CG["call_graph: [PyCallEdge]"]
    APP --> EP["entrypoints: {framework: [PyEntrypoint]}"]
    ST --> MOD[PyModule]
    MOD --> CLS[PyClass]
    MOD --> FN[PyCallable]
    CLS --> M[PyCallable]
    CLS --> ATTR[PyClassAttribute]
    FN --> CALL[PyCallsite]
    FN --> PARAM[PyCallableParameter]
    FN --> DEC[PyDecorator]

    PG -->|"no --neo4j-uri"| SNAP["graph.cypher snapshot"]
    PG -->|"--neo4j-uri (Bolt)"| LIVE["live Neo4j, incremental"]
    SNAP --> NODES[":PyApplication / :PyModule / :PyClass / :PyCallable …"]
    LIVE --> NODES
    NODES -->|"PY_HAS_MODULE / PY_DECLARES / PY_CALLS …"| NODES

The default artifact is a single PyApplication. Every model below is a Pydantic model defined in codeanalyzer.schema.py_schema; the JSON and msgpack outputs are serializations of the same schema. Line/column fields default to -1 when unknown.

The root object.

FieldTypeDescription
symbol_tableDict[str, PyModule]File path → module model. The whole-project inventory.
call_graphList[PyCallEdge]Identity-keyed call edges.
entrypointsDict[str, List[PyEntrypoint]]Framework name → detected roots.

One per source file.

FieldTypeDescription
file_pathstrAbsolute path to the file.
module_namestrDotted module name.
importsList[PyImport]Import statements.
commentsList[PyComment]Comments and docstrings.
classesDict[str, PyClass]Top-level classes by name.
functionsDict[str, PyCallable]Top-level functions by name.
variablesList[PyVariableDeclaration]Module-level variables.
content_hash, last_modified, file_sizestr / float / intCache-invalidation metadata.
FieldTypeDescription
namestrClass short name.
signaturestrFully-qualified identity (e.g. module.ClassName).
base_classesList[str]Names of base classes.
decoratorsList[PyDecorator]Class decorators.
methodsDict[str, PyCallable]Methods by name.
attributesDict[str, PyClassAttribute]Class attributes by name.
inner_classesDict[str, PyClass]Nested classes.
comments, codeList[PyComment] / strDocstrings/comments and source.
start_line, end_lineintSource span.

A function or method. The richest model in the artifact.

FieldTypeDescription
namestrCallable short name.
pathstrFile the callable is defined in.
signaturestrFully-qualified identity (e.g. module.Class.method). The call-graph node key.
parametersList[PyCallableParameter]Declared parameters.
return_typeOptional[str]Resolved return type, if known.
decoratorsList[PyDecorator]Applied decorators.
codeOptional[str]The source body.
call_sitesList[PyCallsite]Calls made from this callable.
accessed_symbolsList[PySymbol]Symbols read/written in the body.
local_variablesList[PyVariableDeclaration]Locals.
inner_callables, inner_classesDict[str, ...]Nested definitions.
cyclomatic_complexityintComputed complexity.
is_entrypointboolWhether a finder marked this an entrypoint.
entrypoint_frameworkOptional[str]The framework, if so.
start_line, end_line, code_start_lineintSource spans.

A single call made from within a callable — the rich per-call metadata behind a graph edge.

FieldTypeDescription
method_namestrThe invoked name as written.
receiver_expr, receiver_typeOptional[str]The receiver expression and its resolved type.
argument_typesList[str]Resolved argument types.
return_typeOptional[str]Resolved return type.
callee_signatureOptional[str]The resolved target’s signature (CodeQL may backfill this).
is_constructor_callboolWhether the call constructs an instance.
start_line, end_line, …intSource location.

An identity-only call-graph edge.

FieldTypeDescription
sourcestrCaller’s PyCallable.signature.
targetstrCallee’s PyCallable.signature.
type"CALL_DEP"Edge kind.
weightintEdge weight (default 1).
provenanceList[str]Which engine(s) produced it: "jedi", "codeql", or an extension token. Open vocabulary.
tagsDict[str, str]Free-form, extension-namespaced metadata (e.g. an ORM-dispatch trigger predicate). Never interpreted by core.

A framework-dispatched root, referencing a callable by signature.

FieldTypeDescription
signaturestrThe PyCallable.signature this entrypoint refers to.
frameworkstrThe dispatching framework.
detection_sourcestrHow it was detected — decorator, base_class, url_resolver, router_mount, blueprint, lambda_template, typer_subapp, click_add_command, argparse_dispatch, convention, or extension. Open vocabulary.
route_path, http_methodsOptional[str] / List[str]For HTTP routes.
celery_task_name, cli_command_name, lambda_handler_key, grpc_service_nameOptional[str]Framework-specific identifiers, when applicable.
source_fileOptional[str]File declaring the binding (urls.py, template.yaml, …).
tagsDict[str, str]Free-form, namespaced metadata for extensions.
  • PyImportmodule, name, alias, and source span.
  • PyCommentcontent, is_docstring, and source span.
  • PyDecoratorname, resolved qualified_name, and raw positional_arguments / keyword_arguments (source-text fragments for finders to parse).
  • PyCallableParametername, type, default_value, source span.
  • PyClassAttributename, type, comments, source span.
  • PyVariableDeclarationname, type, initializer, value, scope.
  • PySymbol — a referenced symbol: name, scope, kind, resolved type, qualified_name, is_builtin.

Every model is decorated for MessagePack support, exposing to_msgpack_bytes() / from_msgpack_bytes() (gzip-compressed) and to_msgpack_dict() / from_msgpack_dict(). PyApplication additionally exposes get_compression_ratio(). For JSON, use the Pydantic v1/v2 compatibility helpers model_dump_json / model_validate_json from codeanalyzer.schema. Models built via the fluent builder pattern — PyApplication.builder().symbol_table(...).call_graph(...).build().

--emit neo4j projects the same in-memory PyApplication into a labeled property graph instead of a JSON file. Where analysis.json is one self-contained blob you load whole into memory, the graph is a persistent, queryable system of record: many applications can live in one database — each anchored at its own :PyApplication node — so whole-monorepo or cross-service questions become a Cypher traversal rather than parsing giant JSON files. See the CLI reference for how the two writers (the graph.cypher snapshot and the incremental Bolt push) work.

Every node label is Py-prefixed and every relationship type is PY_-prefixed (e.g. :PyClass, PY_CALLS), so the Java, TypeScript, and Python analyzers can share one database without label or relationship-type collisions. Declarations — classes, callables, and external symbols — are keyed by their signature and merged under a shared :PySymbol label, which is what makes the identity invariant cheap to enforce and cross-module references stable. The labels, relationships, and properties below are generated from codeanalyzer/neo4j/catalog.py and published verbatim as the machine-readable schema contract.

The key is the property the node is MERGEd on. Declaration nodes (:PyClass, :PyCallable, :PyExternal) carry the extra :PySymbol label and are merged on signature.

LabelMerge labelKeyNotable properties
:PyApplication:PyApplicationnameschema_version — the application anchor, named by --app-name.
:PyModule:PyModulefile_keymodule_name, content_hash, last_modified, file_size.
:PyClass:PySymbolsignaturename, code, base_classes, docstring, start_line, end_line.
:PyCallable:PySymbolsignaturename, path, return_type, cyclomatic_complexity, code, code_start_line, start_line.
:PyExternal:PySymbolsignaturename, module — a ghost node for a third-party / unresolved target, mirroring the JSON call graph’s ghost-node behavior.
:PyPackage:PyPackagenameAn imported package, shared across modules and applications.
:PyDecorator:PyDecoratornameA decorator, shared across callables and applications.
:PyCallSite:PyCallSiteidmethod_name, receiver_expr, receiver_type, argument_types, return_type, callee_signature, is_constructor_call.
:PyAttribute:PyAttributeidname, type, docstring, start_line, end_line.
:PyVariable:PyVariableidname, type, initializer, scope, start_line, end_line.
RelationshipEndpointsNotes
PY_HAS_MODULE(:PyApplication)-[]->(:PyModule)The application anchor contains each analyzed source module.
PY_DECLARES(:PyModule|PyClass|PyCallable)-[]->(:PyClass|PyCallable)Declaration containment, recursive: modules declare top-level classes/functions; classes and callables declare nested ones.
PY_HAS_METHOD(:PyClass)-[]->(:PyCallable)A class owns a method callable.
PY_HAS_ATTRIBUTE(:PyClass)-[]->(:PyAttribute)A class owns an attribute.
PY_DECLARES_VAR(:PyModule|PyCallable)-[]->(:PyVariable)A module- or function-scoped variable declaration.
PY_HAS_CALLSITE(:PyCallable)-[]->(:PyCallSite)A callable contains the call sites it makes.
PY_RESOLVES_TO(:PyCallSite)-[]->(:PyCallable|PyExternal)A call site resolves to a concrete callable or an external (ghost) symbol.
PY_CALLS(:PyCallable|PyExternal)-[]->(:PyCallable|PyExternal)The call-graph edge. Properties: weight (integer), provenance (string[], e.g. jedi / codeql / an extension token).
PY_EXTENDS(:PyClass)-[]->(:PyClass)Class inheritance (self-referential).
PY_IMPORTS(:PyModule)-[]->(:PyPackage)A module imports a package. Properties: imported_names (string[]), aliases (string[]).
PY_DECORATED_BY(:PyCallable)-[]->(:PyDecorator)A callable is decorated by a decorator.

The PY_CALLS edge is the property-graph form of PyCallEdge: the same weight and provenance carry over, and the same optional CodeQL augmentation backfills resolved call edges. PY_RESOLVES_TO preserves the finer per-call-site resolution that PyCallsite records in the JSON model.

graph LR
    APP[":PyApplication"] -->|PY_HAS_MODULE| MOD[":PyModule"]
    MOD -->|PY_DECLARES| CLS[":PyClass"]
    MOD -->|PY_DECLARES| FN[":PyCallable"]
    MOD -->|PY_IMPORTS| PKG[":PyPackage"]
    MOD -->|PY_DECLARES_VAR| VAR[":PyVariable"]
    CLS -->|PY_HAS_METHOD| FN
    CLS -->|PY_HAS_ATTRIBUTE| ATTR[":PyAttribute"]
    CLS -->|PY_EXTENDS| CLS
    FN -->|PY_DECORATED_BY| DEC[":PyDecorator"]
    FN -->|PY_HAS_CALLSITE| CS[":PyCallSite"]
    FN -->|PY_DECLARES_VAR| VAR
    CS -->|PY_RESOLVES_TO| FN
    CS -->|PY_RESOLVES_TO| EXT[":PyExternal"]
    FN -->|PY_CALLS| FN
    FN -->|PY_CALLS| EXT

Both writers run the same DDL before any load (it is idempotent — every statement is IF NOT EXISTS) so each MERGE is an index seek rather than a label scan, and the identity invariant is enforced by the database itself.

// Uniqueness constraints
CREATE CONSTRAINT py_symbol_sig IF NOT EXISTS FOR (s:PySymbol) REQUIRE s.signature IS UNIQUE;
CREATE CONSTRAINT py_app_name IF NOT EXISTS FOR (a:PyApplication) REQUIRE a.name IS UNIQUE;
CREATE CONSTRAINT py_module_key IF NOT EXISTS FOR (m:PyModule) REQUIRE m.file_key IS UNIQUE;
CREATE CONSTRAINT py_package_name IF NOT EXISTS FOR (p:PyPackage) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT py_decorator_name IF NOT EXISTS FOR (d:PyDecorator) REQUIRE d.name IS UNIQUE;
CREATE CONSTRAINT py_callsite_id IF NOT EXISTS FOR (c:PyCallSite) REQUIRE c.id IS UNIQUE;
CREATE CONSTRAINT py_attribute_id IF NOT EXISTS FOR (a:PyAttribute) REQUIRE a.id IS UNIQUE;
CREATE CONSTRAINT py_variable_id IF NOT EXISTS FOR (v:PyVariable) REQUIRE v.id IS UNIQUE;
// Lookup indexes
CREATE INDEX py_callable_name IF NOT EXISTS FOR (c:PyCallable) ON (c.name);
CREATE INDEX py_class_name IF NOT EXISTS FOR (c:PyClass) ON (c.name);
// Fulltext index for code search over callable bodies and docstrings
CREATE FULLTEXT INDEX py_code_fts IF NOT EXISTS FOR (c:PyCallable) ON EACH [c.code, c.docstring];

The py_code_fts fulltext index backs code search across everything loaded into the database — query it with db.index.fulltext.queryNodes, then filter to one application by walking back to its anchor:

CALL db.index.fulltext.queryNodes('py_code_fts', 'subprocess AND shell')
YIELD node, score
MATCH (app:PyApplication {name: 'my-service'})
-[:PY_HAS_MODULE]->(:PyModule)-[:PY_DECLARES*1..]->(node)
RETURN node.signature AS callable, score
ORDER BY score DESC
LIMIT 20;

Because every subgraph hangs off its :PyApplication anchor, every query scopes to one application by matching {name: '<app-name>'} — the same value passed as --app-name at emit time. That scoping is also what keeps a shared database multi-tenant: a push for one application only touches its own anchored subtree.

--emit schema serializes this catalog — node labels, relationship types, and their property types — to a version-stamped schema.json. It is a static catalog, so no project is required:

Terminal window
# Print the contract to stdout (no project needed)
canpy --emit schema
# Or write it to a directory
canpy --emit schema --output ./out # → ./out/schema.json

The contract carries SCHEMA_VERSION (currently 1.1.0), the same value stamped onto every graph’s :PyApplication node. It is checked in as schema.neo4j.json and shipped as a GitHub Release asset, so consumers can pin to a version and detect contract changes:

{
"schema_version": "1.1.0",
"generator": "codeanalyzer-python",
"node_labels": [
{
"label": "PyApplication",
"merge_label": "PyApplication",
"key": "name",
"properties": { "name": "string", "schema_version": "string" }
}
]
}

The CLDK Python SDK has a read-only Neo4j backend that reconstructs these same typed models from the graph — no JDK, no native binary, and no project source on the consumer, only read-only Neo4j credentials. Pass a Neo4jConnectionConfig whose application_name matches the --app-name the graph was loaded with, and CLDK.python rebuilds the same PyClass / PyCallable objects and the same networkx call graph the in-process analyzer would produce:

from cldk import CLDK
from cldk.analysis.commons.backend_config import Neo4jConnectionConfig
# The graph is populated out of band by `canpy --emit neo4j`; the SDK only reads it.
analysis = CLDK.python(
backend=Neo4jConnectionConfig(
uri="bolt://localhost:7687",
username="neo4j",
password="neo4j",
application_name="my-service", # matches canpy --app-name
),
)
classes = analysis.get_classes() # Dict[str, PyClass]
cg = analysis.get_call_graph() # networkx.DiGraph keyed by callable signatures

The SDK’s neo4j driver is an optional extra (pip install cldk[neo4j]). See the Neo4j guide for the full read API, and the CLI reference for how producers and consumers split.