Core concepts
Every run produces one PyApplication — a typed model of a project with three top-level pieces: a symbol table, a call graph, and entrypoints. This page explains what each contains, how the pipeline builds them, the two cross-cutting ideas you’ll meet everywhere — provenance and the analysis cache — and how that same in-memory model projects into a Neo4j property graph when you ask for it.
flowchart TB
ST["symbol_table: Dict[str, PyModule]"]
CG["call_graph: List[PyCallEdge]"]
EP["entrypoints: Dict[str, List[PyEntrypoint]]"]
APP["PyApplication"] --> ST
APP --> CG
APP --> EP
APP -. "--emit neo4j" .-> PG["Labeled property graph<br/>(:PyApplication anchor)"]
PG --> SNAP["graph.cypher snapshot<br/>(no --neo4j-uri)"]
PG --> BOLT["live Bolt push<br/>(--neo4j-uri, incremental)"]
Symbol table
Section titled “Symbol table”The symbol table is the structured inventory of the project: one PyModule per source file, each holding its imports, classes, functions, and module-level variables. It’s the foundation every other piece is built on, and it’s what you get even on the cheapest run.
flowchart LR
M[PyModule] --> C[PyClass]
M --> F["PyCallable (function)"]
C --> ME["PyCallable (method)"]
ME --> CS[PyCallsite]
ME --> P[PyCallableParameter]
C --> A[PyClassAttribute]
A PyCallable (function or method) carries its signature, source code, parameters, decorators, call_sites, accessed symbols, cyclomatic complexity, and nested callables/classes. A PyClass carries its base_classes, methods, attributes, and decorators. Each node records line/column spans so you can map any element back to source.
Construction is done by Jedi (for type and reference resolution) over a Tree-sitter / ast walk. Because Jedi resolves against the project’s own installed dependencies, canpy builds an isolated virtual environment per project first. In CI, containers, and sandboxed runs where that’s redundant, --no-venv resolves against the ambient interpreter instead.
Call graph
Section titled “Call graph”The call graph records who-calls-whom as a flat list of PyCallEdge objects. Each edge is identity-only: a source signature, a target signature, a weight, and a provenance list. The nodes of the graph are the PyCallable entries already in the symbol table — there’s no separate vertex type. Rich per-call detail (receiver, argument types, location) lives on the PyCallsite entries inside each callable.
flowchart LR
A["app.cli.main"] -->|jedi| B["app.parser.parse"]
B -->|jedi, codeql| C["app.model.Order.__init__"]
B -->|codeql| D["thirdparty.rpc.call"]
Because it’s a plain edge list keyed by signature, loading it into networkx is direct:
import json, networkx as nx
app = json.load(open("analysis.json"))g = nx.DiGraph()for e in app["call_graph"]: g.add_edge(e["source"], e["target"])
nx.has_path(g, entry_sig, sink_sig) # reachability — a query, not a guessHow the graph is built
Section titled “How the graph is built”Every run builds the graph in four steps — CodeQL participates only when --codeql is passed:
- CodeQL resolution (if enabled) produces resolved edges tagged
provenance=["codeql"]and backfillscallee_signatureon call sites Jedi couldn’t resolve. - Constructor fallback — a heuristic walks the symbol table by class short-name and scope to fill in constructor calls neither Jedi nor CodeQL resolved (common for classes nested inside functions), synthesizing
<class>.__init__targets. - Jedi edges are derived from the now-fully-augmented symbol table, reflecting every resolution it contains.
- Merge — Jedi and CodeQL edges are unioned; an edge both engines saw carries both provenance tokens.
Provenance
Section titled “Provenance”Every PyCallEdge carries a provenance list recording which engine(s) produced it: "jedi", "codeql", or an extension’s own token (e.g. "odoo_orm_dispatch"). It’s an open vocabulary — a stored analysis.json round-trips no matter which engines or passes were installed when it was written. Provenance lets a consumer weigh edges by confidence, or filter to a single engine’s view. The projection carries it through: a PY_CALLS relationship keeps weight and the full provenance string array.
Entrypoints
Section titled “Entrypoints”Entrypoints are the framework-dispatched roots of an application — the functions a framework calls that your own code never calls directly: a Flask route handler, a Celery task, a Click command, a gRPC servicer method. They’re collected into entrypoints, keyed by framework name, with each PyEntrypoint referencing a callable by signature and carrying framework metadata (route path, HTTP methods, task name, …).
Entrypoints matter because reachability is only meaningful from a real root. “Is this sink reachable?” becomes answerable once you know where execution actually enters the program. See Entrypoint detection.
The analysis cache
Section titled “The analysis cache”Analysis is lazy by default. canpy stores its results under .codeanalyzer/ and, on the next run, reuses the cached entry for any file whose mtime, size, and content hash are unchanged — only new or modified files are re-analyzed. --eager forces a full rebuild; --clear-cache deletes the cache on exit.
Crucially, only the symbol table and base call graph are cached. The pass-pipeline output — entrypoints and synthetic edges — is recomputed on every run, so it can never go stale when an extension is added, changed, or removed.
flowchart LR
R[analyze] --> Cache{"cached &<br/>unchanged?"}
Cache -->|yes| Reuse[reuse symbol table<br/>+ base call graph]
Cache -->|no| Build[rebuild from source]
Reuse --> Pipe[run pass pipeline<br/>always]
Build --> Pipe
Pipe --> Out[PyApplication]
The same per-file content hash that drives this cache also drives the incremental Neo4j push: a PyModule carries its content_hash, and a live Bolt push compares it against the hash already in the database so only changed modules are rewritten.
From artifact to property graph
Section titled “From artifact to property graph”analysis.json is one self-contained file: to query it you load the whole thing into memory and walk it. That’s fine for a single project and a non-starter across a portfolio. --emit neo4j projects the same in-memory PyApplication — same symbol table, same call graph — into a labeled property graph so many applications can live in one database and you query across all of them with Cypher instead of parsing giant JSON blobs.
The projection is a faithful mapping, not a new analysis:
- Labels are namespaced. Every node label is
Py-prefixed and every relationship type isPY_-prefixed —:PyModule,:PyClass,:PyCallable,PY_CALLS,PY_DECLARES— so the Java (:J*/ no prefix) and TypeScript analyzers can share one database without label or relationship-type collisions. - Declarations are keyed by signature.
:PyClass,:PyCallable, and:PyExternalare allMERGEd under a shared:PySymbollabel keyed bysignature— the very identity used in the symbol table and call graph. That’s what lets call edges, inheritance, and declaration containment reference a symbol without duplicating it. - Ghost nodes become
:PyExternal. The third-party and RPC endpoints that the in-memory model keeps as ghost nodes are materialized authoritatively as:PyExternalnodes (carryingnameandmodule). APyCallSiteresolves viaPY_RESOLVES_TOto either a real:PyCallableor an external, andPY_CALLSedges to externals survive the projection. - One application, one anchor. Everything hangs off a single
:PyApplicationnode whosenameis your--app-name(defaulting to the input directory’s basename). The node also carriesschema_version— currently1.1.0— so a consumer can check the contract it’s reading against.
Snapshot vs. incremental Bolt
Section titled “Snapshot vs. incremental Bolt”--emit neo4j has two sub-modes, decided solely by whether --neo4j-uri is set:
Without --neo4j-uri, canpy writes a self-contained graph.cypher to the output directory: constraints and indexes, a scoped wipe of this application’s prior subtree, then batched UNWIND … MERGE for nodes and edges. It needs no extra dependencies and expresses the full truth of the analysis (it is not incremental). Load it with cypher-shell:
canpy --input ./my-python-project --emit neo4j --app-name my-service --output ./outcypher-shell < ./out/graph.cypherThe wipe is scoped to MATCH (a:PyApplication {name: "my-service"}) and its module subtree, so reloading one application never touches another’s data in a shared database.
With --neo4j-uri, canpy pushes to a live Neo4j over Bolt incrementally: it ensures the schema (constraints + indexes), diffs each module’s content_hash against the database, and rewrites only the modules that changed. Shared :PyExternal / :PyPackage / :PyDecorator nodes are MERGE-only and nodes are never blindly deleted, so cross-module references survive. On a full run (no --file-name), modules whose source file vanished are pruned — and that prune is scoped to this application’s :PyApplication anchor, so pushing one app can’t delete another app’s modules.
This path needs the optional neo4j driver extra:
pip install 'codeanalyzer-python[neo4j]'Prefer the NEO4J_PASSWORD environment variable over --neo4j-password — the flag is visible in shell history and the process list. NEO4J_URI, NEO4J_USERNAME, and NEO4J_DATABASE are read the same way (an explicit flag wins when set):
export NEO4J_PASSWORD=secretcanpy --input ./my-python-project --emit neo4j --app-name my-service \ --neo4j-uri bolt://localhost:7687 --neo4j-user neo4j