Skip to content

codellm-devkit/codeanalyzer-python

Repository files navigation

CodeLLM-DevKit

codeanalyzer-python (canpy)

A Python static-analysis toolkit — the CLDK backend that emits a canonical symbol table and call graph, as analysis.json or a Neo4j property graph.

PyPI Python GitHub release Release License


canpy is a static analyzer for Python built on Jedi, with optional CodeQL-resolved call edges and Tree-sitter parsing. It produces the canonical CodeLLM-DevKit (CLDK) analysis.json — a symbol table plus a call graph — and can project that same analysis into a Neo4j property graph. It is the Python backend behind CLDK, mirroring its TypeScript (cants) and Java siblings.

Every run produces a symbol table and a call graph. Edges come from Jedi's lexical resolution by default; --codeql resolves additional edges (RPC / third-party / dynamically-dispatched targets) and merges them with the Jedi-derived edges, also backfilling callees Jedi could not resolve.

Table of Contents

Features

  • Symbol table — modules, classes, functions, methods, variables, decorators, imports, and docstrings, with precise source spans.
  • Call graph — Jedi's lexical resolver by default, with optional CodeQL-resolved edges (--codeql) for RPC / third-party / dynamically-dispatched targets, merged with the Jedi edges; CodeQL also backfills callees Jedi could not resolve.
  • Neo4j output — project the analysis into a labeled property graph: a self-contained graph.cypher snapshot, or an incremental push to a live database over Bolt.
  • Versioned schema — a machine-readable, version-stamped Neo4j schema contract (--emit schema), checked in as schema.neo4j.json and shipped with every release.
  • Incremental cache — per-file results are cached under .codeanalyzer; --lazy (default) reuses them, --eager forces a clean rebuild. --ray distributes the work across cores.
  • Compact output — canonical analysis.json, or binary analysis.msgpack for smaller artifacts.

Installation

Prerequisites

  • Python 3.10 or newer.

  • A C toolchain and the venv / development headers — the analyzer builds an isolated virtual environment per project (via Python's venv) so Jedi can resolve types and imports:

    # Ubuntu / Debian
    sudo apt install python3-venv python3-dev build-essential
    
    # Fedora / RHEL / CentOS
    sudo dnf group install "Development Tools" && sudo dnf install python3-venv python3-devel
    
    # macOS
    xcode-select --install

Install via pip (PyPI)

pip install codeanalyzer-python
canpy --help

For the optional live Neo4j push (--emit neo4j --neo4j-uri …), install the neo4j extra:

pip install 'codeanalyzer-python[neo4j]'

Install via shell script

Install the CLI as an isolated tool with the one-line installer (provisions via uv / pipx / pip):

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/codellm-devkit/codeanalyzer-python/releases/latest/download/canpy-installer.sh | sh

Install via Homebrew

brew install codellm-devkit/tap/codeanalyzer-python

The formula depends on uv and installs canpy as an isolated, version-pinned uv tool (the package and its dependencies are resolved and cached on first run).

Build from source

This project uses uv for dependency management.

git clone https://github.com/codellm-devkit/codeanalyzer-python
cd codeanalyzer-python
uv sync --all-groups
uv run canpy --help

Usage

canpy --input /path/to/python/project

With no --output, the analysis is printed to stdout as compact JSON; with --output <dir> it is written to analysis.json (or graph.cypher for --emit neo4j, or analysis.msgpack with --format msgpack) in that directory.

Options

$ canpy --help

 Usage: canpy [OPTIONS] COMMAND [ARGS]...

 Static Analysis on Python source code using Jedi, CodeQL and Tree sitter.

╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --input           -i                     PATH                 Path to the project root directory │
│                                                               (not required for --emit schema).  │
│ --output          -o                     PATH                 Output directory for artifacts.    │
│ --format          -f                     [json|msgpack]       Output format for --emit json:     │
│                                                               json or msgpack.                   │
│                                                               [default: json]                    │
│ --emit                                   [json|neo4j|schema]  Output target: json                │
│                                                               (analysis.json, default) | neo4j   │
│                                                               (graph.cypher or live Bolt push) | │
│                                                               schema (the Neo4j schema.json      │
│                                                               contract).                         │
│                                                               [default: json]                    │
│ --app-name                               TEXT                 Logical application name for the   │
│                                                               graph :PyApplication anchor        │
│                                                               (default: input dir name).         │
│ --neo4j-uri                              TEXT                 Push the graph to a live Neo4j     │
│                                                               over Bolt (incremental); omit to   │
│                                                               write graph.cypher.                │
│                                                               [env var: NEO4J_URI]               │
│ --neo4j-user                             TEXT                 Neo4j username.                    │
│                                                               [env var: NEO4J_USERNAME]          │
│                                                               [default: neo4j]                   │
│ --neo4j-password                         TEXT                 Neo4j password. Prefer the env var │
│                                                               over the flag (the flag is visible │
│                                                               in shell history / process list).  │
│                                                               [env var: NEO4J_PASSWORD]          │
│                                                               [default: neo4j]                   │
│ --neo4j-database                         TEXT                 Neo4j database name (default:      │
│                                                               server default).                   │
│                                                               [env var: NEO4J_DATABASE]          │
│ --codeql              --no-codeql                             Enable CodeQL-based analysis.      │
│                                                               [default: no-codeql]               │
│ --ray                 --no-ray                                Enable Ray for distributed         │
│                                                               analysis.                          │
│                                                               [default: no-ray]                  │
│ --eager               --lazy                                  Enable eager or lazy analysis.     │
│                                                               Defaults to lazy.                  │
│                                                               [default: lazy]                    │
│ --skip-tests          --include-tests                         Skip test files in analysis.       │
│                                                               [default: skip-tests]              │
│ --no-venv             --venv                                  Skip virtualenv creation and       │
│                                                               dependency installation; resolve   │
│                                                               imports against the ambient Python │
│                                                               environment instead.               │
│                                                               [default: venv]                    │
│ --file-name                              PATH                 Analyze only the specified file    │
│                                                               (relative to input directory).     │
│ --cache-dir       -c                     PATH                 Directory to store analysis cache. │
│                                                               Defaults to '.codeanalyzer' in the │
│                                                               input directory.                   │
│ --clear-cache         --keep-cache                            Clear cache after analysis. By     │
│                                                               default, cache is retained.        │
│                                                               [default: keep-cache]              │
│                   -v                     INTEGER              Increase verbosity: -v, -vv, -vvv  │
│                                                               [default: 0]                       │
│ --help                                                        Show this message and exit.        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

Examples

  1. Basic analysis to stdout, or to a file:

    canpy --input ./my-python-project                        # compact JSON on stdout
    canpy --input ./my-python-project --output ./out         # → ./out/analysis.json
  2. Binary output (msgpack):

    canpy --input ./my-python-project --output ./out --format msgpack   # → ./out/analysis.msgpack
  3. Resolve extra call edges with CodeQL:

    canpy --input ./my-python-project --codeql

    By default, edges come from Jedi's lexical analysis. Adding --codeql resolves additional edges (including RPC / third-party / dynamically-dispatched targets) and merges them with the Jedi-derived edges; CodeQL also backfills resolved callees Jedi could not resolve. CodeQL integration is experimental; the CLI is downloaded into <cache_dir>/codeql/ on first use.

  4. Emit a Neo4j snapshot, or push to a live database:

    canpy --input ./my-python-project --emit neo4j --output ./out   # → ./out/graph.cypher
    canpy --input ./my-python-project --emit neo4j \
      --neo4j-uri bolt://localhost:7687 --neo4j-user neo4j --neo4j-password secret
  5. Emit the Neo4j schema contract:

    canpy --emit schema                   # print schema.json to stdout (no project needed)
    canpy --emit schema --output ./out    # → ./out/schema.json
  6. Force a clean rebuild with a custom cache directory:

    canpy --input ./my-python-project --eager --cache-dir /path/to/custom-cache

Output targets

canpy builds one analysis in memory and can emit it three ways (--emit):

analysis.json (default)

A PyApplication document — the canonical CLDK contract:

{
  "symbol_table": { /* file path → module (classes, functions, variables, imports, …) */ },
  "call_graph":   [ /* CALL_DEP edges: { source, target, weight, provenance } keyed by callable signature */ ]
}

By default this is printed to stdout in JSON; with --output it is written to analysis.json (or analysis.msgpack with --format msgpack, a more compact binary format).

Neo4j graph

--emit neo4j projects the same analysis into a labeled property graph. Every node label is Py-prefixed and every relationship type is PY_-prefixed (e.g. :PyClass, PY_CALLS) so multiple language analyzers can share one database without label or relationship-type collisions. Declarations are keyed by their signature under a shared :PySymbol label; calls, imports, inheritance, decorators, and call sites are relationships:

  • Without --neo4j-uri — writes a self-contained graph.cypher (constraints + indexes, a scoped wipe, then batched MERGEs). Load it with cypher-shell < graph.cypher. Needs no extra dependencies.
  • With --neo4j-uri — pushes to a live Neo4j over Bolt incrementally: only modules whose content hash changed are rewritten, and on a full run modules whose source file vanished are pruned. Requires the neo4j extra. Every graph carries a schema_version on its :PyApplication node.

Call-graph endpoints that aren't present in the symbol table (third-party / framework / RPC targets) are materialized as :PyExternal ghost nodes, mirroring the analyzer's own ghost-node behaviour.

The connection options also read from the standard Neo4j environment variables — NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD, NEO4J_DATABASE — when the corresponding flag is omitted (an explicit flag wins). Prefer the env var for the password so it doesn't land in shell history or the process list:

export NEO4J_URI=bolt://localhost:7687
export NEO4J_PASSWORD=secret
canpy -i ./my-project --emit neo4j     # credentials picked up from the environment

Schema contract

--emit schema writes the machine-readable, version-stamped Neo4j schema (schema.json: node labels, relationships, properties, constraints, and indexes). It needs no project and is checked into the repo as schema.neo4j.json and bundled in every release as a GitHub Release asset, so a consumer can validate producer/consumer compatibility without invoking the tool. The shape of the contract matches the codeanalyzer-typescript backend.

A UML of the analysis.json schema (the PyApplication containment tree) is checked in as schema-uml.drawio, and the property-graph schema as neo4j-schema.drawio.

Development

This project uses uv.

uv sync --all-groups
uv run canpy --input /path/to/project           # run from source
uv run canpy --emit schema > schema.neo4j.json  # regenerate the checked-in schema contract
uv run python scripts/update_readme.py          # regenerate the canpy --help block above
uv run pytest                                   # run the test suite

The Neo4j schema-conformance test always runs. The Neo4j bolt integration test spins up a real Neo4j via Testcontainers and is opt-in — it needs a container runtime (Docker or Podman) and is enabled with an environment variable:

RUN_CONTAINER_TESTS=1 uv run pytest test/test_neo4j_bolt.py -s

License

Apache 2.0 — see LICENSE.

About

Python Static Analysis Backend for CLDK

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors