Skip to content

codeanalyzer-python

The Python analysis backend provides PythonAnalysis in CLDK. It runs Jedi for semantic analysis, optional CodeQL for call-graph augmentation, and Tree-sitter for syntax parsing. The output is a canonical PyApplication schema that ships with the backend and is re-exported by the CLDK Python SDK.

codeanalyzer-python is a standalone static analysis library (published to PyPI as codeanalyzer-python) that the SDK manages in a virtualenv. It converts a Python project into a queryable symbol table and call graph.

The backend produces:

  • Symbol table: All modules, classes, methods, functions, imports, parameters, and docstrings in typed PyModule objects.
  • Call graph: Inter- and intra-procedural call edges (PyCallEdge), with Jedi-based baseline and optional CodeQL augmentation merging resolved callees from both engines.
  • Class hierarchies: Base classes, inheritance chains, and method overrides.
  • Entrypoints: Framework-detected entry points (Flask routes, Celery tasks, Django views, gRPC servicers, etc.) linked to their callables.
flowchart LR
    A["Input: project_path"] --> B["Virtualenv<br/>+ deps"]
    B --> C["Jedi: symbol table<br/>+ Jedi call edges"]
    B --> D["CodeQL<br/>(optional)"]
    D --> E["Merge edges"]
    C --> E
    E --> F["Symbol table<br/>Call graph"]
    F --> G["PyApplication<br/>+ PyModule, PyClass,<br/>PyCallable, PyCallEdge"]
    G --> H["JSON / Msgpack"]
    G --> I["CLDK SDK<br/>cldk.models.python"]
  • codeanalyzer.core:Codeanalyzer: Orchestrates the analysis pipeline, manages virtualenv setup, caching, and invokes semantic passes.
  • codeanalyzer.syntactic_analysis:SymbolTableBuilder: Parses Python source via Tree-sitter and Jedi to extract modules, classes, methods, and call sites.
  • codeanalyzer.semantic_analysis.call_graph: Builds inter-procedural call graphs using Jedi’s resolution; merges CodeQL edges when enabled.
  • codeanalyzer.semantic_analysis.codeql: Optional CodeQL integration for resolving dynamic calls, third-party dispatch, and RPC targets.
  • codeanalyzer.schema:py_schema: Defines all Pydantic models: PyModule, PyClass, PyCallable, PyCallEdge, PyApplication, and others.

All models are defined in /codeanalyzer/schema/py_schema.py and re-exported in the CLDK SDK at cldk.models.python:

PyApplication: The root output of every analysis run.

class PyApplication(BaseModel):
symbol_table: Dict[str, PyModule] # file_path → PyModule
call_graph: List[PyCallEdge] = [] # edges with source → target signature

PyModule: Represents one .py file.

class PyModule(BaseModel):
file_path: str
module_name: str
imports: List[PyImport] = []
comments: List[PyComment] = []
classes: Dict[str, PyClass] = {} # class_name → PyClass
functions: Dict[str, PyCallable] = {} # function_name → PyCallable
variables: List[PyVariableDeclaration] = []
content_hash: Optional[str] = None # for cache invalidation
last_modified: Optional[float] = None
file_size: Optional[int] = None

PyClass: A class definition.

class PyClass(BaseModel):
name: str
signature: str # e.g., "my_pkg.module.ClassName"
comments: List[PyComment] = []
code: str | None = None
base_classes: List[str] = [] # parent signatures
methods: Dict[str, PyCallable] = {} # method_name → PyCallable
attributes: Dict[str, PyClassAttribute] = {} # attr_name → attribute
inner_classes: Dict[str, "PyClass"] = {}
start_line: int
end_line: int

PyCallable: A function or method.

class PyCallable(BaseModel):
name: str
path: str
signature: str # e.g., "my_pkg.module.ClassName.method_name"
comments: List[PyComment] = []
decorators: List[str] = []
parameters: List[PyCallableParameter] = []
return_type: Optional[str] = None
code: str | None = None # source code of the callable
start_line: int
end_line: int
code_start_line: int
accessed_symbols: List[PySymbol] = [] # local variable / import refs
call_sites: List[PyCallsite] = [] # calls made within this callable
inner_callables: Dict[str, "PyCallable"] = {}
inner_classes: Dict[str, "PyClass"] = {}
local_variables: List[PyVariableDeclaration] = []
cyclomatic_complexity: int = 0

PyCallsite: A single call inside a callable.

class PyCallsite(BaseModel):
method_name: str
receiver_expr: Optional[str] = None # "obj" in obj.method()
receiver_type: Optional[str] = None
argument_types: List[str] = []
return_type: Optional[str] = None
callee_signature: Optional[str] = None # resolved target (if found)
is_constructor_call: bool = False
start_line: int
start_column: int
end_line: int
end_column: int

PyCallEdge: A directed edge in the call graph.

class PyCallEdge(BaseModel):
source: str # caller PyCallable.signature
target: str # callee PyCallable.signature
type: Literal["CALL_DEP"] = "CALL_DEP"
weight: int = 1
provenance: List[Literal["jedi", "codeql", "joern"]] = []

PyImport: An import statement.

class PyImport(BaseModel):
module: str # "os", "flask", "my_pkg.utils"
name: str # "path", "Flask", "helper"
alias: Optional[str] = None
start_line: int
end_line: int
start_column: int
end_column: int

PyComment: A comment or docstring.

class PyComment(BaseModel):
content: str
start_line: int
end_line: int
start_column: int
end_column: int
is_docstring: bool = False

All Py* models support:

  • Pydantic v1 and v2 compatibility via cldk.models.python.
  • Builder pattern: PyModule.builder().module_name("x").classes({...}).build().
  • Serialization: to_msgpack_bytes(), from_msgpack_bytes(), model_dump_json().

The backend ships a command-line tool canpy, installed by pip install codeanalyzer-python:

Terminal window
canpy --input /path/to/my_pkg [OPTIONS]

codeanalyzer is a deprecated alias kept for backwards compatibility: it prints a deprecation warning to stderr and delegates to canpy. Prefer canpy.

OptionShortTypeDefaultDescription
--input-iPATHRequiredProject root directory to analyze
--output-oPATHNoneSave analysis.json or analysis.msgpack to this directory (stdout if None)
--format-fjson | msgpackjsonOutput serialization format
--emitjson | neo4j | schemajsonOutput target: json (analysis.json), neo4j (graph.cypher, or a live Bolt push with --neo4j-uri), or schema (Neo4j schema contract; needs no input)
--app-namestrinput dir nameApplication name for the graph :PyApplication anchor
--neo4j-uristrNone (NEO4J_URI)Push the graph to a live Neo4j over Bolt (incremental); omit to write graph.cypher
--neo4j-userstrneo4j (NEO4J_USERNAME)Neo4j username
--neo4j-passwordstrneo4j (NEO4J_PASSWORD)Neo4j password (prefer the env var)
--neo4j-databasestrserver default (NEO4J_DATABASE)Neo4j database name
--codeql / --no-codeqlboolfalseEnable CodeQL-based call-graph augmentation (experimental)
--ray / --no-rayboolfalseEnable Ray for distributed analysis
--eager / --lazyboollazyForce rebuild cache (eager) or reuse cached results (lazy)
--cache-dir-cPATH.codeanalyzer in input dirWhere to store virtualenv, CodeQL DB, analysis cache
--keep-cache / --clear-cacheboolkeepRetain cache after analysis (default) or remove it
--skip-tests / --include-testsboolskipExclude or include test_*.py / *_test.py files
--file-namePATHNoneAnalyze only a single file (relative to input dir)
-vcount0Verbosity: -v, -vv, -vvv for debug/trace

Basic symbol table analysis:

Terminal window
canpy -i ./my_pkg

Outputs analysis.json to stdout (symbol table + Jedi call graph).

With CodeQL augmentation:

Terminal window
canpy -i ./my_pkg --codeql

Merges Jedi edges with CodeQL-resolved edges; note that CodeQL integration is experimental and may take longer.

With Ray distributed analysis:

Terminal window
canpy -i ./my_pkg --ray

Enables Ray for parallel processing across available cores.

Save to file in msgpack format:

Terminal window
canpy -i ./my_pkg -o ./results --format msgpack

Saves compressed analysis.msgpack with 30-50% of JSON size.

Custom cache, eager rebuild:

Terminal window
canpy -i ./my_pkg --cache-dir /tmp/analysis-cache --eager

Rebuilds virtualenv and analysis cache from scratch, storing in /tmp/analysis-cache/.codeanalyzer.

Single file:

Terminal window
canpy -i ./my_pkg --file-name src/handlers.py

Analyzes only src/handlers.py.

A call to CLDK.python(project_path="my_pkg") in the Python SDK proceeds as follows:

  1. Virtualenv provisioning: CLDK detects or installs codeanalyzer-python into a managed virtualenv in the cache directory (default: <project_dir>/.codeanalyzer/venv).

  2. CLI invocation: The SDK constructs a canpy command with options (--codeql, --eager, --cache-dir, etc.) and runs it as a subprocess. Stdout is parsed as analysis.json.

  3. Schema re-export: cldk.models.python re-exports PyApplication, PyModule, PyClass, PyCallable, and other Py* types directly from codeanalyzer.schema.py_schema, ensuring a single source of truth.

  4. In-memory analysis object: The parsed PyApplication is passed to PythonAnalysis, which wraps it with convenience methods:

    • get_symbol_table() → Dict[str, PyModule]
    • get_classes() → Dict[str, PyClass]
    • get_call_graph() → networkx.DiGraph
    • get_callers(target_class_name, target_method_declaration) → Dict
    • get_callees(source_class_name, source_method_declaration) → Dict
from cldk import CLDK
from cldk.analysis import AnalysisLevel
from cldk.analysis.commons.backend_config import PyCodeAnalyzerConfig
analysis = CLDK.python(
project_path="my_pkg",
analysis_level=AnalysisLevel.call_graph,
backend=PyCodeAnalyzerConfig(use_codeql=True), # optional; merges CodeQL edges
)
# Query the symbol table
modules = analysis.get_symbol_table()
classes = analysis.get_classes()
# Compute reachability
call_graph = analysis.get_call_graph()
import networkx as nx
is_reachable = nx.has_path(call_graph, "my_pkg.main", "my_pkg.unsafe_sink")
# Find callers
callers = analysis.get_callers("my_pkg.MyClass", "process")

The backend is selected by the type of the backend= config passed to CLDK.python(...):

  • In-memory codeanalyzer (default): omit backend=, or pass backend=PyCodeAnalyzerConfig(...). The Python-only call-graph knobs use_codeql=... and use_ray=... live on this config, as does cache_dir=....
  • Read-only Neo4j: pass backend=Neo4jConnectionConfig(...) to query a graph populated out of band (no local analysis is run).
from cldk import CLDK
from cldk.analysis.commons.backend_config import (
PyCodeAnalyzerConfig,
Neo4jConnectionConfig,
)
# In-memory backend with Ray + custom cache directory
analysis = CLDK.python(
project_path="my_pkg",
backend=PyCodeAnalyzerConfig(
use_codeql=True,
use_ray=True,
cache_dir="/tmp/analysis-cache",
),
)
# Read-only Neo4j backend
analysis = CLDK.python(
project_path="my_pkg",
backend=Neo4jConnectionConfig(
uri="bolt://localhost:7687",
username="neo4j",
password="neo4j",
database=None,
application_name="my_pkg",
),
)

Neo4jConnectionConfig is importable from cldk.analysis.commons.backend_config (and also from cldk.analysis.python.neo4j).

CLDK.python(...) keeps the project_path, analysis_level, target_files, and eager keyword arguments. The old CLDK(language="python").analysis(...) form still works but is deprecated; prefer CLDK.python(...). The from cldk import CLDK import is unchanged.

  • Cache location: A single language-keyed cache_dir (default: <project_dir>/.codeanalyzer); Python artifacts live under <cache_dir>/python/. Set it via backend=PyCodeAnalyzerConfig(cache_dir=...).
  • Virtualenv: Auto-created under the cache directory. The backend installs dependencies from requirements.txt, pyproject.toml, setup.py, Pipfile, etc.
  • Analysis cache: Indexed by file content hash; unchanged files reuse cached results.
  • CodeQL database: Stored under the cache directory if CodeQL is enabled; downloaded on first use.

To force a clean rebuild, pass eager=True to CLDK.python(...) (or --eager on the CLI), or delete the cache directory.

  • One schema: The same PyApplication schema is used across the CLI, SDK, and consuming code. All Py* types are Pydantic models with JSON/msgpack serialization.
  • Semantic over syntactic: Jedi resolves symbols and types, so queries operate on the resolved program rather than raw tokens.
  • Optional CodeQL: Jedi alone resolves approximately 80-90% of call edges. CodeQL augments dynamic and RPC calls at the cost of additional analysis time. Enable it when those edges are required.
  • Queryable interface: Reachability is a networkx query, and callers and callees are exposed through API methods over the analyzed project.