codeanalyzer-python
The Python analysis backend provides PythonAnalysis in CLDK. It runs Jedi for semantic analysis, optional CodeQL for call-graph augmentation, and Tree-sitter for syntax parsing. The output is a canonical PyApplication schema that ships with the backend and is re-exported by the CLDK Python SDK.
Overview
Section titled “Overview”codeanalyzer-python is a standalone static analysis library (published to PyPI as codeanalyzer-python) that the SDK manages in a virtualenv. It converts a Python project into a queryable symbol table and call graph.
The backend produces:
- Symbol table: All modules, classes, methods, functions, imports, parameters, and docstrings in typed
PyModuleobjects. - Call graph: Inter- and intra-procedural call edges (
PyCallEdge), with Jedi-based baseline and optional CodeQL augmentation merging resolved callees from both engines. - Class hierarchies: Base classes, inheritance chains, and method overrides.
- Entrypoints: Framework-detected entry points (Flask routes, Celery tasks, Django views, gRPC servicers, etc.) linked to their callables.
Architecture
Section titled “Architecture”flowchart LR
A["Input: project_path"] --> B["Virtualenv<br/>+ deps"]
B --> C["Jedi: symbol table<br/>+ Jedi call edges"]
B --> D["CodeQL<br/>(optional)"]
D --> E["Merge edges"]
C --> E
E --> F["Symbol table<br/>Call graph"]
F --> G["PyApplication<br/>+ PyModule, PyClass,<br/>PyCallable, PyCallEdge"]
G --> H["JSON / Msgpack"]
G --> I["CLDK SDK<br/>cldk.models.python"]
Key modules
Section titled “Key modules”codeanalyzer.core:Codeanalyzer: Orchestrates the analysis pipeline, manages virtualenv setup, caching, and invokes semantic passes.codeanalyzer.syntactic_analysis:SymbolTableBuilder: Parses Python source via Tree-sitter and Jedi to extract modules, classes, methods, and call sites.codeanalyzer.semantic_analysis.call_graph: Builds inter-procedural call graphs using Jedi’s resolution; merges CodeQL edges when enabled.codeanalyzer.semantic_analysis.codeql: Optional CodeQL integration for resolving dynamic calls, third-party dispatch, and RPC targets.codeanalyzer.schema:py_schema: Defines all Pydantic models:PyModule,PyClass,PyCallable,PyCallEdge,PyApplication, and others.
Schema: the Py* models
Section titled “Schema: the Py* models”All models are defined in /codeanalyzer/schema/py_schema.py and re-exported in the CLDK SDK at cldk.models.python:
Core application model
Section titled “Core application model”PyApplication: The root output of every analysis run.
class PyApplication(BaseModel): symbol_table: Dict[str, PyModule] # file_path → PyModule call_graph: List[PyCallEdge] = [] # edges with source → target signatureSymbol table
Section titled “Symbol table”PyModule: Represents one .py file.
class PyModule(BaseModel): file_path: str module_name: str imports: List[PyImport] = [] comments: List[PyComment] = [] classes: Dict[str, PyClass] = {} # class_name → PyClass functions: Dict[str, PyCallable] = {} # function_name → PyCallable variables: List[PyVariableDeclaration] = [] content_hash: Optional[str] = None # for cache invalidation last_modified: Optional[float] = None file_size: Optional[int] = NonePyClass: A class definition.
class PyClass(BaseModel): name: str signature: str # e.g., "my_pkg.module.ClassName" comments: List[PyComment] = [] code: str | None = None base_classes: List[str] = [] # parent signatures methods: Dict[str, PyCallable] = {} # method_name → PyCallable attributes: Dict[str, PyClassAttribute] = {} # attr_name → attribute inner_classes: Dict[str, "PyClass"] = {} start_line: int end_line: intPyCallable: A function or method.
class PyCallable(BaseModel): name: str path: str signature: str # e.g., "my_pkg.module.ClassName.method_name" comments: List[PyComment] = [] decorators: List[str] = [] parameters: List[PyCallableParameter] = [] return_type: Optional[str] = None code: str | None = None # source code of the callable start_line: int end_line: int code_start_line: int accessed_symbols: List[PySymbol] = [] # local variable / import refs call_sites: List[PyCallsite] = [] # calls made within this callable inner_callables: Dict[str, "PyCallable"] = {} inner_classes: Dict[str, "PyClass"] = {} local_variables: List[PyVariableDeclaration] = [] cyclomatic_complexity: int = 0PyCallsite: A single call inside a callable.
class PyCallsite(BaseModel): method_name: str receiver_expr: Optional[str] = None # "obj" in obj.method() receiver_type: Optional[str] = None argument_types: List[str] = [] return_type: Optional[str] = None callee_signature: Optional[str] = None # resolved target (if found) is_constructor_call: bool = False start_line: int start_column: int end_line: int end_column: intCall graph
Section titled “Call graph”PyCallEdge: A directed edge in the call graph.
class PyCallEdge(BaseModel): source: str # caller PyCallable.signature target: str # callee PyCallable.signature type: Literal["CALL_DEP"] = "CALL_DEP" weight: int = 1 provenance: List[Literal["jedi", "codeql", "joern"]] = []Supporting models
Section titled “Supporting models”PyImport: An import statement.
class PyImport(BaseModel): module: str # "os", "flask", "my_pkg.utils" name: str # "path", "Flask", "helper" alias: Optional[str] = None start_line: int end_line: int start_column: int end_column: intPyComment: A comment or docstring.
class PyComment(BaseModel): content: str start_line: int end_line: int start_column: int end_column: int is_docstring: bool = FalseAll Py* models support:
- Pydantic v1 and v2 compatibility via
cldk.models.python. - Builder pattern:
PyModule.builder().module_name("x").classes({...}).build(). - Serialization:
to_msgpack_bytes(),from_msgpack_bytes(),model_dump_json().
CLI interface
Section titled “CLI interface”The backend ships a command-line tool canpy, installed by pip install codeanalyzer-python:
canpy --input /path/to/my_pkg [OPTIONS]codeanalyzer is a deprecated alias kept for backwards compatibility: it prints a deprecation warning to stderr and delegates to canpy. Prefer canpy.
Options
Section titled “Options”| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--input | -i | PATH | Required | Project root directory to analyze |
--output | -o | PATH | None | Save analysis.json or analysis.msgpack to this directory (stdout if None) |
--format | -f | json | msgpack | json | Output serialization format |
--emit | json | neo4j | schema | json | Output target: json (analysis.json), neo4j (graph.cypher, or a live Bolt push with --neo4j-uri), or schema (Neo4j schema contract; needs no input) | |
--app-name | str | input dir name | Application name for the graph :PyApplication anchor | |
--neo4j-uri | str | None (NEO4J_URI) | Push the graph to a live Neo4j over Bolt (incremental); omit to write graph.cypher | |
--neo4j-user | str | neo4j (NEO4J_USERNAME) | Neo4j username | |
--neo4j-password | str | neo4j (NEO4J_PASSWORD) | Neo4j password (prefer the env var) | |
--neo4j-database | str | server default (NEO4J_DATABASE) | Neo4j database name | |
--codeql / --no-codeql | bool | false | Enable CodeQL-based call-graph augmentation (experimental) | |
--ray / --no-ray | bool | false | Enable Ray for distributed analysis | |
--eager / --lazy | bool | lazy | Force rebuild cache (eager) or reuse cached results (lazy) | |
--cache-dir | -c | PATH | .codeanalyzer in input dir | Where to store virtualenv, CodeQL DB, analysis cache |
--keep-cache / --clear-cache | bool | keep | Retain cache after analysis (default) or remove it | |
--skip-tests / --include-tests | bool | skip | Exclude or include test_*.py / *_test.py files | |
--file-name | PATH | None | Analyze only a single file (relative to input dir) | |
-v | count | 0 | Verbosity: -v, -vv, -vvv for debug/trace |
Examples
Section titled “Examples”Basic symbol table analysis:
canpy -i ./my_pkgOutputs analysis.json to stdout (symbol table + Jedi call graph).
With CodeQL augmentation:
canpy -i ./my_pkg --codeqlMerges Jedi edges with CodeQL-resolved edges; note that CodeQL integration is experimental and may take longer.
With Ray distributed analysis:
canpy -i ./my_pkg --rayEnables Ray for parallel processing across available cores.
Save to file in msgpack format:
canpy -i ./my_pkg -o ./results --format msgpackSaves compressed analysis.msgpack with 30-50% of JSON size.
Custom cache, eager rebuild:
canpy -i ./my_pkg --cache-dir /tmp/analysis-cache --eagerRebuilds virtualenv and analysis cache from scratch, storing in /tmp/analysis-cache/.codeanalyzer.
Single file:
canpy -i ./my_pkg --file-name src/handlers.pyAnalyzes only src/handlers.py.
How the SDK consumes it
Section titled “How the SDK consumes it”A call to CLDK.python(project_path="my_pkg") in the Python SDK proceeds as follows:
-
Virtualenv provisioning: CLDK detects or installs
codeanalyzer-pythoninto a managed virtualenv in the cache directory (default:<project_dir>/.codeanalyzer/venv). -
CLI invocation: The SDK constructs a
canpycommand with options (--codeql,--eager,--cache-dir, etc.) and runs it as a subprocess. Stdout is parsed asanalysis.json. -
Schema re-export:
cldk.models.pythonre-exportsPyApplication,PyModule,PyClass,PyCallable, and other Py* types directly fromcodeanalyzer.schema.py_schema, ensuring a single source of truth. -
In-memory analysis object: The parsed
PyApplicationis passed toPythonAnalysis, which wraps it with convenience methods:get_symbol_table() → Dict[str, PyModule]get_classes() → Dict[str, PyClass]get_call_graph() → networkx.DiGraphget_callers(target_class_name, target_method_declaration) → Dictget_callees(source_class_name, source_method_declaration) → Dict
from cldk import CLDKfrom cldk.analysis import AnalysisLevelfrom cldk.analysis.commons.backend_config import PyCodeAnalyzerConfig
analysis = CLDK.python( project_path="my_pkg", analysis_level=AnalysisLevel.call_graph, backend=PyCodeAnalyzerConfig(use_codeql=True), # optional; merges CodeQL edges)
# Query the symbol tablemodules = analysis.get_symbol_table()classes = analysis.get_classes()
# Compute reachabilitycall_graph = analysis.get_call_graph()import networkx as nxis_reachable = nx.has_path(call_graph, "my_pkg.main", "my_pkg.unsafe_sink")
# Find callerscallers = analysis.get_callers("my_pkg.MyClass", "process")canpy -i ./my_pkg --codeql --output ./results --format jsoncat results/analysis.json | jq '.symbol_table | keys'Choosing a backend
Section titled “Choosing a backend”The backend is selected by the type of the backend= config passed to CLDK.python(...):
- In-memory codeanalyzer (default): omit
backend=, or passbackend=PyCodeAnalyzerConfig(...). The Python-only call-graph knobsuse_codeql=...anduse_ray=...live on this config, as doescache_dir=.... - Read-only Neo4j: pass
backend=Neo4jConnectionConfig(...)to query a graph populated out of band (no local analysis is run).
from cldk import CLDKfrom cldk.analysis.commons.backend_config import ( PyCodeAnalyzerConfig, Neo4jConnectionConfig,)
# In-memory backend with Ray + custom cache directoryanalysis = CLDK.python( project_path="my_pkg", backend=PyCodeAnalyzerConfig( use_codeql=True, use_ray=True, cache_dir="/tmp/analysis-cache", ),)
# Read-only Neo4j backendanalysis = CLDK.python( project_path="my_pkg", backend=Neo4jConnectionConfig( uri="bolt://localhost:7687", username="neo4j", password="neo4j", database=None, application_name="my_pkg", ),)Neo4jConnectionConfig is importable from cldk.analysis.commons.backend_config (and also from cldk.analysis.python.neo4j).
CLDK.python(...) keeps the project_path, analysis_level, target_files, and eager keyword arguments. The old CLDK(language="python").analysis(...) form still works but is deprecated; prefer CLDK.python(...). The from cldk import CLDK import is unchanged.
Caching and virtualenv management
Section titled “Caching and virtualenv management”- Cache location: A single language-keyed
cache_dir(default:<project_dir>/.codeanalyzer); Python artifacts live under<cache_dir>/python/. Set it viabackend=PyCodeAnalyzerConfig(cache_dir=...). - Virtualenv: Auto-created under the cache directory. The backend installs dependencies from
requirements.txt,pyproject.toml,setup.py,Pipfile, etc. - Analysis cache: Indexed by file content hash; unchanged files reuse cached results.
- CodeQL database: Stored under the cache directory if CodeQL is enabled; downloaded on first use.
To force a clean rebuild, pass eager=True to CLDK.python(...) (or --eager on the CLI), or delete the cache directory.
Design principles
Section titled “Design principles”- One schema: The same
PyApplicationschema is used across the CLI, SDK, and consuming code. All Py* types are Pydantic models with JSON/msgpack serialization. - Semantic over syntactic: Jedi resolves symbols and types, so queries operate on the resolved program rather than raw tokens.
- Optional CodeQL: Jedi alone resolves approximately 80-90% of call edges. CodeQL augments dynamic and RPC calls at the cost of additional analysis time. Enable it when those edges are required.
- Queryable interface: Reachability is a
networkxquery, and callers and callees are exposed through API methods over the analyzed project.