Summary
--emit neo4j silently drops call-graph edges whose target is a bare module name that is also imported in the project (e.g. a call to os / re / json when import os is present). The in-memory PyApplication.call_graph contains these edges; the emitted Neo4j graph does not.
Environment
- codeanalyzer-python 0.2.0
Impact
Measured projecting a real 57-module project to Neo4j: 34 of 919 call edges (~3.7%) missing from the emitted graph, all with targets like os, re, json, subprocess, shlex, logging, codeanalyzer_typescript. Any consumer reading the call graph from Neo4j (e.g. the CLDK SDK's read-only PyNeo4jBackend) gets a call graph that diverges from the JSON / in-memory backend.
Root cause
codeanalyzer/neo4j/rows.py RowBuilder tracks every emitted node value in a single self._keys set, regardless of label or key property. codeanalyzer/neo4j/project.py project() projects module imports before the call graph, so import os creates a :PyPackage {name: "os"} node and adds "os" to _keys. Then for a call edge whose target signature is "os":
# project.py _call_endpoint()
if b.has_key("os"): # True — but it matched the :PyPackage *name*, not a :PySymbol *signature*
return _sym("os") # NodeRef("PySymbol", "signature", "os") → a node that was never created
The edge endpoint becomes a :PySymbol {signature: "os"} that does not exist (only :PyPackage {name: "os"} does). The bolt writer's _upsert_edges then runs ... MATCH (b:PySymbol {signature: "os"}) MERGE (a)-[r:PY_CALLS]->(b) ..., which matches nothing, so the edge is silently skipped. (The cypher-snapshot writer has the same blind spot.)
Repro
# project foo.py: `import os` + a function calling `os.getcwd()`
from codeanalyzer.neo4j.emit import emit_neo4j
# app.call_graph contains (foo.fn -> "os")
emit_neo4j(app, opts_with_neo4j_uri)
MATCH ()-[r:PY_CALLS]->(:PySymbol {signature:"os"}) RETURN count(r); // 0
MATCH (n {signature:"os"}) RETURN n; // none — only :PyPackage {name:"os"}
Suggested fix
Make node-identity tracking key-namespace-aware so a :PyPackage name can't shadow a :PySymbol signature. Either:
- key
RowBuilder._keys / has_key by (merge_label, value) (or (key_prop, value)) instead of bare value; or
- in
_call_endpoint, gate specifically on symbol existence (a value seen as :PySymbol/:PyExternal), materializing the :PyExternal ghost when only a same-named :PyPackage exists.
Chosen fix: first-class external_symbols (parity with codeanalyzer-typescript)
Rather than only patching the projection's identity tracking, adopt the model
the TypeScript backend already uses: make external call targets first-class in
the IR. This fixes the dropped edges and enriches every consumer (not just
Neo4j) in one change.
- Schema. Add
PyExternalSymbol { name, module } and a top-level
PyApplication.external_symbols: Dict[str, PyExternalSymbol] keyed by
signature (mirrors TS external_symbols: Record<string, TSExternalSymbol>).
- Analyzer. Populate
external_symbols once, where the call graph is
assembled: every call-graph endpoint that is not a declared symbol is an
external, recorded with its name and best-effort module. analysis.json
now carries external info that today is only a bare target string.
- Neo4j.
:PyExternal gains a module property; the projection emits
externals authoritatively from external_symbols instead of the
"not in symbol_table" heuristic, so a :PyPackage name can never shadow a
call target. (Additive Neo4j schema change -> SCHEMA_VERSION minor bump.)
- Node-identity tracking is also made
(merge_label, value)-keyed so deferred
PY_EXTENDS / PY_RESOLVES_TO edges can't be shadowed either.
Net effect: the dropped-edge bug is fixed structurally, and external symbols are
represented consistently across the JSON and Neo4j backends.
Summary
--emit neo4jsilently drops call-graph edges whose target is a bare module name that is also imported in the project (e.g. a call toos/re/jsonwhenimport osis present). The in-memoryPyApplication.call_graphcontains these edges; the emitted Neo4j graph does not.Environment
Impact
Measured projecting a real 57-module project to Neo4j: 34 of 919 call edges (~3.7%) missing from the emitted graph, all with targets like
os,re,json,subprocess,shlex,logging,codeanalyzer_typescript. Any consumer reading the call graph from Neo4j (e.g. the CLDK SDK's read-onlyPyNeo4jBackend) gets a call graph that diverges from the JSON / in-memory backend.Root cause
codeanalyzer/neo4j/rows.pyRowBuildertracks every emitted node value in a singleself._keysset, regardless of label or key property.codeanalyzer/neo4j/project.pyproject()projects module imports before the call graph, soimport oscreates a:PyPackage {name: "os"}node and adds"os"to_keys. Then for a call edge whose target signature is"os":The edge endpoint becomes a
:PySymbol {signature: "os"}that does not exist (only:PyPackage {name: "os"}does). The bolt writer's_upsert_edgesthen runs... MATCH (b:PySymbol {signature: "os"}) MERGE (a)-[r:PY_CALLS]->(b) ..., which matches nothing, so the edge is silently skipped. (The cypher-snapshot writer has the same blind spot.)Repro
Suggested fix
Make node-identity tracking key-namespace-aware so a
:PyPackagename can't shadow a:PySymbolsignature. Either:RowBuilder._keys/has_keyby(merge_label, value)(or(key_prop, value)) instead of barevalue; or_call_endpoint, gate specifically on symbol existence (a value seen as:PySymbol/:PyExternal), materializing the:PyExternalghost when only a same-named:PyPackageexists.Chosen fix: first-class
external_symbols(parity with codeanalyzer-typescript)Rather than only patching the projection's identity tracking, adopt the model
the TypeScript backend already uses: make external call targets first-class in
the IR. This fixes the dropped edges and enriches every consumer (not just
Neo4j) in one change.
PyExternalSymbol { name, module }and a top-levelPyApplication.external_symbols: Dict[str, PyExternalSymbol]keyed bysignature (mirrors TS
external_symbols: Record<string, TSExternalSymbol>).external_symbolsonce, where the call graph isassembled: every call-graph endpoint that is not a declared symbol is an
external, recorded with its
nameand best-effortmodule.analysis.jsonnow carries external info that today is only a bare target string.
:PyExternalgains amoduleproperty; the projection emitsexternals authoritatively from
external_symbolsinstead of the"not in symbol_table" heuristic, so a
:PyPackagename can never shadow acall target. (Additive Neo4j schema change ->
SCHEMA_VERSIONminor bump.)(merge_label, value)-keyed so deferredPY_EXTENDS/PY_RESOLVES_TOedges can't be shadowed either.Net effect: the dropped-edge bug is fixed structurally, and external symbols are
represented consistently across the JSON and Neo4j backends.