Skip to content

Add first-class external_symbols (codeanalyzer-typescript parity); fixes dropped call edges to imported module names #44

Description

@rahlk

Summary

--emit neo4j silently drops call-graph edges whose target is a bare module name that is also imported in the project (e.g. a call to os / re / json when import os is present). The in-memory PyApplication.call_graph contains these edges; the emitted Neo4j graph does not.

Environment

  • codeanalyzer-python 0.2.0

Impact

Measured projecting a real 57-module project to Neo4j: 34 of 919 call edges (~3.7%) missing from the emitted graph, all with targets like os, re, json, subprocess, shlex, logging, codeanalyzer_typescript. Any consumer reading the call graph from Neo4j (e.g. the CLDK SDK's read-only PyNeo4jBackend) gets a call graph that diverges from the JSON / in-memory backend.

Root cause

codeanalyzer/neo4j/rows.py RowBuilder tracks every emitted node value in a single self._keys set, regardless of label or key property. codeanalyzer/neo4j/project.py project() projects module imports before the call graph, so import os creates a :PyPackage {name: "os"} node and adds "os" to _keys. Then for a call edge whose target signature is "os":

# project.py _call_endpoint()
if b.has_key("os"):        # True — but it matched the :PyPackage *name*, not a :PySymbol *signature*
    return _sym("os")      # NodeRef("PySymbol", "signature", "os")  → a node that was never created

The edge endpoint becomes a :PySymbol {signature: "os"} that does not exist (only :PyPackage {name: "os"} does). The bolt writer's _upsert_edges then runs ... MATCH (b:PySymbol {signature: "os"}) MERGE (a)-[r:PY_CALLS]->(b) ..., which matches nothing, so the edge is silently skipped. (The cypher-snapshot writer has the same blind spot.)

Repro

# project foo.py:  `import os` + a function calling `os.getcwd()`
from codeanalyzer.neo4j.emit import emit_neo4j
# app.call_graph contains  (foo.fn -> "os")
emit_neo4j(app, opts_with_neo4j_uri)
MATCH ()-[r:PY_CALLS]->(:PySymbol {signature:"os"}) RETURN count(r);   // 0
MATCH (n {signature:"os"}) RETURN n;                                   // none — only :PyPackage {name:"os"}

Suggested fix

Make node-identity tracking key-namespace-aware so a :PyPackage name can't shadow a :PySymbol signature. Either:

  • key RowBuilder._keys / has_key by (merge_label, value) (or (key_prop, value)) instead of bare value; or
  • in _call_endpoint, gate specifically on symbol existence (a value seen as :PySymbol/:PyExternal), materializing the :PyExternal ghost when only a same-named :PyPackage exists.

Chosen fix: first-class external_symbols (parity with codeanalyzer-typescript)

Rather than only patching the projection's identity tracking, adopt the model
the TypeScript backend already uses: make external call targets first-class in
the IR
. This fixes the dropped edges and enriches every consumer (not just
Neo4j) in one change.

  • Schema. Add PyExternalSymbol { name, module } and a top-level
    PyApplication.external_symbols: Dict[str, PyExternalSymbol] keyed by
    signature (mirrors TS external_symbols: Record<string, TSExternalSymbol>).
  • Analyzer. Populate external_symbols once, where the call graph is
    assembled: every call-graph endpoint that is not a declared symbol is an
    external, recorded with its name and best-effort module. analysis.json
    now carries external info that today is only a bare target string.
  • Neo4j. :PyExternal gains a module property; the projection emits
    externals authoritatively from external_symbols instead of the
    "not in symbol_table" heuristic, so a :PyPackage name can never shadow a
    call target. (Additive Neo4j schema change -> SCHEMA_VERSION minor bump.)
  • Node-identity tracking is also made (merge_label, value)-keyed so deferred
    PY_EXTENDS / PY_RESOLVES_TO edges can't be shadowed either.

Net effect: the dropped-edge bug is fixed structurally, and external symbols are
represented consistently across the JSON and Neo4j backends.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions