Skip to content

feat: In-memory rag#272

Draft
ilyabo wants to merge 4 commits intosqlrooms:mainfrom
stprnvsh:inmemory-rag
Draft

feat: In-memory rag#272
ilyabo wants to merge 4 commits intosqlrooms:mainfrom
stprnvsh:inmemory-rag

Conversation

@ilyabo
Copy link
Collaborator

@ilyabo ilyabo commented Dec 20, 2025

No description provided.

@ilyabo
Copy link
Collaborator Author

ilyabo commented Jan 15, 2026

@greptileai

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 15, 2026

Greptile Summary

Adds in-memory RAG capabilities to @sqlrooms/ai-rag with browser-side document preparation including PDF parsing, markdown-aware chunking, and embedding generation

Key Changes:

  • New prepare module with utilities for chunking (markdown-aware and size-based), database schema initialization, metadata tracking, and PDF text extraction
  • RagSlice now supports in-memory DuckDB databases for user-uploaded documents alongside existing file-based databases
  • Example app updated with document upload UI and RAG search dialog
  • Added pdfjs-dist dependency for client-side PDF processing

Critical Issues:

  • SQL injection vulnerabilities in multiple locations where databaseName, embedding arrays, and databaseFilePathOrUrl are interpolated directly into SQL queries without proper sanitization or parameterization
  • CDN-based worker loading for PDF.js creates external dependency risk

Confidence Score: 1/5

  • This PR has critical security vulnerabilities that must be addressed before merging
  • Multiple SQL injection vulnerabilities exist in database operations where user-controlled or configuration parameters (databaseName, embedding arrays, file paths) are directly interpolated into SQL queries without sanitization. While the functionality is well-designed, these security issues pose significant risk.
  • Pay close attention to packages/ai-rag/src/prepare/database.ts, packages/ai-rag/src/RagSlice.ts, and packages/ai-rag/src/prepare/metadata.ts - all contain SQL injection vulnerabilities that must be fixed

Important Files Changed

Filename Overview
packages/ai-rag/src/prepare/database.ts Added database operations with SQL injection vulnerabilities in databaseName and embedding parameters
packages/ai-rag/src/RagSlice.ts Major refactoring with in-memory RAG support, but contains critical SQL injection vulnerabilities in database operations
packages/ai-rag/src/prepare/metadata.ts Metadata creation and storage utilities, contains SQL injection vulnerability in databaseName parameter
packages/ai-rag/src/prepare/pdf.ts PDF text extraction using pdfjs-dist with CDN worker source

Sequence Diagram

sequenceDiagram
    participant User
    participant UI as UI Component
    participant Store as RagSlice Store
    participant Prepare as Prepare Module
    participant DB as DuckDB (In-Memory)
    participant Embedding as Embedding Provider

    Note over Store,DB: Initialization Flow
    User->>UI: Upload PDF/Markdown/Text
    UI->>Store: rag.initialize()
    Store->>DB: ATTACH DATABASE ':memory:' AS user_docs
    Store->>Prepare: initializeRagSchema(connector, 'user_docs', 1536)
    Prepare->>DB: CREATE TABLE documents
    Prepare->>DB: CREATE TABLE source_documents
    Prepare->>DB: CREATE TABLE embedding_metadata
    DB-->>Store: Schema initialized

    Note over Store,Embedding: Document Processing Flow
    UI->>Store: rag.addPdfDocument(file) or rag.addDocument(doc)
    Store->>Prepare: extractTextFromPDF(file) [if PDF]
    Prepare-->>Store: extracted text
    Store->>Prepare: chunkMarkdown(text) or chunkBySize(text)
    Prepare-->>Store: chunks[]
    Store->>Prepare: validateAndSplitChunks(chunks)
    Prepare-->>Store: validated chunks[]
    
    loop For each chunk
        Store->>Embedding: embeddingProvider(chunk.text)
        Embedding-->>Store: embedding vector (number[])
        Store->>Prepare: insertDocument(connector, dbName, {nodeId, text, embedding, metadata})
        Prepare->>DB: INSERT INTO documents (node_id, text, embedding, metadata_)
        DB-->>Store: chunk stored
    end
    
    Store->>Prepare: insertSourceDocument(connector, dbName, {docId, text, fileName})
    Prepare->>DB: INSERT INTO source_documents
    Store->>Prepare: updateMetadataStats(connector, dbName, chunks, [document])
    Prepare->>DB: UPDATE embedding_metadata
    Store-->>UI: nodeIds[] (success)

    Note over Store,Embedding: Query Flow
    User->>UI: Search query
    UI->>Store: rag.queryByText(query, {topK: 5})
    Store->>Embedding: embeddingProvider(query)
    Embedding-->>Store: query embedding vector
    Store->>DB: SELECT with array_cosine_similarity(embedding, query_embedding)
    DB-->>Store: top K results
    Store-->>UI: EmbeddingResult[] with scores
    UI-->>User: Display search results
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

embeddingDimensions: number,
): Promise<void> {
await connector.query(`
CREATE TABLE IF NOT EXISTS ${databaseName}.documents (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: databaseName parameter is directly interpolated into SQL without sanitization, creating SQL injection risk

Suggested change
CREATE TABLE IF NOT EXISTS ${databaseName}.documents (
CREATE TABLE IF NOT EXISTS "${databaseName}".documents (
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/ai-rag/src/prepare/database.ts
Line: 24:24

Comment:
**logic:** `databaseName` parameter is directly interpolated into SQL without sanitization, creating SQL injection risk

```suggestion
    CREATE TABLE IF NOT EXISTS "${databaseName}".documents (
```

How can I resolve this? If you propose a fix, please make it concise.

databaseName: string,
doc: InsertDocumentParams,
): Promise<void> {
const embeddingLiteral = `[${doc.embedding.join(', ')}]`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: embedding array values are joined directly without validation, allowing SQL injection if non-numeric values are provided

Suggested change
const embeddingLiteral = `[${doc.embedding.join(', ')}]`;
const embeddingLiteral = `[${doc.embedding.map(n => {
if (typeof n !== 'number' || !isFinite(n)) throw new Error('Invalid embedding value');
return n;
}).join(', ')}]`;
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/ai-rag/src/prepare/database.ts
Line: 101:101

Comment:
**logic:** `embedding` array values are joined directly without validation, allowing SQL injection if non-numeric values are provided

```suggestion
  const embeddingLiteral = `[${doc.embedding.map(n => {
    if (typeof n !== 'number' || !isFinite(n)) throw new Error('Invalid embedding value');
    return n;
  }).join(', ')}]`;
```

How can I resolve this? If you propose a fix, please make it concise.

if (!databaseFilePathOrUrl) {
// Create in-memory attached database
await connector.query(
`ATTACH DATABASE ':memory:' AS ${databaseName}`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: databaseName is directly interpolated into SQL query without sanitization, creating SQL injection vulnerability

Suggested change
`ATTACH DATABASE ':memory:' AS ${databaseName}`,
`ATTACH DATABASE ':memory:' AS "${databaseName}"`,
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/ai-rag/src/RagSlice.ts
Line: 262:262

Comment:
**logic:** `databaseName` is directly interpolated into SQL query without sanitization, creating SQL injection vulnerability

```suggestion
                  `ATTACH DATABASE ':memory:' AS "${databaseName}"`,
```

How can I resolve this? If you propose a fix, please make it concise.


// ATTACH DATABASE 'path/to/file.duckdb' AS database_name (READ_ONLY)
await connector.query(
`ATTACH DATABASE '${databaseFilePathOrUrl}' AS ${databaseName} (READ_ONLY)`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: databaseFilePathOrUrl and databaseName are directly interpolated without sanitization, creating SQL injection risk

Suggested change
`ATTACH DATABASE '${databaseFilePathOrUrl}' AS ${databaseName} (READ_ONLY)`,
`ATTACH DATABASE '${databaseFilePathOrUrl.replace(/'/g, "''")}' AS "${databaseName}" (READ_ONLY)`,
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/ai-rag/src/RagSlice.ts
Line: 280:280

Comment:
**logic:** `databaseFilePathOrUrl` and `databaseName` are directly interpolated without sanitization, creating SQL injection risk

```suggestion
                `ATTACH DATABASE '${databaseFilePathOrUrl.replace(/'/g, "''")}' AS "${databaseName}" (READ_ONLY)`,
```

How can I resolve this? If you propose a fix, please make it concise.

@@ -302,214 +368,258 @@ export function createRagSlice({
const embeddingDim = queryEmbedding.length;
const embeddingLiteral = `[${queryEmbedding.join(', ')}]`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: queryEmbedding array is joined without validation, allowing SQL injection if non-numeric values are provided

Suggested change
const embeddingLiteral = `[${queryEmbedding.join(', ')}]`;
const embeddingLiteral = `[${queryEmbedding.map(n => {
if (typeof n !== 'number' || !isFinite(n)) throw new Error('Invalid embedding value');
return n;
}).join(', ')}]`;
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/ai-rag/src/RagSlice.ts
Line: 369:369

Comment:
**logic:** `queryEmbedding` array is joined without validation, allowing SQL injection if non-numeric values are provided

```suggestion
          const embeddingLiteral = `[${queryEmbedding.map(n => {
            if (typeof n !== 'number' || !isFinite(n)) throw new Error('Invalid embedding value');
            return n;
          }).join(', ')}]`;
```

How can I resolve this? If you propose a fix, please make it concise.

import * as pdfjsLib from 'pdfjs-dist';

// Set worker source - uses the same version as the library
pdfjsLib.GlobalWorkerOptions.workerSrc = `//cdnjs.cloudflare.com/ajax/libs/pdf.js/${pdfjsLib.version}/pdf.worker.min.mjs`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Loading worker from external CDN creates security and reliability risks (CDN unavailability, tampering)

Consider bundling the worker locally or using a versioned URL with integrity checks

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/ai-rag/src/prepare/pdf.ts
Line: 9:9

Comment:
**style:** Loading worker from external CDN creates security and reliability risks (CDN unavailability, tampering)

Consider bundling the worker locally or using a versioned URL with integrity checks

How can I resolve this? If you propose a fix, please make it concise.


for (const [key, value] of Object.entries(flatMetadata)) {
await connector.query(`
INSERT INTO ${databaseName}.embedding_metadata (key, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: databaseName is directly interpolated without sanitization, creating SQL injection vulnerability

Suggested change
INSERT INTO ${databaseName}.embedding_metadata (key, value)
INSERT INTO "${databaseName}".embedding_metadata (key, value)
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/ai-rag/src/prepare/metadata.ts
Line: 168:168

Comment:
**logic:** `databaseName` is directly interpolated without sanitization, creating SQL injection vulnerability

```suggestion
      INSERT INTO "${databaseName}".embedding_metadata (key, value)
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants