Update dependencies, roadmap, and add indexing scripts

- Add LanceDB (@lancedb/lancedb) for vector database - Add @xenova/transformers for local embeddings - Add gray-matter for YAML frontmatter parsing - Update ROADMAP.md with Phase 1 completion status - Add indexing scripts: index-docs.ts, test-parser.ts, test-search.ts - Add .claude/ configuration for MCP server settings - Add npm script: index-docs for rebuilding search index 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-23 04:58:14 -06:00 · 2025-11-23 04:58:14 -06:00 · 6ca8339387
commit 6ca8339387
parent f56b92e76e
7 changed files with 1623 additions and 29 deletions
--- a/.claude/mcp.json
+++ b/.claude/mcp.json
@ -0,0 +1,8 @@
+ {
+    "mcpServers": {
+      "babylon-mcp": {
+        "command": "npx",
+        "args": ["mcp-proxy", "http://localhost:4000/mcp"]
+      }
+    }
+  }
--- a/ROADMAP.md
+++ b/ROADMAP.md
@ -9,6 +9,29 @@ Build an MCP (Model Context Protocol) server that helps developers working with

 ---

+## Recent Progress (2025-01-23)
+
+**Phase 1 Core Features - COMPLETED** ✅
+
+Successfully implemented vector search with local embeddings:
+- ✅ Installed and configured LanceDB + @xenova/transformers
+- ✅ Created document parser with YAML frontmatter extraction
+- ✅ Built indexer that processes 745 markdown files
+- ✅ Generated vector embeddings using Xenova/all-MiniLM-L6-v2 (local, no API costs)
+- ✅ Implemented `search_babylon_docs` MCP tool with semantic search
+- ✅ Implemented `get_babylon_doc` MCP tool for document retrieval
+- ✅ Added relevance scoring and snippet extraction
+- ✅ Tested successfully with "Vector3" query
+
+**Key Implementation Details:**
+- Vector database: LanceDB stored in `./data/lancedb`
+- Embedding model: Runs locally in Node.js via transformers.js
+- Indexed fields: title, description, keywords, category, breadcrumbs, content, headings, code blocks
+- Search features: Semantic similarity, category filtering, ranked results with snippets
+- Scripts: `npm run index-docs` to rebuild index
+
+---
+
 ## Phase 1: Core MCP Infrastructure & Documentation Indexing
 **Goal**: Establish foundational MCP server with documentation search from the canonical GitHub source

@ -21,31 +44,35 @@ Build an MCP (Model Context Protocol) server that helps developers working with
 ### 1.2 Documentation Repository Integration
 - [X] Clone and set up local copy of BabylonJS/Documentation repository
 - [X] Implement automated git pull mechanism for updates
- [ ] Parse documentation file structure (markdown files, code examples)
- [ ] Extract metadata from documentation files (titles, categories, versions)
+- [X] Parse documentation file structure (markdown files, code examples)
+- [X] Extract metadata from documentation files (titles, categories, versions)
+- [I] Index Babylon.js source repository markdown files (Option 3 - Hybrid Approach, Phase 1)
+  - [I] Add 144 markdown files from Babylon.js/Babylon.js repository
+  - [I] Include: CHANGELOG.md, package READMEs, contributing guides
+  - [ ] Phase 2: Evaluate TypeDoc integration for API reference
 - [ ] Create documentation change detection system

 ### 1.3 Search Index Implementation
- [ ] Design indexing strategy for markdown documentation
- [ ] Implement vector embeddings for semantic search (consider OpenAI embeddings or local alternatives)
- [ ] Create full-text search index (SQLite FTS5 or similar)
- [ ] Index code examples separately from prose documentation
+- [X] Design indexing strategy for markdown documentation
+- [X] Implement vector embeddings for semantic search (using @xenova/transformers with Xenova/all-MiniLM-L6-v2)
+- [X] Create vector database with LanceDB
+- [X] Index code examples separately from prose documentation
 - [ ] Implement incremental index updates (only reindex changed files)

 ### 1.4 Basic Documentation Search Tool
- [ ] Implement MCP tool: `search_babylon_docs`
+- [X] Implement MCP tool: `search_babylon_docs`
  - Input: search query, optional filters (category, API section)
  - Output: ranked documentation results with context snippets and file paths
- [ ] Return results in token-efficient format (concise snippets vs full content)
- [ ] Add relevance scoring based on semantic similarity and keyword matching
+- [X] Return results in token-efficient format (concise snippets vs full content)
+- [X] Add relevance scoring based on semantic similarity and keyword matching
 - [ ] Implement result deduplication

 ### 1.5 Documentation Retrieval Tool
- [ ] Implement MCP tool: `get_babylon_doc`
+- [X] Implement MCP tool: `get_babylon_doc`
  - Input: specific documentation file path or topic identifier
  - Output: full documentation content optimized for AI consumption
- [ ] Format content to minimize token usage while preserving clarity
- [ ] Include related documentation links in results
+- [X] Format content to minimize token usage while preserving clarity
+- [X] Include related documentation links in results

 ---

@ -264,15 +291,18 @@ Build an MCP (Model Context Protocol) server that helps developers working with
 - **Tools**: search_babylon_docs, get_babylon_doc, search_babylon_examples, provide_feedback, submit_suggestion, vote_on_suggestion, browse_suggestions
 - **Resources**: babylon_context (common framework information)

-### Search & Indexing
- **Vector Search**: OpenAI embeddings or local model (all-MiniLM-L6-v2)
- **Full-Text Search**: SQLite FTS5 for simplicity, Elasticsearch for scale
- **Hybrid Approach**: Combine semantic and keyword search for best results
+### Search & Indexing (✅ Implemented)
+- **Vector Database**: LanceDB for vector storage and similarity search
+- **Embedding Model**: @xenova/transformers with Xenova/all-MiniLM-L6-v2 (local, no API costs)
+- **Document Parser**: gray-matter for YAML frontmatter + markdown content extraction
+- **Indexed Documents**: 745 markdown files from BabylonJS/Documentation repository
+- **Search Features**: Semantic vector search with relevance scoring, category filtering, snippet extraction

-### Data Storage
- **Primary Database**: SQLite (development/small scale) → PostgreSQL (production)
- **Cache**: Redis for query results and frequently accessed docs
- **File Storage**: Local clone of BabylonJS/Documentation repository
+### Data Storage (✅ Implemented)
+- **Vector Database**: LanceDB stored in `./data/lancedb`
+- **Document Storage**: Local clone of BabylonJS/Documentation in `./data/repositories/Documentation`
+- **Indexed Fields**: title, description, keywords, category, breadcrumbs, content, headings, code blocks, playground IDs
+- **Future**: Add Redis for query caching, implement incremental updates

 ### Token Optimization Strategy
 - Return concise snippets by default (50-200 tokens)
@ -292,11 +322,12 @@ Build an MCP (Model Context Protocol) server that helps developers working with

 ## Success Metrics

-### Phase 1-2 (Core Functionality)
- Documentation indexing: 100% of BabylonJS/Documentation repo
- Search response time: < 500ms p95
- Search relevance: > 80% of queries return useful results
- Token efficiency: Average response < 300 tokens
+### Phase 1-2 (Core Functionality) ✅ ACHIEVED
+- ✅ Documentation indexing: 100% of BabylonJS/Documentation repo (745 files indexed)
+- ✅ Search implementation: LanceDB vector search with local embeddings operational
+- ⏳ Search response time: Testing needed for p95 latency
+- ⏳ Search relevance: Initial tests successful, needs broader validation
+- ⏳ Token efficiency: Needs measurement and optimization

 ### Phase 3-5 (Optimization & Feedback)
 - Cache hit rate: > 60%
--- a/package-lock.json
+++ b/package-lock.json
--- a/package.json
+++ b/package.json
@ -12,15 +12,19 @@
    "test": "vitest",
    "test:ui": "vitest --ui",
    "test:run": "vitest run",
-    "test:coverage": "vitest run --coverage"
+    "test:coverage": "vitest run --coverage",
+    "index-docs": "tsx scripts/index-docs.ts"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "description": "",
  "dependencies": {
+    "@lancedb/lancedb": "^0.22.3",
    "@modelcontextprotocol/sdk": "^1.22.0",
+    "@xenova/transformers": "^2.17.2",
    "express": "^5.1.0",
+    "gray-matter": "^4.0.3",
    "simple-git": "^3.30.0",
    "zod": "^3.25.76"
  },
--- a/scripts/index-docs.ts
+++ b/scripts/index-docs.ts
@ -0,0 +1,51 @@
+#!/usr/bin/env tsx
+
+import { LanceDBIndexer, DocumentSource } from '../src/search/lancedb-indexer.js';
+import path from 'path';
+import { fileURLToPath } from 'url';
+
+const __filename = fileURLToPath(import.meta.url);
+const __dirname = path.dirname(__filename);
+
+async function main() {
+  const projectRoot = path.join(__dirname, '..');
+  const dbPath = path.join(projectRoot, 'data', 'lancedb');
+
+  // Define documentation sources
+  const sources: DocumentSource[] = [
+    {
+      name: 'documentation',
+      path: path.join(projectRoot, 'data', 'repositories', 'Documentation', 'content'),
+      urlPrefix: 'https://doc.babylonjs.com',
+    },
+    {
+      name: 'source-repo',
+      path: path.join(projectRoot, 'data', 'repositories', 'Babylon.js'),
+      urlPrefix: 'https://github.com/BabylonJS/Babylon.js/blob/master',
+    },
+  ];
+
+  console.log('Starting Babylon.js documentation indexing...');
+  console.log(`Database path: ${dbPath}`);
+  console.log(`\nDocumentation sources:`);
+  sources.forEach((source, index) => {
+    console.log(`  ${index + 1}. ${source.name}: ${source.path}`);
+  });
+  console.log('');
+
+  const indexer = new LanceDBIndexer(dbPath, sources);
+
+  try {
+    await indexer.initialize();
+    await indexer.indexDocuments();
+    console.log('');
+    console.log('✓ Documentation indexing completed successfully!');
+  } catch (error) {
+    console.error('Error during indexing:', error);
+    process.exit(1);
+  } finally {
+    await indexer.close();
+  }
+}
+
+main();
--- a/scripts/test-parser.ts
+++ b/scripts/test-parser.ts
@ -0,0 +1,70 @@
+#!/usr/bin/env tsx
+import { DocumentParser } from '../src/search/document-parser.js';
+import path from 'path';
+
+async function main() {
+  const parser = new DocumentParser();
+
+  // Test files to parse
+  const testFiles = [
+    'data/repositories/Documentation/content/features.md',
+    'data/repositories/Documentation/content/features/featuresDeepDive/mesh/gizmo.md',
+    'data/repositories/Documentation/content/toolsAndResources/thePlayground.md',
+  ];
+
+  console.log('🔍 Testing DocumentParser on real BabylonJS documentation\n');
+  console.log('='.repeat(80));
+
+  for (const file of testFiles) {
+    const filePath = path.join(process.cwd(), file);
+
+    try {
+      console.log(`\n📄 Parsing: ${file}`);
+      console.log('-'.repeat(80));
+
+      const doc = await parser.parseFile(filePath);
+
+      console.log(`Title: ${doc.title}`);
+      console.log(`Description: ${doc.description.substring(0, 100)}...`);
+      console.log(`Category: ${doc.category}`);
+      console.log(`Breadcrumbs: ${doc.breadcrumbs.join(' > ')}`);
+      console.log(`Keywords: ${doc.keywords.join(', ')}`);
+      console.log(`Headings: ${doc.headings.length} found`);
+
+      if (doc.headings.length > 0) {
+        console.log('  First 3 headings:');
+        doc.headings.slice(0, 3).forEach(h => {
+          console.log(`    ${'#'.repeat(h.level)} ${h.text}`);
+        });
+      }
+
+      console.log(`Code blocks: ${doc.codeBlocks.length} found`);
+      if (doc.codeBlocks.length > 0) {
+        console.log('  Languages:', [...new Set(doc.codeBlocks.map(cb => cb.language))].join(', '));
+      }
+
+      console.log(`Playground IDs: ${doc.playgroundIds.length} found`);
+      if (doc.playgroundIds.length > 0) {
+        console.log('  IDs:', doc.playgroundIds.slice(0, 3).join(', '));
+      }
+
+      console.log(`Further reading: ${doc.furtherReading.length} links`);
+      if (doc.furtherReading.length > 0) {
+        doc.furtherReading.forEach(link => {
+          console.log(`  - ${link.title}: ${link.url}`);
+        });
+      }
+
+      console.log(`Content length: ${doc.content.length} characters`);
+      console.log(`Last modified: ${doc.lastModified.toISOString()}`);
+
+    } catch (error) {
+      console.error(`❌ Error parsing ${file}:`, error);
+    }
+  }
+
+  console.log('\n' + '='.repeat(80));
+  console.log('✅ Parser test complete!');
+}
+
+main().catch(console.error);
--- a/scripts/test-search.ts
+++ b/scripts/test-search.ts
@ -0,0 +1,45 @@
+#!/usr/bin/env tsx
+
+import { LanceDBSearch } from '../src/search/lancedb-search.js';
+import path from 'path';
+import { fileURLToPath } from 'url';
+
+const __filename = fileURLToPath(import.meta.url);
+const __dirname = path.dirname(__filename);
+
+async function main() {
+  const projectRoot = path.join(__dirname, '..');
+  const dbPath = path.join(projectRoot, 'data', 'lancedb');
+
+  console.log('Initializing search...');
+  const search = new LanceDBSearch(dbPath);
+  await search.initialize();
+
+  console.log('\n=== Testing search for "Vector3" ===\n');
+  const results = await search.search('Vector3', { limit: 5 });
+
+  console.log(`Found ${results.length} results:\n`);
+  results.forEach((result, index) => {
+    console.log(`${index + 1}. ${result.title}`);
+    console.log(`   URL: ${result.url}`);
+    console.log(`   Relevance: ${(result.score * 100).toFixed(1)}%`);
+    console.log(`   Description: ${result.description}`);
+    console.log(`   Snippet: ${result.content.substring(0, 150)}...`);
+    console.log('');
+  });
+
+  console.log('\n=== Testing search for "camera controls" ===\n');
+  const cameraResults = await search.search('camera controls', { limit: 3 });
+
+  console.log(`Found ${cameraResults.length} results:\n`);
+  cameraResults.forEach((result, index) => {
+    console.log(`${index + 1}. ${result.title}`);
+    console.log(`   URL: ${result.url}`);
+    console.log(`   Relevance: ${(result.score * 100).toFixed(1)}%`);
+    console.log('');
+  });
+
+  await search.close();
+}
+
+main().catch(console.error);