Verified Results
Benchmark Results
Real tests. Real numbers. Reproducible results from automated test suites.
Last Updated: January 2026
Test 1: Coding Hallucination Verification
CoreTest Description
When AI claims code changes, Cortex verifies against the actual codebase. Detects non-existent method additions, incorrect line numbers, etc.
Verification:
1. Hallucination Detection (F1 Score)
2. Precision
3. Recall
Key Difference
When AI says "I fixed the bug", Cortex verifies against actual code.
Benchmark Results
| Metric | Result | Status |
|---|---|---|
| Hallucination F1 | 88.9% | Pass |
| Precision | 100% | Pass |
| Recall | 88% | Pass |
Example
AI: "Added calculateTotal() to UserService"
Cortex: AST analysis → Not found → Hallucination
Test 2: Multi-Environment Context Sync
Test Description
Verifies context synchronization across desktop, laptop, remote servers. Git-based sync enables seamless continuation anywhere.
Test Scenarios:
1. Create context on desktop → Pull on laptop
2. Verify conflict-free merge
3. Multi-agent concurrent sync
4. Ontology relationship data sync
Results
Key: Continue with same context regardless of environment
Test 3: Cross-Session Memory
Architecture Feature
LLMs are stateless by design. Cortex maintains cross-session context with persistent local storage, enabling continuation of days-old conversations.
Capabilities:
1. Recall context from days/weeks ago
2. No need to re-explain project
3. Maintain ontology relationships
4. Auto-track goals/progress
Comparison
Standard LLM
- - Session end = memory reset
- - Need to explain project each time
- - Context window limit
With Cortex
- + Unlimited session persistence
- + "What we did last time..." works
- + Unlimited conversation history
Test 4: Pay Attention
Test Description
Verifies tracking of all topic versions (A → A' → A'') in long conversations. Automatically injects context/goals periodically.
Test Cases:
1. Long conversation recall
2. Version tracking (5 versions)
3. Referential query ("that earlier")
4. Completeness validation
5. Trigger detection (8 types)
Results
| Test | Without | Cortex |
|---|---|---|
| Initial | Fail | Pass |
| Versions | 0 | 5 |
| Reference | Fail | Pass |
| Triggers | N/A | 8/8 |
Test 5: Search Performance
Local vector search with ChromaDB + sentence-transformers. Avg 58ms response across 500 documents.
Cortex Core Value
Verify AI coding tool reliability and maintain context across all environments.
Hallucination Check
Verify AI claims against actual code
Multi-Env Sync
Same context in any environment
Multi-Agent
Context merge & sync for concurrent work
Ontology
Efficient context retrieval via relationships
Auto Context
Auto-update context/goals periodically
Local Only
All data stays local only
Run the Tests Yourself
Don't just take our word for it. Join the beta and verify results yourself.