Verified Results

Benchmark Results

Real tests. Real numbers. Reproducible results from automated test suites.

Last Updated: January 2026

Test 1: Coding Hallucination Verification

Core
3/3 Passed

Test Description

When AI claims code changes, Cortex verifies against the actual codebase. Detects non-existent method additions, incorrect line numbers, etc.

Verification:
1. Hallucination Detection (F1 Score)
2. Precision
3. Recall

Key Difference

When AI says "I fixed the bug", Cortex verifies against actual code.

Benchmark Results

Metric Result Status
Hallucination F1 88.9% Pass
Precision 100% Pass
Recall 88% Pass

Example

AI: "Added calculateTotal() to UserService"

Cortex: AST analysis → Not found → Hallucination

Test 2: Multi-Environment Context Sync

Pass

Test Description

Verifies context synchronization across desktop, laptop, remote servers. Git-based sync enables seamless continuation anywhere.

Test Scenarios:
1. Create context on desktop → Pull on laptop
2. Verify conflict-free merge
3. Multi-agent concurrent sync
4. Ontology relationship data sync

Results

Cross-env Sync 100%
Multi-agent Merge 0 Conflicts
Ontology Data 100%

Key: Continue with same context regardless of environment

Test 3: Cross-Session Memory

Unlimited

Architecture Feature

LLMs are stateless by design. Cortex maintains cross-session context with persistent local storage, enabling continuation of days-old conversations.

Capabilities:
1. Recall context from days/weeks ago
2. No need to re-explain project
3. Maintain ontology relationships
4. Auto-track goals/progress

Comparison

Standard LLM

  • - Session end = memory reset
  • - Need to explain project each time
  • - Context window limit

With Cortex

  • + Unlimited session persistence
  • + "What we did last time..." works
  • + Unlimited conversation history

Test 4: Pay Attention

5/5 Passed

Test Description

Verifies tracking of all topic versions (A → A' → A'') in long conversations. Automatically injects context/goals periodically.

Test Cases:
1. Long conversation recall
2. Version tracking (5 versions)
3. Referential query ("that earlier")
4. Completeness validation
5. Trigger detection (8 types)

Results

Test Without Cortex
Initial Fail Pass
Versions 0 5
Reference Fail Pass
Triggers N/A 8/8

Cortex Core Value

Verify AI coding tool reliability and maintain context across all environments.

1

Hallucination Check

Verify AI claims against actual code

2

Multi-Env Sync

Same context in any environment

3

Multi-Agent

Context merge & sync for concurrent work

4

Ontology

Efficient context retrieval via relationships

5

Auto Context

Auto-update context/goals periodically

6

Local Only

All data stays local only

Run the Tests Yourself

Don't just take our word for it. Join the beta and verify results yourself.