AI ENGINE · v0.25.0 · Phase 3 Milestones M1–M12 + M14 + M15 + M16 + M17 + M18 + M19 + M20 + M21 + M22 + M23 + M24 + M25 (M13 deferred)

SOST AI
Engine

Internal autonomous research and validation system that generates hypotheses, validates them against local evidence, detects contradictions, scores utility and reduces overclaiming across Materials Engine, GeaSpirit, SOST protocol decisions and Useful Compute task design. Local-first. Free-first. Paid-last. No autonomous public claims.

INTERNAL RESEARCH | 1,402 / 1,402 tests · 100,000 hypotheses offline · 0 paid calls · 8 official-free source connectors · outcome learning loop · operator command center · miner support triage · scientific validation dossiers · validation campaigns · private operator console · SQLite chat persistence · speculative discovery lab · practical scientific recommendations · multi-repo project registry · deny-by-default capability gates · task-level outcome ledger · per-cycle project heartbeat · autonomous scoreboard · experiment planner · replay sandbox · counterfactual GO/MAYBE/NO-GO · canonical mission objectives · six-role AI scientist swarm · contradiction-to-discovery loop · research frontier map · strategic 40/30/20/10 allocator · weekly canonical roadmap · anti-obsession guard · localhost ops dashboard · background lab autopilot · useful-compute task classifier · private task staging queue · daily brief generator · operator approval inbox · executive summary · operator feedback loop · network OFF by default
// 01 — WHAT IT IS

An evidence-first autonomous research engine

// SUMMARY

The SOST AI Engine is an internal autonomous research system that generates hypotheses, compares evidence, detects contradictions and helps prioritize scientific and protocol decisions across SOST projects. It does not publish autonomous conclusions; all public outputs require human review.

// CORE PRINCIPLE

No conclusion is accepted just because a model said it. Every important claim is classified by evidence level — local code, local data, local doc, multi-source, model-only, speculative, contradicted or insufficient — before it can be considered for any internal promotion, and never published without an explicit human review pass.

// DESIGN BIAS — FREE / LOCAL FIRST, PAID LAST

100,000 hypotheses are generated locally and cheaply. Only the best, most uncertain or most contradictory ones reach a multi-AI council. Network access is OFF by default; paid models are OFF by default and require an explicit operator flag.

// 02 — WHAT IT DOES

Capabilities across four milestones

M1 — Evidence Core & Public Claim Guard
SHIPPED

Ten-level evidence classification (local_code_verified → do_not_publish), claim extractor, public-claim guard with safer-rewrite suggestions, eight seeded eval cases derived from real past mistakes (Useful Compute rewards postponed, avg288 vs avg1000, gold-redemption wording, GeaSpirit mineral guarantees, DFT-validated overclaim, etc.).

  • blocks public wording such as "rewards are active" while UC trial is postponed
  • blocks "DFT validated" when no DFT artefact is found locally
  • blocks "mineral guaranteed at depth" in GeaSpirit copy
  • downgrades absolute certainty wording to needs human review
M2 — Real Validators (Materials, GeaSpirit, SOST)
SHIPPED

Read-only validators that cross-check claims against the actual local corpus. The unified ValidatorResult carries verdict, evidence_level, confidence, publishability, evidence_items, missing_evidence, risks and next_steps. The orchestrator merges multiple validators with the strictest verdict winning.

  • Materials: validate_material_claim, validate_dft_status, low-cost / catalyst / photovoltaic / false-positive checks
  • GeaSpirit: depth-aware evidence required for any depth claim; satellite-only marked as surface proxy only
  • SOST: locked policies on cASERT 6210 (cancelled), avg288-only consensus, mandatory-update wording
  • Useful Compute: detects "rewards active" wording and contradicts it against the trial doc
M3 — Hypothesis Factory + AI Council + Learning Loop
SHIPPED

Mass local hypothesis generation (binary, ternary, quaternary and doped compositions for materials; AOI×commodity for GeaSpirit; risky-wording and Heavy-task-design ideas for SOST/UC). Deterministic ranking with configurable weights. AI Council with validator-veto (not majority vote). Outcome-driven rule-based learning loop with append-only persistence.

  • capable of generating 100,000+ hypotheses offline in seconds
  • deduplication by stable hash and project-specific pair keys
  • eight canonical campaign templates (Materials DFT priority, GeaSpirit public safety, UC Heavy task design, ...)
  • static HTML dashboard generator — local file, no server, no public deploy
M4 — Applicability & Utility Engine
SHIPPED

Each hypothesis is enriched with a structured ApplicabilityProfile answering: what it could be useful for, why theoretically, what evidence is missing, what false-positive risks exist, what the next validation step is, and whether it's publishable.

  • materials family classifier: oxide / sulfide / nitride / phosphide / carbide / silicide / halide / metallic
  • element-risk detection: PGM cost, toxicity, rare-earth supply, cheap-only
  • application-aware validation pathway (band structure only for PV/photonic, elastic only for structural)
  • recommended actions: promote_to_dft_queue, promote_to_chgnet, literature_review, keep_internal, reject_false_positive, future_heavy_task_candidate, ...
M5 — Free Knowledge Connectors & Source Reliability
SHIPPED

Connects the engine to official / free public APIs (arXiv, OpenAlex, Crossref, PubChem, JARVIS, Materials Project, USGS) and to optional local / free AI providers (Ollama, OpenRouter, HuggingFace) — with cache, rate-limit, domain allowlist, citation tracking and source reliability scoring. Truth hierarchy: local validators > local data > official DB > peer-reviewed metadata > preprints > local LLM > free hosted > paid judge (last and opt-in).

  • HTTP layer: urllib only, explicit domain allowlist, never logs token values
  • per-source rate limits + sha256-keyed cache (default 7 days, USGS 30 days)
  • 9 canonical contradictions hard-locked: rewards-active vs postponed, avg1000 consensus mismatch, cASERT 6210 cancelled, DFT-validated overclaim, mineral / depth guarantees, "trustless" overclaim, no-risk wording, guaranteed price/payout
  • research session: claim → validators → local knowledge → sources → synthesis → contradiction resolution → internal answer with provenance
  • model-answer validator: scores overclaim / hallucination, emits a corrected answer that embeds the canonical truth
  • defaults: network OFF, paid OFF, free-AI OFF, local-model OFF
M6 — Autonomous Research Daemon & Human Review Pipeline
SHIPPED

The engine now runs as an internal autonomous research daemon: it observes local Materials Engine, GeaSpirit, SOST and Useful Compute artefacts in read-only mode, plans its own bounded tasks, executes safe local work, learns from outcomes via rule-based memory, and produces reviewable archives for human approval. Critically, it never publishes: every public claim must pass an explicit human review and approval step before being exported — and the exporter writes only to reports/ai_engine/approved_exports/, never to the public website.

  • SelfTask schema with 28 task types covering all four projects
  • read-only adapters for Materials Engine, GeaSpirit, SOST and Useful Compute (no writes, no DFT, no GIS)
  • bounded planner with 24h dedup window, per-project + total caps, network/paid gates
  • review pack: summary.md, publication_candidates.md, do_not_publish.md, manifest.json, checksums.sha256, plus a .tar.gz archive
  • publication queue with strict approval gate: do_not_publish drafts cannot be approved
  • rule-based research memory (no neural reranker): boosts validated families, demotes rejected patterns, blocks risky wording
  • manual scripts only — no systemd, no cron
  • defaults: daemon OFF, network OFF, paid OFF, free-AI OFF, local-model OFF, public publication FORBIDDEN
M7 — Local / free AI provider wiring (paid disabled in M7)
SHIPPED

M7 wires the policy gate and judge plumbing for local + free AI providers (Ollama local, OpenRouter / HuggingFace free models). Paid AI is hard-disabled in M7 — even when a caller passes --allow-paid, the policy coerces max_paid_calls to 0 and reports paid_judge as disabled. The provider answer judge runs every reply against the canonical contradictions and the public-claim guard, scores overclaim and hallucination, and embeds the canonical correction in the corrected_answer.

  • provider_policy: explicit defaults all-OFF; allow_paid=True coerced to False
  • free_ai_model_registry: Ollama prefix allowlist (qwen/llama/mistral/phi/gemma/deepseek/codellama), OpenRouter only :free suffix, HuggingFace small free-inference list
  • provider_answer_judge: deterministic JudgeReport scoring overclaim/hallucination, with corrected_answer that appends the canonical truth
  • live_research_session: validators-only by default; every provider recorded as used=False with a skipped_reason
  • Ollama provider: refuses any non-loopback URL via compare_exchange-style check; is_available() never raises
  • token presence reported as boolean only — values never read or stored; no password field anywhere
M8 — Internal SOST AI Ask Engine
SHIPPED

A small CLI-driven interface that lets the operator type a free-form prompt and have the AI engine search/reason over local knowledge of Materials Engine, GeaSpirit, SOST, Useful Compute and DEX, then return a cautious internal-only answer. Never publishes — even when the prompt explicitly asks to publish, the answer composer routes to needs_human_review.

  • prompt_router: 12 intent buckets (explain / search / compare / validate / generate_hypotheses / public_wording_review / useful_compute_task_design / dft_priority / geaspirit_public_safety / dex_safety / mining_help / create_report)
  • project_selector: keyword-based routing across materials / geaspirit / sost / useful_compute / dex / mining
  • answer_composer: always runs contradiction_resolver + public_claim_guard; embeds canonical corrections; downgrades publishability on blocking findings
  • internal_citation: lightweight registry for local-file references with inline [n] markers and a markdown bibliography
  • output saved to reports/ai_engine/ask/<ts>_<slug>/ with answer.md, evidence.json, files_consulted.txt, risk_report.md, manifest.json
  • defaults: internal_only, network off, paid false, no automatic publication, no website write
M9 — Public Help Center + Approved Knowledge Export
SHIPPED

The public-safe layer. Private AI thinks. Human reviews. Public site explains only approved safe knowledge. The public website (sost-help.html and sost-miner-troubleshooter.html) consumes only static JSON exported from a human-reviewed pipeline. The public site never calls the private engine, never queries Ollama / OpenRouter / HuggingFace / paid AI, and never uploads a log.

  • approved_knowledge_exporter: combines approved publication-queue items with 12 default safe FAQ templates; writes reports/ai_engine/approved_public_help/<ts>/ with index, markdown, troubleshooter rules, faq, safety + source manifests, README and sha256 checksums
  • public_help_guard: exit-gate guard hard-blocking guaranteed profit, passive income, Useful Compute rewards are active, avg1000 consensus, confirmed/guaranteed mineral, DFT-validated, fully trustless DEX, any send/paste/share private key or seed phrase, the personal-email leak token, and the AI attribution leak token
  • miner_troubleshooting_knowledge: 11 deterministic log-pattern rules (rejected-block, profile-mismatch, no-peers, connection-refused, bootstrap-chain, http-zero, too-many-threads, cmake/libsecp/libssl-missing, etc.) consumed by the local-only browser troubleshooter
  • public website: /sost-help.html with client-side search and noscript fallback, /sost-miner-troubleshooter.html with paste-a-log analysis that runs entirely in the browser
  • import helper scripts/import_public_help_pack.py: validates the pack, refuses on missing safety_manifest / checksums / banned phrases, copies the JSON into website/data/; never auto-runs git-add / commit / push
  • defaults: no live AI on the public site, no chatbot, no log upload, no autopilot deploy
M10 — Outcome Learning + Operator Command Center + Miner Support Triage
SHIPPED

The closed feedback loop. Deterministic, rule-based, internal-only — no neural model, no network, no autopublish. The engine now records what actually happened to a candidate after it left review (DFT result, GeaSpirit verdict, public-wording correction, miner outcome, provider contradiction), turns each event into a clamped boost/penalty adjustment, and remembers the pattern for future ranking. The operator command center reads the same signal and recommends P0/P1/P2/P3 actions; the miner support triage classifies free-form miner text into a structured case and drafts a conservative reply that always passes through the public claim guard before being marked as low-risk. The public claim guard + contradiction resolver remain the only paths to public output.

  • M10-1 — Outcome Learning Core: typed OutcomeEvent log (28 outcome types across materials / GeaSpirit / useful compute / SOST / DEX / provider), deterministic derive(event) mapping to LearningAdjustment + PatternLesson rows, hard caps MAX_BOOST=0.30 / MAX_PENALTY=0.40 / MAX_NET_DELTA=0.50, pattern-memory upsert with +0.02 confidence bump per repeat (capped at 1.0), and a Markdown report under reports/ai_engine/learning/
  • M10-2 — Operator Command Center: read-only priority recommender that surfaces P0 (public-claim emergencies, GeaSpirit overclaim/depth guards), P1 (repeated miner issues, provider overclaims, stuck review packs), P2 (memory hygiene), P3 (periodic refreshes); work-queue snapshot; per-page dashboard text; Markdown report bundle under reports/ai_engine/operator/<ts>/ (index.md, actions.md, state.md, dashboard.txt)
  • M10-3 — Miner Support Triage + Reply Drafts: regex-based classifier across 12 categories (install / build / sync / rejected_block / orphan / stale_parent / no_peers / threads / useful_compute / wallet_safety / dex / unknown), per-case Markdown log under reports/ai_engine/support_cases/<ts>/, conservative community-reply drafter (wallet_safety replies are never auto-safe — they always require human ack), release-notice drafter, and help-refresh suggester that proposes Q/A items the operator can review (the suggester itself never publishes)
  • unified CLI: scripts/sost_ai_ops.py with subcommands status, next-actions, risks, review-packs, learning, providers, miner-support, public-help-suggestions, full-report
  • safety invariants: is_safe_for_export() blocks targets {false_positive_risk, guard, task_design, provider_reliability} from leaving the engine as such; every adjustment passes through clamp_adjustment() before insertion; the loop is replayable, auditable and deterministic by construction
  • defaults: pure Python stdlib, no torch / sklearn / numpy, no remote calls, no public publication; M10 outputs feed only the internal review pipeline that already gates everything public via M9
M11 — Scientific Validation Planner + Dossier Factory
SHIPPED

From "this candidate is interesting" to "this is the concrete next step, this is the experiment, this is what would confirm or kill it". Internal-only, deterministic, human-approved. A hypothesis becomes a structured dossier with current evidence, missing evidence, recommended path (literature / CHGNet / DFT input / GeaSpirit layer review / Useful Compute task design), draft experiments, deterministic pass/fail criteria, coarse compute-cost class, P0–P4 priority and publishability tag (internal_only by default). The AI may design validation work; it never executes heavy jobs and never publishes scientific claims.

  • M11-1 — Validation Dossier Core: ValidationDossier dataclass + 3 SQLite tables (validation_dossiers, validation_experiments, dossier_index), validation-plan router, ExperimentSpec dataclass with execution_allowed=False by default, deterministic pass/fail criteria, coarse cost-class estimator, priority + publishability policy, and the dossier renderer that writes dossier.md, dossier.json, experiment_plan.json, go_no_go.md, commands_draft.sh (mode 0644, every line #-commented), README.md, checksums.sha256
  • M11-2 — Materials Validation Planner: CHGNet / DFT plan builders (relaxation / static / band-structure / elastic-constants), Materials Go/No-Go (toxic / radioactive elements without strategic_value=high force HOLD; PGM catalysts always flag cost-of-deployment risk; literature + CHGNet support required for DFT GO unless P0/P1 promotion), per-experiment templates that NEVER invent energies, gaps or stability conclusions
  • M11-3 — GeaSpirit Validation Planner: layer-gap analyzer (geology, lithology, magnetics, gravity, AEM, EMIT, drilling, geochemistry, false-positive filters, spatial block validation), publication-safety reviewer (depth claims based on satellite alone are BLOCKED; depth confirmation requires magnetics / gravity / AEM / drilling + human approval), single-sensor anomaly without geology = NO-GO
  • M11-4 — Useful Compute Heavy Task Designer: HeavyTaskSpec with input/output schema, deterministic requirements, verification method, runtime target, dependency requirements, hardware requirements, and risk lists; conservative go/no-go (fake CPU burn / busy-wait / loop-forever → REJECT; too-light tasks → REJECT; paid / proprietary deps → REJECT; DFT-class without pinned version + pseudopotentials + tolerances + container → HOLD needs_reproducibility_solution); benchmark + verification plans as DRAFT-ONLY; strongest verdict the AI can issue is ready_for_benchmark — the rewarded phase remains gated by a separate human decision
  • M11-5 — Dossier Index + Operator Integration: denormalised dossier_index for fast filtered queries, substring search by subject + project + type, Markdown operator summary grouped by project / priority / publishability, integration into sost_ai_ops.py validation-dossiers and into the full-report bundle
  • safety invariants: dossiers default to internal_only; commands_draft.sh is non-executable by construction; Materials candidates and GeaSpirit AOIs always default to internal_only; Useful Compute campaigns always default to human_review_required; public exposure routes through the existing M9 export pipeline + human approval
M12 — Validation Campaign Orchestrator
SHIPPED

From individual dossiers to organised campaigns. Internal-only, deterministic, human-approved. M11 decides what might be valid; M12 decides what to validate first. A typed batch of dossiers is selected by a per-type policy, ranked, and packaged into a draft-only execution pack (campaign.md, manifest.json, selected_dossiers.jsonl, budget.md, risk_report.md, manual_execution_plan.md, do_not_run_automatically.md, README, checksums) that the operator reviews before any compute or publication.

  • M12-1 — Campaign Core: ValidationCampaign dataclass + 4 SQLite tables (validation_campaigns, campaign_dossiers, campaign_approvals, campaign_index), 12 campaign types (literature / CHGNet / DFT input / DFT relaxation draft / GeaSpirit layer review / GeaSpirit public safety / Useful Compute heavy benchmark design / Useful Compute reproducibility / Useful Compute fake-heavy rejection / Useful Compute schema design / public-wording safety / miner support improvement), status transitions ONLY through record_approval(...) (draft → ready_for_human_review → approved_for_manual_execution → completed), execution_allowed=False default that the system never flips, deterministic per-type selector + family/application diversification + coarse budget estimator
  • M12-2 — Materials Validation Campaigns: M10 outcome bias re-rank (per-family boost / penalty pulled from learning_adjustments), CHGNet excludes red-flag candidates, DFT-class campaigns require literature + CHGNet support OR an explicit P0/P1 promotion, family-aware diversification (at most two same-family candidates before falling back), per-pack extras: selected_materials.csv, selected_materials.jsonl, go_no_go.md
  • M12-3 — GeaSpirit Validation Campaigns: AOI-aware re-rank (richer layer inventory first for layer-review; blocked claims first for public-safety), aggregator that turns per-AOI verdicts into a campaign-level go / hold / block_public, per-pack extras: selected_aoi_claims.jsonl, layer_gap_matrix.csv, public_safety_matrix.md, go_no_go.md
  • M12-4 — Useful Compute Heavy Campaigns: readiness-based bucketing (ready_for_benchmark / needs_reproducibility / fake_heavy / needs_schema_design), strongest verdict is "go to internal benchmark" — never "go to rewarded phase", per-pack extras: selected_heavy_tasks.jsonl, benchmark_matrix.csv, reproducibility_matrix.md, input_output_schema_needs.md, go_no_go.md
  • M12-5 — Campaign Operator Integration: campaign_index rebuild + filtered queries, substring search by title + type + project, Markdown operator summary, sost_ai_ops.py validation-campaigns and campaign-next-actions subcommands, automatic inclusion in the full-report bundle
  • safety invariants: AI does NOT generate Useful Compute task queues, NOT modify task_server / worker, NOT activate rewards, NOT run CHGNet / DFT, NOT download rasters, NOT run GIS pipelines; manual_execution_plan.md is mode 0644 with every line shell-commented (sh manual_execution_plan.md is a no-op); do_not_run_automatically.md ships in every pack as an explicit reminder
M13 — Autonomous Discovery Theorist + Theory Graph
DEFERRED

Planned Phase 3 milestone. The discovery theorist will turn the M11 dossier graph + M12 campaign outcomes into typed scientific theories, with explicit pre-conditions, observable predictions, refutation paths, and a typed theory-graph that links theories to dossiers and to one another. Not yet shipped. The M14 console exposes the discover action but currently returns a safe unavailable envelope so callers can already wire the route end-to-end while the engine is being designed.

  • theory dataclass: id, statement, scope, pre-conditions, predictions, refutation_paths, evidence-links, confidence_class, publishability (always internal_only by default)
  • theory-graph SQLite tables (theories, theory_links, theory_evidence) with deterministic IDs and stable hashes for reproducible snapshots
  • local-only theory generator: deterministic, stdlib-only, no LLM by default; AI Council optional in a future milestone for free-only / paid-locked operation
  • operator integration: theory dossiers under reports/ai_engine/discovery_dossiers/<id>/ and a Markdown summary for the operator command center
  • safety invariants: same as M11/M12 — nothing publishes automatically, no network, no paid AI; theories never override an existing M11 dossier verdict
M14 — SOST AI Engine Private Console
SHIPPED

Localhost-only web console for operating the M1–M12 capabilities of the engine through prompts, quick actions, evidence panels and risk flags. Not exposed on sostcore.com. Never publishes anything. Never runs heavy compute. No paid AI. Started with python3 scripts/sost_ai_console.py, prints a one-shot URL with an ephemeral token, scrubs the token from the address bar after first paint, and keeps it in a JS closure for the lifetime of the page.

  • M14-1 — Backend + Access Control: stdlib-only ThreadingHTTPServer bound to 127.0.0.1 by default; 0.0.0.0 rejected unless both --unsafe-bind-all and --i-understand-this-exposes-the-private-console are passed; secrets.token_urlsafe(32) session token, never persisted; constant-time hmac.compare_digest bearer match; positive read/write allowlists; no shell helper exists in the codebase; 50 tests
  • M14-2 — Private UI: vanilla HTML/CSS/JS, no CDN, no external fonts/scripts/styles, no cookies, no localStorage, no eval(), no Function(); CSP default-src 'self'; CSS dark command-center aesthetic with cyan/green/gold accents; sidebar with 14 sections; chat panel with project + mode selectors, safety badges (NETWORK OFF, PAID LOCKED, PUBLICATION LOCKED, LOCALHOST ONLY), 24+ canned quick prompts, evidence drawer with colour-coded publishability, miner-log triage screen, reports browser, settings panel; token never written to any DOM node; 22 tests
  • M14-3 — Action Integration: 10 actions wired to existing modules — ask/ideate via ask_engine.ask; validate/public_wording_review via public_claim_guard.scan_text; create_dossier via M11 materials/geaspirit/useful-compute planners + insert_dossier + dossier_renderer.render; create_campaign via campaign_renderer.build_and_render (execution_allowed=False unconditionally); triage_miner_log via miner_support_triage; draft_reply via community_reply_drafter; next_actions via operator_command_center; discover returns a safe-unavailable envelope until M13 lands; legacy do_not_publish publishability mapped onto canonical blocked; 48 tests
  • M14-4 — Chat History + Exports: SQLite-backed console_sessions + console_messages + console_actions tables (idempotent migrations on top of the engine's existing persistence layer); console_conversation high-level API (new_chat / rename_chat / append_user / append_assistant / list_chats / load_chat); console_export writes reports/ai_engine/console_exports/<UTC>_<sid8>/ with conversation.md, manifest.json (schema sost_ai_console_export@v1), and checksums.sha256; clearing local history requires the literal confirmation "YES_DELETE_LOCAL_HISTORY"; 16 tests
  • M14-5 — Operator Guide + Smoke: live HTTP smoke tests on a random port and per-test fresh SQLite db: token gate works, paid + publication remain locked, "Useful Compute rewards active" never returns public_safe, miner-rejected-block log triages correctly, dossier creation writes a folder under reports/ai_engine/validation_dossiers/..., no website/ writes occur during a typical action flow, discover returns the M13 unavailable envelope; comprehensive operator guide in docs/multi_ai_console_operator_guide.md; 11 tests
  • safety invariants: localhost only by default, ephemeral token, no permanent passwords, no eval() / Function() / inline scripts / inline event attributes, paid + publication + heavy execution all locked, only allowlisted directories are readable / writable, OPTIONS preflights rejected (no CORS surface), and SSH tunnel is the documented remote-access pattern (no public port)
M15 — Console Persistence + Operator Workflow
SHIPPED

Turns the M14 console into a daily cockpit: every action is now persisted in SQLite, conversations can be reloaded and exported, the action surface covers review-pack / outcome / daemon / operator-status / learning-report / dossier-list / campaign-list, and a strict report browser lets the operator read internal artifacts without ever leaving the safe allowlist. All security locks remain in place.

  • M15-1 — SQLite Chat Persistence: the routes layer auto-writes a turn-pair into console_messages + console_actions on every persistable action; the in-memory ring stays as a lightweight audit cache. ConsoleState.current_session_id is filled lazily on the first action. New endpoints GET /api/history, GET /api/history/load, POST /api/history/{new,rename,clear} with explicit confirm="CLEAR_CONSOLE_HISTORY" guard. The session token NEVER appears in any persisted row. (14 tests)
  • M15-2 — Full Safe Action Wiring: review_pack → project_observer.snapshot_all + review_pack.build_review_pack; outcome_record → OutcomeEvent + outcome_ingestor.ingest; daemon_once → autonomous_daemon.run_once with dry_run=True always, allow_paid=False, console-side hard caps max_tasks ≤ 25 / max_runtime ≤ 300 s; operator_status → full operator_command_center.collect; operator_risks → P0/P1 filter; learning_report → learning_report.generate; validation_dossiers / validation_campaigns → their respective M11/M12 indexes. (14 tests)
  • M15-3 — Report Browser 2.0: stdlib report browser with positive read allowlist. resolve_safe(rel) rejects absolute paths, parent-traversal, ".." inside any segment, and symlinks that escape reports/ai_engine/. New routes GET /api/reports/{tree,view, search} — HTML files are tagged html_escaped, never rendered as live HTML. Search is case-insensitive, capped at 200 hits and 64 KB per file. (15 tests)
  • M15-4 — UI Workflow Polish: sidebar CHATS list backed by /api/history; + NEW button; EXPORT button (copies the exact CLI command to the clipboard — the export script writes to disk, not the JS); CLEAR button with a confirmation modal; creativity selector (conservative / speculative / wild / fantastic / red_team) sent on every prompt; per-mode placeholder hints; run-status spinner on the Send button. No new external dependencies; eval() / Function() / external scripts still absent. (17 tests)
  • M15-5 — Provider Status Panel: GET /api/providers/status returns booleans only: ollama_available (via shutil.which, never executed), paid_locked: true, publication_locked: true, network_enabled: false by default, plus six *_token_present booleans (OPENROUTER / HUGGINGFACE / ANTHROPIC / OPENAI / GROQ / TOGETHER) — only bool(env_var), never the value itself. The status reader makes no network call. (7 tests)
M16 — Speculative Discovery Lab + Real Ideation Engine
SHIPPED

The console no longer answers "Generate 20 membrane ideas" with generic policy text. M16 turns the SOST AI Engine into an internal speculative scientist: it invents structured hypotheses with mechanism, falsifier and validation path; it separates plausible from speculative-but-testable from wild from fantastic; and it self-criticizes every idea. Everything stays internal_only. Nothing publishes, nothing executes DFT/CHGNet/GIS, nothing uses paid AI.

  • Idea schema: Idea dataclass with stable 16-char id (sha256 of project|domain|title), five score fields clamped to [0,1] (novelty / plausibility / utility / falsifiability / absurdity), seven evidence levels (locally_supported / plausible / speculative_but_testable / wild_but_testable / fantastic_unvalidated / non_testable_now / rejected), publishability defaulting to internal_only, and a priority() rule that drops to P3 for any non-testable idea regardless of score.
  • Five creativity dials: conservative / speculative (default) / wild / fantastic / red_team. The dial drives both the generator's score priors and the evidence-level distribution; conservative caps absurdity, fantastic admits non-testable, red_team is the self-critic.
  • Six domain generators: materials/membrane (6 family x 8 target combinatorics with 7 stock failure modes), materials/catalyst (6 family x 7 reaction), materials/photovoltaic (5 family x 5 risk), materials/ion_separation (8 ion targets), geaspirit/theory (8 cross-layer theories with public-safety risks), useful_compute/heavy_task (7 task kinds with fake-heavy guards). Each uses a sha256-derived RNG so the same prompt reproduces the same set.
  • Self-critic / red-team: per-idea Critique with strongest_for / strongest_against / easiest_falsifier / most_likely_failure / promotes_if / kills_if. Every idea has a falsifier; ideas with falsifiability_score < 0.3 cannot rank above low priority.
  • Learning integration: the scorer reads M10 learning_adjustments and lowers patterns previously rejected, flagging them known_weak_pattern. Wild / fantastic creativity may still revisit weak patterns but they are clearly labelled.
  • Output folder: every batch writes reports/ai_engine/ideas/<UTC>_<slug>/ with ideas.md, ideas.jsonl, ranking.csv, falsifiers.md, validation_paths.md, risk_report.md, manifest.json (schema sost_ai_ideas@v1), and checksums.sha256 covering the other files. idea_index SQLite table for fast filtered queries.
  • DFT priority fix: the canonical "What candidates deserve DFT?" prompt now returns either a ranked candidate table built from M11 dossiers + M12 campaigns, or an explicit "no concrete candidate" message that points at the standalone scripts. Never claims DFT validation when no DFT artifact exists.
  • Console wiring: the ideate action returns the numbered idea list (mechanism / why-might-work / why-might-fail / first-test / falsifier / next step per idea). The discover action uses the same engine with creativity="fantastic" by default; the M13 typed-theory-graph deferral is surfaced as a warning, not a refusal. (38 tests)
  • CLI: scripts/multi_ai_generate_ideas.py mirrors the engine for offline use: --project, --domain, --count, --creativity, --prompt, --json, --no-render, --no-persist.
  • safety invariants: every idea is internal_only by default; fantastic / non_testable_now ideas keep that publishability; the renderer honours the M14 write allowlist; the engine never calls the network and never spawns a subprocess; all randomness is deterministic from the prompt + count + creativity tuple.
M17 — Materials General Answerer (practical recommendations)
SHIPPED

Closes the answer-quality bug reported on M16: the console used to fall through to "General internal answer — see project context above." for practical materials questions like "que material es el mejor para hacer nanopartículas fácilmente?". M17 routes those prompts through a structured answerer that returns a real ranked recommendation with mechanism, advantages, risks and first validation step per option. Pure stdlib, no network, no paid AI, no DFT claim.

  • Topic library: four practical-recommendation topics ship in M17 — nanoparticles, membranes, catalysts and abundant-element photovoltaics. Each topic carries title, summary lines, an option table (label, formula, ease, advantages, risks, first validation step) and a by-objective recommendation block.
  • Practical-cue gate: the answerer only fires when the prompt contains a recommendation cue (best / mejor / easiest / cheapest / recommend / which / how to / qué material). Bare keyword prompts still go through the standard flow, so the engine does not over-trigger.
  • Project gate: materials-only in M17. GeaSpirit, useful_compute and other projects keep their existing flow until their own answerers land.
  • Canonical nanoparticle answer: Au (citrate / Turkevich, plasmonic, expensive), SiO₂ (Stöber / sol-gel, cheap, scalable), Fe₃O₄ (coprecipitation, magnetic), TiO₂ (sol-gel, photocatalytic), Ag (Tollens / NaBH4, antimicrobial with aggregation+toxicity caveats). By objective: Au for demonstration / SiO₂ cheap-versatile / Fe₃O₄ magnetic / TiO₂ photocatalysis / Ag antimicrobial.
  • Evidence stamp: every answer states evidence_source: general_scientific_reasoning, confidence: medium, publishability: internal_only and no local DFT / CHGNet / lab artifact was consulted — this is a practical recommendation, not a validation. The engine never claims DFT validation without an actual artifact.
  • Wire-in: answer_composer.compose() consults the answerer before falling back to the legacy "see project context above" line. If the answerer matches, its body replaces the placeholder and its next_actions are merged into the result envelope.
  • safety invariants: no network call, no paid AI, no auto-publication, internal_only by default, no DFT validation claim. Every recommendation is offered as a starting point for a validation dossier or literature review — not as an answer to be published.
M18 — Project Registry + Capability Gates (multi-repo orchestration)
SHIPPED

Closes the multi-repo gap reported on M17 / unified-lab launch: the readonly adapters used to hard-code the project root to materials-engine-private/, so the AI engine never actually saw the sibling GeaSpirit / SOST core / GeaDeep / Materials Discovery repos. M18 ships a stdlib JSON registry that declares every repo the engine may read, plus five canonical capability gates with strict deny-by-default semantics. The engine still cannot publish, run DFT, touch consensus or call the network unless the registry explicitly allows it — and the shipped registry never does.

  • Registry file: src/multi_ai_review/project_registry.json declares six projects: materials (~/SOST/materials-engine-private, the only project with can_write_repo: true), geaspirit (~/SOST/geaspirit, read-only sibling), geadeep (~/SOST/geadeep-energy-private, read-only sibling), sost (~/SOST/sostcore/sost-core, read-only; consensus surface), materials_discovery (~/SOST/materials-engine-discovery, read-only archive), useful_compute (lives inside materials-engine-private; read-only). Schema tag sost_ai_project_registry@v1.
  • Loader (stdlib only): project_registry.py with lazy + thread-safe cache. Resolution order: explicit path argument > SOST_AI_PROJECT_REGISTRY env var > the JSON next to the loader. A missing file, invalid JSON, wrong schema or unknown project all yield an empty registry — deny-by-default still applies.
  • Five canonical gates as pure functions: can_run_dft(project), can_publish(project), can_touch_consensus(project), can_use_network(project), can_write_repo(project). Plus gate_summary(project) for status responses. Every action that touches one of these axes consults the gate before running.
  • Deny-by-default semantics: unknown project → all gates False. Project present but flag absent → False (the registry-wide defaults are themselves all False). Only an explicit "can_X": true in a project block flips a gate. Belt+braces in the test suite asserts that the shipped JSON never grants can_touch_consensus, can_publish or can_run_dft for any project.
  • Adapter integration (back-compat preserving): each adapters/*_readonly.py keeps the historical parents[3] as an explicit _FALLBACK_ROOT and adds a _repo_root() helper that consults the registry first. The module-level _REPO_ROOT alias is preserved so existing callers keep working. The live unified-lab daemon picks up the new resolution on next process restart without disruption mid-loop.
  • Concrete effect on the unified lab: GeaSpirit adapter now reads from ~/SOST/geaspirit (the real sibling repo, not the materials repo). SOST adapter now reads from ~/SOST/sostcore/sost-core. Useful Compute remains read-only inside materials-engine-private. Materials is the only project that can be written to from the AI engine.
  • safety invariants: no project grants consensus / publish / DFT in the shipped registry; the canonical write directories declared by console_security.ALLOWED_WRITE_SUBDIRS still apply on top of the gates; the daemon still runs with --network off. 22 new tests; 1,188 / 1,188 total green.
M19 — Outcome Ledger + Autonomous Scoreboard + Project Heartbeat
SHIPPED

Closes the observability gap reported on the unified-lab launch: the daemon was running but its trail was not measurable. M19 adds three small layers that make autonomy verifiable — without granting it any new permissions. Every task the engine plans, dispatches or refuses leaves a row; every cycle leaves a heartbeat; the scoreboard rolls both up into a Markdown + JSON dashboard.

  • Task-level outcome ledger: outcome_ledger.py with the seven canonical statuses (planned / executed / skipped / failed / useful / wrong / repeated). Stores project, task_type, subject, summary, gate_blocked flag + reason, input/output hashes, saved_path. Helpers: record, list_outcomes, counts_by_status, positive_negative_ratio, gate_block_rate, already_seen(input_hash) for repeat detection. Distinct from M10's subject-level outcome events — both layers coexist.
  • Per-cycle project heartbeat: project_heartbeat.py records one row per project per daemon cycle: tasks_generated, tasks_useful, tasks_failed, gate_blocks, routes_read, memory_updates, free-form notes. Operators can answer "is materials still learning, or just looping?" at a glance.
  • Autonomous scoreboard: autonomous_scoreboard.py rolls the ledger + last heartbeat + the eight registry gates per project up into a payload tagged sost_ai_autonomous_scoreboard@v1. Renders both autonomous_scoreboard.md and autonomous_scoreboard.json under reports/ai_engine/operator/ (already on the M14 write allowlist — no allowlist change needed).
  • Three new capability gates (deny-by-default): can_execute_heavy_task (DFT/CHGNet/GIS), can_create_public_draft (public-facing artifact prep), and can_update_memory (write into the AI engine's learning memory). Every project denies all three by default; only materials grants can_update_memory (its own memory). The full canonical list is now eight gates wide.
  • Belt+braces invariants (asserted by the test suite): no project grants can_execute_heavy_task; no project grants can_create_public_draft; only materials grants can_update_memory; unknown project denies all eight gates; the M18 invariants on the original five gates remain unchanged.
  • CLI: scripts/multi_ai_scoreboard.py renders the dashboard (--db / --out-dir / --since-hours / --json). Live smoke against an empty ledger returns the canonical schema tag, six projects, eight gates per project, and zeroed totals across the seven statuses.
  • safety invariants: M19 ships no permission grants beyond can_update_memory=true for materials. Still no auto-DFT, no auto-publication, no auto-commits, no consensus surface. The unified-lab daemon picks up the new tables on next process restart; no mid-loop disruption. 26 new tests; 1,214 / 1,214 total green.
M20 — Experiment Planner + Replay Sandbox (counterfactual rule evaluation)
SHIPPED

Closes the "method" gap: the engine could ideate (M16), recommend (M17), and run autonomously (M19), but it could not ask itself whether its proposed rules would have actually worked on past data. M20 ships the scientific-method layer: hypothesis → plan → counterfactual replay against M10 history → GO / MAYBE / NO-GO → memory. No new execution permissions — the engine still cannot run anything, only plan and simulate.

  • Baseline registry (auto-derived from M10): baseline_registry.py reads M10's outcome_events table and classifies each row as positive (e.g. dft_success, chgnet_stable, literature_supported, human_promoted, provider_useful_answer...) or negative (dft_failure, chgnet_red_flag, human_rejected, provider_overclaim_confirmed...). Optional baseline_overrides.json next to the engine wins on conflict. Baselines dedupe per (project, subject) — twelve "FeS2 promoted" events count as one positive signal, not twelve.
  • Experiment planner: experiment_planner.py + experiment_plans SQLite table. plan(hypothesis, project) picks project-aware step lists: photovoltaic-flagged materials hypotheses get a band_gap_estimate (DFT input draft only — never run) step; generic materials get the membrane/catalyst step list; geaspirit gets layer-gap + false-positive + publication-safety; useful_compute gets schema design + fake-heavy + reproducibility. Stable plan_id (sha256-derived) so the same hypothesis at the same time produces the same id.
  • Replay sandbox — counterfactual evaluator: replay_sandbox.py. A Rule is a callable (subject, context) -> bool. replay() applies the rule to the historical positives and negatives and returns precision, recall, accuracy, F1 plus the TP/FP/TN/FN counters and a verdict. Decision matrix: precision ≤ 0.30 OR recall ≤ 0.20 → NO-GO; precision ≥ 0.65 AND recall ≥ 0.50 → GO; otherwise MAYBE.
  • Operator's golden rule: "if there is no historical signal, there is no certainty." The sandbox enforces a strict ceiling: historical_count < min_historical (default 20) means the strongest possible verdict is MAYBE — never GO, regardless of how clean precision and recall look. The system stays honest about its own confidence.
  • Renderer: writes reports/ai_engine/experiments/<UTC>_<slug>/ with plan.md, plan.json, replay_result.json, decision.md (GO/MAYBE/NO-GO + reason) and checksums.sha256. Honours the M14 write allowlist; new entry reports/ai_engine/experiments added.
  • Three new capability gates (deny-by-default): can_plan_experiment and can_replay_experiment are true for materials / geaspirit / useful_compute (the projects that produce hypotheses); false for sost / geadeep / materials_discovery. can_execute_experiment is false for every project and asserted by the test suite. The canonical gate set is now eleven wide.
  • CLIs: scripts/multi_ai_plan_experiment.py turns a hypothesis into a plan + on-disk folder; scripts/multi_ai_replay_experiment.py counterfactually evaluates a keyword-based rule against M10 history and prints the verdict. Both refuse to run on projects that lack the gate.
  • safety invariants: M20 ships no execution surface; no auto-DFT, no auto-publication, no auto-commits, no consensus surface, no new network capability. Belt+braces tests assert can_execute_experiment is denied for every registered project. 25 new tests; 1,239 / 1,239 total green.
M21 — Canonical Mission Engine + AI Scientist Swarm (six roles + contradiction-to-discovery)
SHIPPED

Closes the "criterion" gap: the engine could ideate, plan, replay and score, but it had no explicit compass. M21 codifies the project objectives, runs every hypothesis through a six-role scientist swarm, mines contradictions for refinement opportunities, and persists the canonical memory of the cycle. The system stops being "the AI that proposes things" and starts being "the AI that asks itself if its proposals would have worked" — guided by an operator-defined mission. No new execution permissions.

  • Canonical objectives: canonical_objectives.py codifies eight materials objectives (defensible_discovery, cost_abundance_stability, reduce_false_positives_pre_dft, catalysts_no_pgm, non_toxic_pv, hydrogen_proton_membranes, water_desalination, industrial_robust), six geaspirit objectives (real_world_systems, deep_sea_environment, cost_near_concrete, proxy_vs_proof, multi_layer_evidence, public_safety_review), two geadeep objectives (deep_sea_energy, manufacturability_at_scale) and two useful_compute objectives (determinism, reproducibility). Each objective carries align_keywords + kill_criteria + proxies as data.
  • Mission alignment scorer: mission_alignment.py scores a hypothesis against the canonical objectives for its project. Per-objective score in [0, 1], aggregate weighted score, fired-kill-criteria flag, matched-objective count. Refuses to run when can_rank_mission_alignment is denied.
  • Six-role AI scientist swarm: ai_scientist_swarm.py runs six pure-function roles on every hypothesis — Discoverer (bold expansion), Skeptic (kill-criteria + weak proxies), Engineer (fabricable / lab_only / unknown), Economist (cheap / moderate / expensive via Pt/Pd/Rh/In/Ga/Cd/Te keywords), Validator (cheapest test menu, project-aware), Historian (consults outcome_ledger for revisit_failed / consistent_with_history / contested_history). Mean of the six role scores becomes the swarm score.
  • Contradiction-to-Discovery loop: critique_loop.py turns disagreements between roles into structured opportunities: "is there a doping/sibling that resolves the contradiction? does the candidate serve another application better? is the contradiction a false positive in the predictor? does the candidate deserve a plan or an archive?". Archive policy: low swarm_score (< 0.25), or revisit_failed + weak discoverer, or kill-criterion-fired + score < 0.4 forces archive.
  • Canonical memory: canonical_memory.py persists per-cycle summaries (mission_cycles table) and per-hypothesis records (mission_hypotheses table). Gated by can_update_memory from M19 — only materials writes here.
  • Three new capability gates (deny-by-default): can_generate_hypothesis, can_critique_hypothesis, can_rank_mission_alignmenttrue for materials, geaspirit, useful_compute; false for sost, geadeep, materials_discovery. Canonical gate set is now fourteen wide (5 from M18 + 3 from M19 + 3 from M20 + 3 from M21).
  • CLI: scripts/multi_ai_mission_cycle.py runs the full pipeline end to end: read objectives -> M16 generate N hypotheses -> swarm + critique loop -> mission_alignment -> top-K -> M20 plan + replay -> render reports/ai_engine/mission_cycle/<UTC>_<slug>/ with cycle_manifest.json, hypotheses.jsonl, top_picks.md, contradictions.md, next_best_actions.md; per-hypothesis row in the M19 outcome ledger + one heartbeat row.
  • safety invariants: M21 grants no execution surface; no auto-DFT, no auto-publication, no auto-commits, no consensus surface, no new network capability. can_update_memory remains true only for materials. The live unified-lab daemon keeps running unchanged — M21 modules are ready but not yet wired into its hot path. 33 new tests; 1,272 / 1,272 total green.
M22 — Research Frontier Map + Strategic Autonomy (40/30/20/10 attention split + weekly roadmap)
SHIPPED

Closes the "agenda" gap. M21 gave the engine voices and a compass; M22 gives it a map and a calendar. Every recent hypothesis is grouped into a research frontier, gaps in coverage are surfaced, the next cycle's attention is split across explore / exploit / falsify / review, and a weekly roadmap is rendered alongside an anti-obsession watchdog. No new execution permissions; the system still cannot run anything autonomously.

  • Frontier map: frontier_map.py reads M21 mission_hypotheses + M19 outcome_ledger + canonical_objectives. Coarse family detection via ordered substring rules (LDH / kesterite / antimony chalcogenide / oxysulfide / single-atom / PGM / phosphide / nitride / sulfide / layered oxide / ceramic). Each Frontier carries hypothesis count, average swarm + alignment scores, three representative subjects, positive/negative outcome counts and a status label (active / thin / stagnant / contested).
  • Research-gap detector: research_gap_detector.py surfaces five canonical gap kinds: many_ideas_no_evidence (many hypotheses, zero positives), evidence_no_exploration (positives but no recent extension), recurring_contradictions (same swarm contradiction string seen ≥ 3 cycles), promising_no_validation (high alignment + swarm but no experiment plan attached), objective_uncovered (canonical objective with zero matched hypotheses).
  • Strategic allocator (40 / 30 / 20 / 10): strategic_allocator.py proposes the next cycle's attention split — 40 % exploit (active frontiers with the highest combined score), 30 % explore (uncovered objectives + thin frontiers), 20 % falsify (high-promise hypotheses without a plan), 10 % review (stagnant / contested frontiers + recurring contradictions). Each item carries weight, target, detail and rationale.
  • Anti-obsession guard: anti_obsession_guard.py watches the family distribution of the last 30 mission_hypotheses. If any one family covers more than 40 % of the window (and the sample is at least 10), it raises an ObsessionFlag with an explicit "diversify with creativity=wild or seed prompts from a different family" suggestion. Saves the engine from getting locked onto a pretty chemistry just because keywords keep matching.
  • Canonical weekly roadmap: canonical_roadmap.py combines all four pieces into one operator-facing plan: try this week (exploit + falsify), explore this week, archive / discard (contested + thin frontiers with low scores), wait — needs more data (review items), plus an anti-obsession notes block. Internal-only; the roadmap is advisory and does not authorise execution.
  • Three new capability gates (deny-by-default): can_build_frontier_map, can_allocate_research_attention, can_write_roadmaptrue for materials, geaspirit, useful_compute; false for sost, geadeep, materials_discovery. The canonical gate set is now seventeen wide (5 + 3 + 3 + 3 + 3). The hard locks (run_dft / publish / touch_consensus / execute_*) remain false everywhere.
  • CLI: scripts/multi_ai_frontier_cycle.py runs the pipeline end to end and renders reports/ai_engine/frontier/<UTC>_<slug>/ with the six canonical files: frontier_map.json, frontier_map.md, research_gaps.md, strategic_allocation.json, weekly_roadmap.md, anti_obsession_report.md.
  • safety invariants: M22 grants no new execution surface; no auto-DFT, no auto-publication, no auto-commits, no consensus surface, no new network capability. The live unified-lab daemon keeps running unchanged. 30 new tests; 1,302 / 1,302 total green.
M23 — Private AI Ops Dashboard + Background Lab Autopilot (localhost cabin + token gate)
SHIPPED

Closes the "cabin" gap. M14 is a chat-style console; M23 gives operators a read-only operations dashboard at http://127.0.0.1:8766 plus a background autopilot that runs M21 mission cycles + M22 frontier cycles on a configurable interval. Localhost only. Token-gated. No public surface. No new execution permissions.

  • Token surface (separate from M14): ai_ops_token.py. issue(ttl_seconds=4h) returns the cleartext exactly once; on disk only the SHA-256 over a per-issue salt is persisted. verify() uses constant-time hmac.compare_digest plus an expiry check. revoke() unlinks the record. status() returns metadata without the cleartext.
  • Read-only state aggregator: ai_ops_state.py pulls from M19 (outcome ledger + heartbeats), M20 (experiment plans), M21 (canonical memory), M22 (frontier map) and the registry gate set. full_state() is the one-shot snapshot the dashboard's /api/ops/state endpoint serves.
  • Dashboard server: ai_ops_dashboard_server.py — stdlib ThreadingHTTPServer, refuses non-local bind with PermissionError (remote access via SSH tunnel only). Single HTML page (CSP default-src 'self', no CDN, no cookies, no localStorage), one JS file at /static/ops.js, four JSON endpoints under /api/ops/* — all /api/* require Bearer token.
  • Background autopilot: ai_ops_autopilot.py. tick() runs one M21 mission cycle plus one M22 frontier cycle for every project that holds the full pipeline gate set; disallowed projects produce a TickResult with ok=false and a clear "missing required gates" error. loop() refuses interval_seconds < 30. Best-effort: never raises mid-loop. Persists outcome ledger + heartbeats; writes canonical memory only when can_update_memory permits (materials only).
  • Three CLI scripts: scripts/sost_ai_generate_token.py issues a fresh token and prints the dashboard URL once on stdout; scripts/sost_ai_ops_dashboard.py starts the HTTP server (refuses non-local hosts); scripts/sost_ai_autopilot.py ticks once or loops indefinitely with operator-tunable creativity / count / interval.
  • safety invariants: localhost only by construction; token never appears on disk; the autopilot grants no new permissions; if a gate denies a step the project is skipped for that tick; all /api/* endpoints are read-only; no DFT, no publish, no consensus, no commit. 23 new tests; 1,325 / 1,325 total green.
M24 — Useful Compute Task Intelligence Layer (private classifier + spec generator + staging queue)
SHIPPED

Closes the "what counts as a heavy task?" gap reported on the Useful Compute live trial. The public API and worker stay as a dry-run infrastructure; M24 ships a private lab where the AI engine classifies, spec-generates and stages heavy-task candidates. Public publishing and reward activation remain hard-locked behind gates that no project grants in the shipped registry — the operator's CTO verdict.

  • Heavy-task classifier: heavy_task_classifier.py with the five canonical accept axes the operator promised: is_useful (no busy-wait / fake-heavy), is_deterministic (declared + no race / unseeded random hint), is_auditable (declared + no not-auditable hint), is_heavy_enough (runtime ≥ 60 s AND memory ≥ 256 MB), and is_safe_to_verify (no "verifier rerun required" / "no replay possible" hint). Eight curated keyword vocabularies. A task that fails ANY axis is rejected with the offending axis listed.
  • Spec generator: heavy_task_spec_generator.py. TaskSpec + per-project schema templates (materials / geaspirit / useful_compute), pinned-deps declaration, fixed-seed policy, replay = ~10 % of original runtime, explicit fake-heavy baseline ("busy-wait must NOT match the output within tolerances"). Reward class always starts at no_reward; visibility is internal_only when DFT / raw-geospatial / wallet / consensus keywords match, otherwise human_review_required.
  • Private staging queue: useful_compute_private_queue.py + SQLite useful_compute_private_queue table. stage() refuses when can_stage_private_useful_compute_task is denied — only the useful_compute project holds that gate. attempt_publish() and attempt_enable_rewards() are the documented entry points for future operator-only workflows but always return {ok: false, reason: "denied"} because every project denies the corresponding gate.
  • Orchestrator + seed templates: useful_compute_task_intelligence.py orchestrates classify -> spec -> (optional stage). seed_candidates(project) exposes operator-blessed seed examples per project (5 materials, 5 geaspirit, 3 useful_compute) — every seed is engineered to pass the classifier so operators have a known-good baseline.
  • Four new capability gates (deny-by-default): can_design_useful_compute_task (true: materials, geaspirit, useful_compute), can_stage_private_useful_compute_task (true: useful_compute ONLY), can_publish_useful_compute_task (false EVERYWHERE), can_enable_useful_compute_rewards (false EVERYWHERE). Canonical gate set is now twenty-one wide. Belt+braces tests assert publish + reward gates remain false for every project.
  • CLI: scripts/multi_ai_useful_compute_task_lab.py with --seeds (operator-blessed templates), --title + --description (ad-hoc candidate), --stage (private staging when the gate permits) and --json for the full pipeline payload.
  • Operator (CTO) verdict surfaced in code: public Useful Compute API stays as dry-run infrastructure; heavy reward-bearing tasks stay private until the AI knows to classify, validate and audit them; rewards stay OFF until M24+ pronounces a task family ready; publishing stays OFF until a separate human-approval workflow lands. 29 new tests; 1,354 / 1,354 total green.
M25 — Human Command Center + Approval Ritual (daily brief, approval inbox, executive summary, operator feedback loop)
SHIPPED

Closes the loop between the autonomous engine and the human operator. The AI may *propose* concrete decisions (campaigns, dossiers, false-positive archives, frontier promotions, useful-compute staging requests, DFT input prep); the operator approves or rejects them. Golden rule: the AI may ask for permission. The AI cannot grant permission to itself.

  • Approval request schema: approval_request.py. ApprovalRequest dataclass with deterministic request_id (SHA-256 over project / kind / subject / timestamp). Eight canonical kinds: approve_campaign, reject_hypothesis, convert_to_dossier, stage_useful_compute, prepare_dft_input, archive_false_positive, promote_frontier_family, demote_frontier_family. Lifecycle pending → approved | rejected | withdrawn. create() gated by can_create_approval_request; approve() and reject() require a non-empty operator argument — the engine cannot generate one from inside an automated tick.
  • Operator inbox: operator_inbox.py. Aggregates pending approvals across projects, sorts by kind priority (false-positive archive ≫ approve campaign), and renders a markdown table for the CLI plus sost_ai_operator_inbox@v1 JSON for the dashboard.
  • Daily brief: daily_brief.py. Five-section UTC report — what the engine did (recent ledger rows), learned (recent canonical mission cycles), found (top frontier families + research gaps), blocked (capability-gate denials in window), and recommends (open approval requests). Saved under reports/ai_engine/daily_brief/<UTC>/ as brief.json + brief.md. Default 24h lookback.
  • Operator feedback loop: operator_feedback.py. On approve → records a useful task outcome in M19's ledger plus a positive hypothesis_learning_event. On reject → records a wrong outcome plus a negative learning event so the M21 swarm down-weights the pattern. Idempotent per request_id. Gated by can_apply_operator_feedback — only materials holds that gate.
  • Executive summary: executive_summary.py. Top-3-per-section operator view per project: opportunities (frontier families ranked by swarm × count + outcome bias), risks (research gaps + gate-block density + negative-signal pressure), next actions (oldest pending approvals).
  • Dashboard endpoints (M23 surface, extended): /api/ops/approvals, /api/ops/daily-brief, /api/ops/executive-summary, /api/ops/decision-history. All localhost-only and token-gated; payloads are read-only.
  • Four CLI scripts: scripts/sost_ai_daily_brief.py, scripts/sost_ai_operator_inbox.py, scripts/sost_ai_approve_request.py, scripts/sost_ai_reject_request.py. Approve/reject scripts require --operator; both refuse to run with an empty operator name. Optional --apply-feedback flag pipes the decision into the M19 ledger.
  • Three new gates, deny-by-default: can_create_approval_request (true for materials, geaspirit, useful_compute), can_apply_operator_feedback (true only for materials), can_execute_approval (false everywhere — even a human approval does not unlock automated execution; the engine still refuses to run side-effects from an approval row by itself).
  • safety invariants: no DFT execution, no public publishing, no consensus surface, no rewards, no self-approval, no new network capability. The dashboard remains localhost-only. 48 new tests; 1,402 / 1,402 total green.
// 03 — PIPELINE

From 100,000 hypotheses to a few human-reviewed ones

1Mass local generation10s to 100,000 hypotheses, deterministic, no network
2Deduplicationstable hash + project-specific pair keys
3Local evidence searchread-only across docs/, reports/, outputs/
4Validator passM2 validators per project + orchestrator merge
5Deterministic rankingweighted formula, deterministic across runs
6Top-N selectionper-campaign keep_top_n (typically 10–200)
7AI Council on top-Kvalidator-veto, not majority vote; budget-gated
8Applicability profileutility, FP risk, validation pathway, publishability
9Report & backlogJSON report + static HTML dashboard + SQLite log
10Human reviewrequired before any public claim or DFT promotion
// 04 — EVIDENCE LEVELS

Every claim is classified before it can be promoted

levelmeaningpublishable?
local_code_verifiedverified by reading repo sourceyes
local_data_verifiedverified by reading a local DB / data fileyes
local_doc_supported / local_report_supportedsupported by a local doc or internal reportyes
external_official_supportedsupported by an official free public source (arXiv, OpenAlex, ...)with caveat
multi_source_supportedsupported by ≥ 2 independent source typesyes
model_consensus_onlyonly models agree; no data or doc backs itno
speculativeweak signals, not reproducibleno
contradictedlocal evidence contradicts the claimno
insufficient_evidencenothing collected to back the claimno
do_not_publishpublic-facing claim with insufficient backingblock
// 05 — PROJECT INTEGRATIONS

How it plugs into SOST projects

Materials Engine
OPEN PAGE →

Generates and ranks material candidates by family, application and element-cost proxy. Only candidates with explicit DFT or CHGNet evidence in the local corpus may be recommended for promotion to a real DFT queue. Predicted-only candidates are clearly marked predicted_only — never validated.

GeaSpirit
OPEN PAGE →

Evaluates mineral / depth / coordinate / certainty claims with built-in conservatism. Any wording implying guaranteed mineral presence, certain depth or 100% certainty is blocked at do_not_publish. Depth claims require depth-aware geophysics (gravity / magnetics / AEM); satellite-only signals are labelled as surface proxy, never as subsurface evidence.

SOST Protocol
OPEN PAGE →

Audits public wording, consensus and explorer claims. Locks: avg288 is consensus (avg600 / avg1000 are informational); cASERT 6210 fork was cancelled; "mandatory update" requires both fork wording and a documented activation height. Layered atop the Phase 1 public-claim guard.

Useful Compute
OPEN PAGE →

Designs candidate Heavy tasks (DFT relax, CHGNet pre-screen, GeaSpirit feature extraction, ...) and classifies each into ready_for_design, needs_benchmark, needs_reproducibility_solution, too_light, deferred or rejected. Every profile carries the explicit "rewards postponed" warning. The engine never activates rewards and never writes to a queue.

// 06 — WHAT IT DOES NOT DO

Hard guarantees

// THE ENGINE NEVER

  • publishes autonomous conclusions to the public web
  • activates Useful Compute rewards
  • modifies SOST consensus, miner code or node code
  • writes to Useful Compute queues
  • signs transactions or moves funds
  • reads wallet files or private keys
  • scrapes websites or bypasses rate limits
  • calls paid models without an explicit operator flag
  • makes any network call by default
  • treats model-only consensus as verified evidence
  • guarantees mineral discovery or material validation
// 07 — STATUS

Current numbers

Tests passing1,402 / 1,402 (M1 + M2 + M3 + M4 + M5 + M6 + M7 + M8 + M9 + M10 + M11 + M12 + M14 + M15 + M16 + M17 + M18 + M19 + M20 + M21 + M22 + M23 + M24 + M25; M13 deferred)
Free public source connectors8 (arXiv, OpenAlex, Crossref, PubChem, JARVIS, Materials Project, USGS, generic-official)
Hypothesis generation capacity≥ 100,000 candidates offline, in seconds
Network calls per run (default)0 — network is OFF by default
Paid model calls per run (default)0 — paid is OFF by default
PersistenceSQLite, idempotent migrations, 39 internal tables (9 added in M10 for the outcome learning loop, 3 added in M11 for the dossier factory: validation_dossiers, validation_experiments, dossier_index, 4 added in M12 for the campaign orchestrator: validation_campaigns, campaign_dossiers, campaign_approvals, campaign_index, 3 added in M14 for the private console: console_sessions, console_messages, console_actions, 1 added in M16 for the speculative discovery lab: idea_index)
Public outputsnone autonomous — human review required
Source licenseprivate repo — outputs only released after audit

// EXPERIMENTAL STATUS

This is an experimental internal research system. It is not a product, not a financial recommendation engine, not a guaranteed scientific oracle. Its outputs are internal only by default, and any candidate that is later promoted to a public claim must pass through a human review pass and an explicit M2 validator verdict.

// 08 — SCIENTIFIC INTAKE ENGINE

10 phases · 169 tests · offline by default

// SUMMARY

The Scientific Intake Engine is the working layer inside the Materials Engine that turns free-form scientific text (papers, news, proposals, speculative ideas) into a structured, audited research pipeline: claims → evidence → multi-agent reasoning → falsification-first experiment plan → feedback loop → portfolio campaigns → knowledge graph → dossiers → review queue → autonomous orchestration. Ten phases shipped end-to-end. Every phase is local-first, deterministic, append-only, and offline by default. Zero paid API dependencies.

// HARD RULES (enforced in code, not just documented)

Impossible-physics never escapes the kitchen. Sources flagged with closed-timelike-curve / perpetual-motion / over-unity / exceeds-Carnot language are capped at REJECT, cannot be promoted to high-priority, cannot enter a publication package, and cannot be transitioned to published — even with explicit operator override.

No auto-publish. The autonomous orchestrator only ever enqueues review items; the operator must walk them through inbox → reviewing → approved → published manually.

No procedural detail, ever. A shared sanitiser scrubs grams / temperatures / pressures / ignition / detonation cues from every report and plan field before persistence.

Auth on writes, public on reads (with the single exception of the operational review surface, which is auth on both sides).

// THE FULL CHAIN

Scientific text
→ Phase I    claims / entities / 6 scores / hypotheses
→ Phase II   evidence acquisition (Crossref / arXiv / Semantic Scholar)
→ Phase III  multi-agent reasoning + decision memo
→ Phase IV  falsification-first experiment plan + safety gates
→ Phase V    feedback loop + active learning + calibration
→ Phase VI  autonomous campaigns + portfolio ranking
→ Phase VII  knowledge graph + clusters + opportunities
→ Phase VIII dossiers + reports (Markdown / JSON / HTML)
→ Phase IX   review queue + state machine + publication packages
→ Phase X    autonomous orchestration (5 modes, one endpoint)

// PHASE-BY-PHASE

Phase I — Scientific Intake (text → claims → scoring)
SHIPPED · 14 tests

Free-form text intake with claim extraction, named-entity recognition (materials, formulas, technologies, institutions, people, physical concepts), six-score evaluation (credibility / novelty / feasibility / evidence / commercial / research priority) and seed hypothesis generation. SHA-256 dedup. Auth via X-Intake-Key, sliding-window per-IP rate limit. Impossible-physics gate forces REJECT regardless of score.

Phase II — Evidence acquisition adapters
SHIPPED · 13 tests

Free-tier Crossref / arXiv / Semantic Scholar adapters with per-source dedup keyed on doi → arxiv_id → lower-title. Public-source-only. Cache + retry built in. Re-scores the source's evaluation against acquired evidence. No paid API dependency anywhere.

Phase III — Multi-agent evidence reasoning
SHIPPED · 11 tests

Five local deterministic agents (Skeptic / Evidence / Feasibility / Commercial / DecisionMemo) produce per-claim stance assessments (supports / contradicts / mixed / unrelated / insufficient) and a decision memo with six component scores. Impossible-physics survives Phase 3 too — final score capped at 15 in that case.

Phase IV — Experiment planning + safety guardrails
SHIPPED · 12 tests

Five local agents (Safety / Simulation / Falsification / Designer / Milestone) produce a falsification-first plan: literature_review → falsification → calculation → simulation → safety_review → benchmark → external_lab → bench_test → prototype. Hydrogen, combustion, high-pressure, cryogenic, radiation, toxic domains all surface as safety_review with mandatory external-lab routing. Refuses procedural detail entirely.

Phase V — Active learning feedback loop
SHIPPED · 14 tests

Append-only feedback ingestion (literature_review / simulation / benchmark / lab_result / expert_review / correction) with six outcomes (supports / contradicts / inconclusive / unsafe / failed_replication / successful_replication). Per-outcome delta tables, confidence scaling, recommendation rank ladder. Hard-coded floor: expert review can never lift an impossible-physics source above REJECT, even with confidence_score=100.

Phase VI — Autonomous research campaigns
SHIPPED · 16 tests

Group sources into named campaigns. Nine-signal portfolio ranker (final_project_score / evidence_alignment / contradiction_inverse / feasibility / novelty / commercial / safety_inverse / feedback_confidence / evidence_coverage) with configurable weights and a small domain-alignment bonus. Safe local auto-execute only — never auto-promote, never invoke external paid APIs, never auto-run dangerous experiments.

Phase VII — Knowledge graph + concept discovery
SHIPPED · 17 tests

Typed knowledge graph (13 node types × 13 edge types) per source / campaign / global. Lexicon-based theme clustering (hydrogen, combustion, battery, quantum, ctc-impossible, manufacturing, safety, …). Six opportunity types: missing_evidence, contradiction, underexplored_material, cross_domain_transfer, experiment_gap, commercialization_gap. Impossible-physics caps every opportunity priority and suppresses cross-domain transfers entirely.

Phase VIII — Research dossiers + reports
SHIPPED · 15 tests

Five report types (source_dossier / executive_summary / technical_review / opportunity_brief / campaign_dossier), two visibilities (private = full detail, public = redacted: bands instead of raw scores, no operational thresholds, no procedural detail), three export formats (Markdown / JSON / HTML). Every report carries a Reproducibility & audit section with source_id / evidence_ids / memo_id / plan_id / graph_build_id.

Phase IX — Review queue + publication packages
SHIPPED · 16 tests

Operational state machine inbox → reviewing → approved → published (plus rejected as a terminal failure path). Append-only notes + decision audit trail. Four publication package types (bct, paper, lab, investor). Internal review surface is auth-only on both reads and writes — only the curated publication packages are public. Impossible-physics items can be reviewed and rejected but never published nor packaged.

Phase X — Autonomous research orchestration
SHIPPED · 18 tests

One endpoint, five modes: intake_only, full_private, full_review, campaign_cycle, report_only. Composes Phases 1–9 into a single controlled pipeline with per-step audit trail and per-job artifact log. Never auto-publishesfull_review only enqueues a review item; the operator transitions it manually. Publication packages are opt-in (allow_publication_package=true) and even then enforce the impossible-physics + unsafe-feedback gates from Phase IX. Failures are captured per-step and never crash the API.

// AT A GLANCE

Engineering footprint
10 phases shipped end-to-end
169 / 169 tests pass (146 phase + 14 api + 10 db)
0 paid API dependencies
0 mandatory LLM calls
SQLite-backed (portable to Postgres)
Append-only audit on every persistent write
~80 HTTP endpoints under /intake/*
Hard guardrails
Impossible-physics → never published
No auto-publish, ever
No procedural detail in any report
Auth on every write surface
Internal review surface auth on both sides
Publication packages opt-in only
Append-only history on jobs / steps / artifacts
Public surface
GET /intake/sources/…
GET /intake/campaigns/…
GET /intake/reports/…
GET /intake/publications
GET /intake/autonomous/jobs
GET /intake/graph/…
All with ?include_*=true aggregations

// AUDIT TRAIL · SHIPPING COMMITS

Phase I — scientific-intake: phase 1 — text → claims/entities/scoring/hypotheses + auth
Phase II — scientific-intake: phase 2 evidence acquisition adapters
Phase III — scientific-intake: phase 3 multi-agent evidence reasoning
Phase IV — scientific-intake: phase 4 experiment planning and validation gates
Phase V — scientific-intake: phase 5 active learning feedback loop
Phase VI — scientific-intake: phase 6 autonomous research campaigns
Phase VII — scientific-intake: phase 7 knowledge graph and concept discovery
Phase VIII — scientific-intake: phase 8 research dossiers and reports
Phase IX — scientific-intake: phase 9 review workspace and publication packages
Phase X — scientific-intake: phase 10 autonomous research orchestration (commit b43d0d2)

All ten commits authored by NeoB <noreply@sostprotocol.org>. Repository: materials-engine-private · module: src/scientific_intake/.

// 09 — NEXT

What's deferred

M7 — Live AI provider wiring (Ollama / free hosted / paid judge)
DEFERRED

M5 already ships the provider interfaces. M7 will wire real HTTP calls under explicit flags (--allow-local-model, --allow-free-ai, --allow-paid) with cache, rate limits and budget logging. No automatic paid AI usage. No passwords — only API keys via env vars.

DEX Safety Layer (separate sprint)
PLANNED

A small Python layer mirroring the on-page Gold-DEX position-pricing formula, plus a public-wording guard for the DEX HTML pages and a deal-audit persistence table. It sits beside the existing browser AI Copilot — it is not a replacement for it.