SOST AI Engine

M1 — Evidence Core & Public Claim Guard

SHIPPED

Ten-level evidence classification (local_code_verified → do_not_publish), claim extractor, public-claim guard with safer-rewrite suggestions, eight seeded eval cases derived from real past mistakes (Useful Compute rewards postponed, avg288 vs avg1000, gold-redemption wording, GeaSpirit mineral guarantees, DFT-validated overclaim, etc.).

blocks public wording such as "rewards are active" while UC trial is postponed
blocks "DFT validated" when no DFT artefact is found locally
blocks "mineral guaranteed at depth" in GeaSpirit copy
downgrades absolute certainty wording to needs human review

M2 — Real Validators (Materials, GeaSpirit, SOST)

SHIPPED

Read-only validators that cross-check claims against the actual local corpus. The unified ValidatorResult carries verdict, evidence_level, confidence, publishability, evidence_items, missing_evidence, risks and next_steps. The orchestrator merges multiple validators with the strictest verdict winning.

Materials: validate_material_claim, validate_dft_status, low-cost / catalyst / photovoltaic / false-positive checks
GeaSpirit: depth-aware evidence required for any depth claim; satellite-only marked as surface proxy only
SOST: locked policies on cASERT 6210 (cancelled), avg288-only consensus, mandatory-update wording
Useful Compute: detects "rewards active" wording and contradicts it against the trial doc

M3 — Hypothesis Factory + AI Council + Learning Loop

SHIPPED

Mass local hypothesis generation (binary, ternary, quaternary and doped compositions for materials; AOI×commodity for GeaSpirit; risky-wording and Heavy-task-design ideas for SOST/UC). Deterministic ranking with configurable weights. AI Council with validator-veto (not majority vote). Outcome-driven rule-based learning loop with append-only persistence.

capable of generating 100,000+ hypotheses offline in seconds
deduplication by stable hash and project-specific pair keys
eight canonical campaign templates (Materials DFT priority, GeaSpirit public safety, UC Heavy task design, ...)
static HTML dashboard generator — local file, no server, no public deploy

M4 — Applicability & Utility Engine

SHIPPED

Each hypothesis is enriched with a structured ApplicabilityProfile answering: what it could be useful for, why theoretically, what evidence is missing, what false-positive risks exist, what the next validation step is, and whether it's publishable.

materials family classifier: oxide / sulfide / nitride / phosphide / carbide / silicide / halide / metallic
element-risk detection: PGM cost, toxicity, rare-earth supply, cheap-only
application-aware validation pathway (band structure only for PV/photonic, elastic only for structural)
recommended actions: promote_to_dft_queue, promote_to_chgnet, literature_review, keep_internal, reject_false_positive, future_heavy_task_candidate, ...

M5 — Free Knowledge Connectors & Source Reliability

SHIPPED

Connects the engine to official / free public APIs (arXiv, OpenAlex, Crossref, PubChem, JARVIS, Materials Project, USGS) and to optional local / free AI providers (Ollama, OpenRouter, HuggingFace) — with cache, rate-limit, domain allowlist, citation tracking and source reliability scoring. Truth hierarchy: local validators > local data > official DB > peer-reviewed metadata > preprints > local LLM > free hosted > paid judge (last and opt-in).

HTTP layer: urllib only, explicit domain allowlist, never logs token values
per-source rate limits + sha256-keyed cache (default 7 days, USGS 30 days)
9 canonical contradictions hard-locked: rewards-active vs postponed, avg1000 consensus mismatch, cASERT 6210 cancelled, DFT-validated overclaim, mineral / depth guarantees, "trustless" overclaim, no-risk wording, guaranteed price/payout
research session: claim → validators → local knowledge → sources → synthesis → contradiction resolution → internal answer with provenance
model-answer validator: scores overclaim / hallucination, emits a corrected answer that embeds the canonical truth
defaults: network OFF, paid OFF, free-AI OFF, local-model OFF

M6 — Autonomous Research Daemon & Human Review Pipeline

SHIPPED

The engine now runs as an internal autonomous research daemon: it observes local Materials Engine, GeaSpirit, SOST and Useful Compute artefacts in read-only mode, plans its own bounded tasks, executes safe local work, learns from outcomes via rule-based memory, and produces reviewable archives for human approval. Critically, it never publishes: every public claim must pass an explicit human review and approval step before being exported — and the exporter writes only to reports/ai_engine/approved_exports/, never to the public website.

SelfTask schema with 28 task types covering all four projects
read-only adapters for Materials Engine, GeaSpirit, SOST and Useful Compute (no writes, no DFT, no GIS)
bounded planner with 24h dedup window, per-project + total caps, network/paid gates
review pack: summary.md, publication_candidates.md, do_not_publish.md, manifest.json, checksums.sha256, plus a .tar.gz archive
publication queue with strict approval gate: do_not_publish drafts cannot be approved
rule-based research memory (no neural reranker): boosts validated families, demotes rejected patterns, blocks risky wording
manual scripts only — no systemd, no cron
defaults: daemon OFF, network OFF, paid OFF, free-AI OFF, local-model OFF, public publication FORBIDDEN

M7 — Local / free AI provider wiring (paid disabled in M7)

SHIPPED

M7 wires the policy gate and judge plumbing for local + free AI providers (Ollama local, OpenRouter / HuggingFace free models). Paid AI is hard-disabled in M7 — even when a caller passes --allow-paid, the policy coerces max_paid_calls to 0 and reports paid_judge as disabled. The provider answer judge runs every reply against the canonical contradictions and the public-claim guard, scores overclaim and hallucination, and embeds the canonical correction in the corrected_answer.

provider_policy: explicit defaults all-OFF; allow_paid=True coerced to False
free_ai_model_registry: Ollama prefix allowlist (qwen/llama/mistral/phi/gemma/deepseek/codellama), OpenRouter only :free suffix, HuggingFace small free-inference list
provider_answer_judge: deterministic JudgeReport scoring overclaim/hallucination, with corrected_answer that appends the canonical truth
live_research_session: validators-only by default; every provider recorded as used=False with a skipped_reason
Ollama provider: refuses any non-loopback URL via compare_exchange-style check; is_available() never raises
token presence reported as boolean only — values never read or stored; no password field anywhere

M8 — Internal SOST AI Ask Engine

SHIPPED

A small CLI-driven interface that lets the operator type a free-form prompt and have the AI engine search/reason over local knowledge of Materials Engine, GeaSpirit, SOST, Useful Compute and DEX, then return a cautious internal-only answer. Never publishes — even when the prompt explicitly asks to publish, the answer composer routes to needs_human_review.

prompt_router: 12 intent buckets (explain / search / compare / validate / generate_hypotheses / public_wording_review / useful_compute_task_design / dft_priority / geaspirit_public_safety / dex_safety / mining_help / create_report)
project_selector: keyword-based routing across materials / geaspirit / sost / useful_compute / dex / mining
answer_composer: always runs contradiction_resolver + public_claim_guard; embeds canonical corrections; downgrades publishability on blocking findings
internal_citation: lightweight registry for local-file references with inline [n] markers and a markdown bibliography
output saved to reports/ai_engine/ask/<ts>_<slug>/ with answer.md, evidence.json, files_consulted.txt, risk_report.md, manifest.json
defaults: internal_only, network off, paid false, no automatic publication, no website write

M9 — Public Help Center + Approved Knowledge Export

SHIPPED

The public-safe layer. Private AI thinks. Human reviews. Public site explains only approved safe knowledge. The public website (sost-help.html and sost-miner-troubleshooter.html) consumes only static JSON exported from a human-reviewed pipeline. The public site never calls the private engine, never queries Ollama / OpenRouter / HuggingFace / paid AI, and never uploads a log.

approved_knowledge_exporter: combines approved publication-queue items with 12 default safe FAQ templates; writes reports/ai_engine/approved_public_help/<ts>/ with index, markdown, troubleshooter rules, faq, safety + source manifests, README and sha256 checksums
public_help_guard: exit-gate guard hard-blocking guaranteed profit, passive income, Useful Compute rewards are active, avg1000 consensus, confirmed/guaranteed mineral, DFT-validated, fully trustless DEX, any send/paste/share private key or seed phrase, the personal-email leak token, and the AI attribution leak token
miner_troubleshooting_knowledge: 11 deterministic log-pattern rules (rejected-block, profile-mismatch, no-peers, connection-refused, bootstrap-chain, http-zero, too-many-threads, cmake/libsecp/libssl-missing, etc.) consumed by the local-only browser troubleshooter
public website: /sost-help.html with client-side search and noscript fallback, /sost-miner-troubleshooter.html with paste-a-log analysis that runs entirely in the browser
import helper scripts/import_public_help_pack.py: validates the pack, refuses on missing safety_manifest / checksums / banned phrases, copies the JSON into website/data/; never auto-runs git-add / commit / push
defaults: no live AI on the public site, no chatbot, no log upload, no autopilot deploy

M10 — Outcome Learning + Operator Command Center + Miner Support Triage

SHIPPED

The closed feedback loop. Deterministic, rule-based, internal-only — no neural model, no network, no autopublish. The engine now records what actually happened to a candidate after it left review (DFT result, GeaSpirit verdict, public-wording correction, miner outcome, provider contradiction), turns each event into a clamped boost/penalty adjustment, and remembers the pattern for future ranking. The operator command center reads the same signal and recommends P0/P1/P2/P3 actions; the miner support triage classifies free-form miner text into a structured case and drafts a conservative reply that always passes through the public claim guard before being marked as low-risk. The public claim guard + contradiction resolver remain the only paths to public output.

M10-1 — Outcome Learning Core: typed OutcomeEvent log (28 outcome types across materials / GeaSpirit / useful compute / SOST / DEX / provider), deterministic derive(event) mapping to LearningAdjustment + PatternLesson rows, hard caps MAX_BOOST=0.30 / MAX_PENALTY=0.40 / MAX_NET_DELTA=0.50, pattern-memory upsert with +0.02 confidence bump per repeat (capped at 1.0), and a Markdown report under reports/ai_engine/learning/
M10-2 — Operator Command Center: read-only priority recommender that surfaces P0 (public-claim emergencies, GeaSpirit overclaim/depth guards), P1 (repeated miner issues, provider overclaims, stuck review packs), P2 (memory hygiene), P3 (periodic refreshes); work-queue snapshot; per-page dashboard text; Markdown report bundle under reports/ai_engine/operator/<ts>/ (index.md, actions.md, state.md, dashboard.txt)
M10-3 — Miner Support Triage + Reply Drafts: regex-based classifier across 12 categories (install / build / sync / rejected_block / orphan / stale_parent / no_peers / threads / useful_compute / wallet_safety / dex / unknown), per-case Markdown log under reports/ai_engine/support_cases/<ts>/, conservative community-reply drafter (wallet_safety replies are never auto-safe — they always require human ack), release-notice drafter, and help-refresh suggester that proposes Q/A items the operator can review (the suggester itself never publishes)
unified CLI: scripts/sost_ai_ops.py with subcommands status, next-actions, risks, review-packs, learning, providers, miner-support, public-help-suggestions, full-report
safety invariants: is_safe_for_export() blocks targets {false_positive_risk, guard, task_design, provider_reliability} from leaving the engine as such; every adjustment passes through clamp_adjustment() before insertion; the loop is replayable, auditable and deterministic by construction
defaults: pure Python stdlib, no torch / sklearn / numpy, no remote calls, no public publication; M10 outputs feed only the internal review pipeline that already gates everything public via M9

M11 — Scientific Validation Planner + Dossier Factory

SHIPPED

From "this candidate is interesting" to "this is the concrete next step, this is the experiment, this is what would confirm or kill it". Internal-only, deterministic, human-approved. A hypothesis becomes a structured dossier with current evidence, missing evidence, recommended path (literature / CHGNet / DFT input / GeaSpirit layer review / Useful Compute task design), draft experiments, deterministic pass/fail criteria, coarse compute-cost class, P0–P4 priority and publishability tag (internal_only by default). The AI may design validation work; it never executes heavy jobs and never publishes scientific claims.

M11-1 — Validation Dossier Core: ValidationDossier dataclass + 3 SQLite tables (validation_dossiers, validation_experiments, dossier_index), validation-plan router, ExperimentSpec dataclass with execution_allowed=False by default, deterministic pass/fail criteria, coarse cost-class estimator, priority + publishability policy, and the dossier renderer that writes dossier.md, dossier.json, experiment_plan.json, go_no_go.md, commands_draft.sh (mode 0644, every line #-commented), README.md, checksums.sha256
M11-2 — Materials Validation Planner: CHGNet / DFT plan builders (relaxation / static / band-structure / elastic-constants), Materials Go/No-Go (toxic / radioactive elements without strategic_value=high force HOLD; PGM catalysts always flag cost-of-deployment risk; literature + CHGNet support required for DFT GO unless P0/P1 promotion), per-experiment templates that NEVER invent energies, gaps or stability conclusions
M11-3 — GeaSpirit Validation Planner: layer-gap analyzer (geology, lithology, magnetics, gravity, AEM, EMIT, drilling, geochemistry, false-positive filters, spatial block validation), publication-safety reviewer (depth claims based on satellite alone are BLOCKED; depth confirmation requires magnetics / gravity / AEM / drilling + human approval), single-sensor anomaly without geology = NO-GO
M11-4 — Useful Compute Heavy Task Designer: HeavyTaskSpec with input/output schema, deterministic requirements, verification method, runtime target, dependency requirements, hardware requirements, and risk lists; conservative go/no-go (fake CPU burn / busy-wait / loop-forever → REJECT; too-light tasks → REJECT; paid / proprietary deps → REJECT; DFT-class without pinned version + pseudopotentials + tolerances + container → HOLD needs_reproducibility_solution); benchmark + verification plans as DRAFT-ONLY; strongest verdict the AI can issue is ready_for_benchmark — the rewarded phase remains gated by a separate human decision
M11-5 — Dossier Index + Operator Integration: denormalised dossier_index for fast filtered queries, substring search by subject + project + type, Markdown operator summary grouped by project / priority / publishability, integration into sost_ai_ops.py validation-dossiers and into the full-report bundle
safety invariants: dossiers default to internal_only; commands_draft.sh is non-executable by construction; Materials candidates and GeaSpirit AOIs always default to internal_only; Useful Compute campaigns always default to human_review_required; public exposure routes through the existing M9 export pipeline + human approval

M12 — Validation Campaign Orchestrator

SHIPPED

From individual dossiers to organised campaigns. Internal-only, deterministic, human-approved. M11 decides what might be valid; M12 decides what to validate first. A typed batch of dossiers is selected by a per-type policy, ranked, and packaged into a draft-only execution pack (campaign.md, manifest.json, selected_dossiers.jsonl, budget.md, risk_report.md, manual_execution_plan.md, do_not_run_automatically.md, README, checksums) that the operator reviews before any compute or publication.

M12-1 — Campaign Core: ValidationCampaign dataclass + 4 SQLite tables (validation_campaigns, campaign_dossiers, campaign_approvals, campaign_index), 12 campaign types (literature / CHGNet / DFT input / DFT relaxation draft / GeaSpirit layer review / GeaSpirit public safety / Useful Compute heavy benchmark design / Useful Compute reproducibility / Useful Compute fake-heavy rejection / Useful Compute schema design / public-wording safety / miner support improvement), status transitions ONLY through record_approval(...) (draft → ready_for_human_review → approved_for_manual_execution → completed), execution_allowed=False default that the system never flips, deterministic per-type selector + family/application diversification + coarse budget estimator
M12-2 — Materials Validation Campaigns: M10 outcome bias re-rank (per-family boost / penalty pulled from learning_adjustments), CHGNet excludes red-flag candidates, DFT-class campaigns require literature + CHGNet support OR an explicit P0/P1 promotion, family-aware diversification (at most two same-family candidates before falling back), per-pack extras: selected_materials.csv, selected_materials.jsonl, go_no_go.md
M12-3 — GeaSpirit Validation Campaigns: AOI-aware re-rank (richer layer inventory first for layer-review; blocked claims first for public-safety), aggregator that turns per-AOI verdicts into a campaign-level go / hold / block_public, per-pack extras: selected_aoi_claims.jsonl, layer_gap_matrix.csv, public_safety_matrix.md, go_no_go.md
M12-4 — Useful Compute Heavy Campaigns: readiness-based bucketing (ready_for_benchmark / needs_reproducibility / fake_heavy / needs_schema_design), strongest verdict is "go to internal benchmark" — never "go to rewarded phase", per-pack extras: selected_heavy_tasks.jsonl, benchmark_matrix.csv, reproducibility_matrix.md, input_output_schema_needs.md, go_no_go.md
M12-5 — Campaign Operator Integration: campaign_index rebuild + filtered queries, substring search by title + type + project, Markdown operator summary, sost_ai_ops.py validation-campaigns and campaign-next-actions subcommands, automatic inclusion in the full-report bundle
safety invariants: AI does NOT generate Useful Compute task queues, NOT modify task_server / worker, NOT activate rewards, NOT run CHGNet / DFT, NOT download rasters, NOT run GIS pipelines; manual_execution_plan.md is mode 0644 with every line shell-commented (sh manual_execution_plan.md is a no-op); do_not_run_automatically.md ships in every pack as an explicit reminder

M13 — Autonomous Discovery Theorist + Theory Graph

DEFERRED

Planned Phase 3 milestone. The discovery theorist will turn the M11 dossier graph + M12 campaign outcomes into typed scientific theories, with explicit pre-conditions, observable predictions, refutation paths, and a typed theory-graph that links theories to dossiers and to one another. Not yet shipped. The M14 console exposes the discover action but currently returns a safe unavailable envelope so callers can already wire the route end-to-end while the engine is being designed.

theory dataclass: id, statement, scope, pre-conditions, predictions, refutation_paths, evidence-links, confidence_class, publishability (always internal_only by default)
theory-graph SQLite tables (theories, theory_links, theory_evidence) with deterministic IDs and stable hashes for reproducible snapshots
local-only theory generator: deterministic, stdlib-only, no LLM by default; AI Council optional in a future milestone for free-only / paid-locked operation
operator integration: theory dossiers under reports/ai_engine/discovery_dossiers/<id>/ and a Markdown summary for the operator command center
safety invariants: same as M11/M12 — nothing publishes automatically, no network, no paid AI; theories never override an existing M11 dossier verdict

M14 — SOST AI Engine Private Console

SHIPPED

Localhost-only web console for operating the M1–M12 capabilities of the engine through prompts, quick actions, evidence panels and risk flags. Not exposed on sostcore.com. Never publishes anything. Never runs heavy compute. No paid AI. Started with python3 scripts/sost_ai_console.py, prints a one-shot URL with an ephemeral token, scrubs the token from the address bar after first paint, and keeps it in a JS closure for the lifetime of the page.

M14-1 — Backend + Access Control: stdlib-only ThreadingHTTPServer bound to 127.0.0.1 by default; 0.0.0.0 rejected unless both --unsafe-bind-all and --i-understand-this-exposes-the-private-console are passed; secrets.token_urlsafe(32) session token, never persisted; constant-time hmac.compare_digest bearer match; positive read/write allowlists; no shell helper exists in the codebase; 50 tests
M14-2 — Private UI: vanilla HTML/CSS/JS, no CDN, no external fonts/scripts/styles, no cookies, no localStorage, no eval(), no Function(); CSP default-src 'self'; CSS dark command-center aesthetic with cyan/green/gold accents; sidebar with 14 sections; chat panel with project + mode selectors, safety badges (NETWORK OFF, PAID LOCKED, PUBLICATION LOCKED, LOCALHOST ONLY), 24+ canned quick prompts, evidence drawer with colour-coded publishability, miner-log triage screen, reports browser, settings panel; token never written to any DOM node; 22 tests
M14-3 — Action Integration: 10 actions wired to existing modules — ask/ideate via ask_engine.ask; validate/public_wording_review via public_claim_guard.scan_text; create_dossier via M11 materials/geaspirit/useful-compute planners + insert_dossier + dossier_renderer.render; create_campaign via campaign_renderer.build_and_render (execution_allowed=False unconditionally); triage_miner_log via miner_support_triage; draft_reply via community_reply_drafter; next_actions via operator_command_center; discover returns a safe-unavailable envelope until M13 lands; legacy do_not_publish publishability mapped onto canonical blocked; 48 tests
M14-4 — Chat History + Exports: SQLite-backed console_sessions + console_messages + console_actions tables (idempotent migrations on top of the engine's existing persistence layer); console_conversation high-level API (new_chat / rename_chat / append_user / append_assistant / list_chats / load_chat); console_export writes reports/ai_engine/console_exports/<UTC>_<sid8>/ with conversation.md, manifest.json (schema sost_ai_console_export@v1), and checksums.sha256; clearing local history requires the literal confirmation "YES_DELETE_LOCAL_HISTORY"; 16 tests
M14-5 — Operator Guide + Smoke: live HTTP smoke tests on a random port and per-test fresh SQLite db: token gate works, paid + publication remain locked, "Useful Compute rewards active" never returns public_safe, miner-rejected-block log triages correctly, dossier creation writes a folder under reports/ai_engine/validation_dossiers/..., no website/ writes occur during a typical action flow, discover returns the M13 unavailable envelope; comprehensive operator guide in docs/multi_ai_console_operator_guide.md; 11 tests
safety invariants: localhost only by default, ephemeral token, no permanent passwords, no eval() / Function() / inline scripts / inline event attributes, paid + publication + heavy execution all locked, only allowlisted directories are readable / writable, OPTIONS preflights rejected (no CORS surface), and SSH tunnel is the documented remote-access pattern (no public port)

M15 — Console Persistence + Operator Workflow

SHIPPED

Turns the M14 console into a daily cockpit: every action is now persisted in SQLite, conversations can be reloaded and exported, the action surface covers review-pack / outcome / daemon / operator-status / learning-report / dossier-list / campaign-list, and a strict report browser lets the operator read internal artifacts without ever leaving the safe allowlist. All security locks remain in place.

M15-1 — SQLite Chat Persistence: the routes layer auto-writes a turn-pair into console_messages + console_actions on every persistable action; the in-memory ring stays as a lightweight audit cache. ConsoleState.current_session_id is filled lazily on the first action. New endpoints GET /api/history, GET /api/history/load, POST /api/history/{new,rename,clear} with explicit confirm="CLEAR_CONSOLE_HISTORY" guard. The session token NEVER appears in any persisted row. (14 tests)
M15-2 — Full Safe Action Wiring: review_pack → project_observer.snapshot_all + review_pack.build_review_pack; outcome_record → OutcomeEvent + outcome_ingestor.ingest; daemon_once → autonomous_daemon.run_once with dry_run=True always, allow_paid=False, console-side hard caps max_tasks ≤ 25 / max_runtime ≤ 300 s; operator_status → full operator_command_center.collect; operator_risks → P0/P1 filter; learning_report → learning_report.generate; validation_dossiers / validation_campaigns → their respective M11/M12 indexes. (14 tests)
M15-3 — Report Browser 2.0: stdlib report browser with positive read allowlist. resolve_safe(rel) rejects absolute paths, parent-traversal, ".." inside any segment, and symlinks that escape reports/ai_engine/. New routes GET /api/reports/{tree,view, search} — HTML files are tagged html_escaped, never rendered as live HTML. Search is case-insensitive, capped at 200 hits and 64 KB per file. (15 tests)
M15-4 — UI Workflow Polish: sidebar CHATS list backed by /api/history; + NEW button; EXPORT button (copies the exact CLI command to the clipboard — the export script writes to disk, not the JS); CLEAR button with a confirmation modal; creativity selector (conservative / speculative / wild / fantastic / red_team) sent on every prompt; per-mode placeholder hints; run-status spinner on the Send button. No new external dependencies; eval() / Function() / external scripts still absent. (17 tests)
M15-5 — Provider Status Panel: GET /api/providers/status returns booleans only: ollama_available (via shutil.which, never executed), paid_locked: true, publication_locked: true, network_enabled: false by default, plus six *_token_present booleans (OPENROUTER / HUGGINGFACE / ANTHROPIC / OPENAI / GROQ / TOGETHER) — only bool(env_var), never the value itself. The status reader makes no network call. (7 tests)

M16 — Speculative Discovery Lab + Real Ideation Engine

SHIPPED

The console no longer answers "Generate 20 membrane ideas" with generic policy text. M16 turns the SOST AI Engine into an internal speculative scientist: it invents structured hypotheses with mechanism, falsifier and validation path; it separates plausible from speculative-but-testable from wild from fantastic; and it self-criticizes every idea. Everything stays internal_only. Nothing publishes, nothing executes DFT/CHGNet/GIS, nothing uses paid AI.

Idea schema: Idea dataclass with stable 16-char id (sha256 of project|domain|title), five score fields clamped to [0,1] (novelty / plausibility / utility / falsifiability / absurdity), seven evidence levels (locally_supported / plausible / speculative_but_testable / wild_but_testable / fantastic_unvalidated / non_testable_now / rejected), publishability defaulting to internal_only, and a priority() rule that drops to P3 for any non-testable idea regardless of score.
Five creativity dials: conservative / speculative (default) / wild / fantastic / red_team. The dial drives both the generator's score priors and the evidence-level distribution; conservative caps absurdity, fantastic admits non-testable, red_team is the self-critic.
Six domain generators: materials/membrane (6 family x 8 target combinatorics with 7 stock failure modes), materials/catalyst (6 family x 7 reaction), materials/photovoltaic (5 family x 5 risk), materials/ion_separation (8 ion targets), geaspirit/theory (8 cross-layer theories with public-safety risks), useful_compute/heavy_task (7 task kinds with fake-heavy guards). Each uses a sha256-derived RNG so the same prompt reproduces the same set.
Self-critic / red-team: per-idea Critique with strongest_for / strongest_against / easiest_falsifier / most_likely_failure / promotes_if / kills_if. Every idea has a falsifier; ideas with falsifiability_score < 0.3 cannot rank above low priority.
Learning integration: the scorer reads M10 learning_adjustments and lowers patterns previously rejected, flagging them known_weak_pattern. Wild / fantastic creativity may still revisit weak patterns but they are clearly labelled.
Output folder: every batch writes reports/ai_engine/ideas/<UTC>_<slug>/ with ideas.md, ideas.jsonl, ranking.csv, falsifiers.md, validation_paths.md, risk_report.md, manifest.json (schema sost_ai_ideas@v1), and checksums.sha256 covering the other files. idea_index SQLite table for fast filtered queries.
DFT priority fix: the canonical "What candidates deserve DFT?" prompt now returns either a ranked candidate table built from M11 dossiers + M12 campaigns, or an explicit "no concrete candidate" message that points at the standalone scripts. Never claims DFT validation when no DFT artifact exists.
Console wiring: the ideate action returns the numbered idea list (mechanism / why-might-work / why-might-fail / first-test / falsifier / next step per idea). The discover action uses the same engine with creativity="fantastic" by default; the M13 typed-theory-graph deferral is surfaced as a warning, not a refusal. (38 tests)
CLI: scripts/multi_ai_generate_ideas.py mirrors the engine for offline use: --project, --domain, --count, --creativity, --prompt, --json, --no-render, --no-persist.
safety invariants: every idea is internal_only by default; fantastic / non_testable_now ideas keep that publishability; the renderer honours the M14 write allowlist; the engine never calls the network and never spawns a subprocess; all randomness is deterministic from the prompt + count + creativity tuple.

M17 — Materials General Answerer (practical recommendations)

SHIPPED

Closes the answer-quality bug reported on M16: the console used to fall through to "General internal answer — see project context above." for practical materials questions like "que material es el mejor para hacer nanopartículas fácilmente?". M17 routes those prompts through a structured answerer that returns a real ranked recommendation with mechanism, advantages, risks and first validation step per option. Pure stdlib, no network, no paid AI, no DFT claim.

Topic library: four practical-recommendation topics ship in M17 — nanoparticles, membranes, catalysts and abundant-element photovoltaics. Each topic carries title, summary lines, an option table (label, formula, ease, advantages, risks, first validation step) and a by-objective recommendation block.
Practical-cue gate: the answerer only fires when the prompt contains a recommendation cue (best / mejor / easiest / cheapest / recommend / which / how to / qué material). Bare keyword prompts still go through the standard flow, so the engine does not over-trigger.
Project gate: materials-only in M17. GeaSpirit, useful_compute and other projects keep their existing flow until their own answerers land.
Canonical nanoparticle answer: Au (citrate / Turkevich, plasmonic, expensive), SiO₂ (Stöber / sol-gel, cheap, scalable), Fe₃O₄ (coprecipitation, magnetic), TiO₂ (sol-gel, photocatalytic), Ag (Tollens / NaBH4, antimicrobial with aggregation+toxicity caveats). By objective: Au for demonstration / SiO₂ cheap-versatile / Fe₃O₄ magnetic / TiO₂ photocatalysis / Ag antimicrobial.
Evidence stamp: every answer states evidence_source: general_scientific_reasoning, confidence: medium, publishability: internal_only and no local DFT / CHGNet / lab artifact was consulted — this is a practical recommendation, not a validation. The engine never claims DFT validation without an actual artifact.
Wire-in: answer_composer.compose() consults the answerer before falling back to the legacy "see project context above" line. If the answerer matches, its body replaces the placeholder and its next_actions are merged into the result envelope.
safety invariants: no network call, no paid AI, no auto-publication, internal_only by default, no DFT validation claim. Every recommendation is offered as a starting point for a validation dossier or literature review — not as an answer to be published.

M18 — Project Registry + Capability Gates (multi-repo orchestration)

SHIPPED

Closes the multi-repo gap reported on M17 / unified-lab launch: the readonly adapters used to hard-code the project root to materials-engine-private/, so the AI engine never actually saw the sibling GeaSpirit / SOST core / GeaDeep / Materials Discovery repos. M18 ships a stdlib JSON registry that declares every repo the engine may read, plus five canonical capability gates with strict deny-by-default semantics. The engine still cannot publish, run DFT, touch consensus or call the network unless the registry explicitly allows it — and the shipped registry never does.

Registry file: src/multi_ai_review/project_registry.json declares six projects: materials (~/SOST/materials-engine-private, the only project with can_write_repo: true), geaspirit (~/SOST/geaspirit, read-only sibling), geadeep (~/SOST/geadeep-energy-private, read-only sibling), sost (~/SOST/sostcore/sost-core, read-only; consensus surface), materials_discovery (~/SOST/materials-engine-discovery, read-only archive), useful_compute (lives inside materials-engine-private; read-only). Schema tag sost_ai_project_registry@v1.
Loader (stdlib only): project_registry.py with lazy + thread-safe cache. Resolution order: explicit path argument > SOST_AI_PROJECT_REGISTRY env var > the JSON next to the loader. A missing file, invalid JSON, wrong schema or unknown project all yield an empty registry — deny-by-default still applies.
Five canonical gates as pure functions: can_run_dft(project), can_publish(project), can_touch_consensus(project), can_use_network(project), can_write_repo(project). Plus gate_summary(project) for status responses. Every action that touches one of these axes consults the gate before running.
Deny-by-default semantics: unknown project → all gates False. Project present but flag absent → False (the registry-wide defaults are themselves all False). Only an explicit "can_X": true in a project block flips a gate. Belt+braces in the test suite asserts that the shipped JSON never grants can_touch_consensus, can_publish or can_run_dft for any project.
Adapter integration (back-compat preserving): each adapters/*_readonly.py keeps the historical parents[3] as an explicit _FALLBACK_ROOT and adds a _repo_root() helper that consults the registry first. The module-level _REPO_ROOT alias is preserved so existing callers keep working. The live unified-lab daemon picks up the new resolution on next process restart without disruption mid-loop.
Concrete effect on the unified lab: GeaSpirit adapter now reads from ~/SOST/geaspirit (the real sibling repo, not the materials repo). SOST adapter now reads from ~/SOST/sostcore/sost-core. Useful Compute remains read-only inside materials-engine-private. Materials is the only project that can be written to from the AI engine.
safety invariants: no project grants consensus / publish / DFT in the shipped registry; the canonical write directories declared by console_security.ALLOWED_WRITE_SUBDIRS still apply on top of the gates; the daemon still runs with --network off. 22 new tests; 1,188 / 1,188 total green.

M19 — Outcome Ledger + Autonomous Scoreboard + Project Heartbeat

SHIPPED

Closes the observability gap reported on the unified-lab launch: the daemon was running but its trail was not measurable. M19 adds three small layers that make autonomy verifiable — without granting it any new permissions. Every task the engine plans, dispatches or refuses leaves a row; every cycle leaves a heartbeat; the scoreboard rolls both up into a Markdown + JSON dashboard.

Task-level outcome ledger: outcome_ledger.py with the seven canonical statuses (planned / executed / skipped / failed / useful / wrong / repeated). Stores project, task_type, subject, summary, gate_blocked flag + reason, input/output hashes, saved_path. Helpers: record, list_outcomes, counts_by_status, positive_negative_ratio, gate_block_rate, already_seen(input_hash) for repeat detection. Distinct from M10's subject-level outcome events — both layers coexist.
Per-cycle project heartbeat: project_heartbeat.py records one row per project per daemon cycle: tasks_generated, tasks_useful, tasks_failed, gate_blocks, routes_read, memory_updates, free-form notes. Operators can answer "is materials still learning, or just looping?" at a glance.
Autonomous scoreboard: autonomous_scoreboard.py rolls the ledger + last heartbeat + the eight registry gates per project up into a payload tagged sost_ai_autonomous_scoreboard@v1. Renders both autonomous_scoreboard.md and autonomous_scoreboard.json under reports/ai_engine/operator/ (already on the M14 write allowlist — no allowlist change needed).
Three new capability gates (deny-by-default): can_execute_heavy_task (DFT/CHGNet/GIS), can_create_public_draft (public-facing artifact prep), and can_update_memory (write into the AI engine's learning memory). Every project denies all three by default; only materials grants can_update_memory (its own memory). The full canonical list is now eight gates wide.
Belt+braces invariants (asserted by the test suite): no project grants can_execute_heavy_task; no project grants can_create_public_draft; only materials grants can_update_memory; unknown project denies all eight gates; the M18 invariants on the original five gates remain unchanged.
CLI: scripts/multi_ai_scoreboard.py renders the dashboard (--db / --out-dir / --since-hours / --json). Live smoke against an empty ledger returns the canonical schema tag, six projects, eight gates per project, and zeroed totals across the seven statuses.
safety invariants: M19 ships no permission grants beyond can_update_memory=true for materials. Still no auto-DFT, no auto-publication, no auto-commits, no consensus surface. The unified-lab daemon picks up the new tables on next process restart; no mid-loop disruption. 26 new tests; 1,214 / 1,214 total green.

M20 — Experiment Planner + Replay Sandbox (counterfactual rule evaluation)

SHIPPED

Closes the "method" gap: the engine could ideate (M16), recommend (M17), and run autonomously (M19), but it could not ask itself whether its proposed rules would have actually worked on past data. M20 ships the scientific-method layer: hypothesis → plan → counterfactual replay against M10 history → GO / MAYBE / NO-GO → memory. No new execution permissions — the engine still cannot run anything, only plan and simulate.

Baseline registry (auto-derived from M10): baseline_registry.py reads M10's outcome_events table and classifies each row as positive (e.g. dft_success, chgnet_stable, literature_supported, human_promoted, provider_useful_answer...) or negative (dft_failure, chgnet_red_flag, human_rejected, provider_overclaim_confirmed...). Optional baseline_overrides.json next to the engine wins on conflict. Baselines dedupe per (project, subject) — twelve "FeS2 promoted" events count as one positive signal, not twelve.
Experiment planner: experiment_planner.py + experiment_plans SQLite table. plan(hypothesis, project) picks project-aware step lists: photovoltaic-flagged materials hypotheses get a band_gap_estimate (DFT input draft only — never run) step; generic materials get the membrane/catalyst step list; geaspirit gets layer-gap + false-positive + publication-safety; useful_compute gets schema design + fake-heavy + reproducibility. Stable plan_id (sha256-derived) so the same hypothesis at the same time produces the same id.
Replay sandbox — counterfactual evaluator: replay_sandbox.py. A Rule is a callable (subject, context) -> bool. replay() applies the rule to the historical positives and negatives and returns precision, recall, accuracy, F1 plus the TP/FP/TN/FN counters and a verdict. Decision matrix: precision ≤ 0.30 OR recall ≤ 0.20 → NO-GO; precision ≥ 0.65 AND recall ≥ 0.50 → GO; otherwise MAYBE.
Operator's golden rule: "if there is no historical signal, there is no certainty." The sandbox enforces a strict ceiling: historical_count < min_historical (default 20) means the strongest possible verdict is MAYBE — never GO, regardless of how clean precision and recall look. The system stays honest about its own confidence.
Renderer: writes reports/ai_engine/experiments/<UTC>_<slug>/ with plan.md, plan.json, replay_result.json, decision.md (GO/MAYBE/NO-GO + reason) and checksums.sha256. Honours the M14 write allowlist; new entry reports/ai_engine/experiments added.
Three new capability gates (deny-by-default): can_plan_experiment and can_replay_experiment are true for materials / geaspirit / useful_compute (the projects that produce hypotheses); false for sost / geadeep / materials_discovery. can_execute_experiment is false for every project and asserted by the test suite. The canonical gate set is now eleven wide.
CLIs: scripts/multi_ai_plan_experiment.py turns a hypothesis into a plan + on-disk folder; scripts/multi_ai_replay_experiment.py counterfactually evaluates a keyword-based rule against M10 history and prints the verdict. Both refuse to run on projects that lack the gate.
safety invariants: M20 ships no execution surface; no auto-DFT, no auto-publication, no auto-commits, no consensus surface, no new network capability. Belt+braces tests assert can_execute_experiment is denied for every registered project. 25 new tests; 1,239 / 1,239 total green.

M21 — Canonical Mission Engine + AI Scientist Swarm (six roles + contradiction-to-discovery)

SHIPPED

Closes the "criterion" gap: the engine could ideate, plan, replay and score, but it had no explicit compass. M21 codifies the project objectives, runs every hypothesis through a six-role scientist swarm, mines contradictions for refinement opportunities, and persists the canonical memory of the cycle. The system stops being "the AI that proposes things" and starts being "the AI that asks itself if its proposals would have worked" — guided by an operator-defined mission. No new execution permissions.

Canonical objectives: canonical_objectives.py codifies eight materials objectives (defensible_discovery, cost_abundance_stability, reduce_false_positives_pre_dft, catalysts_no_pgm, non_toxic_pv, hydrogen_proton_membranes, water_desalination, industrial_robust), six geaspirit objectives (real_world_systems, deep_sea_environment, cost_near_concrete, proxy_vs_proof, multi_layer_evidence, public_safety_review), two geadeep objectives (deep_sea_energy, manufacturability_at_scale) and two useful_compute objectives (determinism, reproducibility). Each objective carries align_keywords + kill_criteria + proxies as data.
Mission alignment scorer: mission_alignment.py scores a hypothesis against the canonical objectives for its project. Per-objective score in [0, 1], aggregate weighted score, fired-kill-criteria flag, matched-objective count. Refuses to run when can_rank_mission_alignment is denied.
Six-role AI scientist swarm: ai_scientist_swarm.py runs six pure-function roles on every hypothesis — Discoverer (bold expansion), Skeptic (kill-criteria + weak proxies), Engineer (fabricable / lab_only / unknown), Economist (cheap / moderate / expensive via Pt/Pd/Rh/In/Ga/Cd/Te keywords), Validator (cheapest test menu, project-aware), Historian (consults outcome_ledger for revisit_failed / consistent_with_history / contested_history). Mean of the six role scores becomes the swarm score.
Contradiction-to-Discovery loop: critique_loop.py turns disagreements between roles into structured opportunities: "is there a doping/sibling that resolves the contradiction? does the candidate serve another application better? is the contradiction a false positive in the predictor? does the candidate deserve a plan or an archive?". Archive policy: low swarm_score (< 0.25), or revisit_failed + weak discoverer, or kill-criterion-fired + score < 0.4 forces archive.
Canonical memory: canonical_memory.py persists per-cycle summaries (mission_cycles table) and per-hypothesis records (mission_hypotheses table). Gated by can_update_memory from M19 — only materials writes here.
Three new capability gates (deny-by-default): can_generate_hypothesis, can_critique_hypothesis, can_rank_mission_alignment — true for materials, geaspirit, useful_compute; false for sost, geadeep, materials_discovery. Canonical gate set is now fourteen wide (5 from M18 + 3 from M19 + 3 from M20 + 3 from M21).
CLI: scripts/multi_ai_mission_cycle.py runs the full pipeline end to end: read objectives -> M16 generate N hypotheses -> swarm + critique loop -> mission_alignment -> top-K -> M20 plan + replay -> render reports/ai_engine/mission_cycle/<UTC>_<slug>/ with cycle_manifest.json, hypotheses.jsonl, top_picks.md, contradictions.md, next_best_actions.md; per-hypothesis row in the M19 outcome ledger + one heartbeat row.
safety invariants: M21 grants no execution surface; no auto-DFT, no auto-publication, no auto-commits, no consensus surface, no new network capability. can_update_memory remains true only for materials. The live unified-lab daemon keeps running unchanged — M21 modules are ready but not yet wired into its hot path. 33 new tests; 1,272 / 1,272 total green.

M22 — Research Frontier Map + Strategic Autonomy (40/30/20/10 attention split + weekly roadmap)

SHIPPED

Closes the "agenda" gap. M21 gave the engine voices and a compass; M22 gives it a map and a calendar. Every recent hypothesis is grouped into a research frontier, gaps in coverage are surfaced, the next cycle's attention is split across explore / exploit / falsify / review, and a weekly roadmap is rendered alongside an anti-obsession watchdog. No new execution permissions; the system still cannot run anything autonomously.

Frontier map: frontier_map.py reads M21 mission_hypotheses + M19 outcome_ledger + canonical_objectives. Coarse family detection via ordered substring rules (LDH / kesterite / antimony chalcogenide / oxysulfide / single-atom / PGM / phosphide / nitride / sulfide / layered oxide / ceramic). Each Frontier carries hypothesis count, average swarm + alignment scores, three representative subjects, positive/negative outcome counts and a status label (active / thin / stagnant / contested).
Research-gap detector: research_gap_detector.py surfaces five canonical gap kinds: many_ideas_no_evidence (many hypotheses, zero positives), evidence_no_exploration (positives but no recent extension), recurring_contradictions (same swarm contradiction string seen ≥ 3 cycles), promising_no_validation (high alignment + swarm but no experiment plan attached), objective_uncovered (canonical objective with zero matched hypotheses).
Strategic allocator (40 / 30 / 20 / 10): strategic_allocator.py proposes the next cycle's attention split — 40 % exploit (active frontiers with the highest combined score), 30 % explore (uncovered objectives + thin frontiers), 20 % falsify (high-promise hypotheses without a plan), 10 % review (stagnant / contested frontiers + recurring contradictions). Each item carries weight, target, detail and rationale.
Anti-obsession guard: anti_obsession_guard.py watches the family distribution of the last 30 mission_hypotheses. If any one family covers more than 40 % of the window (and the sample is at least 10), it raises an ObsessionFlag with an explicit "diversify with creativity=wild or seed prompts from a different family" suggestion. Saves the engine from getting locked onto a pretty chemistry just because keywords keep matching.
Canonical weekly roadmap: canonical_roadmap.py combines all four pieces into one operator-facing plan: try this week (exploit + falsify), explore this week, archive / discard (contested + thin frontiers with low scores), wait — needs more data (review items), plus an anti-obsession notes block. Internal-only; the roadmap is advisory and does not authorise execution.
Three new capability gates (deny-by-default): can_build_frontier_map, can_allocate_research_attention, can_write_roadmap — true for materials, geaspirit, useful_compute; false for sost, geadeep, materials_discovery. The canonical gate set is now seventeen wide (5 + 3 + 3 + 3 + 3). The hard locks (run_dft / publish / touch_consensus / execute_*) remain false everywhere.
CLI: scripts/multi_ai_frontier_cycle.py runs the pipeline end to end and renders reports/ai_engine/frontier/<UTC>_<slug>/ with the six canonical files: frontier_map.json, frontier_map.md, research_gaps.md, strategic_allocation.json, weekly_roadmap.md, anti_obsession_report.md.
safety invariants: M22 grants no new execution surface; no auto-DFT, no auto-publication, no auto-commits, no consensus surface, no new network capability. The live unified-lab daemon keeps running unchanged. 30 new tests; 1,302 / 1,302 total green.

M23 — Private AI Ops Dashboard + Background Lab Autopilot (localhost cabin + token gate)

SHIPPED

Closes the "cabin" gap. M14 is a chat-style console; M23 gives operators a read-only operations dashboard at http://127.0.0.1:8766 plus a background autopilot that runs M21 mission cycles + M22 frontier cycles on a configurable interval. Localhost only. Token-gated. No public surface. No new execution permissions.

Token surface (separate from M14): ai_ops_token.py. issue(ttl_seconds=4h) returns the cleartext exactly once; on disk only the SHA-256 over a per-issue salt is persisted. verify() uses constant-time hmac.compare_digest plus an expiry check. revoke() unlinks the record. status() returns metadata without the cleartext.
Read-only state aggregator: ai_ops_state.py pulls from M19 (outcome ledger + heartbeats), M20 (experiment plans), M21 (canonical memory), M22 (frontier map) and the registry gate set. full_state() is the one-shot snapshot the dashboard's /api/ops/state endpoint serves.
Dashboard server: ai_ops_dashboard_server.py — stdlib ThreadingHTTPServer, refuses non-local bind with PermissionError (remote access via SSH tunnel only). Single HTML page (CSP default-src 'self', no CDN, no cookies, no localStorage), one JS file at /static/ops.js, four JSON endpoints under /api/ops/* — all /api/* require Bearer token.
Background autopilot: ai_ops_autopilot.py. tick() runs one M21 mission cycle plus one M22 frontier cycle for every project that holds the full pipeline gate set; disallowed projects produce a TickResult with ok=false and a clear "missing required gates" error. loop() refuses interval_seconds < 30. Best-effort: never raises mid-loop. Persists outcome ledger + heartbeats; writes canonical memory only when can_update_memory permits (materials only).
Three CLI scripts: scripts/sost_ai_generate_token.py issues a fresh token and prints the dashboard URL once on stdout; scripts/sost_ai_ops_dashboard.py starts the HTTP server (refuses non-local hosts); scripts/sost_ai_autopilot.py ticks once or loops indefinitely with operator-tunable creativity / count / interval.
safety invariants: localhost only by construction; token never appears on disk; the autopilot grants no new permissions; if a gate denies a step the project is skipped for that tick; all /api/* endpoints are read-only; no DFT, no publish, no consensus, no commit. 23 new tests; 1,325 / 1,325 total green.

M24 — Useful Compute Task Intelligence Layer (private classifier + spec generator + staging queue)

SHIPPED

Closes the "what counts as a heavy task?" gap reported on the Useful Compute live trial. The public API and worker stay as a dry-run infrastructure; M24 ships a private lab where the AI engine classifies, spec-generates and stages heavy-task candidates. Public publishing and reward activation remain hard-locked behind gates that no project grants in the shipped registry — the operator's CTO verdict.

Heavy-task classifier: heavy_task_classifier.py with the five canonical accept axes the operator promised: is_useful (no busy-wait / fake-heavy), is_deterministic (declared + no race / unseeded random hint), is_auditable (declared + no not-auditable hint), is_heavy_enough (runtime ≥ 60 s AND memory ≥ 256 MB), and is_safe_to_verify (no "verifier rerun required" / "no replay possible" hint). Eight curated keyword vocabularies. A task that fails ANY axis is rejected with the offending axis listed.
Spec generator: heavy_task_spec_generator.py. TaskSpec + per-project schema templates (materials / geaspirit / useful_compute), pinned-deps declaration, fixed-seed policy, replay = ~10 % of original runtime, explicit fake-heavy baseline ("busy-wait must NOT match the output within tolerances"). Reward class always starts at no_reward; visibility is internal_only when DFT / raw-geospatial / wallet / consensus keywords match, otherwise human_review_required.
Private staging queue: useful_compute_private_queue.py + SQLite useful_compute_private_queue table. stage() refuses when can_stage_private_useful_compute_task is denied — only the useful_compute project holds that gate. attempt_publish() and attempt_enable_rewards() are the documented entry points for future operator-only workflows but always return {ok: false, reason: "denied"} because every project denies the corresponding gate.
Orchestrator + seed templates: useful_compute_task_intelligence.py orchestrates classify -> spec -> (optional stage). seed_candidates(project) exposes operator-blessed seed examples per project (5 materials, 5 geaspirit, 3 useful_compute) — every seed is engineered to pass the classifier so operators have a known-good baseline.
Four new capability gates (deny-by-default): can_design_useful_compute_task (true: materials, geaspirit, useful_compute), can_stage_private_useful_compute_task (true: useful_compute ONLY), can_publish_useful_compute_task (false EVERYWHERE), can_enable_useful_compute_rewards (false EVERYWHERE). Canonical gate set is now twenty-one wide. Belt+braces tests assert publish + reward gates remain false for every project.
CLI: scripts/multi_ai_useful_compute_task_lab.py with --seeds (operator-blessed templates), --title + --description (ad-hoc candidate), --stage (private staging when the gate permits) and --json for the full pipeline payload.
Operator (CTO) verdict surfaced in code: public Useful Compute API stays as dry-run infrastructure; heavy reward-bearing tasks stay private until the AI knows to classify, validate and audit them; rewards stay OFF until M24+ pronounces a task family ready; publishing stays OFF until a separate human-approval workflow lands. 29 new tests; 1,354 / 1,354 total green.

M25 — Human Command Center + Approval Ritual (daily brief, approval inbox, executive summary, operator feedback loop)

SHIPPED

Closes the loop between the autonomous engine and the human operator. The AI may *propose* concrete decisions (campaigns, dossiers, false-positive archives, frontier promotions, useful-compute staging requests, DFT input prep); the operator approves or rejects them. Golden rule: the AI may ask for permission. The AI cannot grant permission to itself.

Approval request schema: approval_request.py. ApprovalRequest dataclass with deterministic request_id (SHA-256 over project / kind / subject / timestamp). Eight canonical kinds: approve_campaign, reject_hypothesis, convert_to_dossier, stage_useful_compute, prepare_dft_input, archive_false_positive, promote_frontier_family, demote_frontier_family. Lifecycle pending → approved | rejected | withdrawn. create() gated by can_create_approval_request; approve() and reject() require a non-empty operator argument — the engine cannot generate one from inside an automated tick.
Operator inbox: operator_inbox.py. Aggregates pending approvals across projects, sorts by kind priority (false-positive archive ≫ approve campaign), and renders a markdown table for the CLI plus sost_ai_operator_inbox@v1 JSON for the dashboard.
Daily brief: daily_brief.py. Five-section UTC report — what the engine did (recent ledger rows), learned (recent canonical mission cycles), found (top frontier families + research gaps), blocked (capability-gate denials in window), and recommends (open approval requests). Saved under reports/ai_engine/daily_brief/<UTC>/ as brief.json + brief.md. Default 24h lookback.
Operator feedback loop: operator_feedback.py. On approve → records a useful task outcome in M19's ledger plus a positive hypothesis_learning_event. On reject → records a wrong outcome plus a negative learning event so the M21 swarm down-weights the pattern. Idempotent per request_id. Gated by can_apply_operator_feedback — only materials holds that gate.
Executive summary: executive_summary.py. Top-3-per-section operator view per project: opportunities (frontier families ranked by swarm × count + outcome bias), risks (research gaps + gate-block density + negative-signal pressure), next actions (oldest pending approvals).
Dashboard endpoints (M23 surface, extended): /api/ops/approvals, /api/ops/daily-brief, /api/ops/executive-summary, /api/ops/decision-history. All localhost-only and token-gated; payloads are read-only.
Four CLI scripts: scripts/sost_ai_daily_brief.py, scripts/sost_ai_operator_inbox.py, scripts/sost_ai_approve_request.py, scripts/sost_ai_reject_request.py. Approve/reject scripts require --operator; both refuse to run with an empty operator name. Optional --apply-feedback flag pipes the decision into the M19 ledger.
Three new gates, deny-by-default: can_create_approval_request (true for materials, geaspirit, useful_compute), can_apply_operator_feedback (true only for materials), can_execute_approval (false everywhere — even a human approval does not unlock automated execution; the engine still refuses to run side-effects from an approval row by itself).
safety invariants: no DFT execution, no public publishing, no consensus surface, no rewards, no self-approval, no new network capability. The dashboard remains localhost-only. 48 new tests; 1,402 / 1,402 total green.

level	meaning	publishable?
`local_code_verified`	verified by reading repo source	yes
`local_data_verified`	verified by reading a local DB / data file	yes
`local_doc_supported` / `local_report_supported`	supported by a local doc or internal report	yes
`external_official_supported`	supported by an official free public source (arXiv, OpenAlex, ...)	with caveat
`multi_source_supported`	supported by ≥ 2 independent source types	yes
`model_consensus_only`	only models agree; no data or doc backs it	no
`speculative`	weak signals, not reproducible	no
`contradicted`	local evidence contradicts the claim	no
`insufficient_evidence`	nothing collected to back the claim	no
`do_not_publish`	public-facing claim with insufficient backing	block

Tests passing	1,402 / 1,402 (M1 + M2 + M3 + M4 + M5 + M6 + M7 + M8 + M9 + M10 + M11 + M12 + M14 + M15 + M16 + M17 + M18 + M19 + M20 + M21 + M22 + M23 + M24 + M25; M13 deferred)
Free public source connectors	8 (arXiv, OpenAlex, Crossref, PubChem, JARVIS, Materials Project, USGS, generic-official)
Hypothesis generation capacity	≥ 100,000 candidates offline, in seconds
Network calls per run (default)	0 — network is OFF by default
Paid model calls per run (default)	0 — paid is OFF by default
Persistence	SQLite, idempotent migrations, 39 internal tables (9 added in M10 for the outcome learning loop, 3 added in M11 for the dossier factory: validation_dossiers, validation_experiments, dossier_index, 4 added in M12 for the campaign orchestrator: validation_campaigns, campaign_dossiers, campaign_approvals, campaign_index, 3 added in M14 for the private console: console_sessions, console_messages, console_actions, 1 added in M16 for the speculative discovery lab: idea_index)
Public outputs	none autonomous — human review required
Source license	private repo — outputs only released after audit

SOST AI
Engine

An evidence-first autonomous research engine

// SUMMARY

// CORE PRINCIPLE

// DESIGN BIAS — FREE / LOCAL FIRST, PAID LAST

Capabilities across four milestones

From 100,000 hypotheses to a few human-reviewed ones

Every claim is classified before it can be promoted

How it plugs into SOST projects

Hard guarantees

// THE ENGINE NEVER

Current numbers

// EXPERIMENTAL STATUS

10 phases · 169 tests · offline by default

// SUMMARY

// HARD RULES (enforced in code, not just documented)

// THE FULL CHAIN

// PHASE-BY-PHASE

// AT A GLANCE

// AUDIT TRAIL · SHIPPING COMMITS

What's deferred