MASTER DEVELOPMENT PROMPT

Triad Coherence Data Ingestion + Normalization + Multi-Format Processing System

SYSTEM GOAL: You will design and implement a full data ingestion + cleaning + normalization + preview + export system for the Coherence Index Project. The system must:

Import raw data from multiple formats:
- CSV
- Excel (XLS/XLSX)
- PDF (text extraction required)
- Markdown (.md)
- Text (.txt)
- HTML
Automatically detect:
- Year columns
- Values
- Units
- Missing data
- Inconsistent formatting
- Header anomalies
- Multi-row headers
- Broken or shifted tables
Generate a data preview for the user:
- Show the detected columns
- Show a cleaned 5—10 row sample
- Ask for confirmation or corrections
- If wrong, allow the user to specify the correct mapping
- Re-run the cleaning pipeline with the corrected mapping

Normalize all indicators into a unified structure: Every dataset must be transformed to the schema:

year (INT)
indicator_name (TEXT)
indicator_domain (TEXT) -- society / individual / physics/logos
raw_value (FLOAT)
normalized_value (FLOAT)
source (TEXT)
notes (TEXT)

Export results to:
- PostgreSQL (with automated table creation)
- Obsidian-compatible Markdown summaries
- Clean CSV files
- Optional: JSON bundles for analytics pipelines
GUI Requirement: Build a Streamlit GUI with:
- Drag-and-drop file upload
- Automatic format detection
- Step-by-step cleaning wizard
- Preview panels
- An export menu
- A logging window
Architectural Requirements:
- Modular Python package structure
- Dedicated folder for format-specific parsers
- Dedicated folder for cleaning/normalization functions
- A “rules engine” that encodes the Triad domains
- A configuration file for expanding indicator mappings later
- A “human-in-loop correction loop” for invalid auto-detections

WHAT I WANT FROM YOU FIRST:

A complete high-level architecture
File/folder structure
Class/module design
The Streamlit GUI layout plan
A test plan using sample dirty CSV/PDF/MD files
After approval, generate the full code base in steps.

ADDITIONAL RULES:

Code must be Python 3.10+ compatible.
Use only widely supported libraries (pandas, numpy, pdfplumber, python-docx if needed, markdown2, BeautifulSoup4, SQLAlchemy, Streamlit).
For PDFs: extract tables if possible, fallback to text-block parsing.
For Markdown: identify tables if present, fallback to YAML + body parsing.
For HTML: extract <table>s; ignore styling.
All errors must be caught and displayed cleanly in the Streamlit GUI.

BEGIN

“Begin by designing the full system architecture. Do not write code yet. Produce the blueprint first.”

Canonical Hub: CANONICAL_INDEX

Theophysics Framework

Explorer

MASTER_DEVELOPMENT_PROMPT

MASTER DEVELOPMENT PROMPT

Triad Coherence Data Ingestion + Normalization + Multi-Format Processing System

WHAT I WANT FROM YOU FIRST:

ADDITIONAL RULES:

BEGIN

Graph View

Table of Contents