MASTER DEVELOPMENT PROMPT

Triad Coherence Data Ingestion + Normalization + Multi-Format Processing System

SYSTEM GOAL: You will design and implement a full data ingestion + cleaning + normalization + preview + export system for the Coherence Index Project. The system must:

  1. Import raw data from multiple formats:

    • CSV
    • Excel (XLS/XLSX)
    • PDF (text extraction required)
    • Markdown (.md)
    • Text (.txt)
    • HTML
  2. Automatically detect:

    • Year columns
    • Values
    • Units
    • Missing data
    • Inconsistent formatting
    • Header anomalies
    • Multi-row headers
    • Broken or shifted tables
  3. Generate a data preview for the user:

    • Show the detected columns
    • Show a cleaned 5—10 row sample
    • Ask for confirmation or corrections
    • If wrong, allow the user to specify the correct mapping
    • Re-run the cleaning pipeline with the corrected mapping
  4. Normalize all indicators into a unified structure: Every dataset must be transformed to the schema:

    year (INT)
    indicator_name (TEXT)
    indicator_domain (TEXT) -- society / individual / physics/logos
    raw_value (FLOAT)
    normalized_value (FLOAT)
    source (TEXT)
    notes (TEXT)
    
  5. Export results to:

    • PostgreSQL (with automated table creation)
    • Obsidian-compatible Markdown summaries
    • Clean CSV files
    • Optional: JSON bundles for analytics pipelines
  6. GUI Requirement: Build a Streamlit GUI with:

    • Drag-and-drop file upload
    • Automatic format detection
    • Step-by-step cleaning wizard
    • Preview panels
    • An export menu
    • A logging window
  7. Architectural Requirements:

    • Modular Python package structure
    • Dedicated folder for format-specific parsers
    • Dedicated folder for cleaning/normalization functions
    • A “rules engine” that encodes the Triad domains
    • A configuration file for expanding indicator mappings later
    • A “human-in-loop correction loop” for invalid auto-detections

WHAT I WANT FROM YOU FIRST:

  1. A complete high-level architecture
  2. File/folder structure
  3. Class/module design
  4. The Streamlit GUI layout plan
  5. A test plan using sample dirty CSV/PDF/MD files
  6. After approval, generate the full code base in steps.

ADDITIONAL RULES:

  • Code must be Python 3.10+ compatible.
  • Use only widely supported libraries (pandas, numpy, pdfplumber, python-docx if needed, markdown2, BeautifulSoup4, SQLAlchemy, Streamlit).
  • For PDFs: extract tables if possible, fallback to text-block parsing.
  • For Markdown: identify tables if present, fallback to YAML + body parsing.
  • For HTML: extract <table>s; ignore styling.
  • All errors must be caught and displayed cleanly in the Streamlit GUI.

BEGIN

“Begin by designing the full system architecture. Do not write code yet. Produce the blueprint first.”

Canonical Hub: CANONICAL_INDEX