Skip to main content

Unstructured content, the bane of RAG systems. Or is it?

When I started my Prompt Engineering for IAs course generation experiment, it was meant to be a simple exercise to understand how context windows work. Somewhere along the way though, I found myself staring at 20000+ lines of unstructured text across 11 modules.

In it's current form, the course content is not AI-friendly, even though it is AI-generated, ironically. The reams of content are as fun as a textbook, minus diagrams at that.

I noticed structural patterns, conceptual relationships, and opportunities to make the content more manageable. And DITA is what I know. Modular DITA topics that could serve multiple use cases: learning paths, reference lookup, AI training data, video scripts, PDFs...seemed like the answer.

After a series of convoluted and partially-successful attempts to wrangle the freeflowing course content into DITA chunks, I had to pause and ask: "Is DITA actually the right tool here?" Because my immediate goal was to have AI systems parse it, and thereby help learners navigate it quicker. The current MDX format already makes it human-readable, I only need it to be machine-readable too.

Pivoting from content structure to knowledge extraction

The question I now asked was "How do I extract this content for AI to understand and use it effectively?" I figured an ontology would help this content to be queried by AI assistants, retrieved for RAG systems, and surfaced in conversational interfaces.

An ontology defines:

  1. Entities - What types of knowledge objects exist
  2. Relationships - How they connect to each other
  3. Attributes - Metadata for filtering and retrieval
  4. Retrieval contexts - What questions each entity answers

Instead of asking "Is this a concept or a task?" (DITA thinking), I asked:

  • "What knowledge does this represent?"
  • "What other knowledge does it connect to?"
  • "When would an AI need to surface this?"

This shifted the focus entirely:

Old Focus (DITA)New Focus (Ontology)
Content structureKnowledge structure
Topic types (concept, task, reference)Knowledge entities (framework, principle, pattern)
Reuse via embeddingRetrieval via semantic search
Consumer: publishing toolchainConsumer: AI systems
Goal: maintain once, publish everywhereGoal: AI understands meaning, not just text

Audit agent design

I created an ontology audit agent in CoPilot with:

  • Entity identification rules

 IF content teaches a reusable mental model → Framework or Principle
IF content is an AI prompt meant to be copied → PromptPattern
IF content lists steps to follow → Workflow
IF content lists items to verify → Checklist
IF content helps choose between options → DecisionMatrix
IF content shows how something works → Example
IF content analyzes a real system → CaseStudy
IF content defines a term → Term
IF content warns against something → Warning
  • Relationship mapping process

    1. Identify all entities in module
    2. Map CONTAINS relationships (parent-child)
    3. Map IMPLEMENTS relationships (pattern → principle)
    4. Map PREREQUISITES (learning sequence)
    5. Identify cross-module REFERENCES
  • Output structure

    Each audit produces:

    • Entity inventory by type
    • Relationship map (visual hierarchy)
    • Prerequisite chain
    • Cross-module reference table
    • Retrieval contexts for each entity

New audit (knowledge-focused)

- id: validation-pyramid
entity: Framework
title: "Validation Pyramid"
lines: 1065-1350

contains:
- level-1-quick-checks
- level-2-structural
- level-3-semantic
- level-4-user-validation

validates: any-ai-output

retrieval_contexts:
- "How do I validate AI-generated content?"
- "What checks should I run before using AI output?"

prerequisites:
- ai-human-partnership-basics

cross_module: [1-2, 2-1, 3-1]

This captured the knowledge structure, which is what AI systems need.

The ontology layer of JSON files

The entity taxonomy

The course content analysis revealed these knowledge entity types:

Entity typeDescriptionExamples
FrameworkMulti-part conceptual modelValidation Pyramid, IA Decision Framework
PrincipleCore idea or guidelineAI-First zone, Progressive disclosure
PromptPatternReusable AI prompt with structureTaxonomy generation, Refinement pattern
WorkflowMulti-step process4-round taxonomy refinement
ChecklistActionable verification itemsQuick checks, Structural validation
DecisionMatrixCriteria for choosing optionsOrganizational approach matrix
ExampleIllustrative scenarioCard sorting analysis
CaseStudyReal-world applicationKubernetes docs structure
TermDefinition or glossary itemPolyhierarchy, Faceted classification
WarningCritical caution or anti-patternStrategic red flags
ToolSoftware or methodTree testing, Card sorting
RolePersona or user typeDevOps engineer, Content strategist

This is fundamentally different from content type classification. It's about what knowledge exists, not how content is formatted.

Relationships

Relationships detail out how knowledge connects:

RelationshipWhat it meansExample
CONTAINSParent includes childFramework CONTAINS Principle
IMPLEMENTSApplies a conceptPromptPattern IMPLEMENTS Principle
VALIDATESChecks correctnessChecklist VALIDATES Workflow
DEMONSTRATESShows how it worksExample DEMONSTRATES Principle
PRECEDESMust come beforeModule 1 PRECEDES Module 2
CONFLICTS_WITHTrade-off or tensionAI-First CONFLICTS_WITH Human-First

When an AI retrieves content, these relationships enable context:

  • "This is part of a larger framework. Here's the context"
  • "Before understanding this, you need to know..."
  • "Here's an example that demonstrates this principle"
  • "This conflicts with what we discussed earlier. Here's how to think about the trade-off"

Retrieval contexts: QnA

Each entity gets mapped to the questions it answers:

- id: validation-pyramid
entity: Framework
title: "Validation Pyramid"
retrieval_contexts:
- "How do I validate AI-generated content?"
- "What checks should I run on a taxonomy?"
- "How thorough should my AI review be?"

This inverts the traditional content-first model. Instead of "here's what we have, hope you can find it," it's "here's what you might ask, and here's what answers it."

What this layer enables

1. Smarter RAG retrieval

Without ontology: "How do I validate a taxonomy?" → keyword search returns any chunk mentioning "validate" and "taxonomy"

With ontology:

  • Retrieves the Validation Pyramid framework
  • Includes child entities (Level 1-4 checklists)
  • Notes that this validates AI output
  • Excludes examples unless asked

2. Prerequisite-aware responses

AI can check:

  • If the user understood prerequisite concepts
  • If not, surface foundational content first
  • Build explanations that connect to prior knowledge

3. Relationship-aware explanations

Instead of isolated chunks:

  • "The Validation Pyramid is a framework containing 4 levels..."
  • "This relates to the AI-First/Human-First decision you learned earlier..."
  • "Here's a prompt pattern that implements this principle..."

4. Conflict/trade-off awareness

When principles conflict, AI can explain:

  • "AI-First is appropriate here, but note the Human-First concerns about..."
  • "This is a trade-off between efficiency and oversight..."

What this project demonstrates

Knowledge architecture

EvidenceCompetency
12-type entity taxonomyOntology design
Relationship type definitionsKnowledge modeling
Retrieval context mappingAI-ready content design
Prerequisite chainsLearning sequence design

Process design

ElementValue
Audit agent with methodologyRepeatable, documentable process
Entity identification heuristicsSystematic classification
Relationship mapping protocolConsistent knowledge capture
Output templateStructured, usable artifacts

What I've learned

About content architecture

  • Knowledge structure ≠ content structure
  • Relationships are as important as entities
  • User questions, not author convenience, drive good content organization
  • Human-readable content with explicit structure can become AI-ready content

About AI collaboration

  • AI is excellent at identifying patterns and generating structure
  • Strategic decisions (ontology design) require careful human judgment
  • Documenting methodology makes work reproducible

About ontology implementation paths

  • Enhanced frontmatter

    The lightest implementation is to add structured metadata to existing MDX:

    ---
    entity: Framework
    id: validation-pyramid
    contains: [level-1-checks, level-2-structural, level-3-semantic, level-4-user]
    validates: any-ai-output
    retrieval_contexts:
    - "How do I validate AI output?"
    - "What checks should I run?"
    prerequisites: [ai-human-partnership-basics]
    ---

    ### The Validation Pyramid
    ...

    This can be parsed by AI systems or CI/CD tools without changing the authoring workflow.

  • Structured knowledge base

    Export ontology as structured data:

      ontology/
    ├── entities/
    │ ├── frameworks.yaml
    │ ├── principles.yaml
    │ ├── prompt-patterns.yaml
    │ └── checklists.yaml
    ├── relationships.yaml
    ├── prerequisites.yaml
    └── retrieval-index.yaml
  • Knowledge graph

    For complex querying and AI integration:

      CREATE (vp:Framework {id: 'validation-pyramid'})
    CREATE (l1:Checklist {id: 'level-1-checks'})
    CREATE (vp)-[:CONTAINS]->(l1)
    CREATE (l1)-[:VALIDATES]->(:Output {type: 'ai-generated'})

Summary

What started as a prompt engineering exercise became a knowledge architecture project. By shifting focus from content structure (DITA) to knowledge structure (ontology), I could:

  • Design an entity taxonomy for learning content
  • Create relationship types that capture how knowledge connects
  • Build retrieval contexts that map user questions to content
  • Develop an audit methodology focused on meaning, not format
  • Position the content for AI consumption

Crucial takeaway: This experiment does NOT prove that unstructured content is A-okay!
None of the hoops that Gemini and I jumped through (multiple ontology audits, building JSON files with detailed contexts, or even scripts) would have been necessary if I had structured and modular content to begin with.
Right now, any update to the source files warrants updating the ontology files and retrieval index; not a process that would work in an enterprise setting.

A distant goal in this experiment is to convert the 11 modules into structured data. How is another project, but I have a draft plan in place already.

START: MDX Source File


STEP 1: INTELLIGENT CHUNKING (Split by H2)


STEP 2: TWO-PASS TRANSFORMATION (The "Expert" Process)

├─ INPUT: Single MDX Chunk + [Context: Diataxis Defs, ID Rules]

├─ PASS 1: ANALYSIS AGENT (The "Architect")
│ │ "Analyze intent. Don't write content yet."
│ │
│ └─ OUTPUT: Metadata Object
│ ├─ Intent: "Teach a concept" vs "Guide a procedure"
│ ├─ Diataxis Mode: "Explanation"
│ └─ Prereqs: "Refers to [Previous Topic]"

├─ PASS 2: TRANSFORMATION AGENT (The "Writer")
│ │ "Convert text to Schema using Pass 1 metadata."
│ │
│ ├─ CONSTRAINT: Allow structural reformatting (Paragraph -> Steps)
│ ├─ CONSTRAINT: Ban fact fabrication
│ │
│ └─ OUTPUT: Structured JSON
│ ├─ type: "Task"
│ ├─ body: {
│ │ "context": "...",
│ │ "steps": ["1. Click...", "2. Type..."],
│ │ "result": "..."
│ │ }
│ └─ artifacts: [ { "type": "example", "content": "..." } ]


STEP 3: SEMANTIC VALIDATION

└─ Check: Does JSON "steps" count match source "action verbs"?
├─ YES ──► Generate .dita file
└─ NO ──► Flag for Review

Stay tuned!