Unstructured content, the bane of RAG systems. Or is it?

November 22, 2025

When I started my Prompt Engineering for IAs course generation experiment, it was meant to be a simple exercise to understand how context windows work. Somewhere along the way though, I found myself staring at 20000+ lines of unstructured text across 11 modules.

In it's current form, the course content is not AI-friendly, even though it is AI-generated, ironically. The reams of content are as fun as a textbook, minus diagrams at that.

I noticed structural patterns, conceptual relationships, and opportunities to make the content more manageable. And DITA is what I know. Modular DITA topics that could serve multiple use cases: learning paths, reference lookup, AI training data, video scripts, PDFs...seemed like the answer.

After a series of convoluted and partially-successful attempts to wrangle the freeflowing course content into DITA chunks, I had to pause and ask: "Is DITA actually the right tool here?" Because my immediate goal was to have AI systems parse it, and thereby help learners navigate it quicker. The current MDX format already makes it human-readable, I only need it to be machine-readable too.

Pivoting from content structure to knowledge extraction

The question I now asked was "How do I extract this content for AI to understand and use it effectively?" I figured an ontology would help this content to be queried by AI assistants, retrieved for RAG systems, and surfaced in conversational interfaces.

An ontology defines:

Entities - What types of knowledge objects exist
Relationships - How they connect to each other
Attributes - Metadata for filtering and retrieval
Retrieval contexts - What questions each entity answers

Instead of asking "Is this a concept or a task?" (DITA thinking), I asked:

"What knowledge does this represent?"
"What other knowledge does it connect to?"
"When would an AI need to surface this?"

This shifted the focus entirely:

Old Focus (DITA)	New Focus (Ontology)
Content structure	Knowledge structure
Topic types (concept, task, reference)	Knowledge entities (framework, principle, pattern)
Reuse via embedding	Retrieval via semantic search
Consumer: publishing toolchain	Consumer: AI systems
Goal: maintain once, publish everywhere	Goal: AI understands meaning, not just text

Audit agent design

I created an ontology audit agent in CoPilot with:

Entity identification rules

 IF content teaches a reusable mental model → Framework or Principle
 IF content is an AI prompt meant to be copied → PromptPattern
 IF content lists steps to follow → Workflow
 IF content lists items to verify → Checklist
 IF content helps choose between options → DecisionMatrix
 IF content shows how something works → Example
 IF content analyzes a real system → CaseStudy
 IF content defines a term → Term
 IF content warns against something → Warning

Relationship mapping process
1. Identify all entities in module
2. Map CONTAINS relationships (parent-child)
3. Map IMPLEMENTS relationships (pattern → principle)
4. Map PREREQUISITES (learning sequence)
5. Identify cross-module REFERENCES
Output structure

Each audit produces:
- Entity inventory by type
- Relationship map (visual hierarchy)
- Prerequisite chain
- Cross-module reference table
- Retrieval contexts for each entity

New audit (knowledge-focused)

- id: validation-pyramid
  entity: Framework
  title: "Validation Pyramid"
  lines: 1065-1350
  
  contains:
    - level-1-quick-checks
    - level-2-structural
    - level-3-semantic
    - level-4-user-validation
    
  validates: any-ai-output
  
  retrieval_contexts:
    - "How do I validate AI-generated content?"
    - "What checks should I run before using AI output?"
    
  prerequisites:
    - ai-human-partnership-basics
    
  cross_module: [1-2, 2-1, 3-1]

This captured the knowledge structure, which is what AI systems need.

The ontology layer of JSON files

The entity taxonomy

The course content analysis revealed these knowledge entity types:

Entity type	Description	Examples
Framework	Multi-part conceptual model	Validation Pyramid, IA Decision Framework
Principle	Core idea or guideline	AI-First zone, Progressive disclosure
PromptPattern	Reusable AI prompt with structure	Taxonomy generation, Refinement pattern
Workflow	Multi-step process	4-round taxonomy refinement
Checklist	Actionable verification items	Quick checks, Structural validation
DecisionMatrix	Criteria for choosing options	Organizational approach matrix
Example	Illustrative scenario	Card sorting analysis
CaseStudy	Real-world application	Kubernetes docs structure
Term	Definition or glossary item	Polyhierarchy, Faceted classification
Warning	Critical caution or anti-pattern	Strategic red flags
Tool	Software or method	Tree testing, Card sorting
Role	Persona or user type	DevOps engineer, Content strategist

This is fundamentally different from content type classification. It's about what knowledge exists, not how content is formatted.

Relationships

Relationships detail out how knowledge connects:

Relationship	What it means	Example
`CONTAINS`	Parent includes child	Framework CONTAINS Principle
`IMPLEMENTS`	Applies a concept	PromptPattern IMPLEMENTS Principle
`VALIDATES`	Checks correctness	Checklist VALIDATES Workflow
`DEMONSTRATES`	Shows how it works	Example DEMONSTRATES Principle
`PRECEDES`	Must come before	Module 1 PRECEDES Module 2
`CONFLICTS_WITH`	Trade-off or tension	AI-First CONFLICTS_WITH Human-First

When an AI retrieves content, these relationships enable context:

"This is part of a larger framework. Here's the context"
"Before understanding this, you need to know..."
"Here's an example that demonstrates this principle"
"This conflicts with what we discussed earlier. Here's how to think about the trade-off"

Retrieval contexts: QnA

Each entity gets mapped to the questions it answers:

- id: validation-pyramid
  entity: Framework
  title: "Validation Pyramid"
  retrieval_contexts:
    - "How do I validate AI-generated content?"
    - "What checks should I run on a taxonomy?"
    - "How thorough should my AI review be?"

This inverts the traditional content-first model. Instead of "here's what we have, hope you can find it," it's "here's what you might ask, and here's what answers it."

What this layer enables

1. Smarter RAG retrieval

Without ontology: "How do I validate a taxonomy?" → keyword search returns any chunk mentioning "validate" and "taxonomy"

With ontology:

Retrieves the Validation Pyramid framework
Includes child entities (Level 1-4 checklists)
Notes that this validates AI output
Excludes examples unless asked

2. Prerequisite-aware responses

AI can check:

If the user understood prerequisite concepts
If not, surface foundational content first
Build explanations that connect to prior knowledge

3. Relationship-aware explanations

Instead of isolated chunks:

"The Validation Pyramid is a framework containing 4 levels..."
"This relates to the AI-First/Human-First decision you learned earlier..."
"Here's a prompt pattern that implements this principle..."

4. Conflict/trade-off awareness

When principles conflict, AI can explain:

"AI-First is appropriate here, but note the Human-First concerns about..."
"This is a trade-off between efficiency and oversight..."

What this project demonstrates

Knowledge architecture

Evidence	Competency
12-type entity taxonomy	Ontology design
Relationship type definitions	Knowledge modeling
Retrieval context mapping	AI-ready content design
Prerequisite chains	Learning sequence design

Process design

Element	Value
Audit agent with methodology	Repeatable, documentable process
Entity identification heuristics	Systematic classification
Relationship mapping protocol	Consistent knowledge capture
Output template	Structured, usable artifacts

What I've learned

About content architecture

Knowledge structure ≠ content structure
Relationships are as important as entities
User questions, not author convenience, drive good content organization
Human-readable content with explicit structure can become AI-ready content

About AI collaboration

AI is excellent at identifying patterns and generating structure
Strategic decisions (ontology design) require careful human judgment
Documenting methodology makes work reproducible

About ontology implementation paths

Enhanced frontmatter

The lightest implementation is to add structured metadata to existing MDX:

---
entity: Framework
id: validation-pyramid
contains: [level-1-checks, level-2-structural, level-3-semantic, level-4-user]
validates: any-ai-output
retrieval_contexts:
    - "How do I validate AI output?"
    - "What checks should I run?"
prerequisites: [ai-human-partnership-basics]
---

### The Validation Pyramid
...

This can be parsed by AI systems or CI/CD tools without changing the authoring workflow.

Structured knowledge base

Export ontology as structured data:

  ontology/
  ├── entities/
  │   ├── frameworks.yaml
  │   ├── principles.yaml
  │   ├── prompt-patterns.yaml
  │   └── checklists.yaml
  ├── relationships.yaml
  ├── prerequisites.yaml
  └── retrieval-index.yaml

Knowledge graph

For complex querying and AI integration:

  CREATE (vp:Framework {id: 'validation-pyramid'})
  CREATE (l1:Checklist {id: 'level-1-checks'})
  CREATE (vp)-[:CONTAINS]->(l1)
  CREATE (l1)-[:VALIDATES]->(:Output {type: 'ai-generated'})

Summary

What started as a prompt engineering exercise became a knowledge architecture project. By shifting focus from content structure (DITA) to knowledge structure (ontology), I could:

Design an entity taxonomy for learning content
Create relationship types that capture how knowledge connects
Build retrieval contexts that map user questions to content
Develop an audit methodology focused on meaning, not format
Position the content for AI consumption

Crucial takeaway: This experiment does NOT prove that unstructured content is A-okay!
None of the hoops that Gemini and I jumped through (multiple ontology audits, building JSON files with detailed contexts, or even scripts) would have been necessary if I had structured and modular content to begin with.
Right now, any update to the source files warrants updating the ontology files and retrieval index; not a process that would work in an enterprise setting.

A distant goal in this experiment is to convert the 11 modules into structured data. How is another project, but I have a draft plan in place already.

START: MDX Source File
  │
  ▼
STEP 1: INTELLIGENT CHUNKING (Split by H2)
  │
  ▼
STEP 2: TWO-PASS TRANSFORMATION (The "Expert" Process)
  │
  ├─ INPUT: Single MDX Chunk + [Context: Diataxis Defs, ID Rules]
  │
  ├─ PASS 1: ANALYSIS AGENT (The "Architect")
  │   │  "Analyze intent. Don't write content yet."
  │   │
  │   └─ OUTPUT: Metadata Object
  │       ├─ Intent: "Teach a concept" vs "Guide a procedure"
  │       ├─ Diataxis Mode: "Explanation"
  │       └─ Prereqs: "Refers to [Previous Topic]"
  │
  ├─ PASS 2: TRANSFORMATION AGENT (The "Writer")
  │   │  "Convert text to Schema using Pass 1 metadata."
  │   │
  │   ├─ CONSTRAINT: Allow structural reformatting (Paragraph -> Steps)
  │   ├─ CONSTRAINT: Ban fact fabrication
  │   │
  │   └─ OUTPUT: Structured JSON
  │       ├─ type: "Task"
  │       ├─ body: {
  │       │    "context": "...",
  │       │    "steps": ["1. Click...", "2. Type..."],
  │       │    "result": "..."
  │       │  }
  │       └─ artifacts: [ { "type": "example", "content": "..." } ]
  │
  ▼
STEP 3: SEMANTIC VALIDATION
  │
  └─ Check: Does JSON "steps" count match source "action verbs"?
      ├─ YES ──► Generate .dita file
      └─ NO  ──► Flag for Review

Stay tuned!

Pivoting from content structure to knowledge extraction​

Audit agent design​

Entity identification rules​

Relationship mapping process​

Output structure​

New audit (knowledge-focused)​

The ontology layer of JSON files​

The entity taxonomy​

Relationships​

Retrieval contexts: QnA​

What this layer enables​

1. Smarter RAG retrieval​

2. Prerequisite-aware responses​

3. Relationship-aware explanations​

4. Conflict/trade-off awareness​

What this project demonstrates​

Knowledge architecture​

Process design​

What I've learned​

About content architecture​

About AI collaboration​

About ontology implementation paths​

Summary​