Unstructured content, the bane of RAG systems. Or is it?
When I started my Prompt Engineering for IAs course generation experiment, it was meant to be a simple exercise to understand how context windows work. Somewhere along the way though, I found myself staring at 20000+ lines of unstructured text across 11 modules.
In it's current form, the course content is not AI-friendly, even though it is AI-generated, ironically. The reams of content are as fun as a textbook, minus diagrams at that.
I noticed structural patterns, conceptual relationships, and opportunities to make the content more manageable. And DITA is what I know. Modular DITA topics that could serve multiple use cases: learning paths, reference lookup, AI training data, video scripts, PDFs...seemed like the answer.
After a series of convoluted and partially-successful attempts to wrangle the freeflowing course content into DITA chunks, I had to pause and ask: "Is DITA actually the right tool here?" Because my immediate goal was to have AI systems parse it, and thereby help learners navigate it quicker. The current MDX format already makes it human-readable, I only need it to be machine-readable too.
Pivoting from content structure to knowledge extraction
The question I now asked was "How do I extract this content for AI to understand and use it effectively?" I figured an ontology would help this content to be queried by AI assistants, retrieved for RAG systems, and surfaced in conversational interfaces.
An ontology defines:
- Entities - What types of knowledge objects exist
- Relationships - How they connect to each other
- Attributes - Metadata for filtering and retrieval
- Retrieval contexts - What questions each entity answers
Instead of asking "Is this a concept or a task?" (DITA thinking), I asked:
- "What knowledge does this represent?"
- "What other knowledge does it connect to?"
- "When would an AI need to surface this?"
This shifted the focus entirely:
| Old Focus (DITA) | New Focus (Ontology) |
|---|---|
| Content structure | Knowledge structure |
| Topic types (concept, task, reference) | Knowledge entities (framework, principle, pattern) |
| Reuse via embedding | Retrieval via semantic search |
| Consumer: publishing toolchain | Consumer: AI systems |
| Goal: maintain once, publish everywhere | Goal: AI understands meaning, not just text |
Audit agent design
I created an ontology audit agent in CoPilot with:
-
Entity identification rules
IF content teaches a reusable mental model → Framework or Principle
IF content is an AI prompt meant to be copied → PromptPattern
IF content lists steps to follow → Workflow
IF content lists items to verify → Checklist
IF content helps choose between options → DecisionMatrix
IF content shows how something works → Example
IF content analyzes a real system → CaseStudy
IF content defines a term → Term
IF content warns against something → Warning
-
Relationship mapping process
- Identify all entities in module
- Map CONTAINS relationships (parent-child)
- Map IMPLEMENTS relationships (pattern → principle)
- Map PREREQUISITES (learning sequence)
- Identify cross-module REFERENCES
-
Output structure
Each audit produces:
- Entity inventory by type
- Relationship map (visual hierarchy)
- Prerequisite chain
- Cross-module reference table
- Retrieval contexts for each entity
New audit (knowledge-focused)
- id: validation-pyramid
entity: Framework
title: "Validation Pyramid"
lines: 1065-1350
contains:
- level-1-quick-checks
- level-2-structural
- level-3-semantic
- level-4-user-validation
validates: any-ai-output
retrieval_contexts:
- "How do I validate AI-generated content?"
- "What checks should I run before using AI output?"
prerequisites:
- ai-human-partnership-basics
cross_module: [1-2, 2-1, 3-1]
This captured the knowledge structure, which is what AI systems need.
The ontology layer of JSON files
The entity taxonomy
The course content analysis revealed these knowledge entity types:
| Entity type | Description | Examples |
|---|---|---|
| Framework | Multi-part conceptual model | Validation Pyramid, IA Decision Framework |
| Principle | Core idea or guideline | AI-First zone, Progressive disclosure |
| PromptPattern | Reusable AI prompt with structure | Taxonomy generation, Refinement pattern |
| Workflow | Multi-step process | 4-round taxonomy refinement |
| Checklist | Actionable verification items | Quick checks, Structural validation |
| DecisionMatrix | Criteria for choosing options | Organizational approach matrix |
| Example | Illustrative scenario | Card sorting analysis |
| CaseStudy | Real-world application | Kubernetes docs structure |
| Term | Definition or glossary item | Polyhierarchy, Faceted classification |
| Warning | Critical caution or anti-pattern | Strategic red flags |
| Tool | Software or method | Tree testing, Card sorting |
| Role | Persona or user type | DevOps engineer, Content strategist |
This is fundamentally different from content type classification. It's about what knowledge exists, not how content is formatted.
Relationships
Relationships detail out how knowledge connects:
| Relationship | What it means | Example |
|---|---|---|
CONTAINS | Parent includes child | Framework CONTAINS Principle |
IMPLEMENTS | Applies a concept | PromptPattern IMPLEMENTS Principle |
VALIDATES | Checks correctness | Checklist VALIDATES Workflow |
DEMONSTRATES | Shows how it works | Example DEMONSTRATES Principle |
PRECEDES | Must come before | Module 1 PRECEDES Module 2 |
CONFLICTS_WITH | Trade-off or tension | AI-First CONFLICTS_WITH Human-First |
When an AI retrieves content, these relationships enable context:
- "This is part of a larger framework. Here's the context"
- "Before understanding this, you need to know..."
- "Here's an example that demonstrates this principle"
- "This conflicts with what we discussed earlier. Here's how to think about the trade-off"
Retrieval contexts: QnA
Each entity gets mapped to the questions it answers:
- id: validation-pyramid
entity: Framework
title: "Validation Pyramid"
retrieval_contexts:
- "How do I validate AI-generated content?"
- "What checks should I run on a taxonomy?"
- "How thorough should my AI review be?"
This inverts the traditional content-first model. Instead of "here's what we have, hope you can find it," it's "here's what you might ask, and here's what answers it."
What this layer enables
1. Smarter RAG retrieval
Without ontology: "How do I validate a taxonomy?" → keyword search returns any chunk mentioning "validate" and "taxonomy"
With ontology:
- Retrieves the Validation Pyramid framework
- Includes child entities (Level 1-4 checklists)
- Notes that this validates AI output
- Excludes examples unless asked
2. Prerequisite-aware responses
AI can check:
- If the user understood prerequisite concepts
- If not, surface foundational content first
- Build explanations that connect to prior knowledge
3. Relationship-aware explanations
Instead of isolated chunks:
- "The Validation Pyramid is a framework containing 4 levels..."
- "This relates to the AI-First/Human-First decision you learned earlier..."
- "Here's a prompt pattern that implements this principle..."
4. Conflict/trade-off awareness
When principles conflict, AI can explain:
- "AI-First is appropriate here, but note the Human-First concerns about..."
- "This is a trade-off between efficiency and oversight..."
What this project demonstrates
Knowledge architecture
| Evidence | Competency |
|---|---|
| 12-type entity taxonomy | Ontology design |
| Relationship type definitions | Knowledge modeling |
| Retrieval context mapping | AI-ready content design |
| Prerequisite chains | Learning sequence design |
Process design
| Element | Value |
|---|---|
| Audit agent with methodology | Repeatable, documentable process |
| Entity identification heuristics | Systematic classification |
| Relationship mapping protocol | Consistent knowledge capture |
| Output template | Structured, usable artifacts |
What I've learned
About content architecture
- Knowledge structure ≠ content structure
- Relationships are as important as entities
- User questions, not author convenience, drive good content organization
- Human-readable content with explicit structure can become AI-ready content
About AI collaboration
- AI is excellent at identifying patterns and generating structure
- Strategic decisions (ontology design) require careful human judgment
- Documenting methodology makes work reproducible
About ontology implementation paths
-
Enhanced frontmatter
The lightest implementation is to add structured metadata to existing MDX:
---
entity: Framework
id: validation-pyramid
contains: [level-1-checks, level-2-structural, level-3-semantic, level-4-user]
validates: any-ai-output
retrieval_contexts:
- "How do I validate AI output?"
- "What checks should I run?"
prerequisites: [ai-human-partnership-basics]
---
### The Validation Pyramid
...This can be parsed by AI systems or CI/CD tools without changing the authoring workflow.
-
Structured knowledge base
Export ontology as structured data:
ontology/
├── entities/
│ ├── frameworks.yaml
│ ├── principles.yaml
│ ├── prompt-patterns.yaml
│ └── checklists.yaml
├── relationships.yaml
├── prerequisites.yaml
└── retrieval-index.yaml -
Knowledge graph
For complex querying and AI integration:
CREATE (vp:Framework {id: 'validation-pyramid'})
CREATE (l1:Checklist {id: 'level-1-checks'})
CREATE (vp)-[:CONTAINS]->(l1)
CREATE (l1)-[:VALIDATES]->(:Output {type: 'ai-generated'})
Summary
What started as a prompt engineering exercise became a knowledge architecture project. By shifting focus from content structure (DITA) to knowledge structure (ontology), I could:
- Design an entity taxonomy for learning content
- Create relationship types that capture how knowledge connects
- Build retrieval contexts that map user questions to content
- Develop an audit methodology focused on meaning, not format
- Position the content for AI consumption
Crucial takeaway: This experiment does NOT prove that unstructured content is A-okay!
None of the hoops that Gemini and I jumped through (multiple ontology audits, building JSON files with detailed contexts, or even scripts) would have been necessary if I had structured and modular content to begin with.
Right now, any update to the source files warrants updating the ontology files and retrieval index; not a process that would work in an enterprise setting.
A distant goal in this experiment is to convert the 11 modules into structured data. How is another project, but I have a draft plan in place already.
START: MDX Source File
│
▼
STEP 1: INTELLIGENT CHUNKING (Split by H2)
│
▼
STEP 2: TWO-PASS TRANSFORMATION (The "Expert" Process)
│
├─ INPUT: Single MDX Chunk + [Context: Diataxis Defs, ID Rules]
│
├─ PASS 1: ANALYSIS AGENT (The "Architect")
│ │ "Analyze intent. Don't write content yet."
│ │
│ └─ OUTPUT: Metadata Object
│ ├─ Intent: "Teach a concept" vs "Guide a procedure"
│ ├─ Diataxis Mode: "Explanation"
│ └─ Prereqs: "Refers to [Previous Topic]"
│
├─ PASS 2: TRANSFORMATION AGENT (The "Writer")
│ │ "Convert text to Schema using Pass 1 metadata."
│ │
│ ├─ CONSTRAINT: Allow structural reformatting (Paragraph -> Steps)
│ ├─ CONSTRAINT: Ban fact fabrication
│ │
│ └─ OUTPUT: Structured JSON
│ ├─ type: "Task"
│ ├─ body: {
│ │ "context": "...",
│ │ "steps": ["1. Click...", "2. Type..."],
│ │ "result": "..."
│ │ }
│ └─ artifacts: [ { "type": "example", "content": "..." } ]
│
▼
STEP 3: SEMANTIC VALIDATION
│
└─ Check: Does JSON "steps" count match source "action verbs"?
├─ YES ──► Generate .dita file
└─ NO ──► Flag for Review
Stay tuned!