- Depth of architectural reasoning — 40
- Trade-off analysis — 30
- Ability to generalize beyond RAGFlow — 15
- Precision and clarity — 15
Overview and learning objectives
RAGFlow is an end-to-end RAG platform integrating deep document understanding, hybrid retrieval, and agent-based reasoning into a unified pipeline. In this assignment you will evaluate the architectural design decisions behind RAGFlow and reason about trade-offs that arise in production RAG + Agent systems. You will find a good description of what a baseline RAG is in this book. By completing this assignment, you will:- Analyze document parsing and chunking strategies and their impact on retrieval quality.
- Compare retrieval architectures (sparse, dense, hybrid) with concrete failure cases.
- Reason about knowledge representation, query understanding, and memory design.
- Design a microservices decomposition for a real-world RAG platform.
Instructions
For each question:- Provide a technical justification.
- Analyze trade-offs and failure modes.
- Ground your reasoning in systems, IR, or distributed architecture principles.
Questions
1. Deep document understanding vs naive chunking (10 pts)
RAGFlow emphasizes layout-aware document parsing (tables, structure, metadata) through its DeepDoc engine and configurable PDF parsers. Why does deep document understanding outperform fixed-size chunking in enterprise RAG? Discuss implications for:- Retrieval fidelity
- Index design
- Preprocessing cost
2. Chunking strategy: template vs semantic (10 pts)
RAGFlow supports configurable chunking strategies rather than a single method. Compare:- Template-based chunking
- Embedding-driven semantic segmentation
- Highly structured documents (e.g., financial reports)
- Loosely structured corpora (e.g., chat logs)
3. Hybrid retrieval architecture (10 pts)
RAGFlow combines lexical (BM25), vector similarity, and re-ranking. Formally analyze why hybrid retrieval improves recall and precision. Provide concrete failure cases for:- Lexical-only
- Vector-only
- Hybrid (edge case)
4. Multi-stage retrieval pipeline (10 pts)
RAGFlow decomposes retrieval into candidate generation, re-ranking, and query refinement. Why is a multi-stage pipeline superior to a single-pass ANN search? Discuss:- Recall vs latency trade-off
- Cascading error propagation
5. Indexing strategy and storage backends (10 pts)
RAGFlow builds retrieval-optimized indexes rather than relying on generic storage, with support for switching between doc engines including Elasticsearch and Infinity. Define design criteria for selecting a backend:- Elasticsearch-like hybrid store
- Vector-native DB
- Graph-augmented store
6. Query understanding and reformulation (10 pts)
RAGFlow incorporates query rewriting and semantic gap handling in its pipeline via its multi-turn optimization feature. Why is query transformation (e.g., expansion, decomposition) critical in RAG? Compare:- Static query to retrieval
- Iterative query refinement (agent-driven)
7. Knowledge representation layer (10 pts)
RAGFlow can construct embeddings, metadata layers, and knowledge graphs. Compare three representations:- Dense vector space
- Relational schema
- Knowledge graph
- Compositional reasoning
- Retrieval explainability
8. Data ingestion pipeline architecture (10 pts)
RAGFlow provides an ingestion pipeline that converts heterogeneous data into indexed knowledge. Design a robust ingestion system. Address:- Schema normalization across sources
- Incremental indexing
- Consistency vs throughput trade-offs
9. Memory design in RAG systems (10 pts)
RAGFlow introduces memory components for long-running interactions, with evolving support across v0.23 and v0.24. Compare memory architectures:- Vector memory (semantic recall)
- Structured memory (SQL/graph)
- Episodic logs (temporal traces)
10. End-to-end system decomposition (10 pts)
RAGFlow spans ingestion, indexing, retrieval, reasoning, and serving (see system architecture). Design a microservices architecture for RAGFlow. Specify:- Stateless vs stateful services
- Scaling strategy per component
- Failure isolation boundaries
Deliverables
A written report (3-5 pages markdown (md or mdx)) addressing all 10 questions. Ensure any mermaid diagrams are parsable in Github.Evaluation criteria
| Criterion | Description |
|---|---|
| Depth of reasoning | Architectural justification, not surface-level description |
| Trade-off analysis | Clear articulation of alternatives, failure modes, and when each applies |
| Generalization | Ability to reason beyond RAGFlow to general RAG/agent systems |
| Clarity | Precise language, well-structured arguments, readable diagrams |

