Skip to main content
  • Depth of architectural reasoning — 40
  • Trade-off analysis — 30
  • Ability to generalize beyond RAGFlow — 15
  • Precision and clarity — 15

Overview and learning objectives

RAGFlow is an end-to-end RAG platform integrating deep document understanding, hybrid retrieval, and agent-based reasoning into a unified pipeline. In this assignment you will evaluate the architectural design decisions behind RAGFlow and reason about trade-offs that arise in production RAG + Agent systems. You will find a good description of what a baseline RAG is in this book. By completing this assignment, you will:
  • Analyze document parsing and chunking strategies and their impact on retrieval quality.
  • Compare retrieval architectures (sparse, dense, hybrid) with concrete failure cases.
  • Reason about knowledge representation, query understanding, and memory design.
  • Design a microservices decomposition for a real-world RAG platform.

Instructions

For each question:
  • Provide a technical justification.
  • Analyze trade-offs and failure modes.
  • Ground your reasoning in systems, IR, or distributed architecture principles.

Questions

1. Deep document understanding vs naive chunking (10 pts)

RAGFlow emphasizes layout-aware document parsing (tables, structure, metadata) through its DeepDoc engine and configurable PDF parsers. Why does deep document understanding outperform fixed-size chunking in enterprise RAG? Discuss implications for:
  • Retrieval fidelity
  • Index design
  • Preprocessing cost

2. Chunking strategy: template vs semantic (10 pts)

RAGFlow supports configurable chunking strategies rather than a single method. Compare:
  • Template-based chunking
  • Embedding-driven semantic segmentation
Which one fails under:
  • Highly structured documents (e.g., financial reports)
  • Loosely structured corpora (e.g., chat logs)

3. Hybrid retrieval architecture (10 pts)

RAGFlow combines lexical (BM25), vector similarity, and re-ranking. Formally analyze why hybrid retrieval improves recall and precision. Provide concrete failure cases for:
  • Lexical-only
  • Vector-only
  • Hybrid (edge case)

4. Multi-stage retrieval pipeline (10 pts)

RAGFlow decomposes retrieval into candidate generation, re-ranking, and query refinement. Why is a multi-stage pipeline superior to a single-pass ANN search? Discuss:
  • Recall vs latency trade-off
  • Cascading error propagation

5. Indexing strategy and storage backends (10 pts)

RAGFlow builds retrieval-optimized indexes rather than relying on generic storage, with support for switching between doc engines including Elasticsearch and Infinity. Define design criteria for selecting a backend:
  • Elasticsearch-like hybrid store
  • Vector-native DB
  • Graph-augmented store
What workloads favor each?

6. Query understanding and reformulation (10 pts)

RAGFlow incorporates query rewriting and semantic gap handling in its pipeline via its multi-turn optimization feature. Why is query transformation (e.g., expansion, decomposition) critical in RAG? Compare:
  • Static query to retrieval
  • Iterative query refinement (agent-driven)

7. Knowledge representation layer (10 pts)

RAGFlow can construct embeddings, metadata layers, and knowledge graphs. Compare three representations:
  • Dense vector space
  • Relational schema
  • Knowledge graph
How does each affect:
  • Compositional reasoning
  • Retrieval explainability

8. Data ingestion pipeline architecture (10 pts)

RAGFlow provides an ingestion pipeline that converts heterogeneous data into indexed knowledge. Design a robust ingestion system. Address:
  • Schema normalization across sources
  • Incremental indexing
  • Consistency vs throughput trade-offs

9. Memory design in RAG systems (10 pts)

RAGFlow introduces memory components for long-running interactions, with evolving support across v0.23 and v0.24. Compare memory architectures:
  • Vector memory (semantic recall)
  • Structured memory (SQL/graph)
  • Episodic logs (temporal traces)

10. End-to-end system decomposition (10 pts)

RAGFlow spans ingestion, indexing, retrieval, reasoning, and serving (see system architecture). Design a microservices architecture for RAGFlow. Specify:
  • Stateless vs stateful services
  • Scaling strategy per component
  • Failure isolation boundaries

Deliverables

A written report (3-5 pages markdown (md or mdx)) addressing all 10 questions. Ensure any mermaid diagrams are parsable in Github.

Evaluation criteria

CriterionDescription
Depth of reasoningArchitectural justification, not surface-level description
Trade-off analysisClear articulation of alternatives, failure modes, and when each applies
GeneralizationAbility to reason beyond RAGFlow to general RAG/agent systems
ClarityPrecise language, well-structured arguments, readable diagrams