How to Build an AI Study System That Actually Works (Citizens Replace Your Broken PDF Tools)
Build a citation-grounded AI study system that ingests massive PDFs whole. A complete technical guide to vector databases, reranking strategies, and LLM orchestration for students tired of compromised answers.

Meta Description: Build a citation-grounded AI study system that ingests massive PDFs whole. A complete technical guide to vector databases, reranking strategies, and LLM orchestration for students tired of compromised answers.
The Problem Isn't That We Lack Tools—It's That We're Using Them Wrong
Let me start by saying this: I've watched too many smart people get gaslit by AI tools that promise everything and deliver vibes. You know the pattern. You upload your 500MB pharmacology textbook—the one you need for the exam that'll determine whether you get to keep pursuing the thing you actually care about—and the tool cheerfully tells you it's "ready." Then you ask it a question about drug interactions that requires synthesizing information from chapters 3, 11, and 27, and what you get back is either a beautifully formatted hallucination or a technically accurate response so fragmented it's useless.
NotebookLM does this. Claude does this when you try to brute-force massive context windows. ChatGPT definitely does this. And I want to be clear about something: this isn't because the underlying technology is fundamentally broken. It's because we're trying to use general-purpose conversational interfaces to solve a specific, structurally complex problem that requires a different architecture entirely.
The student who inspired this post needed something straightforward: ingest a massive textbook without splitting it (because the information they need doesn't respect chapter boundaries), generate theory-focused answers that are exam-ready, include proper inline citations, and ideally produce flowcharts, tables, and diagram references. When they enabled citations in NotebookLM, answer quality tanked. When they disabled citations, the answers were great but totally unverifiable. This is not a feature tradeoff. This is a fundamental architectural mismatch.
So here's what I'm going to do: I'm going to show you how to build a system that actually solves this. Not a hack, not a workaround, but a properly architected workflow that treats your PDF like the complex knowledge graph it actually is. And I'm going to assume you're a vibe coder—you know your way around full-stack development, you're comfortable in the terminal, you understand APIs and databases, but you're not trying to write a PhD thesis on retrieval-augmented generation. You just want something that works.

Why This Problem Is Actually Hard (And Why Most Tools Fail)
Before we build the solution, let's talk about why this is legitimately difficult, because understanding the constraints makes the architecture make sense.
The Context Window Trap
The naive approach—just throw the whole PDF into Claude's 200K token context window—sounds elegant until you realize that attention mechanisms don't distribute evenly across massive contexts. Research (and my own frustrating empirical experience) shows that LLMs struggle with "lost in the middle" problems: information buried in the middle of a huge context gets significantly less attention weight than stuff at the beginning or end. So even if you can technically fit your textbook into the context window, the model effectively forgets the middle chapters when answering questions.
And here's the thing that makes me furious about how this gets marketed: companies know this. They know their models perform worse on retrieval tasks as context length increases. But they're incentivized to advertise the maximum theoretical context window as if it's uniformly useful, which it absolutely is not.
The Citation Problem Is Actually a Retrieval Problem
When NotebookLM gives you citations, it's doing retrieval under the hood—finding relevant chunks, ranking them, then trying to ground the answer in those specific passages. The quality drops because now the model is working with fragmented context instead of the full narrative flow of the textbook. But when you disable citations, you're back to the context window trap, and the model is just vibing based on whatever it half-remembers from the entire document.
What you actually need is a system that:
- Breaks the PDF into semantically meaningful chunks (not arbitrary page splits)
- Stores those chunks in a way that preserves their relationships
- Retrieves the right chunks based on your question
- Reranks them for relevance
- Reconstructs enough context around those chunks that the answer makes narrative sense
- Generates citations that point back to specific locations
That's not a single tool. That's an orchestrated workflow.
The Diagram/Table Problem
Most PDF parsing treats tables and diagrams as second-class citizens. They get OCR'd into text (badly) or ignored entirely. But if you're studying medicine, engineering, or anything technical, those visual elements are load-bearing. You can't just skip them. You need a system that recognizes them, extracts them, indexes their captions and surrounding context, and includes them in retrieval.
Why NotebookLM (and Similar Tools) Fall Short
I don't want to just dunk on NotebookLM—it's actually a clever product that works well for certain use cases. But it's optimized for general knowledge synthesis, not deep, citation-grounded study of massive technical documents. Here's what's happening under the hood and why it doesn't fit this use case:
The Good: NotebookLM uses a retrieval-augmented generation (RAG) approach, which is fundamentally correct. It chunks your documents, embeds them, stores them in a vector database, and retrieves relevant passages when you ask questions.
The Problem: The chunking strategy, embedding model, and retrieval parameters are all black-boxed. You can't tune them. When you enable citations, it's retrieving smaller, more precise chunks to make grounding easier—but that sacrifices the contextual richness needed for complex synthesis. When you disable citations, it's probably pulling larger chunks or relying more heavily on the LLM's parametric memory, which improves coherence but loses verifiability.
You need control over this tradeoff. And you need to be able to inspect, debug, and iterate on the retrieval pipeline. Closed tools don't let you do that.
The Solution: A Recursive Agent Architecture with Layered Retrieval
Alright, here's the actual system we're building. I'm going to describe the architecture first at a high level, then walk through implementation step by step.
Conceptual Overview
We're building a multi-stage RAG pipeline with the following components:
- Document Ingestion & Intelligent Chunking: Parse the PDF, extract text/tables/diagrams, and chunk it in a way that preserves semantic coherence.
- Vector Database with Metadata: Store chunks with rich metadata (page numbers, section headers, proximity to diagrams/tables).
- Hybrid Retrieval: Combine semantic search (vector similarity) with keyword search (BM25) to catch both conceptual matches and specific terminology.
- Reranking Layer: Use a cross-encoder model to rerank retrieved chunks by relevance to the specific query.
- Context Reconstruction: Pull not just the top chunk, but also its neighbors (the chunks immediately before and after) to preserve narrative flow.
- LLM Orchestration with Structured Output: Feed the reconstructed context to a local LLM with a prompt that enforces citation formatting, encourages tables/flowcharts, and references diagrams.
- Iterative Refinement (Optional): Let the agent ask follow-up retrieval queries if the initial context is insufficient.
This sounds complicated, but each piece is conceptually simple. The magic is in how they compose.
Step-by-Step Implementation Guide
1. Choose Your Stack
Here's what I recommend for vibe coders who want something robust but not enterprise-overcomplicated:
Core Components:
- Docling (for PDF parsing): Handles complex layouts, tables, and diagrams better than PyPDF2 or pdfplumber. Outputs structured markdown with metadata.
- LangChain or LlamaIndex (for orchestration): Provides abstractions for chunking, embedding, retrieval, and LLM chaining. LlamaIndex is slightly more opinionated toward RAG use cases.
- Qdrant or Weaviate (for vector database): Qdrant has a great local-first Docker setup; Weaviate has excellent hybrid search out of the box.
- Sentence-Transformers (for embeddings): Use
all-MiniLM-L6-v2for speed orbge-large-en-v1.5for quality. - Cohere Rerank API or a local cross-encoder (for reranking): Cohere has a generous free tier; if you want fully local, use
cross-encoder/ms-marco-MiniLM-L-6-v2. - Ollama + Qwen2.5 or Llama 3.1 (for LLM): Qwen2.5 14B is shockingly good at structured output and reasoning; Llama 3.1 8B is faster but slightly less reliable on complex queries.
Alternative Stack (More Minimalist):
- Marker (PDF parsing): Lighter than Docling, still good.
- ChromaDB (vector database): Stupidly easy to set up, runs entirely in-process.
- Replicate or Groq API (hosted LLM): If you don't want to run models locally.
I'm biasing toward local-first because I deeply distrust putting your study materials—your intellectual labor—into someone else's cloud where they can train on it, rate-limit you, or change pricing.
2. Parse and Chunk the PDF
Install Docling:
pip install docling
Parse your PDF:
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("massive_textbook.pdf")
# Export to markdown with metadata
markdown_text = result.document.export_to_markdown()
Docling gives you structured output with headings, tables, and figure captions preserved. Now chunk it intelligently:
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
# Create Document object
doc = Document(text=markdown_text)
# Chunk with overlap to preserve context
splitter = SentenceSplitter(
chunk_size=512, # tokens
chunk_overlap=128, # overlap to prevent cutting mid-concept
)
nodes = splitter.get_nodes_from_documents([doc])
Key insight: The 128-token overlap is crucial. It means that if a concept spans a chunk boundary, the next chunk will include enough context to make sense of it. This is what most tools get wrong.

3. Set Up Your Vector Database
Let's use Qdrant because it's straightforward:
docker run -p 6333:6333 qdrant/qdrant
Index your chunks:
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
from qdrant_client import QdrantClient
# Connect to Qdrant
client = QdrantClient(url="http://localhost:6333")
# Create vector store
vector_store = QdrantVectorStore(
client=client,
collection_name="textbook_chunks"
)
# Build index
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)
Now your chunks are embedded and queryable.
4. Build the Hybrid Retrieval + Reranking Layer
LlamaIndex supports hybrid search natively if your vector store does (Weaviate is easiest for this). For Qdrant, you can manually combine vector and BM25:
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.postprocessor import SentenceTransformerRerank
# Semantic retrieval
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=20, # Cast a wide net
)
# Reranker
reranker = SentenceTransformerRerank(
model="cross-encoder/ms-marco-MiniLM-L-6-v2",
top_n=5, # Final top-k after reranking
)
This retrieves 20 candidates based on vector similarity, then reranks them using a more expensive but accurate cross-encoder model, keeping the top 5.
5. Reconstruct Context Around Retrieved Chunks
Here's where we do something most tutorials skip: we don't just use the top 5 chunks. We also grab their immediate neighbors to restore narrative flow.
def get_chunk_with_context(node, all_nodes, window=1):
"""Retrieve a chunk plus its neighbors."""
node_idx = all_nodes.index(node)
start_idx = max(0, node_idx - window)
end_idx = min(len(all_nodes), node_idx + window + 1)
context_nodes = all_nodes[start_idx:end_idx]
return "\n\n".join([n.text for n in context_nodes])
This gives the LLM enough surrounding context to generate coherent, narrative answers instead of fragmented factoids.
6. Orchestrate the LLM with Structured Prompting
Now we feed the reconstructed context to our local LLM with a carefully designed prompt:
from llama_index.llms.ollama import Ollama
llm = Ollama(model="qwen2.5:14b", request_timeout=120.0)
query = "Explain the mechanism of action for beta-blockers in heart failure, including compensatory changes."
# Retrieve and rerank
retrieved_nodes = retriever.retrieve(query)
reranked_nodes = reranker.postprocess_nodes(retrieved_nodes, query_str=query)
# Reconstruct context
full_context = "\n\n---\n\n".join([
f"[Source: Page {node.metadata.get('page_num', 'unknown')}]\n{get_chunk_with_context(node, nodes)}"
for node in reranked_nodes
])
# Prompt engineering
prompt = f"""You are an expert study assistant helping a medical student prepare for exams. Answer the following question using ONLY the provided context from the textbook.
**Requirements:**
- Provide a comprehensive, theory-focused answer suitable for exam preparation
- Include inline citations in the format [Page X] after each claim
- If relevant, create a table or flowchart to organize information
- Reference any diagrams or figures mentioned in the context
- If the context doesn't contain enough information, explicitly state what's missing
**Context:**
{full_context}
**Question:**
{query}
**Answer:**"""
response = llm.complete(prompt)
print(response.text)
7. Add Iterative Refinement (Optional but Powerful)
For complex questions, the first retrieval might miss something. Build a simple agent loop:
def recursive_query(query, max_iterations=3):
for i in range(max_iterations):
# Retrieve and generate
response = generate_answer(query)
# Check if LLM indicates missing info
if "insufficient information" in response.lower():
# Extract what's missing and query again
follow_up = extract_missing_topic(response)
query = f"{query} [Additional context needed: {follow_up}]"
else:
return response
return response
This lets the system recursively fetch more context if the initial retrieval wasn't comprehensive enough.
Why This Architecture Works
Let me be explicit about what we've accomplished here, because I think the underlying philosophy matters.
We've separated concerns: Parsing, embedding, retrieval, reranking, and generation are all discrete, inspectable stages. If something breaks, you can debug it. If you want to swap out Qwen for Llama, you change one function. This is software engineering, not magic.
We've preserved context at every layer: Overlapping chunks during ingestion, neighbor retrieval during context reconstruction, and explicit citation requirements in the prompt. The system is designed to maintain narrative coherence.
We've made the retrieval observable: You can log which chunks were retrieved, why they were ranked highly, and what context was fed to the LLM. This is crucial for iterating and improving the system.
We've kept it local: Your textbook never leaves your machine. You're not rate-limited. You're not paying per token. You own the infrastructure.
Example Workflow: A Day in the Life
Let's say you're studying for a pharmacology exam. Here's how you'd use this system:
Morning: Ingest your 500MB textbook. Run the parsing and indexing script. Go make coffee. Come back to a fully indexed knowledge base.
Afternoon: Ask complex questions:
- "Compare the pharmacokinetics of lipophilic vs hydrophilic beta-blockers"
- "Generate a table of antiarrhythmic drug classifications with mechanisms and side effects"
- "Explain why ACE inhibitors cause a dry cough, citing the specific biochemical pathway"
Each answer comes back with inline citations, relevant tables, and references to diagrams you can look up in the PDF.
Evening: When you're reviewing, ask for comparisons:
- "Create a flowchart comparing the sympathetic and parasympathetic effects on heart rate"
- "What are the contraindications for beta-blockers mentioned in chapters 8 and 15?"
The system pulls from multiple chapters, synthesizes the information, and gives you exam-ready summaries.

Limitations and Honest Tradeoffs
I need to be straight with you about what this system doesn't do:
It won't make you understand the material. This is a retrieval and synthesis tool. If you don't already have some baseline understanding of the domain, the answers—while accurate—might go over your head. Use it to reinforce and test your knowledge, not replace active learning.
The first-time setup takes time. Parsing a 500MB PDF, embedding it, and indexing it might take 30-60 minutes depending on your hardware. This is a one-time cost, but it's real.
It's not perfect at complex visual reasoning. If your textbook has intricate diagrams that require spatial understanding (like anatomy or circuit diagrams), the system can reference them but can't truly "see" them the way a multimodal model would. You'll still need to look at the actual images.
Reranking adds latency. Each query takes a few seconds because of the cross-encoder reranking step. This is the price of accuracy. If you need instant responses, you can skip reranking, but quality will drop.
You need decent hardware. Running a 14B parameter model locally requires at least 16GB of RAM and ideally a GPU. If you don't have that, use the Groq/Replicate API alternative I mentioned.
Frequently Asked Questions
Q: Can I use this for non-PDF documents?
Yes. The architecture works for any text-heavy document. Word docs, EPUBs, HTML exports, even well-structured Notion exports. Just swap out the PDF parser for the appropriate converter.
Q: How do I handle multiple textbooks in one system?
Create separate collections in your vector database (one per textbook) or use metadata filtering. LlamaIndex supports namespacing, so you can query "only search the cardiology textbook" or "search across all my medicine textbooks."
Q: What if I want to generate flashcards or quizzes?
Add a post-processing step after answer generation. Use a second LLM call with a prompt like "Convert this explanation into 5 flashcard Q&A pairs" or "Generate 3 multiple-choice questions testing this concept." The structured output from Qwen makes this straightforward.
Q: Can I run this on a laptop without a GPU?
Yes, but use smaller models. Qwen2.5 7B or Llama 3.1 8B will run on CPU, just slower. Alternatively, use Groq's API (fast and cheap) and keep only the vector database local.
Q: How do I update the database when I get new lecture notes?
Just re-run the ingestion script on the new document and add it to the existing collection. Vector databases support incremental updates. You don't need to reindex everything.
Q: What about handwritten notes or scanned PDFs?
You'll need an OCR step. Docling has some OCR support, but for heavily handwritten content, use something like Tesseract or Azure Form Recognizer first, then pipe the text into the normal workflow.
Conclusion: Building Tools That Respect Your Intelligence
Here's what frustrates me about most AI "study tools": they're built for people who want to feel productive, not people who want to actually learn. They're optimized for engagement metrics and viral demos, not deep, citation-grounded understanding.
What I've described here is not sexy. There's no slick interface, no "one-click magic," no subscription tier that promises to make you smarter. It's infrastructure. It's plumbing. It's the boring, difficult work of building systems that actually do what they claim to do.
But here's the thing: once you've built it, it's yours. You control the retrieval parameters. You can inspect why it gave you a certain answer. You can iterate and improve it as you learn more about your own study habits. You're not locked into someone else's product roadmap or pricing scheme.
And most importantly, you're engaging with the material on your terms. The system serves you; you don't serve the system.
If you're tired of tools that gaslight you with half-baked answers and broken promises, build this. It'll take an afternoon, maybe a weekend if you're learning as you go. But at the end of it, you'll have something real—a tool that treats your intelligence with respect and your education with seriousness.
Ready to build your own citation-grounded AI study system? Start with the Docling and LlamaIndex documentation, spin up a local Qdrant instance, and experiment with one chapter of your textbook first. Once you see it working—really working, with proper citations and coherent answers—you'll understand why this approach is worth the effort.
And if you build something cool with this, or run into interesting problems, I genuinely want to hear about it. This is how we make AI tools that are actually useful instead of just impressively mediocre.