Chat with
Multiple PDFs
An End-to-End Gen AI project that processes 1,000+ Pages. Built with LangChain, FAISS, and Streamlit.
The Challenge
Why read 1,000 pages when AI can do it?
"I’m a huge Harry Potter fan!" I declared to the class. A skeptical student (the class Nerd) muttered, "Yeah... like that's real."
To prove the power of RAG (Retrieval Augmented Generation), I uploaded a 1,065-page Harry Potter fanfiction PDF into my system and asked it to summarize the Battle in the Chamber of Secrets.
The result? A precise, instant answer about Basilisks, the Sword of Gryffindor, and Fawkes the Phoenix. No magic required—just good architecture.
RAG in Action
"Summarize the battle in the Chamber of Secrets."
"In the Chamber of Secrets, Harry Potter battles a gigantic basilisk summoned by Tom Riddle. By dodging the basilisk’s lethal gaze and wielding the Sword of Gryffindor..."
The Architecture
How do we process 1,000+ pages instantly?
PDF Processing
- Extract text from PDFs
- Split into chunks (10k chars)
- Create Embeddings (Google GenAI)
- Store in Vector DB (FAISS)
Query Handling
- Convert query to embedding
- Similarity Search (FAISS)
- Retrieve Context Chunks
- Generate Answer (Gemini Pro)
Why Vector Stores?
Imagine a 1000-page document. To find specific info, we convert text into "Embeddings" (numerical vectors) and store them in FAISS. This allows for lightning-fast similarity searches.
Capacity
Pages Processed
def get_text_chunks(text):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=10000,
chunk_overlap=1000
)
chunks = text_splitter.split_text(text)
return chunksThe Code
Integrating LangChain, Gemini, and FAISS.
TECH 01
PyPDF2
Extract raw text from PDF files.
TECH 02
LangChain
Text splitting & Prompt templates.
TECH 03
Google Gemini
Embeddings & Generation (1.0 Pro).
TECH 04
FAISS
Local vector storage for retrieval.
Deployment
Going live with Streamlit Cloud.
Streamlit Cloud
Connected directly to the GitHub repository for continuous deployment. Any push to main triggers a re-build.
Secret Management
Secured sensitive data like `GEMINI_API_KEY` using Streamlit's secrets management (`secrets.toml`).
Live Demo
Interact with the interface.
Demo Mode
The live Streamlit app is currently restricted for security. Enjoy this interactive preview of the interface.