What Happened
This tutorial chapter covers LangChain's document loading and text splitting system, two foundational steps before building any RAG (Retrieval-Augmented Generation) pipeline. LangChain provides purpose-built Document Loaders for PDF (PyPDFLoader, UnstructuredPDFLoader), Word (Docx2txtLoader), HTML (URLLoader), Markdown (UnstructuredMarkdownLoader), plain text (TextLoader), and Excel (PandasExcelLoader). Every loader returns a List[Document] object containing page_content (string) and metadata (dict with source, page number, etc.).
Why It Matters
Most LLMs have hard context window limits — GPT-3.5-turbo caps at roughly 4,000 tokens. Feeding a 50-page PDF directly into a prompt fails silently or gets truncated. Indie developers and SMEs building internal tools (HR policy bots, contract analyzers, support knowledge bases) need a reliable way to:
- Parse heterogeneous file formats without writing custom parsers
- Preserve metadata (page number, source file) for citation and debugging
- Chunk documents into token-safe segments before embedding and retrieval
LangChain's loaders eliminate the parsing layer entirely. Installing pypdf or unstructured is the only setup required before documents become queryable.
Asia-Pacific Angle
Chinese and Southeast Asian developers building document Q&A products face an additional challenge: most open-source loaders are optimized for Latin-script text. When using UnstructuredPDFLoader on Chinese-language PDFs, character encoding and column layout can break extraction. Practical mitigations include: using PyMuPDF (fitz) as an alternative PDF backend which handles CJK fonts more reliably, setting chunk_size conservatively (512–800 tokens) when splitting Chinese text since tokenizers count CJK characters differently than English, and pairing loaders with Qwen or GLM-series models whose tokenizers are trained on Chinese corpora, reducing token miscounts during retrieval scoring.
Action Item This Week
Install pip install langchain pypdf unstructured, load one internal PDF using PyPDFLoader, print the metadata dict for each page, then confirm page numbers are preserved before building any retrieval index — missing metadata is the most common cause of broken citations in production RAG apps.