Text chunking is a fundamental technique in natural language processing where long documents are split into smaller, manageable pieces called chunks. This process is essential for systems with token limits (like language models), search engines that need to index content in segments, and document processing pipelines.
Many NLP systems have practical limits on input size. Large language models have context windows (token limits), and processing very long texts can be computationally expensive. Chunking allows us to:
Fixed-size chunking splits text into chunks of exactly N tokens, but may break sentences or thoughts mid-way. Semantic chunking preserves meaningful boundaries like sentences or paragraphs, creating more coherent pieces.
Overlap between chunks helps maintain context. When chunk A ends with "The company announced new policies" and chunk B starts with "new policies will take effect", the overlap ensures we don't lose the connection between these related ideas.
Effective chunking requires balancing multiple constraints: maximum size limits, semantic boundaries, and overlap requirements. The chunker must handle edge cases like individual sentences that exceed the token limit, or very short documents that don't need chunking.
Metadata tracking is crucial - knowing where each chunk originated in the source document enables reconstruction, highlighting, and maintaining document structure in downstream processing.
1def simple_chunker(text, chunk_size):
2 """Basic word-based chunking example"""
3 words = text.split()
4 chunks = []
5
6 for i in range(0, len(words), chunk_size):
7 chunk_words = words[i:i + chunk_size]
8 chunk_text = ' '.join(chunk_words)
9
10 chunks.append({
11 'text': chunk_text,
12 'word_count': len(chunk_words),
13 'start_word': i,
14 'end_word': i + len(chunk_words)
15 })
16
17 return chunks
18
19# Example usage
20text = "The quick brown fox jumps over the lazy dog in the meadow"
21result = simple_chunker(text, chunk_size=4)