Class HierarchicalDocumentSplitter
java.lang.Object
dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
- All Implemented Interfaces:
dev.langchain4j.data.document.DocumentSplitter
- Direct Known Subclasses:
DocumentByCharacterSplitter,DocumentByLineSplitter,DocumentByParagraphSplitter,DocumentByRegexSplitter,DocumentBySentenceSplitter,DocumentByWordSplitter
public abstract class HierarchicalDocumentSplitter
extends Object
implements dev.langchain4j.data.document.DocumentSplitter
Base class for hierarchical document splitters.
Extends DocumentSplitter and provides machinery for sub-splitting documents
when a single segment is too long.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final intprotected final intprotected final dev.langchain4j.data.document.DocumentSplitterprotected final dev.langchain4j.model.Tokenizer -
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedHierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars) Creates a new instance ofHierarchicalDocumentSplitter.protectedHierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, HierarchicalDocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter.protectedHierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, dev.langchain4j.model.Tokenizer tokenizer) Creates a new instance ofHierarchicalDocumentSplitter.protectedHierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, dev.langchain4j.model.Tokenizer tokenizer, dev.langchain4j.data.document.DocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter. -
Method Summary
Modifier and TypeMethodDescriptionprotected abstract dev.langchain4j.data.document.DocumentSplitterThe default sub-splitter to use when a single segment is too long.protected abstract StringDelimiter string to use to re-join the parts.List<dev.langchain4j.data.segment.TextSegment> split(dev.langchain4j.data.document.Document document) protected abstract String[]Splits the provided text into parts.Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface dev.langchain4j.data.document.DocumentSplitter
splitAll
-
Field Details
-
maxSegmentSize
protected final int maxSegmentSize -
maxOverlapSize
protected final int maxOverlapSize -
tokenizer
protected final dev.langchain4j.model.Tokenizer tokenizer -
subSplitter
protected final dev.langchain4j.data.document.DocumentSplitter subSplitter
-
-
Constructor Details
-
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars) Creates a new instance ofHierarchicalDocumentSplitter.- Parameters:
maxSegmentSizeInChars- The maximum size of a segment in characters.maxOverlapSizeInChars- The maximum size of the overlap between segments in characters.
-
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, HierarchicalDocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter.- Parameters:
maxSegmentSizeInChars- The maximum size of a segment in characters.maxOverlapSizeInChars- The maximum size of the overlap between segments in characters.subSplitter- The sub-splitter to use when a single segment is too long.
-
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, dev.langchain4j.model.Tokenizer tokenizer) Creates a new instance ofHierarchicalDocumentSplitter.- Parameters:
maxSegmentSizeInTokens- The maximum size of a segment in tokens.maxOverlapSizeInTokens- The maximum size of the overlap between segments in tokens.tokenizer- The tokenizer to use to estimate the number of tokens in a text.
-
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, dev.langchain4j.model.Tokenizer tokenizer, dev.langchain4j.data.document.DocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter.- Parameters:
maxSegmentSizeInTokens- The maximum size of a segment in tokens.maxOverlapSizeInTokens- The maximum size of the overlap between segments in tokens.tokenizer- The tokenizer to use to estimate the number of tokens in a text.subSplitter- The sub-splitter to use when a single segment is too long.
-
-
Method Details
-
split
Splits the provided text into parts. Implementation API.- Parameters:
text- The text to be split.- Returns:
- An array of parts.
-
joinDelimiter
Delimiter string to use to re-join the parts.- Returns:
- The delimiter.
-
defaultSubSplitter
protected abstract dev.langchain4j.data.document.DocumentSplitter defaultSubSplitter()The default sub-splitter to use when a single segment is too long.- Returns:
- The default sub-splitter.
-
split
public List<dev.langchain4j.data.segment.TextSegment> split(dev.langchain4j.data.document.Document document) - Specified by:
splitin interfacedev.langchain4j.data.document.DocumentSplitter
-