Class DocumentByWordSplitter
- All Implemented Interfaces:
dev.langchain4j.data.document.DocumentSplitter
Document into words and attempts to fit as many words as possible
into a single TextSegment, adhering to the limit set by maxSegmentSize.
The maxSegmentSize can be defined in terms of characters (default) or tokens.
For token-based limit, a Tokenizer must be provided.
Word boundaries are detected by a minimum of one space (" "). Any additional whitespaces before or after are ignored. So, the following examples are all valid word separators: " ", " ", "\n", and so on.
If multiple words fit within maxSegmentSize, they are joined together using a space (" ").
Although this should not happen, if a single word is too long and exceeds maxSegmentSize,
the subSplitter (DocumentByCharacterSplitter by default) is used to split it into smaller parts and
place them into multiple segments.
Such segments contain only the parts of the split long word.
Each TextSegment inherits all metadata from the Document and includes an "index" metadata key
representing its position within the document (starting from 0).
-
Field Summary
Fields inherited from class dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
maxOverlapSize, maxSegmentSize, subSplitter, tokenizer -
Constructor Summary
ConstructorsConstructorDescriptionDocumentByWordSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars) DocumentByWordSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, dev.langchain4j.data.document.DocumentSplitter subSplitter) DocumentByWordSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, dev.langchain4j.model.Tokenizer tokenizer) DocumentByWordSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, dev.langchain4j.model.Tokenizer tokenizer, dev.langchain4j.data.document.DocumentSplitter subSplitter) -
Method Summary
Modifier and TypeMethodDescriptionprotected dev.langchain4j.data.document.DocumentSplitterThe default sub-splitter to use when a single segment is too long.Delimiter string to use to re-join the parts.String[]Splits the provided text into parts.Methods inherited from class dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
splitMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface dev.langchain4j.data.document.DocumentSplitter
splitAll
-
Constructor Details
-
DocumentByWordSplitter
public DocumentByWordSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars) -
DocumentByWordSplitter
public DocumentByWordSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, dev.langchain4j.data.document.DocumentSplitter subSplitter) -
DocumentByWordSplitter
public DocumentByWordSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, dev.langchain4j.model.Tokenizer tokenizer) -
DocumentByWordSplitter
public DocumentByWordSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, dev.langchain4j.model.Tokenizer tokenizer, dev.langchain4j.data.document.DocumentSplitter subSplitter)
-
-
Method Details
-
split
Description copied from class:HierarchicalDocumentSplitterSplits the provided text into parts. Implementation API.- Specified by:
splitin classHierarchicalDocumentSplitter- Parameters:
text- The text to be split.- Returns:
- An array of parts.
-
joinDelimiter
Description copied from class:HierarchicalDocumentSplitterDelimiter string to use to re-join the parts.- Specified by:
joinDelimiterin classHierarchicalDocumentSplitter- Returns:
- The delimiter.
-
defaultSubSplitter
protected dev.langchain4j.data.document.DocumentSplitter defaultSubSplitter()Description copied from class:HierarchicalDocumentSplitterThe default sub-splitter to use when a single segment is too long.- Specified by:
defaultSubSplitterin classHierarchicalDocumentSplitter- Returns:
- The default sub-splitter.
-