Class DocumentSplitters

java.lang.Object
dev.langchain4j.data.document.splitter.DocumentSplitters

public class DocumentSplitters extends Object
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static dev.langchain4j.data.document.DocumentSplitter
    recursive(int maxSegmentSizeInChars, int maxOverlapSizeInChars)
    This is a recommended DocumentSplitter for generic text.
    static dev.langchain4j.data.document.DocumentSplitter
    recursive(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, dev.langchain4j.model.Tokenizer tokenizer)
    This is a recommended DocumentSplitter for generic text.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • DocumentSplitters

      public DocumentSplitters()
  • Method Details

    • recursive

      public static dev.langchain4j.data.document.DocumentSplitter recursive(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, dev.langchain4j.model.Tokenizer tokenizer)
      This is a recommended DocumentSplitter for generic text. It tries to split the document into paragraphs first and fits as many paragraphs into a single TextSegment as possible. If some paragraphs are too long, they are recursively split into lines, then sentences, then words, and then characters until they fit into a segment.
      Parameters:
      maxSegmentSizeInTokens - The maximum size of the segment, defined in tokens.
      maxOverlapSizeInTokens - The maximum size of the overlap, defined in tokens. Only full sentences are considered for the overlap.
      tokenizer - The tokenizer that is used to count tokens in the text.
      Returns:
      recursive document splitter
    • recursive

      public static dev.langchain4j.data.document.DocumentSplitter recursive(int maxSegmentSizeInChars, int maxOverlapSizeInChars)
      This is a recommended DocumentSplitter for generic text. It tries to split the document into paragraphs first and fits as many paragraphs into a single TextSegment as possible. If some paragraphs are too long, they are recursively split into lines, then sentences, then words, and then characters until they fit into a segment.
      Parameters:
      maxSegmentSizeInChars - The maximum size of the segment, defined in characters.
      maxOverlapSizeInChars - The maximum size of the overlap, defined in characters. Only full sentences are considered for the overlap.
      Returns:
      recursive document splitter