Class Icu4jEncodingDetector

java.lang.Object
org.apache.tika.parser.txt.Icu4jEncodingDetector
All Implemented Interfaces:
Serializable, org.apache.tika.detect.EncodingDetector

public class Icu4jEncodingDetector extends Object implements org.apache.tika.detect.EncodingDetector
See Also:
  • Constructor Details

    • Icu4jEncodingDetector

      public Icu4jEncodingDetector()
  • Method Details

    • detect

      public Charset detect(InputStream input, org.apache.tika.metadata.Metadata metadata) throws IOException
      Specified by:
      detect in interface org.apache.tika.detect.EncodingDetector
      Throws:
      IOException
    • isStripMarkup

      public boolean isStripMarkup()
    • setStripMarkup

      @Field public void setStripMarkup(boolean stripMarkup)
      Whether or not to attempt to strip html-ish markup from the stream before sending it to the underlying detector.

      The underlying detector may still apply its own stripping if this is set to false.

      Parameters:
      stripMarkup - whether or not to attempt to strip markup before sending the stream to the underlying detector
    • getMarkLimit

      public int getMarkLimit()
    • setMarkLimit

      @Field public void setMarkLimit(int markLimit)
      How far into the stream to read for charset detection. Default is 12000.
      Parameters:
      markLimit -
    • getMarkLimt

      public int getMarkLimt()
    • setIgnoreCharsets

      @Field public void setIgnoreCharsets(List<String> charsetsToIgnore)
    • getIgnoreCharsets

      public List<String> getIgnoreCharsets()