Package org.apache.tika.parser.txt
Class Icu4jEncodingDetector
java.lang.Object
org.apache.tika.parser.txt.Icu4jEncodingDetector
- All Implemented Interfaces:
Serializable,org.apache.tika.detect.EncodingDetector
public class Icu4jEncodingDetector
extends Object
implements org.apache.tika.detect.EncodingDetector
- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptiondetect(InputStream input, org.apache.tika.metadata.Metadata metadata) intintbooleanvoidsetIgnoreCharsets(List<String> charsetsToIgnore) voidsetMarkLimit(int markLimit) How far into the stream to read for charset detection.voidsetStripMarkup(boolean stripMarkup) Whether or not to attempt to strip html-ish markup from the stream before sending it to the underlying detector.
-
Constructor Details
-
Icu4jEncodingDetector
public Icu4jEncodingDetector()
-
-
Method Details
-
detect
public Charset detect(InputStream input, org.apache.tika.metadata.Metadata metadata) throws IOException - Specified by:
detectin interfaceorg.apache.tika.detect.EncodingDetector- Throws:
IOException
-
isStripMarkup
public boolean isStripMarkup() -
setStripMarkup
@Field public void setStripMarkup(boolean stripMarkup) Whether or not to attempt to strip html-ish markup from the stream before sending it to the underlying detector.The underlying detector may still apply its own stripping if this is set to
false.- Parameters:
stripMarkup- whether or not to attempt to strip markup before sending the stream to the underlying detector
-
getMarkLimit
public int getMarkLimit() -
setMarkLimit
@Field public void setMarkLimit(int markLimit) How far into the stream to read for charset detection. Default is 12000.- Parameters:
markLimit-
-
getMarkLimt
public int getMarkLimt() -
setIgnoreCharsets
-
getIgnoreCharsets
-