public final class GptBytePairEncodingParams
extends java.lang.Object
This library supports the encodings that are listed in EncodingType out of the box.
But if you want to use a custom encoding, you can use this class to pass the parameters to the library.
Use EncodingRegistry.registerGptBytePairEncoding(GptBytePairEncodingParams) to register your custom encoding
to the registry, so that you can easily use your encoding in conjunction with the predefined ones.
The encoding parameters are:
| Constructor and Description |
|---|
GptBytePairEncodingParams(java.lang.String name,
java.util.regex.Pattern pattern,
java.util.Map<byte[],java.lang.Integer> encoder,
java.util.Map<java.lang.String,java.lang.Integer> specialTokensEncoder)
Creates a new instance of
GptBytePairEncodingParams. |
| Modifier and Type | Method and Description |
|---|---|
java.util.Map<byte[],java.lang.Integer> |
getEncoder() |
java.lang.String |
getName() |
java.util.regex.Pattern |
getPattern() |
java.util.Map<java.lang.String,java.lang.Integer> |
getSpecialTokensEncoder() |
public GptBytePairEncodingParams(java.lang.String name,
java.util.regex.Pattern pattern,
java.util.Map<byte[],java.lang.Integer> encoder,
java.util.Map<java.lang.String,java.lang.Integer> specialTokensEncoder)
GptBytePairEncodingParams.name - the name of the encoding. This is used to identify the encoding and must be uniquepattern - the pattern that is used to split the input text into tokens.encoder - the encoder that maps the tokens to their idsspecialTokensEncoder - the encoder that maps the special tokens to their ids