public interface Encoding
| Modifier and Type | Method and Description |
|---|---|
int |
countTokens(java.lang.String text)
Encodes the given text into a list of token ids and returns the amount of tokens.
|
int |
countTokensOrdinary(java.lang.String text)
Encodes the given text into a list of token ids and returns the amount of tokens.
|
java.lang.String |
decode(java.util.List<java.lang.Integer> tokens)
Decodes the given list of token ids into a text.
|
byte[] |
decodeBytes(java.util.List<java.lang.Integer> tokens)
Decodes the given list of token ids into a byte array.
|
java.util.List<java.lang.Integer> |
encode(java.lang.String text)
Encodes the given text into a list of token ids.
|
EncodingResult |
encode(java.lang.String text,
int maxTokens)
Encodes the given text into a list of token ids.
|
java.util.List<java.lang.Integer> |
encodeOrdinary(java.lang.String text)
Encodes the given text into a list of token ids, ignoring special tokens.
|
EncodingResult |
encodeOrdinary(java.lang.String text,
int maxTokens)
Encodes the given text into a list of token ids, ignoring special tokens.
|
java.lang.String |
getName()
Returns the name of this encoding.
|
java.util.List<java.lang.Integer> encode(java.lang.String text)
Special tokens are artificial tokens used to unlock capabilities from a model,
such as fill-in-the-middle. There is currently no support for parsing special tokens
in a text, so if the text contains special tokens, this method will throw an
UnsupportedOperationException.
If you want to encode special tokens as ordinary text, use encodeOrdinary(String).
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
encoding.encode("hello world");
// returns [15339, 1917]
encoding.encode("hello <|endoftext|> world");
// raises an UnsupportedOperationException
text - the text to encodejava.lang.UnsupportedOperationException - if the text contains special tokens which are not supported for nowEncodingResult encode(java.lang.String text, int maxTokens)
Special tokens are artificial tokens used to unlock capabilities from a model,
such as fill-in-the-middle. There is currently no support for parsing special tokens
in a text, so if the text contains special tokens, this method will throw an
UnsupportedOperationException.
If you want to encode special tokens as ordinary text, use encodeOrdinary(String, int).
This method will truncate the list of token ids if the number of tokens exceeds the given maxTokens parameter. Note that it will try to keep characters together, that are encoded into multiple tokens. For example, if the text contains a character which is encoded into 3 tokens, and due to the maxTokens parameter the last token of the character is truncated, the first two tokens of the character will also be truncated. Therefore, the actual number of tokens may be less than the given maxTokens parameter.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
encoding.encode("hello world", 100);
// returns [15339, 1917]
encoding.encode("hello <|endoftext|> world", 100);
// raises an UnsupportedOperationException
text - the text to encodemaxTokens - the maximum number of tokens to encodeEncodingResult containing a list of token ids and whether the tokens were truncated due to the maxTokens parameterjava.lang.UnsupportedOperationException - if the text contains special tokens which are not supported for nowjava.util.List<java.lang.Integer> encodeOrdinary(java.lang.String text)
This method does not throw an exception if the text contains special tokens, but instead encodes them as if they were ordinary text.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
encoding.encodeOrdinary("hello world");
// returns [15339, 1917]
encoding.encodeOrdinary("hello <|endoftext|> world");
// returns [15339, 83739, 8862, 728, 428, 91, 29, 1917]
text - the text to encodeEncodingResult encodeOrdinary(java.lang.String text, int maxTokens)
This method does not throw an exception if the text contains special tokens, but instead encodes them as if they were ordinary text.
It will truncate the list of token ids if the number of tokens exceeds the given maxTokens parameter. Note that it will try to keep characters together, that are encoded into multiple tokens. For example, if the text contains a character which is encoded into 3 tokens, and due to the maxTokens parameter the last token of the character is truncated, the first two tokens of the character will also be truncated. Therefore, the actual number of tokens may be less than the given maxTokens parameter.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
encoding.encodeOrdinary("hello world", 100);
// returns [15339, 1917]
encoding.encodeOrdinary("hello <|endoftext|> world", 100);
// returns [15339, 83739, 8862, 728, 428, 91, 29, 1917]
text - the text to encodemaxTokens - the maximum number of tokens to encodeEncodingResult containing a list of token ids and whether the tokens were truncated due to the maxTokens parameterint countTokens(java.lang.String text)
encode(String), if all you want is to
know the amount of tokens. It is not more performant than encode(String),
so prefer to use encode(String) if you actually need the tokens.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
encoding.countTokens("hello world");
// returns 2
encoding.countTokens("hello <|endoftext|> world");
// raises an UnsupportedOperationException
text - the text to count tokens forjava.lang.UnsupportedOperationException - if the text contains special tokens which are not supported for nowint countTokensOrdinary(java.lang.String text)
encodeOrdinary(String), if all you want is to
know the amount of tokens. It is not more performant than encodeOrdinary(String),
so prefer to use encodeOrdinary(String) if you actually need the tokens.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
encoding.countTokensOrdinary("hello world");
// returns 2
encoding.countTokensOrdinary("hello <|endoftext|> world");
// returns 8
text - the text to count tokens forjava.lang.UnsupportedOperationException - if the text contains special tokens which are not supported for nowjava.lang.String decode(java.util.List<java.lang.Integer> tokens)
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.decode(List.of(15339, 1917)); // returns "hello world" encoding.decode(List.of(15339, 1917, Integer.MAX_VALUE)); // raises an IllegalArgumentException
tokens - the list of token idsjava.lang.IllegalArgumentException - if the list contains invalid token idsbyte[] decodeBytes(java.util.List<java.lang.Integer> tokens)
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.decodeBytes(List.of(15339, 1917)); // returns [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100] encoding.decodeBytes(List.of(15339, 1917, Integer.MAX_VALUE)); // raises an IllegalArgumentException
tokens - the list of token idsjava.lang.IllegalArgumentException - if the list contains invalid token idsjava.lang.String getName()
EncodingRegistry.