String Functions¶

Note

These functions assume that the input strings contain valid UTF-8 encoded Unicode code points. There are no explicit checks for valid UTF-8 and the functions may return incorrect results on invalid UTF-8.

Additionally, the functions operate on Unicode code points and not user visible characters (or grapheme clusters). Some languages combine multiple code points into a single user-perceived character, the basic unit of a writing system for a language, but the functions will treat each code point as a separate unit.

The lower() and upper() functions do not perform locale-sensitive, context-sensitive, or one-to-many mappings required for some languages. Specifically, this will return incorrect results for Lithuanian, Turkish and Azeri.

chr(n) → varchar¶: Returns the Unicode code point n as a single character string.

codepoint(string) → integer¶: Returns the Unicode code point of the only character of string.

concat(string1, ..., stringN) → varchar¶: Returns the concatenation of string1, string2, ..., stringN. This function provides the same functionality as the SQL-standard concatenation operator (||).

ends_with(string, substring) → boolean¶: Returns whether string ends_with with substring.

from_utf8(binary) → varchar¶: Decodes a UTF-8 encoded string from binary. Invalid UTF-8 sequences are replaced with the Unicode replacement character U+FFFD. Each invalid UTF-8 codepoint, including those of multi-byte long, is replaced with one replacement character.

from_utf8(binary, replace) → varchar: Decodes a UTF-8 encoded string from binary. Invalid UTF-8 sequences are replaced with replace. Each invalid UTF-8 codepoint, including those of multi-byte long, is replaced with one replacement character. The replace argument can be either Unicode code point (bigint), a single character or empty string. When replace is an empty string invalid characters are removed.

hamming_distance(string1, string2) → bigint¶: Returns the Hamming distance of string1 and string2, i.e. the number of positions at which the corresponding characters are different. Note that the two strings must have the same length.

length(string) → bigint¶: Returns the length of string in characters.

levenshtein_distance(string_1, string_2) → bigint¶: Returns the Levenshtein edit distance of 2 strings. I.e. the minimum number of single-character edits (insertions, deletions or substitutions) needed to convert string_1 to string_2.

lower(string) → varchar¶: Converts string to lowercase.

lpad(string, size, padstring) → varchar¶: Left pads string to size characters with padstring. If size is less than the length of string, the result is truncated to size characters. size must not be negative and padstring must be non-empty.

ltrim(string) → varchar¶: Removes leading whitespace from string. See trim() for the set of recognized whitespace characters.

ltrim(string, chars) → varchar

Removes the longest substring containing only characters in chars from the beginning of the string.

SELECT ltrim('test', 't'); -- est
SELECT ltrim('tetris', 'te'); -- ris

replaceFirst(string, search, replace) → varchar¶

Removes the first instances of search with replace in string.

If search is an empty string, inserts replace in front of string.

replace(string, search) → varchar¶: Removes all instances of search from string.

replace(string, search, replace) → varchar

Replaces all instances of search with replace in string.

If search is an empty string, inserts replace in front of every character and at the end of the string.

reverse(string) → varchar: Returns input string with characters in reverse order.

reverse(varbinary) → varbinary: Returns input binary with bytes in reversed order.

rpad(string, size, padstring) → varchar¶: Right pads string to size characters with padstring. If size is less than the length of string, the result is truncated to size characters. size must not be negative and padstring must be non-empty.

rtrim(string) → varchar¶: Removes trailing whitespace from string. See trim() for the set of recognized whitespace characters.

rtrim(string, chars) → varchar

Removes the longest substring containing only characters in chars from the end of the string.

SELECT rtrim('test', 't'); -- tes
SELECT rtrim('test...', '.'); -- test

split(string, delimiter)¶: Splits string on delimiter and returns an array.

split(string, delimiter, limit)

Splits string on delimiter and returns an array of size at most limit.

The last element in the array always contains everything left in the string. limit must be a positive number.

split_part(string, delimiter, index) → string¶

Splits string on delimiter and returns the part at index index.

Field indexes start with 1. If the index is larger than the number of fields, then null is returned.

split_to_map(string, entryDelimiter, keyValueDelimiter) → map<varchar, varchar>¶

Splits string by entryDelimiter and keyValueDelimiter and returns a map. entryDelimiter splits string into key-value pairs. keyValueDelimiter splits each pair into key and value. Note that entryDelimiter and keyValueDelimiter are interpreted literally, i.e., as full string matches.

entryDelimiter and keyValueDelimiter must not be empty and must not be the same. entryDelimiter is allowed to be the trailing character.

Raises an error if there are duplicate keys.

split_to_map(string, entryDelimiter, keyValueDelimiter, function(K, V1, V2, R)) → map<varchar, varchar>¶

Splits string by entryDelimiter and keyValueDelimiter and returns a map. entryDelimiter splits string into key-value pairs. keyValueDelimiter splits each pair into key and value. Note that entryDelimiter and keyValueDelimiter are interpreted literally, i.e., as full string matches.

function(K,V1,V2,R) is used to decide whether to keep first or last value for duplicate keys. (k, v1, v2) -> v1 keeps first value. (k, v1, v2) -> v2 keeps last value. Arbitrary functions are not supported.

SELECT(split_to_map('a:1;b:2;a:3', ';', ':', (k, v1, v2) -> v1)); -- {"a": "1", "b": "2"}
SELECT(split_to_map('a:1;b:2;a:3', ';', ':', (k, v1, v2) -> v2)); -- {"a": "3", "b": "2"}

split_to_multimap(string, entryDelimiter, keyValueDelimiter) → map<varchar, array<varchar>>¶

Splits string by entryDelimiter and keyValueDelimiter and returns a map containing an array of values for each unique key. entryDelimiter splits string into key-value pairs. keyValueDelimiter splits each pair into key and value. The values for each key will be in the same order as they appeared in string.

Note that entryDelimiter and keyValueDelimiter are interpreted literally, i.e., as full string matches.

starts_with(string, substring) → boolean¶: Returns whether string starts with substring.

strpos(string, substring) → bigint¶: Returns the starting position of the first instance of substring in string. Positions start with 1. If not found, 0 is returned.

strpos(string, substring, instance) → bigint

Returns the position of the N-th instance of substring in string. instance must be a positive number. Positions start with 1. If not found, 0 is returned. It takes into account overlapping strings when counting occurrences.

SELECT strpos('aaa', 'aa', 2); -- 2

strrpos(string, substring) → bigint¶: Returns the starting position of the last instance of substring in string. Positions start with 1. If not found, 0 is returned.

strrpos(string, substring, instance) → bigint

Returns the position of the N-th instance of substring in string starting from the end of the string. instance must be a positive number. Positions start with 1. If not found, 0 is returned. It takes into account overlapping strings when counting occurrences.

SELECT strrpos('aaa', 'aa', 2); -- 1

trail(string, N) → varchar¶: Returns the last N characters of the input string up to at most the length of string.

substr(string, start) → varchar¶

substring(string, start) → varchar: Returns the rest of string from the starting position start. Positions start with 1. A negative starting position is interpreted as being relative to the end of the string. Returns empty string if absolute value of start is greater then length of the string.

substr(string, start, length) → varchar

substring(string, start, length) → varchar: Returns a substring from string of length length from the starting position start. Positions start with 1. A negative starting position is interpreted as being relative to the end of the string. Returns empty string if absolute value of 'start is greater then length of the string.

trim(string) → varchar¶

Removes starting and ending whitespaces from string.

Recognized whitespace characters:

Code	Description	Code	Description
9	TAB (horizontal tab)	U+1680	Ogham Space Mark
10	LF (NL line feed, new line)	U+2000	En Quad
11	VT (vertical tab)	U+2001	Em Quad
12	FF (NP form feed, new page)	U+2002	En Space
13	CR (carriage return)	U+2003	Em Space
28	FS (file separator)	U+2004	Three-Per-Em Space
29	GS (group separator)	U+2005	Four-Per-Em Space
30	RS (record separator)	U+2006	Four-Per-Em Space
31	US (unit separator)	U+2008	Punctuation Space
32	Space	U+2009	Thin Space
_	_	U+200a	Hair Space
_	_	U+200a	Hair Space
_	_	U+2028	Line Separator
_	_	U+2029	Paragraph Separator
_	_	U+205f	Medium Mathematical Space
_	_	U+3000	Ideographic Space

trim(string, chars) → varchar

Removes the longest substring containing only characters in chars from the beginning and end of the string.

SELECT trim('test', 't'); -- es
SELECT trim('.t.e.s.t.', '.t'); -- e.s

upper(string) → varchar¶: Converts string to uppercase.

word_stem(word) → varchar¶: Returns the stem of word in the English language. If the word is not an English word, the word in lowercase is returned.

word_stem(word, lang) → varchar¶

Returns the stem of word in the lang language. This function supports the following languages:

lang	Language
`ca`	`Catalan`
`da`	`Danish`
`de`	`German`
`en`	`English`
`es`	`Spanish`
`eu`	`Basque`
`fi`	`Finnish`
`fr`	`French`
`hu`	`Hungarian`
`hy`	`Armenian`
`ir`	`Irish`
`it`	`Italian`
`lt`	`Lithuanian`
`nl`	`Dutch`
`no`	`Norwegian`
`pt`	`Portuguese`
`ro`	`Romanian`
`ru`	`Russian`
`sv`	`Swedish`
`tr`	`Turkish`

If the specified lang is not supported, this function throws a user error.

Unicode Functions¶

normalize(string) → varchar¶: Transforms string with NFC normalization form.

normalize(string, form) → varchar¶

Reference: https://unicode.org/reports/tr15/#Norm_Forms Transforms string with the specified normalization form. form must be be one of the following keywords:

Form	Description
`NFD`	Canonical Decomposition
`NFC`	Canonical Decomposition, followed by Canonical Composition
`NFKD`	Compatibility Decomposition
`NFKC`	Compatibility Decomposition, followed by Canonical Composition

Note

This SQL-standard function has special syntax and requires specifying form as a keyword, not as a string.

to_utf8(string) → varbinary¶: Encodes string into a UTF-8 varbinary representation.

String Functions¶

Unicode Functions¶

Table of Contents

Previous topic

Next topic

This Page