String Functions¶
Note
Unless specified otherwise, all functions return NULL if at least one of the arguments is NULL.
These functions assume that input strings contain valid UTF-8 encoded Unicode code points. The behavior is undefined if they are not.
- ascii(string) integer¶
 Returns unicode code point of the first character of
string. Returns 0 ifstringis empty.
- base64(expr) varchar¶
 Converts
exprto a base 64 string using RFC2045 Base64 transfer encoding for MIME.SELECT base64('Spark SQL'); -- 'U3BhcmsgU1FM'
- bit_length(string/binary) integer¶
 Returns the bit length for the specified string column.
SELECT bit_length('123'); -- 24
- char_type_write_side_check(string, limit) varchar¶
 Ensures that input
stringfits within the specified lengthlimitin characters by padding or trimming spaces as needed. If the length ofstringis less thanlimit, it is padded with trailing spaces (ASCII 32) to reachlimit. If the length ofstringis greater thanlimit, trailing spaces are trimmed to fit withinlimit. Throws exception whenstringstill exceedslimitafter trimming trailing spaces or whenlimitis not greater than 0. Note: This function is not directly callable in Spark SQL, but internally used for length check when writing char type columns.-- Function call examples (this function is not directly callable in Spark SQL). char_type_write_side_check("abc", 3) -- "abc" char_type_write_side_check("ab", 3) -- "ab " char_type_write_side_check("a", 3) -- "a " char_type_write_side_check("abc ", 3) -- "abc" char_type_write_side_check("abcd", 3) -- VeloxUserError: "Exceeds allowed length limitation: '3'" char_type_write_side_check("世界", 2) -- "世界" char_type_write_side_check("世", 3) -- "世 " char_type_write_side_check("世界人", 2) -- VeloxUserError: "Exceeds allowed length limitation: '2'" char_type_write_side_check("abc", 0) -- VeloxUserError: "The length limit must be greater than 0."
- chr(n) varchar¶
 Returns the Unicode code point
nas a single character string. Ifn < 0, the result is an empty string. Ifn >= 256, the result is equivalent to chr(n % 256).
- concat_ws(separator, [string/array<string>, ]...) varchar¶
 Returns the concatenation result for
stringand all elements inarray<string>, separated byseparator. The first argument isseparatorwhose type is VARCHAR. Then, this function can take variable number of remaining arguments , and it allows mixed use ofstringtype andarray<string>type. Skips NULL argument or NULL array element during the concatenation. Ifseparatoris NULL, returns NULL, regardless of the following inputs. For non-NULLseparator, if no remaining input exists or all remaining inputs are NULL, returns an empty string.SELECT concat_ws('~', 'a', 'b', 'c'); -- 'a~b~c' SELECT concat_ws('~', ['a', 'b', 'c'], ['d']); -- 'a~b~c~d' SELECT concat_ws('~', 'a', ['b', 'c']); -- 'a~b~c' SELECT concat_ws('~', '', [''], ['a', '']); -- '~~a~' SELECT concat_ws(NULL, 'a'); -- NULL SELECT concat_ws('~'); -- '' SELECT concat_ws('~', NULL, [NULL], 'a', 'b'); -- 'a~b' SELECT concat_ws('~', NULL, NULL); -- '' SELECT concat_ws('~', [NULL]); -- ''
- contains(left, right) boolean¶
 Returns true if ‘right’ is found in ‘left’. Otherwise, returns false.
SELECT contains('Spark SQL', 'Spark'); -- true SELECT contains('Spark SQL', 'SPARK'); -- false SELECT contains('Spark SQL', null); -- NULL SELECT contains(x'537061726b2053514c', x'537061726b'); -- true
- conv(number, fromBase, toBase) varchar¶
 Converts
numberrepresented as a string fromfromBasetotoBase.fromBasemust be an INTEGER value between 2 and 36 inclusively.toBasemust be an INTEGER value between 2 and 36 inclusively or between -36 and -2 inclusively. Otherwise, returns NULL. Returns a signed number iftoBaseis negative. Otherwise, returns an unsigned one. Returns NULL ifnumberis empty. Skips leading spaces.numbermay contain other characters not valid forfromBase. All characters starting from the first invalid character till the end of the string are ignored. Only converts valid characters even thoughfromBase=toBase. Returns ‘0’ if no valid character is found.SELECT conv('100', 2, 10); -- '4' SELECT conv('-10', 16, -10); -- '-16' SELECT conv("-1", 10, 16); -- 'FFFFFFFFFFFFFFFF' SELECT conv("123", 10, 39); -- NULL SELECT conv('', 16, 10); -- NULL SELECT conv(' ', 2, 10); -- NULL SELECT conv("11", 10, 16); -- 'B' SELECT conv("11ABC", 10, 16); -- 'B' SELECT conv("11abc", 10, 10); -- '11' SELECT conv('H016F', 16, 10); -- '0'
- empty2null(input) varchar¶
 Returns NULL if
inputis empty. Otherwise, returnsinput. Note: it’s an internal Spark function used to convert empty value of a partition column, which is then converted to Hive default partition value__HIVE_DEFAULT_PARTITION__.SELECT empty2null(''); -- NULL SELECT empty2null('abc'); -- 'abc'
- endswith(left, right) boolean¶
 Returns true if ‘left’ ends with ‘right’. Otherwise, returns false.
SELECT endswith('js SQL', 'SQL'); -- true SELECT endswith('js SQL', 'js'); -- false SELECT endswith('js SQL', NULL); -- NULL
- find_in_set(str, strArray) integer¶
 Returns 1-based index of the given string
strin the comma-delimited liststrArray. Returns 0, if the string was not found or if the given stringstrcontains a comma.SELECT find_in_set('ab', 'abc,b,ab,c,def'); -- 3 SELECT find_in_set('ab,', 'abc,b,ab,c,def'); -- 0 SELECT find_in_set('dfg', 'abc,b,ab,c,def'); -- 0 SELECT find_in_set('', ''); -- 1 SELECT find_in_set('', '123,'); -- 2 SELECT find_in_set('', ',123'); -- 1 SELECT find_in_set(NULL, ',123'); -- NULL SELECT find_in_set("abc", NULL); -- NULL
- initcap(string) varchar¶
 The
initcapfunction converts the first character of each word to uppercase and all other characters in the word to lowercase. It supports UTF-8 multibyte characters, up to four bytes per character.A word is defined as a sequence of characters separated by whitespace.
SELECT initcap('spark sql'); -- Spark Sql SELECT initcap('spARK sQL'); -- Spark Sql SELECT initcap('123abc DEF!ghi'); -- 123abc Def!ghi SELECT initcap('élan vital für alle'); -- Élan Vital Für Alle SELECT initcap('hello-world test_case'); -- Hello-world Test_case
- instr(string, substring) integer¶
 Returns the starting position of the first instance of
substringinstring. Positions start with1. Returns 0 if ‘substring’ is not found.
- left(string, length) string¶
 Returns the leftmost length characters from the
string. Iflengthis less or equal than 0 the result is an empty string.
- length(string) integer¶
 Returns the length of
stringin characters.
- levenshtein(string1, string2[, threshold]) integer¶
 Returns the Levenshtein distance between the two given strings. If the provided
thresholdis negative, or the levenshtein distance exceedsthreshold, returns -1.SELECT levenshtein('kitten', 'sitting'); -- 3 SELECT levenshtein('kitten', 'sitting', 10); -- 3 SELECT levenshtein('kitten', 'sitting', 2); -- -1
- locate(substring, string, start) integer¶
 Returns the 1-based position of the first occurrence of
substringin givenstringafter positionstart. The search is from the beginning ofstringto the end.startis the starting character position instringto search for thesubstring.startis 1-based and must be at least 1 and at most the characters number ofstring. The following rules on special values are applied to follow Spark’s implementation. They are listed in order of priority:Returns 0 if
startis NULL. Returns NULL ifsubstringorstringis NULL. Returns 0 ifstartis less than 1. Returns 1 ifsubstringis empty. Returns 0 ifstartis greater than the characters number ofstring. Returns 0 ifsubstringis not found instring.SELECT locate('aa', 'aaads', 1); -- 1 SELECT locate('aa', 'aaads', -1); -- 0 SELECT locate('aa', 'aaads', 2); -- 2 SELECT locate('aa', 'aaads', 6); -- 0 SELECT locate('aa', 'aaads', NULL); -- 0 SELECT locate('', 'aaads', 1); -- 1 SELECT locate('', 'aaads', 9); -- 1 SELECT locate('', 'aaads', -1); -- 0 SELECT locate('', '', 1); -- 1 SELECT locate('aa', '', 1); -- 0 SELECT locate(NULL, NULL, NULL); -- 0 SELECT locate(NULL, NULL, 1); -- NULL SELECT locate('\u4FE1', '\u4FE1\u5FF5,\u4FE1\u7231,\u4FE1\u5E0C\u671B', 2); -- 4
- lower(string) string¶
 Returns string with all characters changed to lowercase.
SELECT lower('SparkSql'); -- sparksql
- lpad(string, len, pad) string¶
 Returns
string, left-padded with pad to a length oflen. Ifstringis longer thanlen, the return value is shortened tolencharacters or bytes. Ifpadis not specified,stringwill be padded to the left with space characters if it is a character string, and with zeros if it is a byte sequence.SELECT lpad('hi', 5, '??'); -- ???hi SELECT lpad('hi', 1, '??'); -- h SELECT lpad('hi', 4); -- hi
- ltrim(string) varchar¶
 Removes leading 0x20(space) characters from
string.SELECT ltrim(' data '); -- "data "
- ltrim(trimCharacters, string) varchar
 Removes specified leading characters from
string. The specified character is any character contained intrimCharacters.trimCharacterscan be empty and may contain duplicate characters.SELECT ltrim('ps', 'spark'); -- "ark"
- luhn_check(string) boolean¶
 Returns true if
stringpasses the Luhn algorithm check. Otherwise, returns false. The Luhn algorithm is a simple check digit formula used to validate a variety of identification numbers, defined in US patent 2950048A. Returns NULL ifstringis NULL.SELECT luhn_check('4111111111111111'); -- true SELECT luhn_check('378282246310006'); -- false SELECT luhn_check(NULL); -- NULL
- mask(string[, upperChar, lowerChar, digitChar, otherChar]) string¶
 Returns a masked version of the input
string.string: string value to mask.upperChar: A single character string used to substitute upper case characters. The default is ‘X’. If NULL, upper case characters remain unmasked.lowerChar: A single character string used to substitute lower case characters. The default is ‘x’. If NULL, lower case characters remain unmasked.digitChar: A single character string used to substitute digits. The default is ‘n’. If NULL, digits remain unmasked.otherChar: A single character string used to substitute any other character. The default is NULL, which leaves these characters unmasked. Any invalid UTF-8 characters present in the input string will be treated as a single other character.SELECT mask('abcd-EFGH-8765-4321'); -- "xxxx-XXXX-nnnn-nnnn" SELECT mask('abcd-EFGH-8765-4321', 'Q'); -- "xxxx-QQQQ-nnnn-nnnn" SELECT mask('AbCD123-@$#'); -- "XxXXnnn-@$#" SELECT mask('AbCD123-@$#', 'Q'); -- "QxQQnnn-@$#" SELECT mask('AbCD123-@$#', 'Q', 'q'); -- "QqQQnnn-@$#" SELECT mask('AbCD123-@$#', 'Q', 'q', 'd'); -- "QqQQddd-@$#" SELECT mask('AbCD123-@$#', 'Q', 'q', 'd', 'o'); -- "QqQQdddoooo" SELECT mask('AbCD123-@$#', NULL, 'q', 'd', 'o'); -- "AqCDdddoooo" SELECT mask('AbCD123-@$#', NULL, NULL, 'd', 'o'); -- "AbCDdddoooo" SELECT mask('AbCD123-@$#', NULL, NULL, NULL, 'o'); -- "AbCD123oooo" SELECT mask(NULL, NULL, NULL, NULL, 'o'); -- NULL SELECT mask(NULL); -- NULL SELECT mask('AbCD123-@$#', NULL, NULL, NULL, NULL); -- "AbCD123-@$#"
- overlay(input, replace, pos, len) same as input¶
 Replace a substring of
inputstarting atposcharacter withreplaceand going for restlencharacters ofinput. Types ofinputandreplacemust be the same. Supported types are: VARCHAR and VARBINARY. Wheninputtypes are VARCHAR,lenandposare specified in characters, otherwise, bytes. Result is constructed from three parts. First part is first pos - 1 characters ofinputwhenposif greater then zero, otherwise, empty string. Second part isreplace. Third part is rest ofinputfrom indices pos + len which is 1-based, iflenis negative, it will be set to size ofreplace, if pos + len is negative, it refers to -(pos + len)th element before the end ofinput.SELECT overlay('Spark SQL', '_', 6, -1); -- "Spark_SQL" SELECT overlay('Spark SQL', 'CORE', 7, -1); -- "Spark CORE" SELECT overlay('Spark SQL', 'ANSI ', 7, 0); -- "Spark ANSI SQL" SELECT overlay('Spark SQL', 'tructured', 2, 4); -- "Structured SQL" SELECT overlay('Spark SQL', '_', -6, 3); -- "_Sql"
- read_side_padding(string, limit) varchar¶
 Right-pads the given string with spaces to the specified length
limit. If the string’s length is already greater than or equal tolimit, it is returned as-is. Throws an exception iflimitis not greater than 0. Note: This function is not directly callable in Spark SQL, but is used internally for reading CHAR type columns.-- Function call examples (this function is not directly callable in Spark SQL). read_side_padding("a", 3) -- "a " read_side_padding("abc", 3) -- "abc" read_side_padding("abcd", 3) -- "abcd" read_side_padding("世", 3) -- "世 " read_side_padding("世界", 2) -- "世界" read_side_padding("Привет", 8) -- "Привет " read_side_padding("Γειά", 5) -- "Γειά " read_side_padding("Приветик", 6) -- "Приветик" read_side_padding("a", 0) -- VeloxUserError: "The length limit must be greater than 0."
- repeat(input, n) varchar¶
 Returns the string which repeats
inputntimes. Result size must be less than or equal to 1MB. Ifnis less than or equal to 0, empty string is returned.SELECT repeat('123', 2); -- 123123
- replace(input, replaced) varchar¶
 Removes all instances of
replacedfrominput. Ifreplacedis an empty string, returns the originalinputstring.SELECT replace('ABCabc', ''); -- ABCabc SELECT replace('ABCabc', 'bc'); -- ABCc
- replace(input, replaced, replacement) varchar¶
 Replaces all instances of
replacedwithreplacementininput. Ifreplacedis an empty string, returns the originalinputstring.SELECT replace('ABCabc', '', 'DEF'); -- ABCabc SELECT replace('ABCabc', 'abc', ''); -- ABC SELECT replace('ABCabc', 'abc', 'DEF'); -- ABCDEF
- reverse(string) varchar¶
 Returns input string with characters in reverse order.
- rpad(string, len, pad) string¶
 Returns
string, right-padded withpadto a length oflen. Ifstringis longer thanlen, the return value is shortened tolencharacters. Ifpadis not specified,stringwill be padded to the right with space characters if it is a character string, and with zeros if it is a binary string.SELECT lpad('hi', 5, '??'); -- ???hi SELECT lpad('hi', 1, '??'); -- h SELECT lpad('hi', 4); -- hi
- rtrim(string) varchar¶
 Removes trailing 0x20(space) characters from
string.SELECT rtrim(' data '); -- " data"
- rtrim(trimCharacters, string) varchar
 Removes specified trailing characters from
string. The specified character is any character contained intrimCharacters.trimCharacterscan be empty and may contain duplicate characters.SELECT rtrim('kr', 'spark'); -- "spa"
- soundex(string) string¶
 Returns Soundex code of the string. If first character of
stringis not a letter,stringis returned.SELECT soundex('Miller'); -- "M460"
- split(string, delimiter[, limit]) -> array(string)¶
 Splits
stringaround occurrences that matchdelimiterand returns an array with a length of at mostlimit.delimiteris a string representing regular expression.limitis an integer which controls the number of times the regex is applied. By default,limitis -1. Whenlimit> 0, the resulting array’s length will not be more thanlimit, and the resulting array’s last entry will contain all input beyond the last matched regex. Whenlimit<= 0,regexwill be applied as many times as possible, and the resulting array can be of any size. Whendelimiteris empty, iflimitis smaller than the size ofstring, the resulting array only containslimitnumber of single characters splitting fromstring, iflimitis not provided or is larger than the size ofstring, the resulting array contains all the single characters ofstringand does not include an empty tail character. The split function align with vanilla spark 3.4+ split function.SELECT split('oneAtwoBthreeC', '[ABC]'); -- ["one","two","three",""] SELECT split('oneAtwoBthreeC', '[ABC]', 2); -- ["one","twoBthreeC"] SELECT split('oneAtwoBthreeC', '[ABC]', 5); -- ["one","two","three",""] SELECT split('one', '1'); -- ["one"] SELECT split('abcd', ''); -- ["a","b","c","d"] SELECT split('abcd', '', 3); -- ["a","b","c"] SELECT split('abcd', '', 5); -- ["a","b","c","d"]
- startswith(left, right) boolean¶
 Returns true if ‘left’ starts with ‘right’. Otherwise, returns false.
SELECT startswith('js SQL', 'js'); -- true SELECT startswith('js SQL', 'SQL'); -- false SELECT startswith('js SQL', null); -- NULL
- str_to_map(string, entryDelimiter, keyValueDelimiter)¶
 Returns a map by splitting
stringinto entries withentryDelimiterand splitting each entry into key/value withkeyValueDelimiter.entryDelimiterandkeyValueDelimitermust be constant strings with single ascii character. AllowskeyValueDelimiternot found when splitting an entry. Throws exception when duplicate map keys are found for single row’s result, consistent with Spark’s default behavior.SELECT str_to_map('a:1,b:2,c:3', ',', ':'); -- {"a":"1","b":"2","c":"3"} SELECT str_to_map('a', ',', ':'); -- {"a":NULL} SELECT str_to_map('', ',', ':'); -- {"":NULL} SELECT str_to_map('a:1,b:2,c:3', ',', ','); -- {"a:1":NULL,"b:2":NULL,"c:3":NULL}
- substring(string, start) varchar¶
 Returns the rest of
stringfrom the starting positionstart. Positions start with1. A negative starting position is interpreted as being relative to the end of the string. When the starting position is 0, the meaning is to refer to the first character.Type of ‘start’ must be an INTEGER.
- substring(string, start, length) varchar
 Returns a substring from
stringof lengthlengthfrom the starting positionstart. Positions start with1. A negative starting position is interpreted as being relative to the end of the string. When the starting position is 0, the meaning is to refer to the first character. Type of ‘start’ must be an INTEGER.SELECT substring('Spark SQL', 0, 2); -- Sp SELECT substring('Spark SQL', 1, 2); -- Sp SELECT substring('Spark SQL', 5, 0); -- "" SELECT substring('Spark SQL', 5, -1); -- "" SELECT substring('Spark SQL', 5, 10000); -- "k SQL" SELECT substring('Spark SQL', -9, 3); -- "Spa" SELECT substring('Spark SQL', -10, 3); -- "Sp" SELECT substring('Spark SQL', -20, 3); -- ""
- substring_index(string, delim, count) [same as string]¶
 Returns the substring from
stringbeforecountoccurrences of the delimiterdelim. Here thestringcan be VARCHAR or VARBINARY and return type matches type ofstring. Ifcountis positive, returns everything to the left of the final delimiter (counting from the left). Ifcountis negative, returns everything to the right of the final delimiter (counting from the right). Ifcountis 0, returns empty string. Ifdelimis not found or found fewer times thancount, returns the original input string.delimis case-sensitive. It also takes into account overlapping strings.SELECT substring_index('Spark.SQL', '.', 1); -- "Spark" SELECT substring_index('Spark.SQL', '.', 0); -- "" SELECT substring_index('Spark.SQL', '.', -1); -- "SQL" SELECT substring_index('TEST.Spark.SQL', '.',2); -- "TEST.Spark" SELECT substring_index('TEST.Spark.SQL', '', 0); -- "" SELECT substring_index('TEST.Spark.SQL', '.', -2); -- "Spark.SQL" SELECT substring_index('TEST.Spark.SQL', '.', 10); -- "TEST.Spark.SQL" SELECT substring_index('TEST.Spark.SQL', '.', -12); -- "TEST.Spark.SQL" SELECT substring_index('aaaaa', 'aa', 2); -- "a" SELECT substring_index('aaaaa', 'aa', -4); -- "aaa" SELECT substring_index('aaaaa', 'aa', 0); -- "" SELECT substring_index('aaaaa', 'aa', 5); -- "aaaaa" SELECT substring_index('aaaaa', 'aa', -5); -- "aaaaa"
- translate(string, match, replace) varchar¶
 Returns a new translated string. It translates the character in
stringby a character inreplace. The character inreplaceis corresponding to the character inmatch. The translation will happen when any character instringmatching with a character inmatch. Ifmatch'scharacter size is larger thanreplace's, the extra characters inmatchwill be removed fromstring. In addition, this function only considers the first occurrence of a character inmatchand uses its corresponding character inreplacefor translation. Any invalid UTF-8 characters present in the input string will be treated as a single character.:SELECT translate('spark', 'sa', '12'); -- "1p2rk" SELECT translate('spark', 'sa', '1'); -- "1prk" SELECT translate('spark', 'ss', '12'); -- "1park"
- trim(string) varchar¶
 Removes leading and trailing 0x20(space) characters from
string.SELECT trim(' data '); -- "data"
- trim(trimCharacters, string) varchar
 Removes specified leading and trailing characters from
string. The specified character is any character contained intrimCharacters.trimCharacterscan be empty and may contain duplicate characters.SELECT trim('sprk', 'spark'); -- "a"
- unbase64(expr) varbinary¶
 Returns a decoded base64 string as binary.
SELECT cast(unbase64('U3BhcmsgU1FM') AS STRING); -- 'Spark SQL'
- upper(string) string¶
 Returns string with all characters changed to uppercase.
SELECT upper('SparkSql'); -- SPARKSQL
- varchar_type_write_side_check(string, limit) varchar¶
 Removes trailing space characters (ASCII 32) that exceed the length
limitfrom the end of inputstring.limitis the maximum length of characters that can be allowed. Throws exception whenstringstill exceedslimitafter trimming trailing spaces or whenlimitis not greater than 0. Empty strings are returned as-is since they always satisfy any lengthlimitgreater than 0. Note: This function is not directly callable in Spark SQL, but internally used for length check when writing string type columns.-- Function call examples (this function is not directly callable in Spark SQL). varchar_type_write_side_check("abc", 3) -- "abc" varchar_type_write_side_check("abc ", 3) -- "abc" varchar_type_write_side_check("abcd", 3) -- VeloxUserError: "Exceeds allowed length limitation: '3'" varchar_type_write_side_check("中国", 3) -- "中国" varchar_type_write_side_check("中文中国", 3) -- VeloxUserError: "Exceeds allowed length limitation: '3'" varchar_type_write_side_check(" ", 0) -- VeloxUserError: "The length limit must be greater than 0." varchar_type_write_side_check("", 3) -- ""