String
A String in Elixir is a UTF-8 encoded binary.
String and binary operations
The functions in this module act according to the
Unicode Standard, version 6.3.0. For example,
capitalize/1, downcase/1, strip/1 are provided by this
module.
In addition to this module, Elixir provides more low-level
operations that work directly with binaries. Some
of those can be found in the Kernel module, as:
Kernel.binary_part/3- retrieves part of the binaryKernel.bit_size/1andKernel.byte_size/1- size related functionsKernel.is_bitstring/1andKernel.is_binary/1- type checking function- Plus a number of conversion functions, like
Kernel.binary_to_atom/1,Kernel.binary_to_integer/2,Kernel.binary_to_term/1and their inverses, likeKernel.integer_to_binary/2
Finally, the :binary module
provides a few other functions that work on the byte level.
Codepoints and graphemes
As per the Unicode Standard, a codepoint is an Unicode Character, which may be represented by one or more bytes. For example, the character "é" is represented with two bytes:
iex> byte_size("é")
2
However, this module returns the proper length:
iex> String.length("é")
1
Furthermore, this module also presents the concept of graphemes, which are multiple characters that may be "perceived as a single character" by readers. For example, the same "é" character written above could be represented by the letter "e" followed by the accent ́:
iex> string = "\x{0065}\x{0301}"
...> byte_size(string)
3
iex> String.length(string)
1
Although the example above is made of two characters, it is perceived by users as one.
Graphemes can also be two characters that are interpreted as one by some languages. For example, some languages may consider "ch" as a grapheme. However, since this information depends on the locale, it is not taken into account by this module.
In general, the functions in this module rely on the Unicode Standard, but does not contain any of the locale specific behaviour.
More information about graphemes can be found in the Unicode Standard Annex #29. This current Elixir version implements Extended Grapheme Cluster algorithm.
Integer codepoints
Although codepoints could be represented as integers, this module represents all codepoints as strings. For example:
iex> String.codepoints("josé")
["j", "o", "s", "é"]
There are a couple of ways to retrieve a character integer
codepoint. One may use the ? special macro:
iex> ?j
106
iex> ?é
233
Or also via pattern matching:
iex> << eacute :: utf8 >> = "é"
...> eacute
233
As we have seen above, codepoints can be inserted into a string by their hexadecimal code:
"jos\x{0065}\x{0301}" #=>
"josé"
Self-synchronization
The UTF-8 encoding is self-synchronizing. This means that if malformed data (i.e., data that is not possible according to the definition of the encoding) is encountered, only one codepoint needs to be rejected.
This module relies on this behaviour to ignore such invalid
characters. For example, length/1 is going to return
a correct result even if an invalid codepoint is fed into it.
In other words, this module expects invalid data to be detected when retrieving data from the external source. For example, a driver that reads strings from a database will be the one responsible to check the validity of the encoding.
Summary
| at(string, position) | Returns the grapheme in the |
| capitalize(string) | Converts the first character in the given string to uppercase and the remaining to lowercase |
| codepoints(string) | Returns all codepoints in the string |
| contains?(string, contents) | Check if |
| downcase(binary) | Convert all characters on the given string to lowercase |
| duplicate(subject, n) | Returns a binary |
| ends_with?(string, suffixes) | Returns |
| first(string) | Returns the first grapheme from an utf8 string, nil if the string is empty |
| from_char_list!(list) | Converts a list of integer codepoints to a string |
| from_char_list(list) | Converts a list of integer codepoints to a string |
| graphemes(string) | Returns unicode graphemes in the string as per Extended Grapheme Cluster algorithm outlined in the Unicode Standard Annex #29, Unicode Text Segmentation |
| last(string) | Returns the last grapheme from an utf8 string,
|
| length(string) | Returns the number of unicode graphemes in an utf8 string |
| ljust(subject, len) | Returns a new string of length |
| ljust(subject, len, padding) | |
| lstrip(binary) | Returns a string where leading Unicode whitespace has been removed |
| lstrip(other, char) | Returns a string where leading |
| match?(string, regex) | Check if |
| next_codepoint(string) | Returns the next codepoint in a String |
| next_grapheme(string) | Returns the next grapheme in a String |
| printable?(b) | Checks if a string is printable considering it is encoded
as UTF-8. Returns |
| replace(subject, pattern, replacement, options \\ []) | Returns a new binary based on |
| reverse(string) | Reverses the given string. Works on graphemes |
| rjust(subject, len) | Returns a new string of length |
| rjust(subject, len, padding) | |
| rstrip(binary) | Returns a string where trailing Unicode whitespace has been removed |
| rstrip(string, char) | Returns a string where trailing |
| slice(string, range) | Returns a substring from the offset given by the start of the range to the offset given by the end of the range |
| slice(string, start, len) | Returns a substring starting at the offset given by the first, and
a length given by the second.
If the offset is greater than string length, than it returns |
| split(binary) | Divides a string into substrings at each Unicode whitespace occurrence with leading and trailing whitespace ignored |
| split(binary, pattern, options \\ []) | Divides a string into substrings based on a pattern, returning a list of these substrings. The pattern can be a string, a list of strings or a regular expression |
| starts_with?(string, prefixes) | Returns |
| strip(string) | Returns a string where leading/trailing Unicode whitespace has been removed |
| strip(string, char) | Returns a string where leading/trailing |
| to_char_list!(string) | Converts a string into a char list converting each codepoint to its respective integer value |
| to_char_list(string) | Converts a string into a char list converting each codepoint to its respective integer value |
| upcase(binary) | Convert all characters on the given string to uppercase |
| valid?(arg1) | Checks whether |
| valid_character?(codepoint) | Checks whether |
Functions
Specs:
Returns the grapheme in the position of the given utf8 string.
If position is greater than string length, than it returns nil.
Examples
iex> String.at("elixir", 0)
"e"
iex> String.at("elixir", 1)
"l"
iex> String.at("elixir", 10)
nil
iex> String.at("elixir", -1)
"r"
iex> String.at("elixir", -10)
nil
Specs:
Converts the first character in the given string to uppercase and the remaining to lowercase.
This relies on the titlecase information provided by the Unicode Standard. Note this function makes no attempt to capitalize all words in the string (usually known as titlecase).
Examples
iex> String.capitalize("abcd")
"Abcd"
iex> String.capitalize("fin")
"Fin"
iex> String.capitalize("josé")
"José"
Specs:
Returns all codepoints in the string.
Examples
iex> String.codepoints("josé")
["j", "o", "s", "é"]
iex> String.codepoints("оптими зации")
["о","п","т","и","м","и"," ","з","а","ц","и","и"]
iex> String.codepoints("ἅἪῼ")
["ἅ","Ἢ","ῼ"]
Specs:
Check if string contains any of the given contents.
matches can be either a single string or a list of strings.
Examples
iex> String.contains? "elixir of life", "of"
true
iex> String.contains? "elixir of life", ["life", "death"]
true
iex> String.contains? "elixir of life", ["death", "mercury"]
false
Specs:
Convert all characters on the given string to lowercase.
Examples
iex> String.downcase("ABCD")
"abcd"
iex> String.downcase("AB 123 XPTO")
"ab 123 xpto"
iex> String.downcase("JOSÉ")
"josé"
Specs:
Returns a binary subject duplicated n times.
Examples
iex> String.duplicate("abc", 0)
""
iex> String.duplicate("abc", 1)
"abc"
iex> String.duplicate("abc", 2)
"abcabc"
Specs:
Returns true if string ends with any of the suffixes given, otherwise
false. suffixes can be either a single suffix or a list of suffixes.
Examples
iex> String.ends_with? "language", "age"
true
iex> String.ends_with? "language", ["youth", "age"]
true
iex> String.ends_with? "language", ["youth", "elixir"]
false
Specs:
Returns the first grapheme from an utf8 string, nil if the string is empty.
Examples
iex> String.first("elixir")
"e"
iex> String.first("եոգլի")
"ե"
Specs:
- from_char_list(char_list) :: {:ok, String.t} | {:error, binary, binary} | {:incomplete, binary, binary}
Converts a list of integer codepoints to a string.
Examples
iex> String.from_char_list([0x00E6, 0x00DF])
{ :ok, "æß" }
iex> String.from_char_list([0x0061, 0x0062, 0x0063])
{ :ok, "abc" }
Specs:
- from_char_list!(char_list) :: String.t | no_return
Converts a list of integer codepoints to a string.
In case the conversion fails, it raises a String.UnicodeConversionError.
Examples
iex> String.from_char_list!([0x00E6, 0x00DF])
"æß"
iex> String.from_char_list!([0x0061, 0x0062, 0x0063])
"abc"
Specs:
Returns unicode graphemes in the string as per Extended Grapheme Cluster algorithm outlined in the Unicode Standard Annex #29, Unicode Text Segmentation.
Examples
iex> String.graphemes("Ā̀stute")
["Ā̀","s","t","u","t","e"]
Specs:
Returns the last grapheme from an utf8 string,
nil if the string is empty.
Examples
iex> String.last("elixir")
"r"
iex> String.last("եոգլի")
"ի"
Specs:
- length(t) :: non_neg_integer
Returns the number of unicode graphemes in an utf8 string.
Examples
iex> String.length("elixir")
6
iex> String.length("եոգլի")
5
Specs:
Returns a new string of length len with subject left justified and padded
with padding. If padding is not present, it defaults to whitespace. When
len is less than the length of subject, subject is returned.
Examples
iex> String.ljust("abc", 5)
"abc "
iex> String.ljust("abc", 5, ?-)
"abc--"
Returns a string where leading Unicode whitespace has been removed.
Examples
iex> String.lstrip(" abc ")
"abc "
Specs:
Returns a string where leading char have been removed.
Examples
iex> String.lstrip("_ abc _", ?_)
" abc _"
Specs:
Check if string matches the given regular expression.
Examples
iex> String.match?("foo", ~r/foo/)
true
iex> String.match?("bar", ~r/foo/)
false
Specs:
Returns the next codepoint in a String.
The result is a tuple with the codepoint and the
remaining of the string or nil in case
the string reached its end.
As with other functions in the String module, this function does not check for the validity of the codepoint. That said, if an invalid codepoint is found, it will be returned by this function.
Examples
iex> String.next_codepoint("josé")
{ "j", "osé" }
Specs:
Returns the next grapheme in a String.
The result is a tuple with the grapheme and the
remaining of the string or nil in case
the String reached its end.
Examples
iex> String.next_grapheme("josé")
{ "j", "osé" }
Specs:
- printable?(t) :: boolean
Checks if a string is printable considering it is encoded
as UTF-8. Returns true if so, false otherwise.
Examples
iex> String.printable?("abc")
true
Specs:
Returns a new binary based on subject by replacing the parts
matching pattern by replacement. By default, it replaces
all entries, except if the global option is set to false.
A pattern may be a string or a regex.
Examples
iex> String.replace("a,b,c", ",", "-")
"a-b-c"
iex> String.replace("a,b,c", ",", "-", global: false)
"a-b,c"
The pattern can also be a regex. In those cases, one can give \N
in the replacement string to access a specific capture in the regex:
iex> String.replace("a,b,c", ~r/,(.)/, ",\\1\\1")
"a,bb,cc"
Notice we had to escape the escape character \. By giving &,
one can inject the whole matched pattern in the replacement string.
When strings are used as a pattern, a developer can also use the
replaced part inside the replacement via the :insert_replaced option:
iex> String.replace("a,b,c", "b", "[]", insert_replaced: 1)
"a,[b],c"
iex> String.replace("a,b,c", ",", "[]", insert_replaced: 2)
"a[],b[],c"
iex> String.replace("a,b,c", ",", "[]", insert_replaced: [1, 1])
"a[,,]b[,,]c"
Specs:
Reverses the given string. Works on graphemes.
Examples
iex> String.reverse("abcd")
"dcba"
iex> String.reverse("hello world")
"dlrow olleh"
iex> String.reverse("hello ∂og")
"go∂ olleh"
Specs:
Returns a new string of length len with subject right justified and
padded with padding. If padding is not present, it defaults to
whitespace. When len is less than the length of subject, subject is
returned.
Examples
iex> String.rjust("abc", 5)
" abc"
iex> String.rjust("abc", 5, ?-)
"--abc"
Specs:
Returns a string where trailing Unicode whitespace has been removed.
Examples
iex> String.rstrip(" abc ")
" abc"
Specs:
Returns a string where trailing char have been removed.
Examples
iex> String.rstrip(" abc _", ?_)
" abc "
Specs:
Returns a substring from the offset given by the start of the range to the offset given by the end of the range.
If the start of the range is not a valid offset for the given
string or if the range is in reverse order, returns nil.
Examples
iex> String.slice("elixir", 1..3)
"lix"
iex> String.slice("elixir", 1..10)
"lixir"
iex> String.slice("elixir", 10..3)
nil
iex> String.slice("elixir", -4..-1)
"ixir"
iex> String.slice("elixir", 2..-1)
"ixir"
iex> String.slice("elixir", -4..6)
"ixir"
iex> String.slice("elixir", -1..-4)
nil
iex> String.slice("elixir", -10..-7)
nil
iex> String.slice("a", 0..1500)
"a"
iex> String.slice("a", 1..1500)
""
iex> String.slice("a", 2..1500)
nil
Specs:
Returns a substring starting at the offset given by the first, and
a length given by the second.
If the offset is greater than string length, than it returns nil.
Examples
iex> String.slice("elixir", 1, 3)
"lix"
iex> String.slice("elixir", 1, 10)
"lixir"
iex> String.slice("elixir", 10, 3)
nil
iex> String.slice("elixir", -4, 4)
"ixir"
iex> String.slice("elixir", -10, 3)
nil
iex> String.slice("a", 0, 1500)
"a"
iex> String.slice("a", 1, 1500)
""
iex> String.slice("a", 2, 1500)
nil
Specs:
Divides a string into substrings at each Unicode whitespace occurrence with leading and trailing whitespace ignored.
Examples
iex> String.split("foo bar")
["foo", "bar"]
iex> String.split("foo" <> <<194, 133>> <> "bar")
["foo", "bar"]
iex> String.split(" foo bar ")
["foo", "bar"]
Specs:
Divides a string into substrings based on a pattern, returning a list of these substrings. The pattern can be a string, a list of strings or a regular expression.
The string is split into as many parts as possible by
default, unless the global option is set to false.
Empty strings are only removed from the result if the
trim option is set to true.
Examples
Splitting with a string pattern:
iex> String.split("a,b,c", ",")
["a", "b", "c"]
iex> String.split("a,b,c", ",", global: false)
["a", "b,c"]
iex> String.split(" a b c ", " ", trim: true)
["a", "b", "c"]
A list of patterns:
iex> String.split("1,2 3,4", [" ", ","])
["1", "2", "3", "4"]
A regular expression:
iex> String.split("a,b,c", ~r{,})
["a", "b", "c"]
iex> String.split("a,b,c", ~r{,}, global: false)
["a", "b,c"]
iex> String.split(" a b c ", ~r{\s}, trim: true)
["a", "b", "c"]
Splitting on empty patterns returns codepoints:
iex> String.split("abc", ~r{})
["a", "b", "c", ""]
iex> String.split("abc", "")
["a", "b", "c", ""]
iex> String.split("abc", "", trim: true)
["a", "b", "c"]
iex> String.split("abc", "", global: false)
["a", "bc"]
Specs:
Returns true if string starts with any of the prefixes given, otherwise
false. prefixes can be either a single prefix or a list of prefixes.
Examples
iex> String.starts_with? "elixir", "eli"
true
iex> String.starts_with? "elixir", ["erlang", "elixir"]
true
iex> String.starts_with? "elixir", ["erlang", "ruby"]
false
Specs:
Returns a string where leading/trailing Unicode whitespace has been removed.
Examples
iex> String.strip(" abc ")
"abc"
Specs:
Returns a string where leading/trailing char have been
removed.
Examples
iex> String.strip("a abc a", ?a)
" abc "
Specs:
- to_char_list(String.t) :: {:ok, char_list} | {:error, [], binary} | {:incomplete, [], binary}
Converts a string into a char list converting each codepoint to its respective integer value.
Examples
iex> String.to_char_list("æß")
{ :ok, 'æß' }
iex> String.to_char_list("abc")
{ :ok, 'abc' }
Specs:
- to_char_list!(String.t) :: char_list | no_return
Converts a string into a char list converting each codepoint to its respective integer value.
In case the conversion fails or is incomplete,
it raises a String.UnicodeConversionError.
Examples
iex> String.to_char_list!("æß")
'æß'
iex> String.to_char_list!("abc")
'abc'
Specs:
Convert all characters on the given string to uppercase.
Examples
iex> String.upcase("abcd")
"ABCD"
iex> String.upcase("ab 123 xpto")
"AB 123 XPTO"
iex> String.upcase("josé")
"JOSÉ"
Specs:
- valid?(t) :: boolean
Checks whether str contains only valid characters.
Examples
iex> String.valid?("a")
true
iex> String.valid?("ø")
true
iex> String.valid?(<<0xffff :: 16>>)
false
iex> String.valid?("asd" <> <<0xffff :: 16>>)
false
Specs:
- valid_character?(t) :: boolean
Checks whether str is a valid character.
All characters are codepoints, but some codepoints are not valid characters. They may be reserved, private, or other.
More info at: http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters
Examples
iex> String.valid_character?("a")
true
iex> String.valid_character?("ø")
true
iex> String.valid_character?("\x{ffff}")
false