String Normalization
Syntax#
- normalize_string(s::String, …)
Parameters#
Parameter | Details |
---|---|
casefold=true |
Fold the string to a canonical case based off the Unicode standard. |
stripmark=true |
Strip diacritical marks (i.e. accents) from characters in the input string. |
Case-Insensitive String Comparison
Strings can be compared with the ==
operator in Julia, but this is sensitive to differences in case. For instance, "Hello"
and "hello"
are considered different strings.
julia> "Hello" == "Hello"
true
julia> "Hello" == "hello"
false
To compare strings in a case-insensitive manner, normalize the strings by case-folding them first. For example,
equals_ignore_case(s, t) =
normalize_string(s, casefold=true) == normalize_string(t, casefold=true)
This approach also handles non-ASCII Unicode correctly:
julia> equals_ignore_case("Hello", "hello")
true
julia> equals_ignore_case("Weierstraß", "WEIERSTRASS")
true
Note that in German, the uppercase form of the ß character is SS.
Diacritic-Insensitive String Comparison
Sometimes, one wants strings like "resume"
and "résumé"
to compare equal. That is, graphemes that share a basic glyph, but possibly differ because of additions to those basic glyphs. Such comparison can be accomplished by stripping diacritical marks.
equals_ignore_mark(s, t) =
normalize_string(s, stripmark=true) == normalize_string(t, stripmark=true)
This allows the above example to work correctly. Additionally, it works well even with non-ASCII Unicode characters.
julia> equals_ignore_mark("resume", "résumé")
true
julia> equals_ignore_mark("αβγ", "ὰβ̂γ̆")
true