String manipulation functions
Syntax#
- index(big, little)
- length or length()
- length(string)
- match(string, regex)
- split(string, array, separator)
- split(string, array)
- sprintf(format, …)
- sub(regex, subst, string)
- sub(regex, subst)
- gsub(regex, subst)
- gsub(regex, subst, string)
- substr(string, start, end)
- substr(string, start)
- tolower(string)
- toupper(string)
Parameters#
Parameter | Details |
---|---|
big | The string which is scanned for “little”. |
end | The index at which to end the sub-string. |
format | A printf format string. |
little | The string to scan for in “big”. |
regex | An Extended-Regular-Expression. |
start | The index at which to start the sub-string. |
string | A string. |
subst | The string to substitute in for the matched portion. |
Converting string to upper case
The function toupper
will convert a string to upper case (capital letters). For example:
BEGIN {
greeting = "hello"
loud_greeting = toupper(greeting)
print loud_greeting
}
This code will output “HELLO” when run.
String Concatenation
String concatenation is done simply by writing expressions next to one another without any operator. For example:
BEGIN {
user = "root"
print "Hello "user "!"
}
will print: Hello root!
Note that expressions do not have to be separated by whitespace.
Computing a hash of a string
While implementing one of the standard hashing algorithm in awk is probably a tedious task, defining a hash function that can be used as a handle to text documents is much more tractable. A practical situation where such a function is useful is to assign short ids to items given their description, for instance test cases, so that the short id can be given as reference to the item by the user instead of supplying its long description.
The hash function needs to convert characters to numeric codes, which is accomplished by using a lookup table initialised at the beginning of the script. The hash function is then computed using modular arithmetic transformations, a very classical approach to the computation of hashes.
For demonstration purposes, we add a rule to decorate input lines with their hash, but this rule is not needed to use the function:
BEGIN{
for(n=0;n<256;n++) {
ord[sprintf("%c",n)] = n
}
}
function hash(text, _prime, _modulo, _ax, _chars, _i)
{
_prime = 104729;
_modulo = 1048576;
_ax = 0;
split(text, _chars, "");
for (_i=1; _i <= length(text); _i++) {
_ax = (_ax * _prime + ord[_chars[_i]]) % _modulo;
};
return sprintf("%05x", _ax)
}
# Rule to demonstrate the function
# These comments and the following line are not relevant
# to the definition of the hash function but illustrate
# its use.
{ printf("%s|%s\n", hash($0), $0) }
We save the program above to the file hash.awk
and demonstrate it on a short list of classical english book titles:
awk -f hash.awk <<EOF
Wuthering Heights
Jane Eyre
Pride and Prejudice
The Mayor of Casterbridge
The Great Gatsby
David Copperfield
Great Expectations
The Return of the Soldier
Alice's Adventures in Wonderland
Animal Farm
EOF
The output is
6d6b1|Wuthering Heights
7539b|Jane Eyre
d8fba|Pride and Prejudice
fae95|The Mayor of Casterbridge
17fae|The Great Gatsby
c0005|David Copperfield
7492a|Great Expectations
12871|The Return of the Soldier
c3ab6|Alice's Adventures in Wonderland
46dc0|Animal Farm
When applied on each of the 6948 non-blank lines of my favourite novel this hash function does not generate any collision.
Convert string to lower case
AWK often used for manipulating entire files containing a list of strings. Let’s say file awk_test_file.txt contains:
First String
Second String
Third String
To convert all the strings to lower case execute:
awk '{ print tolower($0) }' awk_test_file.txt
This will result:
first string
second string
third string
String text substitution
SUB function allows to substitute text inside awk
sub(regexp, replacement, target)
where regexp could be a full regular expression
$ cat file
AAAAA
BBBB
CCCC
DDDD
EEEE
FFFF
GGGG
$ awk '{sub("AAA","XXX", $0); print}' file
XXXAA
BBBB
CCCC
DDDD
EEEE
FFFF
GGGG
Substring extraction
GNU
awk supports a sub-string extraction function to return a fixed length character sequence from a main string. The syntax is
*substr(string, start [, length ])*
where, string
is source string and start
marks the start of the sub-string position you want the extraction to be done for an optional length length
characters. If the length is not specified, the extraction is done up to the end of the string.
The first character of the string is treated as character number one.
awk '
BEGIN {
testString = "MyTESTstring"
substring = substr(testString, 3, 4) # Start at character 3 for a length of 4 characters
print substring
}'
will output the sub-string TEST
.
awk '
BEGIN {
testString = "MyTESTstring"
substring = substr(testString, 3) # Start at character 3 till end of the string
print substring
}'
this extracts the sub-string from character position 3 to end of the whole string, returning TESTstring
Note:-
- If
start
is given a negative value,GNU
awk prints the whole string and iflength
is given a non-zero valueGNU
awk behavior returns anull
string and the behavior varies among different implementations ofawk
.