Regular Expressions

Word Boundary

Syntax#

  • POSIX style, end of word: [[:>:]]
  • POSIX style, start of word: [[:<:]]
  • POSIX style, word boundary: [[:<:][:>:]]
  • SVR4/GNU, end of word: \>
  • SVR4/GNU, start of word: \<
  • Perl/GNU, word boundary: \b
  • Tcl, end of word: \M
  • Tcl, start of word: \m
  • Tcl, word boundary: \y
  • Portable ERE, start of word: (^|[^[:alnum:]_])
  • Portable ERE, end of word: ([^[:alnum:]_]|$)

Remarks#

Additional Resources

Match complete word

\bfoo\b

will match the complete word with no alphanumeric and _ preceding or following by it.

Taking from regularexpression.info

There are three different positions that qualify as word boundaries:

  1. Before the first character in the string, if the first character is a word character.
  2. After the last character in the string, if the last character is a word character.
  3. Between two characters in the string, where one is a word character and the other is not a word character.

The term word character here means any of the following

  1. Alphabet([a-zA-Z])
  2. Number([0-9])
  3. Underscore _

In short, word character = \w = [a-zA-Z0-9_]

Find patterns at the beginning or end of a word

Examine the following strings:

foobarfoo
bar
foobar
barfoo
  • the regular expression bar will match all four strings,
  • \bbar\b will only match the 2nd,
  • bar\b will be able to match the 2nd and 3rd strings, and
  • \bbar will match the 2nd and 4th strings.

Word boundaries

The \b metacharacter

To make it easier to find whole words, we can use the metacharacter \b. It marks the beginning and the end of an alphanumeric sequence*. Also, since it only serves to mark this locations, it actually matches no character on its own.

*: It is common to call an alphanumeric sequence a word, since we can catch it’s characters with a \w (the word characters class). This can be misleading, though, since \w also includes numbers and, in most flavors, the underscore.

Examples:

Regex Input Matches?
\bstack\b stackoverflow No, since there’s no ocurrence of the whole word stack
\bstack\b foo stack bar Yes, since there’s nothing before nor after stack
\bstack\b stack!overflow Yes: there’s nothing before stack and !is not a word character
\bstack stackoverflow Yes, since there’s nothing before stack
overflow\b stackoverflow Yes, since there’s nothing after overflow

The \B metacharacter

This is the opposite of \b, matching against the location of every non-boundary character. Like \b, since it matches locations, it matches no character on its own. It is useful for finding non whole words.

Examples:

Regex Input Matches?
\Bb\B abc Yes, since b is not surrounded by word boundaries.
\Ba\B abc No, a has a word boundary on its left side.
a\B abc Yes, a does not have a word boundary on its right side.
\B,\B a,,,b Yes, it matches the second comma because \B will also match the space between two non-word characters (it should be noted that there is a word boundary to the left of the first comma and to the right of the second).

Make text shorter but don’t break last word

To make long text at most N characters long but leave last word intact, use .{0,N}\b pattern:

^(.{0,N})\b.*

This modified text is an extract of the original Stack Overflow Documentation created by the contributors and released under CC BY-SA 3.0 This website is not affiliated with Stack Overflow