123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756 |
- .\" $OpenBSD: re_format.7,v 1.14 2007/05/31 19:19:30 jmc Exp $
- .\"
- .\" Copyright (c) 1997, Phillip F Knaack. All rights reserved.
- .\"
- .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
- .\" Copyright (c) 1992, 1993, 1994
- .\" The Regents of the University of California. All rights reserved.
- .\"
- .\" This code is derived from software contributed to Berkeley by
- .\" Henry Spencer.
- .\"
- .\" Redistribution and use in source and binary forms, with or without
- .\" modification, are permitted provided that the following conditions
- .\" are met:
- .\" 1. Redistributions of source code must retain the above copyright
- .\" notice, this list of conditions and the following disclaimer.
- .\" 2. Redistributions in binary form must reproduce the above copyright
- .\" notice, this list of conditions and the following disclaimer in the
- .\" documentation and/or other materials provided with the distribution.
- .\" 3. Neither the name of the University nor the names of its contributors
- .\" may be used to endorse or promote products derived from this software
- .\" without specific prior written permission.
- .\"
- .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
- .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
- .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- .\" SUCH DAMAGE.
- .\"
- .\" @(#)re_format.7 8.3 (Berkeley) 3/20/94
- .\"
- .Dd $Mdocdate: May 31 2007 $
- .Dt RE_FORMAT 7
- .Os
- .Sh NAME
- .Nm re_format
- .Nd POSIX regular expressions
- .Sh DESCRIPTION
- Regular expressions (REs),
- as defined in
- .St -p1003.1-2004 ,
- come in two forms:
- basic regular expressions
- (BREs)
- and extended regular expressions
- (EREs).
- Both forms of regular expressions are supported
- by the interfaces described in
- .Xr regex 3 .
- Applications dealing with regular expressions
- may use one or the other form
- (or indeed both).
- For example,
- .Xr ed 1
- uses BREs,
- whilst
- .Xr egrep 1
- talks EREs.
- Consult the manual page for the specific application to find out which
- it uses.
- .Pp
- POSIX leaves some aspects of RE syntax and semantics open;
- .Sq **
- marks decisions on these aspects that
- may not be fully portable to other POSIX implementations.
- .Pp
- This manual page first describes regular expressions in general,
- specifically extended regular expressions,
- and then discusses differences between them and basic regular expressions.
- .Sh EXTENDED REGULAR EXPRESSIONS
- An ERE is one** or more non-empty**
- .Em branches ,
- separated by
- .Sq \*(Ba .
- It matches anything that matches one of the branches.
- .Pp
- A branch is one** or more
- .Em pieces ,
- concatenated.
- It matches a match for the first, followed by a match for the second, etc.
- .Pp
- A piece is an
- .Em atom
- possibly followed by a single**
- .Sq * ,
- .Sq + ,
- .Sq ?\& ,
- or
- .Em bound .
- An atom followed by
- .Sq *
- matches a sequence of 0 or more matches of the atom.
- An atom followed by
- .Sq +
- matches a sequence of 1 or more matches of the atom.
- An atom followed by
- .Sq ?\&
- matches a sequence of 0 or 1 matches of the atom.
- .Pp
- A bound is
- .Sq {
- followed by an unsigned decimal integer,
- possibly followed by
- .Sq ,\&
- possibly followed by another unsigned decimal integer,
- always followed by
- .Sq } .
- The integers must lie between 0 and
- .Dv RE_DUP_MAX
- (255**) inclusive,
- and if there are two of them, the first may not exceed the second.
- An atom followed by a bound containing one integer
- .Ar i
- and no comma matches
- a sequence of exactly
- .Ar i
- matches of the atom.
- An atom followed by a bound
- containing one integer
- .Ar i
- and a comma matches
- a sequence of
- .Ar i
- or more matches of the atom.
- An atom followed by a bound
- containing two integers
- .Ar i
- and
- .Ar j
- matches a sequence of
- .Ar i
- through
- .Ar j
- (inclusive) matches of the atom.
- .Pp
- An atom is a regular expression enclosed in
- .Sq ()
- (matching a part of the regular expression),
- an empty set of
- .Sq ()
- (matching the null string)**,
- a
- .Em bracket expression
- (see below),
- .Sq .\&
- (matching any single character),
- .Sq ^
- (matching the null string at the beginning of a line),
- .Sq $
- (matching the null string at the end of a line),
- a
- .Sq \e
- followed by one of the characters
- .Sq ^.[$()|*+?{\e
- (matching that character taken as an ordinary character),
- a
- .Sq \e
- followed by any other character**
- (matching that character taken as an ordinary character,
- as if the
- .Sq \e
- had not been present**),
- or a single character with no other significance (matching that character).
- A
- .Sq {
- followed by a character other than a digit is an ordinary character,
- not the beginning of a bound**.
- It is illegal to end an RE with
- .Sq \e .
- .Pp
- A bracket expression is a list of characters enclosed in
- .Sq [] .
- It normally matches any single character from the list (but see below).
- If the list begins with
- .Sq ^ ,
- it matches any single character
- .Em not
- from the rest of the list
- (but see below).
- If two characters in the list are separated by
- .Sq - ,
- this is shorthand for the full
- .Em range
- of characters between those two (inclusive) in the
- collating sequence, e.g.\&
- .Sq [0-9]
- in ASCII matches any decimal digit.
- It is illegal** for two ranges to share an endpoint, e.g.\&
- .Sq a-c-e .
- Ranges are very collating-sequence-dependent,
- and portable programs should avoid relying on them.
- .Pp
- To include a literal
- .Sq ]\&
- in the list, make it the first character
- (following a possible
- .Sq ^ ) .
- To include a literal
- .Sq - ,
- make it the first or last character,
- or the second endpoint of a range.
- To use a literal
- .Sq -
- as the first endpoint of a range,
- enclose it in
- .Sq [.
- and
- .Sq .]
- to make it a collating element (see below).
- With the exception of these and some combinations using
- .Sq [
- (see next paragraphs),
- all other special characters, including
- .Sq \e ,
- lose their special significance within a bracket expression.
- .Pp
- Within a bracket expression, a collating element
- (a character,
- a multi-character sequence that collates as if it were a single character,
- or a collating-sequence name for either)
- enclosed in
- .Sq [.
- and
- .Sq .]
- stands for the sequence of characters of that collating element.
- The sequence is a single element of the bracket expression's list.
- A bracket expression containing a multi-character collating element
- can thus match more than one character,
- e.g. if the collating sequence includes a
- .Sq ch
- collating element,
- then the RE
- .Sq [[.ch.]]*c
- matches the first five characters of
- .Sq chchcc .
- .Pp
- Within a bracket expression, a collating element enclosed in
- .Sq [=
- and
- .Sq =]
- is an equivalence class, standing for the sequences of characters
- of all collating elements equivalent to that one, including itself.
- (If there are no other equivalent collating elements,
- the treatment is as if the enclosing delimiters were
- .Sq [.
- and
- .Sq .] . )
- For example, if
- .Sq x
- and
- .Sq y
- are the members of an equivalence class,
- then
- .Sq [[=x=]] ,
- .Sq [[=y=]] ,
- and
- .Sq [xy]
- are all synonymous.
- An equivalence class may not** be an endpoint of a range.
- .Pp
- Within a bracket expression, the name of a
- .Em character class
- enclosed
- in
- .Sq [:
- and
- .Sq :]
- stands for the list of all characters belonging to that class.
- Standard character class names are:
- .Bd -literal -offset indent
- alnum digit punct
- alpha graph space
- blank lower upper
- cntrl print xdigit
- .Ed
- .Pp
- These stand for the character classes defined in
- .Xr ctype 3 .
- A locale may provide others.
- A character class may not be used as an endpoint of a range.
- .Pp
- There are two special cases** of bracket expressions:
- the bracket expressions
- .Sq [[:<:]]
- and
- .Sq [[:>:]]
- match the null string at the beginning and end of a word, respectively.
- A word is defined as a sequence of
- characters starting and ending with a word character
- which is neither preceded nor followed by
- word characters.
- A word character is an
- .Em alnum
- character (as defined by
- .Xr ctype 3 )
- or an underscore.
- This is an extension,
- compatible with but not specified by POSIX,
- and should be used with
- caution in software intended to be portable to other systems.
- .Pp
- In the event that an RE could match more than one substring of a given
- string,
- the RE matches the one starting earliest in the string.
- If the RE could match more than one substring starting at that point,
- it matches the longest.
- Subexpressions also match the longest possible substrings, subject to
- the constraint that the whole match be as long as possible,
- with subexpressions starting earlier in the RE taking priority over
- ones starting later.
- Note that higher-level subexpressions thus take priority over
- their lower-level component subexpressions.
- .Pp
- Match lengths are measured in characters, not collating elements.
- A null string is considered longer than no match at all.
- For example,
- .Sq bb*
- matches the three middle characters of
- .Sq abbbc ;
- .Sq (wee|week)(knights|nights)
- matches all ten characters of
- .Sq weeknights ;
- when
- .Sq (.*).*
- is matched against
- .Sq abc ,
- the parenthesized subexpression matches all three characters;
- and when
- .Sq (a*)*
- is matched against
- .Sq bc ,
- both the whole RE and the parenthesized subexpression match the null string.
- .Pp
- If case-independent matching is specified,
- the effect is much as if all case distinctions had vanished from the
- alphabet.
- When an alphabetic that exists in multiple cases appears as an
- ordinary character outside a bracket expression, it is effectively
- transformed into a bracket expression containing both cases,
- e.g.\&
- .Sq x
- becomes
- .Sq [xX] .
- When it appears inside a bracket expression,
- all case counterparts of it are added to the bracket expression,
- so that, for example,
- .Sq [x]
- becomes
- .Sq [xX]
- and
- .Sq [^x]
- becomes
- .Sq [^xX] .
- .Pp
- No particular limit is imposed on the length of REs**.
- Programs intended to be portable should not employ REs longer
- than 256 bytes,
- as an implementation can refuse to accept such REs and remain
- POSIX-compliant.
- .Pp
- The following is a list of extended regular expressions:
- .Bl -tag -width Ds
- .It Ar c
- Any character
- .Ar c
- not listed below matches itself.
- .It \e Ns Ar c
- Any backslash-escaped character
- .Ar c
- matches itself.
- .It \&.
- Matches any single character that is not a newline
- .Pq Sq \en .
- .It Bq Ar char-class
- Matches any single character in
- .Ar char-class .
- To include a
- .Ql \&]
- in
- .Ar char-class ,
- it must be the first character.
- A range of characters may be specified by separating the end characters
- of the range with a
- .Ql - ;
- e.g.\&
- .Ar a-z
- specifies the lower case characters.
- The following literal expressions can also be used in
- .Ar char-class
- to specify sets of characters:
- .Bd -unfilled -offset indent
- [:alnum:] [:cntrl:] [:lower:] [:space:]
- [:alpha:] [:digit:] [:print:] [:upper:]
- [:blank:] [:graph:] [:punct:] [:xdigit:]
- .Ed
- .Pp
- If
- .Ql -
- appears as the first or last character of
- .Ar char-class ,
- then it matches itself.
- All other characters in
- .Ar char-class
- match themselves.
- .Pp
- Patterns in
- .Ar char-class
- of the form
- .Eo [.
- .Ar col-elm
- .Ec .]\&
- or
- .Eo [=
- .Ar col-elm
- .Ec =]\& ,
- where
- .Ar col-elm
- is a collating element, are interpreted according to
- .Xr setlocale 3
- .Pq not currently supported .
- .It Bq ^ Ns Ar char-class
- Matches any single character, other than newline, not in
- .Ar char-class .
- .Ar char-class
- is defined as above.
- .It ^
- If
- .Sq ^
- is the first character of a regular expression, then it
- anchors the regular expression to the beginning of a line.
- Otherwise, it matches itself.
- .It $
- If
- .Sq $
- is the last character of a regular expression,
- it anchors the regular expression to the end of a line.
- Otherwise, it matches itself.
- .It [[:<:]]
- Anchors the single character regular expression or subexpression
- immediately following it to the beginning of a word.
- .It [[:>:]]
- Anchors the single character regular expression or subexpression
- immediately following it to the end of a word.
- .It Pq Ar re
- Defines a subexpression
- .Ar re .
- Any set of characters enclosed in parentheses
- matches whatever the set of characters without parentheses matches
- (that is a long-winded way of saying the constructs
- .Sq (re)
- and
- .Sq re
- match identically).
- .It *
- Matches the single character regular expression or subexpression
- immediately preceding it zero or more times.
- If
- .Sq *
- is the first character of a regular expression or subexpression,
- then it matches itself.
- The
- .Sq *
- operator sometimes yields unexpected results.
- For example, the regular expression
- .Ar b*
- matches the beginning of the string
- .Qq abbb
- (as opposed to the substring
- .Qq bbb ) ,
- since a null match is the only leftmost match.
- .It +
- Matches the singular character regular expression
- or subexpression immediately preceding it
- one or more times.
- .It ?
- Matches the singular character regular expression
- or subexpression immediately preceding it
- 0 or 1 times.
- .Sm off
- .It Xo
- .Pf { Ar n , m No }\ \&
- .Pf { Ar n , No }\ \&
- .Pf { Ar n No }
- .Xc
- .Sm on
- Matches the single character regular expression or subexpression
- immediately preceding it at least
- .Ar n
- and at most
- .Ar m
- times.
- If
- .Ar m
- is omitted, then it matches at least
- .Ar n
- times.
- If the comma is also omitted, then it matches exactly
- .Ar n
- times.
- .It \*(Ba
- Used to separate patterns.
- For example,
- the pattern
- .Sq cat\*(Badog
- matches either
- .Sq cat
- or
- .Sq dog .
- .El
- .Sh BASIC REGULAR EXPRESSIONS
- Basic regular expressions differ in several respects:
- .Bl -bullet -offset 3n
- .It
- .Sq \*(Ba ,
- .Sq + ,
- and
- .Sq ?\&
- are ordinary characters and there is no equivalent
- for their functionality.
- .It
- The delimiters for bounds are
- .Sq \e{
- and
- .Sq \e} ,
- with
- .Sq {
- and
- .Sq }
- by themselves ordinary characters.
- .It
- The parentheses for nested subexpressions are
- .Sq \e(
- and
- .Sq \e) ,
- with
- .Sq (
- and
- .Sq )\&
- by themselves ordinary characters.
- .It
- .Sq ^
- is an ordinary character except at the beginning of the
- RE or** the beginning of a parenthesized subexpression.
- .It
- .Sq $
- is an ordinary character except at the end of the
- RE or** the end of a parenthesized subexpression.
- .It
- .Sq *
- is an ordinary character if it appears at the beginning of the
- RE or the beginning of a parenthesized subexpression
- (after a possible leading
- .Sq ^ ) .
- .It
- Finally, there is one new type of atom, a
- .Em back-reference :
- .Sq \e
- followed by a non-zero decimal digit
- .Ar d
- matches the same sequence of characters matched by the
- .Ar d Ns th
- parenthesized subexpression
- (numbering subexpressions by the positions of their opening parentheses,
- left to right),
- so that, for example,
- .Sq \e([bc]\e)\e1
- matches
- .Sq bb\&
- or
- .Sq cc
- but not
- .Sq bc .
- .El
- .Pp
- The following is a list of basic regular expressions:
- .Bl -tag -width Ds
- .It Ar c
- Any character
- .Ar c
- not listed below matches itself.
- .It \e Ns Ar c
- Any backslash-escaped character
- .Ar c ,
- except for
- .Sq { ,
- .Sq } ,
- .Sq \&( ,
- and
- .Sq \&) ,
- matches itself.
- .It \&.
- Matches any single character that is not a newline
- .Pq Sq \en .
- .It Bq Ar char-class
- Matches any single character in
- .Ar char-class .
- To include a
- .Ql \&]
- in
- .Ar char-class ,
- it must be the first character.
- A range of characters may be specified by separating the end characters
- of the range with a
- .Ql - ;
- e.g.\&
- .Ar a-z
- specifies the lower case characters.
- The following literal expressions can also be used in
- .Ar char-class
- to specify sets of characters:
- .Bd -unfilled -offset indent
- [:alnum:] [:cntrl:] [:lower:] [:space:]
- [:alpha:] [:digit:] [:print:] [:upper:]
- [:blank:] [:graph:] [:punct:] [:xdigit:]
- .Ed
- .Pp
- If
- .Ql -
- appears as the first or last character of
- .Ar char-class ,
- then it matches itself.
- All other characters in
- .Ar char-class
- match themselves.
- .Pp
- Patterns in
- .Ar char-class
- of the form
- .Eo [.
- .Ar col-elm
- .Ec .]\&
- or
- .Eo [=
- .Ar col-elm
- .Ec =]\& ,
- where
- .Ar col-elm
- is a collating element, are interpreted according to
- .Xr setlocale 3
- .Pq not currently supported .
- .It Bq ^ Ns Ar char-class
- Matches any single character, other than newline, not in
- .Ar char-class .
- .Ar char-class
- is defined as above.
- .It ^
- If
- .Sq ^
- is the first character of a regular expression, then it
- anchors the regular expression to the beginning of a line.
- Otherwise, it matches itself.
- .It $
- If
- .Sq $
- is the last character of a regular expression,
- it anchors the regular expression to the end of a line.
- Otherwise, it matches itself.
- .It [[:<:]]
- Anchors the single character regular expression or subexpression
- immediately following it to the beginning of a word.
- .It [[:>:]]
- Anchors the single character regular expression or subexpression
- immediately following it to the end of a word.
- .It \e( Ns Ar re Ns \e)
- Defines a subexpression
- .Ar re .
- Subexpressions may be nested.
- A subsequent backreference of the form
- .Pf \e Ns Ar n ,
- where
- .Ar n
- is a number in the range [1,9], expands to the text matched by the
- .Ar n Ns th
- subexpression.
- For example, the regular expression
- .Ar \e(.*\e)\e1
- matches any string consisting of identical adjacent substrings.
- Subexpressions are ordered relative to their left delimiter.
- .It *
- Matches the single character regular expression or subexpression
- immediately preceding it zero or more times.
- If
- .Sq *
- is the first character of a regular expression or subexpression,
- then it matches itself.
- The
- .Sq *
- operator sometimes yields unexpected results.
- For example, the regular expression
- .Ar b*
- matches the beginning of the string
- .Qq abbb
- (as opposed to the substring
- .Qq bbb ) ,
- since a null match is the only leftmost match.
- .Sm off
- .It Xo
- .Pf \e{ Ar n , m No \e}\ \&
- .Pf \e{ Ar n , No \e}\ \&
- .Pf \e{ Ar n No \e}
- .Xc
- .Sm on
- Matches the single character regular expression or subexpression
- immediately preceding it at least
- .Ar n
- and at most
- .Ar m
- times.
- If
- .Ar m
- is omitted, then it matches at least
- .Ar n
- times.
- If the comma is also omitted, then it matches exactly
- .Ar n
- times.
- .El
- .Sh SEE ALSO
- .Xr ctype 3 ,
- .Xr regex 3
- .Sh STANDARDS
- .St -p1003.1-2004 :
- Base Definitions, Chapter 9 (Regular Expressions).
- .Sh BUGS
- Having two kinds of REs is a botch.
- .Pp
- The current POSIX spec says that
- .Sq )\&
- is an ordinary character in the absence of an unmatched
- .Sq ( ;
- this was an unintentional result of a wording error,
- and change is likely.
- Avoid relying on it.
- .Pp
- Back-references are a dreadful botch,
- posing major problems for efficient implementations.
- They are also somewhat vaguely defined
- (does
- .Sq a\e(\e(b\e)*\e2\e)*d
- match
- .Sq abbbd ? ) .
- Avoid using them.
- .Pp
- POSIX's specification of case-independent matching is vague.
- The
- .Dq one case implies all cases
- definition given above
- is the current consensus among implementors as to the right interpretation.
- .Pp
- The syntax for word boundaries is incredibly ugly.
|