Regular expression is a useful string process utility for handling both pure text and markup language text like html. There are a few useful tools that are designed to use regular expression, such as grep and sed. In development language, most script languages, for example, javascript or perl, support regular expression. Regular expression has also been introduced into java as module for a long time. In C, the most famous regular expression library among many of those libraries developed would be POSIX regular expression API. As for C++, boost regular expression library was developed a few years ago. Then in C++11, regular expression was introduced into the standard library as <regex>. This blog is going to explain what is regular expression and how it is supported in c++11/14.
Regular expression syntax mode. For various existing tools and languages which support regular expressions, there is slight syntax difference between them. Two major regular expression relation standards exist in the industry before c++11: one is defined in POSIX and the other is part of ECMA-262(ECMAScript). Below listed major regular expression syntax:
- POSIX basic : This syntax is used by grep and sed and has the following features:
- Only characters: .[\*^$ are special; other feature characters need to be escaped, for example, grouping: \( and \).
- Back-references to previously marked sub-expressions are allowed. This is an NP-complete task.
- Using "leftmost longest" matching policy. That is when the matcher is matching a pattern against given text, it matches the beginning part as early as possible (which is the so-called leftmost). Among those matched candidates, it chooses the longest one as the regular expression result (which is what longest stands for).
- POSIX extended that is used by egrep and awk. This syntax is similar to POSIX basic but with the following differences:
- characters .[]+*?|\^$() are special characters. If these characters are escaped by “\” then the become normal characters. [Not quite understand: "escaped" means "not specified"? and then what become normal characters?]
- back-reference is not allowed.
- ECMAscript, which is known as JavaScript:
- Use depth first search matching policy.
- Non-greedy repeats.
To explain the policies of "leftmost longest" and "depth first", here is an example. If the given pattern "a|ab" is to match the subject text "ab", the result of leftmost-longest is "ab" and the result of depth-first is "a".
Because of these features of ECMAscript, the syntax is used widely for markup languages and for web text.
For example, there is often "<tab>...</tab>" in markup text. To match this text, ECMAscript syntax could be "<tab[^>]*>.*?</tab>", but this pattern in POSIX matches from the first <tab> to the last </tab>, including those "</tab>" in the middle of text. To achieve expected match, use POSIX regular expression syntax. It could be very complicated, like using "not" expression to disable middle <tab> and middle </tab>.
In C++11, when constructing an regex object, we can switch syntax mode with a flag in the constructor.
Regular expression library in c++
Below shows how c++11 standardizes regular expression.
Regular expression is introduced as a library <regex>. Through this header, a series of operations(functions) are included. All these operations use typical regex parameters. These parameters correspond to the concepts of regular expression below.
- Subject text: serial characters sequence[I'm not sure whether my revision of "serial characters" is correct. Need to check again the term.] which are checked to match the expression pattern.
- Pattern: the regular expression which is used to be matched.
- Match result: it saves the match result information
- Replacement string: It is in the replace function. The replacement is specified with special format. The format indicates how to replace the matched text.
In <regex> library, three kinds of regular expression operations are defined:
- match: checks whether the subject sequence matches the given pattern.
- search: checks whether some sub sequences in the subject match given pattern.
- replace: copies the subject with all matches of pattern replaced according to the replacement string.
Below are details of the operations above.
* regex_match. This function takes at least two parameters: subject sequence and pattern. It could take match_result as an output parameter, and it could take an flag parameter to specify the options of matching process.
The subject sequence could be specified through "const char *" or "std::string" or "['begin_iterator' and 'end_iterator').
The pattern is an object of std::regex (a.k.a const basic_regex<charT,traits>&).
The match_result is an object with the type corresponding to subject sequence, like std::cmatch (a.k.a std::match_results<const char*>), std::smatch (a.k.a std::match_results<string::const_iterator>).
Here are some examples which show the usage of this interface:
Example 1:
std::string sequence ("regular-expression");
std::regex expr ("regular(.*)");
if (std::regex_match (sequence.begin(), sequence.end(), exp)) { .... }
Example 2: form with cmatch result:
std::smatch str_match_result;
std::regex_match (sequence, str_match_result, expr);
for (auto& x : str_match_result) {
std::cout << x << ' ';
}
// will list "regular-expression -expression" for match the pattern and sub-pattern-group
NOTE: This operation will return true only if the entire subject sequence matches the pattern. If only part of the subject matches, use the regex_search operation.
** std::basic_regex :
In regex_match interface, two kinds of object are introduced. One is std::regex, which is a template instantiation of std::basic_regex with char type template parameter std::basic_regex<char>. std::wregex is the instantiation of std::basic_regex<wchar_t>.
The constructor functions of std::regex take pattern (string, const char* or iterator-range) parameter and optional type flag parameter. The type flag could set the regular expression syntax, which could be one of the following syntax:
ECMAScript, basic (posix basic), extended (posix extended), awk, grep, egrep
Apart from syntax, the type flag can also be specified with some other attributes, like icase(case insensitive),nosubs(no subexpression) or collate(local affect character range [a-z])
For example:
std::regex ex ("\\bword$", ECMAScript | icase );
In c++14, there is the move constructor of basic_regex.
** std::match_result
The other type which is used by regex_match interface is std::match_result. The standard of C++ defines 4 instantiations of template std::match_result:
std::smatch (a.k.a match_results<string::const_iterator> ), std::cmatch (a.k.a match_result<const char*>) and std::wsmatch, std::wcmatch for wide characters.
This match_result is used to contain the match/search result of a regular expression pattern against subject sequence. To retrieve the match sequence result, member function match_result::operator[] (or match_result::str(sizetype n = 0)) can be used. The index 0 indicates the entire match result and the indexes followed are sub-match results. The match result operator[] could return an object of std::sub_match template which represents sub-group of pattern.
The match_result also provides match_result::prefix() and match_result::suffix() to retrieve the text sequence leading ahead or followed the match text.
* regex_search, this function prototypes are similar to regex_match. However, this function returns true if anywhere of the subject matches the pattern, and this function will stop once a match is detected. See an example below.
std::string subject ("this text contains sub-text as sub-target to be searched");
std::smatch m_res;
std::regex expr (" sub-([^ ]*)");
while (std::regex_search (subject,m_res,expr)) {
for (auto& x:m_res) std::cout << x << " ";
std::cout << std::endl;
subject = m_res.suffix().str();
}
This example have the output below:
sub-text sub -text
sub-target sub -target
* regex_replace: this function takes these three parameters: subject sequence; regular expression pattern; and replacement string;. The output of this function would be std::string type object (or an output iterator if the input subject is specified through iterator). The example below shows how to utilize this function.
std::string subject ("this text contains sub-text as sub-target to be searched");
std::regex expr (" (sub-)([^ ]*) ");
std::cout << std::regex_replace (subject, expr, " $2 ");
//std::string res; std::regex_replace (std::back_insert(res), subject.begin(), subject.end(), expr, " $2 "); std::cout << res; //iterator form
The output of this example: "this text contains text as target to be searched"
** Replacement string fmt
In this function, a replacement string is required. The replacement string may need formatting the specifiers below:
$n : n>0; the n-th sub match
$& : entire match
$. : prefix of a match
$' : suffix of a match
$$: single $
regex_match, regex_search, regex_replace also take a flag parameter whose default value is "match_default" which indicates the default match behavior. Some potential value of this flag are match_not_bol ("^" not match), match_not_eol("$" not match), match_not_bow("\b" not match begin of word), match_not_eow("\b" not match end of word).
Algorithms of regular expression
The explains and examples above are a brief summary of the regular expression usage in c++. Readers may notice that there could be different regular algorithms specified (ECMAscript, POSIX-basic, POSIX-entended). How each algorithm works could not be covered in this paper, buttwo algorithms(state machine) are listed here for reference:
Backtracking NFA: typically used as algorithm for POSIX-basic, ECMAscript
Nonbacktracking NFA: used for POSIX-extended
Regular expression syntax example For regular expression syntax, users can refer to ECMAscript specification or POSIX for detailed information. Here is a simple list of common used syntax:
. match any single character (exclude newlines)
[] bracket expression, matches a single character that is in the bracket [abc] matches 'a', 'b' or 'c'. [a.b] matches 'a', '.' or 'b', here dot in bracket is dot itself
[^] match a single character that is not contained in brackets
^ matches start point of a line
$ matches end point of a line
* matches preceding element zero or more times
| choice operator: matches either the expression before "|" or the expression after "|"
{m,n} matches repeating preceding element at least "m" times and no more than "n" times
() Defines a sub-expression,
range: [a-z],[1-9] represent any single character 'a', 'b'...'z', or single digital character '1','2'...'9'
Reference:
http://www.cplusplus.com/reference/regex/ECMAScript/
http://en.wikipedia.org/wiki/Regular_expression
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1429.htm#regex_iterator_discussio
English Editor: Jun Qian Zhou (Ashley) - Many thanks!