Regular expression

Regular expressions entered popular use from 1968 in two uses: pattern matching in a text editor[9] and lexical analysis in a compiler.

[10] Among the first appearances of regular expressions in program form was when Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files.

[15] Around the same time when Thompson developed QED, a group of researchers including Douglas T. Ross implemented a tool based on regular expressions that is used for lexical analysis in compiler design.

These rules maintain existing features of Perl 5.x regexes, but also allow BNF-style definition of a recursive descent parser via sub-rules.

Prior to the use of regular expressions, many search languages allowed simple wildcards, for example "*" to match any sequence of characters, and "?"

The phrase regular expressions, or regexes, is often used to mean the specific, standard textual syntax for representing patterns for matching text, as distinct from the mathematical notation described below.

The metacharacter syntax is designed specifically to represent prescribed targets in a concise and flexible way to direct the automation of text processing of a variety of input data, in a form easy to type using a standard ASCII keyboard.

A regex processor translates a regular expression in the above syntax into an internal representation that can be executed and matched against a string representing the text being searched in.

One possible approach is the Thompson's construction algorithm to construct a nondeterministic finite automaton (NFA), which is then made deterministic and the resulting deterministic finite automaton (DFA) is run on the target text string to recognize substrings that match the regular expression.

For instance, determining the validity of a given ISBN requires computing the modulus of the integer base 11, and can be easily implemented with an 11-state DFA.

[30][31] Every regular expression can be written solely in terms of the Kleene star and set unions over finite words.

In 1991, Dexter Kozen axiomatized regular expressions as a Kleene algebra, using equational and Horn clause axioms.

[32] Already in 1964, Redko had proved that no finite set of purely equational axioms can characterize the algebra of regular languages.

A similar convention is used in sed, where search and replace is given by s/re/replacement/ and patterns can be joined with a comma to specify a range of lines as in /re1/,/re2/.

Additional functionality includes lazy matching, backreferences, named capture groups, and recursive patterns.

Examples: According to Russ Cox, the POSIX specification requires ambiguous subexpressions to be handled in a way different from Perl's.

[37] The meaning of metacharacters escaped with a backslash is reversed for some characters in the POSIX Extended Regular Expression (ERE) syntax.

Additionally, support is removed for \n backreferences and the following metacharacters are added: Examples: POSIX Extended Regular Expressions can often be used with modern Unix utilities by including the command line flag -E. The character class is the most basic regex concept after a literal match.

Because of its expressive power and (relative) ease of reading, many other utilities and programming languages have adopted syntax similar to Perl's—for example, Java, JavaScript, Julia, Python, Ruby, Qt, Microsoft's .NET Framework, and XML Schema.

[39] The regex ".+" (including the double-quotes) applied to the string "Ganymede," he continued, "is the largest moon in the Solar System."

The aforementioned quantifiers may, however, be made lazy or minimal or reluctant, matching as few characters as possible, by appending a question mark: ".+?"

[39] In Java and Python 3.11+,[40] quantifiers may be made possessive by appending a plus sign, which disables backing off (in a backtracking engine), even if doing so would allow the overall match to succeed:[41] While the regex ".

This means that, among other things, a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory.

However, pattern matching with an unbounded number of backreferences, as supported by numerous modern tools, is still context sensitive.

This has led to a nomenclature where the term regular expression has different meanings in formal language theory and pattern matching.

An alternative approach is to simulate the NFA directly, essentially building each DFA state on demand and then discarding it at the next step.

Although backtracking implementations only give an exponential guarantee in the worst case, they provide much greater flexibility and expressive power.

[53] GNU grep, which supports a wide variety of POSIX syntaxes and extensions, uses BM for a first-pass prefiltering, and then uses an implicit DFA.

In most respects it makes no difference what the character set is, but some issues do arise when extending regexes to support Unicode.

Although in many cases system administrators can run regex-based queries internally, most search engines do not offer regex support to the public.