Introduction

This new regex implementation is intended eventually to replace Python's current re module implementation.

For testing and comparison with the current 're' module the new implementation is in the form of a module called 'regex'.

Also included are the compiled binary .pyd files for Python 2.5-2.7 and Python 3.1-3.2 on 32-bit Windows.

Change of behaviour

Old vs new behaviour

This module has 2 behaviours:

Version 0 behaviour (old behaviour, compatible with the current 're' module):

Indicated by the VERSION0 or V0 flag, or (?V0) in the pattern.

.split won't split a string at a zero-width match.

Inline flags apply to the entire pattern, and they can't be turned off.

Only simple sets are supported.

Case-insensitive matches in Unicode use simple case-folding by default.

Version 1 behaviour (new behaviour, different from the current 're' module):

Indicated by the VERSION1 or V1 flag, or (?V1) in the pattern.

.split will split a string at a zero-width match.

Inline flags apply to the end of the group or pattern, and they can be turned off.

Nested sets and set operations are supported.

Case-insensitive matches in Unicode use full case-folding by default.

If no version is specified, the regex module will default to regex.DEFAULT_VERSION. In the short term this will be VERSION0, but in the longer term it will be VERSION1.

Note: the VERSION1 flag replaces the NEW flag in previous versions of this module. The NEW flag is still supported in this release, but will be removed. The decision about versions and making the change was made after discussion in the python-dev list.

Case-insensitive matches in Unicode

The regex module supports both simple and full case-folding for case-insensitive matches in Unicode. Use of full case-folding can be turned on using the FULLCASE or F flag, or (?f) in the pattern. Please note that this flag affects how the IGNORECASE flag works; the FULLCASE flag itself does not turn on case-insensitive matching.

In the version 0 behaviour, the flag is off by default.

In the version 1 behaviour, the flag is on by default.

Nested sets and set operations

Previous releases supported both simple and nested sets at the same time. This occasionally lead to problems in practice when an unescaped "[" in a simple set was interpreted as being the start of a nested set. Therefore nested sets and set operations are now supported only in the version 1 behaviour.

Version 0 behaviour: only simple sets are supported.

Version 1 behaviour: nested sets and set operations are supported.

Fuzzy matching

When performing fuzzy matching, the previous release of this module searched for the best match. This proved to be confusing in practice when used with .findall, for example.

Therefore, the default behaviour has been changed to search for the first match which meets the constraints, and a new BESTMATCH flag has been added to force it to search for the best match instead (the previous behaviour).

Flags

There are 2 kinds of flag: scoped and global. Scoped flags can apply to only part of a pattern and can be turned on or off; global flags apply to the entire pattern and can only be turned on.

The scoped flags are: FULLCASE, IGNORECASE, MULTILINE, DOTALL, VERBOSE, WORD.

The global flags are: ASCII, BESTMATCH, LOCALE, REVERSE, UNICODE, VERSION0, VERSION1.

If neither the ASCII, LOCALE nor UNICODE flag is specified, it will default to UNICODE if the regex pattern is a Unicode string and ASCII if it's a bytestring.

The BESTMATCH flag makes fuzzy matching search for the best match instead of the next match which meets the given constraints.

Notes on named capture groups

All capture groups have a group number, starting from 1.

Groups with the same group name will have the same group number, and groups with a different group name will have a different group number.

The same group name can be used on different branches of an alternation because they are mutually exclusive, eg. (?P<foo>first)|(?P<foo>second). They will, of course, have the same group number.

Group numbers will be reused, where possible, across different branches of a branch reset, eg. (?|(first)|(second)) has only group 1. If capture groups have different group names then they will, of course, have different group numbers, eg. (?|(?P<foo>first)|(?P<bar>second)) has group 1 ("foo") and group 2 ("bar").

Multithreading

The regex module releases the GIL during matching on instances of the built-in (immutable) string classes, enabling other Python threads to run concurrently. It is also possible to force the regex module to release the GIL during matching by calling the matching methods with the keyword argument concurrent=True. The behaviour is undefined if the string changes during matching, so use it only when it is guaranteed that that won't happen.

Building for 64-bits

If the source files are built for a 64-bit target then the string positions will also be 64-bit. (The 're' module appears to limit string positions to 32 bits, even on a 64-bit build.)

Unicode

This module supports Unicode 6.0.0.

Full Unicode case-folding is supported.

Additional features

The issue numbers relate to the Python bug tracker, except where listed as "Hg issue".