Package spade :: Module pyparsing
[hide private]
[frames] | no frames]

Module pyparsing

source code

pyparsing module - Classes and methods to define and execute parsing grammars

The pyparsing module is an alternative approach to creating and executing simple grammars, vs. the traditional lex/yacc approach, or the use of regular expressions. With pyparsing, you don't need to learn a new syntax for defining grammars or matching expressions - the parsing module provides a library of classes that you use to construct the grammar directly in Python.

Here is a program to parse "Hello, World!" (or any greeting of the form "<salutation>, <addressee>!"):

   from pyparsing import Word, alphas
   
   # define grammar of a greeting
   greet = Word( alphas ) + "," + Word( alphas ) + "!" 
   
   hello = "Hello, World!"
   print hello, "->", greet.parseString( hello )

The program outputs the following:

   Hello, World! -> ['Hello', ',', 'World', '!']

The Python representation of the grammar is quite readable, owing to the self-explanatory class names, and the use of '+', '|' and '^' operators.

The parsed results returned from parseString() can be accessed as a nested list, a dictionary, or an object with named attributes.

The pyparsing module handles some of the problems that are typically vexing when writing text parsers:


Version: 1.3.3

Author: Paul McGuire <ptmcg@users.sourceforge.net>

Classes [hide private]
  ParseBaseException
base exception class for all parsing runtime exceptions
  ParseException
exception thrown when parse expressions don't match class
  ParseFatalException
user-throwable exception thrown when inconsistent parse content is found; stops all parsing immediately
  RecursiveGrammarException
exception thrown by validate() if the grammar could be improperly recursive
  ParseResults
Structured parse results, to provide multiple means of access to the parsed data:
  ParserElement
Abstract base level parser element class.
  Token
Abstract ParserElement subclass, for defining atomic matching patterns.
  Empty
An empty token, will always match.
  NoMatch
A token that will never match.
  Literal
Token to exactly match a specified string.
  Keyword
Token to exactly match a specified string as a keyword, that is, it must be immediately followed by a non-keyword character.
  CaselessLiteral
Token to match a specified string, ignoring case of letters.
  Word
Token for matching words composed of allowed character sets.
  CharsNotIn
Token for matching words composed of characters *not* in a given set.
  White
Special matching class for matching whitespace.
  PositionToken
  GoToColumn
Token to advance to a specific column of input text; useful for tabular report scraping.
  LineStart
Matches if current position is at the beginning of a line within the parse string
  LineEnd
Matches if current position is at the end of a line within the parse string
  StringStart
Matches if current position is at the beginning of the parse string
  StringEnd
Matches if current position is at the end of the parse string
  ParseExpression
Abstract subclass of ParserElement, for combining and post-processing parsed tokens.
  And
Requires all given ParseExpressions to be found in the given order.
  Or
Requires that at least one ParseExpression is found.
  MatchFirst
Requires that at least one ParseExpression is found.
  Each
Requires all given ParseExpressions to be found, but in any order.
  ParseElementEnhance
Abstract subclass of ParserElement, for combining and post-processing parsed tokens.
  FollowedBy
Lookahead matching of the given parse expression.
  NotAny
Lookahead to disallow matching with the given parse expression.
  ZeroOrMore
Optional repetition of zero or more of the given expression.
  OneOrMore
Repetition of one or more of the given expression.
  Optional
Optional matching of the given expression.
  SkipTo
Token for skipping over all undefined text until the matched expression is found.
  Forward
Forward declaration of an expression to be defined later - used for recursive grammars, such as algebraic infix notation.
  _ForwardNoRecurse
  TokenConverter
Abstract subclass of ParseExpression, for converting parsed results.
  Upcase
Converter to upper case all matching tokens.
  Combine
Converter to concatenate all matching tokens to a single string.
  Group
Converter to return the matched tokens as a list - useful for returning tokens of ZeroOrMore and OneOrMore expressions.
  Dict
Converter to return a repetitive expression as a list, but also as a dictionary.
  Suppress
Converter for ignoring the results of a parsed expression.
Functions [hide private]
 
_ustr(obj)
Drop-in replacement for str(obj) that tries to be Unicode friendly.
source code
 
_str2dict(strg) source code
 
col(loc, strg)
Returns current column within a string, counting newlines as line separators The first column is number 1.
source code
 
lineno(loc, strg)
Returns current line number within a string, counting newlines as line separators The first line is number 1.
source code
 
line(loc, strg)
Returns the line of text containing loc within a string, counting newlines as line separators The first line is number 1.
source code
 
_defaultStartDebugAction(instring, loc, expr) source code
 
_defaultSuccessDebugAction(instring, startloc, endloc, expr, toks) source code
 
_defaultExceptionDebugAction(instring, loc, expr, exc) source code
 
nullDebugAction(*args)
'Do-nothing' debug action, to suppress debugging output during parsing.
source code
 
delimitedList(expr, delim=',', combine=False)
Helper to define a delimited list of expressions - the delimiter defaults to ','.
source code
 
oneOf(strs, caseless=False)
Helper to quickly define a set of alternative Literals, and makes sure to do longest-first testing when there is a conflict, regardless of the input order, but returns a MatchFirst for best performance.
source code
 
dictOf(key, value)
Helper to easily and clearly define a dictionary by specifying the respective patterns for the key and value.
source code
 
_expanded(p) source code
 
srange(s)
Helper to easily define string ranges for use in Word construction.
source code
 
replaceWith(replStr)
Helper method for common parse actions that simply return a literal value.
source code
 
removeQuotes(s, l, t)
Helper parse action for removing quotation marks from parsed quoted strings.
source code
 
upcaseTokens(s, l, t)
Helper parse action to convert tokens to upper case.
source code
 
downcaseTokens(s, l, t)
Helper parse action to convert tokens to lower case.
source code
 
_makeTags(tagStr, xml)
Internal helper to construct opening and closing tag expressions, given a tag name
source code
 
makeHTMLTags(tagStr)
Helper to construct opening and closing tag expressions for HTML, given a tag name
source code
 
makeXMLTags(tagStr)
Helper to construct opening and closing tag expressions for XML, given a tag name
source code
Variables [hide private]
  __doc__ = "...
  __versionTime__ = '12 September 2005 22:50'
  alphas = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
  nums = '0123456789'
  hexnums = '0123456789ABCDEFabcdef'
  alphanums = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVW...
  _bslash = '\\'
  printables = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKL...
  empty = empty
  _escapedPunc = W:(\,\[]-...)
  _printables_less_backslash = '0123456789abcdefghijklmnopqrstuv...
  _escapedHexChar = Combine:({Suppress:("\0x") W:(0123...)})
  _escapedOctChar = Combine:({Suppress:("\") W:(0,0123...)})
  _singleChar = {W:(\,\[]-...) | Combine:({Suppress:("\0x") W:(0...
  _charRange = Group:({{W:(\,\[]-...) | Combine:({Suppress:("\0x...
  _reBracketExpr = {"[" ["^"] Group:({{Group:({{W:(\,\[]-...) | ...
  alphas8bit = u'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîï...
  _escapables = 'tnrfbacdeghijklmopqsuvwxyz \\\'"'
  _octDigits = '01234567'
  _escapedChar = {W:(\,tnrf...) | W:(\,0123...)}
  _sglQuote = "'"
  _dblQuote = """
  dblQuotedString = string enclosed in double quotes
  sglQuotedString = string enclosed in single quotes
  quotedString = quotedString using single or double quotes
  cStyleComment = cStyleComment enclosed in /* ... */
  htmlComment = htmlComment enclosed in <!-- ... -->
  restOfLine = rest of line up to \n
  dblSlashComment = {"//" rest of line up to \n}
  cppStyleComment = {FollowedBy:("/") {{"//" rest of line up to ...
  javaStyleComment = {FollowedBy:("/") {{"//" rest of line up to...
  pythonStyleComment = {"#" rest of line up to \n}
  _noncomma = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLM...
  _commasepitem = commaItem
  commaSeparatedList = commaSeparatedList
  __package__ = 'spade'
  c = '~'
Function Details [hide private]

_ustr(obj)

source code 

Drop-in replacement for str(obj) that tries to be Unicode friendly. It first tries str(obj). If that fails with a UnicodeEncodeError, then it tries unicode(obj). It then < returns the unicode object | encodes it with the default encoding | ... >.

delimitedList(expr, delim=',', combine=False)

source code 

Helper to define a delimited list of expressions - the delimiter defaults to ','. By default, the list elements and delimiters can have intervening whitespace, and comments, but this can be overridden by passing 'combine=True' in the constructor. If combine is set to True, the matching tokens are returned as a single token string, with the delimiters included; otherwise, the matching tokens are returned as a list of tokens, with the delimiters suppressed.

dictOf(key, value)

source code 

Helper to easily and clearly define a dictionary by specifying the respective patterns for the key and value. Takes care of defining the Dict, ZeroOrMore, and Group tokens in the proper order. The key pattern can include delimiting markers or punctuation, as long as they are suppressed, thereby leaving the significant key text. The value pattern can include named results, so that the Dict results can include named token fields.

srange(s)

source code 

Helper to easily define string ranges for use in Word construction. Borrows syntax from regexp '[]' string range definitions:

  srange("[0-9]")   -> "0123456789"
  srange("[a-z]")   -> "abcdefghijklmnopqrstuvwxyz"
  srange("[a-z$_]") -> "abcdefghijklmnopqrstuvwxyz$_"

The input string must be enclosed in []'s, and the returned string is the expanded character set joined into a single string. The values enclosed in the []'s may be:

  a single character
  an escaped character with a leading backslash (such as \- or \])
  an escaped hex character with a leading '\0x' (\0x21, which is a '!' character)
  an escaped octal character with a leading '\0' (\041, which is a '!' character)
  a range of any of the above, separated by a dash ('a-z', etc.)
  any combination of the above ('aeiouy', 'a-zA-Z0-9_$', etc.)

replaceWith(replStr)

source code 

Helper method for common parse actions that simply return a literal value. Especially useful when used with transformString().

removeQuotes(s, l, t)

source code 

Helper parse action for removing quotation marks from parsed quoted strings. To use, add this parse action to quoted string using:

 quotedString.setParseAction( removeQuotes )

Variables Details [hide private]

__doc__

Value:
"""
pyparsing module - Classes and methods to define and execute parsing g\
rammars

The pyparsing module is an alternative approach to creating and execut\
ing simple grammars, 
vs. the traditional lex/yacc approach, or the use of regular expressio\
ns.  With pyparsing, you
...

alphanums

Value:
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'

printables

Value:
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\\
'()*+,-./:;<=>?@[\\]^_`{|}~'

_printables_less_backslash

Value:
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\\
'()*+,-./:;<=>?@[^_`{|}~'

_singleChar

Value:
{W:(\,\[]-...) | Combine:({Suppress:("\0x") W:(0123...)}) | Combine:({\
Suppress:("\") W:(0,0123...)}) | W:(0123...)}

_charRange

Value:
Group:({{W:(\,\[]-...) | Combine:({Suppress:("\0x") W:(0123...)}) | Co\
mbine:({Suppress:("\") W:(0,0123...)}) | W:(0123...)} Suppress:("-") {\
W:(\,\[]-...) | Combine:({Suppress:("\0x") W:(0123...)}) | Combine:({S\
uppress:("\") W:(0,0123...)}) | W:(0123...)}})

_reBracketExpr

Value:
{"[" ["^"] Group:({{Group:({{W:(\,\[]-...) | Combine:({Suppress:("\0x"\
) W:(0123...)}) | Combine:({Suppress:("\") W:(0,0123...)}) | W:(0123..\
.)} Suppress:("-") {W:(\,\[]-...) | Combine:({Suppress:("\0x") W:(0123\
...)}) | Combine:({Suppress:("\") W:(0,0123...)}) | W:(0123...)}}) | W\
:(\,\[]-...) | Combine:({Suppress:("\0x") W:(0123...)}) | Combine:({Su\
ppress:("\") W:(0,0123...)}) | W:(0123...)}}...) "]"}

alphas8bit

Value:
u'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ'

cppStyleComment

Value:
{FollowedBy:("/") {{"//" rest of line up to \n} | cStyleComment enclos\
ed in /* ... */}}

javaStyleComment

Value:
{FollowedBy:("/") {{"//" rest of line up to \n} | cStyleComment enclos\
ed in /* ... */}}

_noncomma

Value:
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\\
'()*+-./:;<=>?@[\\]^_`{|}~'