EarleyParseBuilder
(derived from Dick Grune and Ceriel
J.H. Jacob's "Parsing Techniques -- A Practical Guide" -- my
compliments to the authors! See page 149 in Edition 1 -- which is the
139th page in the .pdf book file.)
See also:
EarleyParseFunctionAlgorithm.html
EarleyParseCompletedItemSetsExample.html
Overview/Context: EarleyTreeBuilder creates one or more unique ParseTree
s from the pCompletedItem
Completed ItemSets generated by the EarleyParser
function algorithm.
- ParseTrees are returned to the calling function in
parseTreeArray
, whose length indicates the maximum number of ParseTree
s desired.
- A
parseCnt
is returned to inform the caller whether 0, 1, ... or n ParseTree
s were created. A parseCnt
of zero indicates no parse was possible, while a parseCnt
greater than one indicates an ambiguous parse since each output ParseTree
is "guaranteed" to be different.
- For a
Grammar
that is known to be unambiguous, the fastest result may be obtained by specifying a parseTreeArray
with length = 1.
- The
input expr is a rewritten version of the original parse Expression,
written as an array of ParseNodeHolder, with
the first symbol in expr[1]. This is done to simplify indexing by
coordinating with the Earley Parse numbering scheme. In addition the
EarleyParser has
"pre-parsed" certain symbols, including: Variable to corresponding
VariableHypothesis, and Cnst to matching "gimme" length 1 Constant
Syntax Axiom (AKA "Named, Typed Constant Syntax Axiom").
- There are
expr.length - 1
Completed ItemSets, one for each symbol in expr
. Each Completed ItemSet, "n
" contains pCompletedItemSetCnt[n]
EarleyItem
s. (each Completed ItemSet array may contain unused items at the end.)
expr
, pCompletedItem
and pCompletedItemSetCnt
have index values begin at 1
-- array position 0
is unused.Thus, expr[1]
is the first symbol and pCompletedItem[1]
is the first Completed ItemSet. These index values correspond to the EarleyItem.atIndex
values: an atIndex
value of m
inside a Completed ItemSet[n]
-- where m <= n
-- indicates that the EarleyItem
has a GrammarRule
matching symbols m
through n
in expr
.- Start Rule Type is derived in
EarleyParser
from the Formula's Type Code. For the purpose of tree building using
Earley Completed Item Sets, the Start Rule Type is the target and will
be the Type Code of the Stmt
in the root of the ParseTree
. In order to enable use of an arbitrary Start Rule Type, EarleyParse
seeds Predictor(0) with all Type Codes that convert to the derived Start Rule Type. And, in EarleyTreeBuilder
, accounting for the fact that a Start Rule Type might not have any NotationRule
s
of its own, the search for a top level Rule includes Type Codes that
convert to the Start Rule; a tree whose root type is not the Start Rule
Type, but which does have a Type converting to the Start Type, will be
supplemented with a new root converting the tree's root to the Start
Type. As a practical matter, this has no effect on set.mm
, but it might (somehow) be used with Type Codes that convert to "wff
" (however, Type Codes that convert to the Provable Logic Stmt Type Code (i.e. "|-
") are prohibited elsewhere via editing.)
Input:
- Cnst startRuleTyp;
- EarleyItem[][] pCompletedItem;
- int[] pCompletedItemSetCnt;
- ParseNodeHolder[] expr;
- Grammar grammar (for error messages)
I/O:
- ParseTree[] parseTreeArray;
Output:
- int parseCnt (via return)
Algorithm "Scheme" (- 1) -- proposed and then revised:
EarleyMapSet: EarleyTreeBuilder
creates a mapping from the Earley Parse Completed Item Sets, Start Rule
Type and input expr to the corresponding Parse Tree(s). We will call
this mapping the Earley Map Set. It contains 0, 1 or more
EarleyRuleMaps.
EarleyRuleMap:
Each EarleyRuleMap corresponds ("maps to") a single ParseTree (or
sub-tree -- more on this below in "EarleyRuleMap Recursion". The unique
identifying key of an EarleyRuleMap is a GrammarRule and 'n', where n
is the nth position in the input expr. Thus, an EarleyRuleMap for an
input expr with n symbols and rule = XXXX, will have an EarleyItem in
pCompletedItemSet[n] with item.rule == XXXX and item.atIndex = 1. Given
a valid EarleyRuleMap we know that we can obtain hypotheses from the
input expression and feed them to the map's rule (XXXX(a,b,c), for
example), and recreate the input expression exactly -- if not, there is
a bug in the code somewhere!
EarleyRuleMap Twin Pointers:
Because there can be more than one EarleyRuleMap in an EarleyMapSet, we
include twin pointer references, forward and backward in the
EarleyRuleMap object. This allows the data structure to accomodate an
arbitrary number of output ParseTrees, though of course multiple
ParseTrees indicate an Ambiguous Grammar! A given expression may have
0, 1 or many ParseTrees, with 0 indicating that the expression is
syntactically invalid (not part of the gramar), and more than 1
indicating ambiguity. Thus, the existence of a twin EarleyRuleMap
indicates ambiguity. During processing a global counter of the number
of twin EarleyRuleMaps is maintained, and this counter is used to limit
tree building as follows: if the user desires 'n' trees, then if Twin
Counter >= Max Twins, no further processing to create twin
EarleyRuleMaps will be done.
An interesting point about the EarleyRuleMap "twins": because each is
generated from a different Rule at the same item.atIndex position in
the input expr -- and the fact that EarleyParse prevents any duplicate
items in pCompletedItem[][] -- every output ParseTree is "guaranteed"
(subject to program bugs,) to be unique; there is no need to
mechanically check parseTree.isDup(), which was done in
BottomUpParse.java (that algorithm does generate duplicates.)
EarleyRuleMap ParseNodeHolder:
in some cases, such as with an expression of length 1, there is not a
matching GramarRule, only a Variable Hypothesis statment that was
"pre-parsed" by EarleyParse. In these cases, a ParseNodeHolder is
stored in EarleyRuleMap instead of a Rule -- which is not a problem
because eventually each EarleyRuleMap is converted to a ParseNodeHolder
with a root ParseNode stored in the ParseNodeHolder. (This is the point
at which the present scheme varies from Grune's grammatical parsing --
we pre-parse variables plus NullsPermitted and TypeConversion
GrammarRules into the final set of NotationRules, so the input
expression contains only Cnst terminals and (pre)ParseNodes
representing Variable Hypotheses, Type Conversions, etc.)
EarleyRuleMap Recursion: a
typical expression is built syntactically using a composite function of
multiple GrammarRules. A subsequence of an input expression may have
its own EarleyRuleMap, which will generate a hypothesis for the next
higher level rule in the ParseTree. Thus, the EarleyRuleMap has a
recursive structure.
EarleyRuleMap "HypMap": Each
EarleyRuleMap contains 'n' HypMap elements (in an array), one for each
hypothesis used by the Rule. These HypMaps differ from EarleyRuleMaps
only in that during processing a HypMap may be generated that does not
end up in the final ParseTree for its parent EarleyRuleMap. This is a
result of the fact that a Completed Item is found to be "completed" at
the end of an expression sub-sequence and its associated items in the
previous Completed Item Set are comingled with other completed items --
some of which do not end up in any ParseTree, but are simply
by-products of the Earley Parse algorithm and the Grammar itself.
So...to build an EarleyRuleMap there is a process of deductive
elimination, or "detective work", which is required to find the
EarleyRuleMap's hypotheses' completed items.
However, once an EarleyRuleMap is successfully constructed, each HypMap
is itself a valid EarleyRuleMap in its own right -- and that means that
it covers a subsequence of the input expr just as the top level
EarleyRuleMap covers the entire expr. And it means that there can be
layers and layers below; some of those lower level EarleyRuleMaps may
even have twins! That is because a given subsequence of an expr with a
"parse sub-tree", may have more than one equivalent parsing -- if the
grammar is ambiguous.
EarleyRuleMap Data Structure "-1 version" -- which is subsequently revised for efficiency:
public class EarleyRuleMap {
/**
* item - index into completedItemSet[exprThru] of EarleyItem
* that maps from a GrammarRule to a portion of an
* expression being parsed:
*
*
rule = completedItemSet[exprThru][itemIndex].rule;
*/
int itemIndex;
/*
* typ - typ code from EarleyItem.rule, here for convenience
*/
Cnst typ;
/**
* exprFrom - index location of start of parameter in the
* expression being parsed.
*/
int exprFrom;
/**
* thru - index location of end of parameter in the
* expression being parsed.
*/
int exprThru;
/**
* hypMap -- array of EarleyRuleMaps, one element for each
* hypothesis of the GrammarRule.
*/
EarleyRuleMap[] hypMap;
/**
* parseNodeHolder -- if not null, used instead of rule
* to generate the output ParseTree. Example: a
* VarHyp in the input expr (there are no GrammarRules
* for variable hypotheses, these are "pre-parsed".)
*/
ParseNodeHolder parseNodeHolder;
/**
* twinFwd -- pointer reference to "twin" EarleyRuleMap
* for a given subsequence of the input expr.
* Each twin will result in an additional output
* ParseTree.
*/
EarleyRuleMap twinFwd;
/**
* twinBwd -- pointer reference to "twin" EarleyRuleMap
* for a given subsequence of the input expr.
* NOT SURE A BACKWARD POINTER IS NEEDED.
*/
EarleyRuleMap twinBwd;
public EarleyRuleMapping() {
itemIndex = -1;
}
Note that this structure can contain a condensed set of data for
multiple ParseTrees, but it does not solve the problem of exactly how
to generate the multiple trees. A brute force method for generating 'n'
top-level trees is to keep track of which tree we're working on,
TwinTree[0], TwinTree[1], TwinTree[2], etc. and then traverse the
EarleyRuleMap structure 'n' times keeping track of how many twins have
been seen and "consumed" so far: when working on Twin[0] don't use any
twin data; when working on Twin[1], use only the first twin reference,
etc. -- and that gets very awkward.
The other humongous problem with this structure is that it is merely an
intermediate structure between the Earley Completed Itemsets and Parse
Trees -- which are themselves a byproduct used to generate RPN's for
Stmt's! This structure, so carefully built consumes the maximum amount
of space and still does not deliver an easy way to generate the
ParseTrees. It is however instructive of the main issue: once we
have an EarleyRuleMap we are guaranteed to have a Parse Tree, but there
can be many sets of HypMaps that can "fit" the EarleyRuleMap. The key
to the problem is to immediately generate a root ParseNode from the
EarleyRuleMap and then if there are alternative sets of VarHyps, to
iterate through those one-by-one, adding new "twin" root ParseNodes to
the EarleyRuleMap. This approach treats an EarleyRuleMap as a
disposable item whose contents are used and disposed of, except for the
chain of rootParseNodes -- which are the essential output. The work of
assembling the chain of root ParseNodes is performed by the
EarleyRuleMap object based on the sets of HypMaps for the rule. The
chain of root ParseNodes is then used by the next higher level
EarleyRuleMap to perform the same process, until finally, the top
EarleyRuleMap for the top rule contains a chain of root ParseNodes,
each corresponding to a ParseTree. So we save space and we minimize the
complications of traversing the set of EarleyRuleMaps by building the
ParseNodes as the EarleyRuleMaps are built; we don't need to store the
whole forest of them, just use them temporarily and discard them once
we have the needed ParseNodes!
As a sidenote, the EarleyRuleMap class can be embedded in the
EarleyTreeBuilder class. This provides access to variables in the outer
class and hides the EarleyRuleMap (since it is totally custom for
EarleyTreeBuilder).
EarleyRuleMap Data Structure "Version 0" -- the ultimate :)
public class EarleyRuleMap {
/**
* item - index into completedItemSet[exprThru] of EarleyItem
* that maps from a GrammarRule to a portion of an
* expression being parsed:
*
*
rule = completedItemSet[exprThru][itemIndex].rule;
*
* A value of '-1' indicates search of Completed Item
* Set should proceed from the beginning at the thru
* set,
*
* A value of '-2' indicates that the search has already
* been (totally) performed and nothing more should or
* can be done.
*/
int itemIndex;
/*
* typ - typ code from EarleyItem.rule, here for convenience
*/
Cnst typ;
/**
* exprFrom - index location of start of parameter in the
* expression being parsed.
*/
int exprFrom;
/**
* thru - index location of end of parameter in the
* expression being parsed.
*/
int exprThru;
/**
* hypMap -- array of EarleyRuleMaps, one element for each
* hypothesis of the GrammarRule.
*/
EarleyRuleMap[] hypMap;
/**
* parseNodeHolder -- after full loading of EarleyRuleMap
* contains reference to ParseNodeHolder of root
* of parse tree for the rule map. The ParseNodeHolder
* contains "fwd" and "bwd" references forming a
* twin chain that enables multiple parse trees
* to be generated (for ambiguous grammar parses).
* Note that a hypMap can contain a parseNodeHolder
* to just a VarHyp with no corresponding rule
* from EarleyParse in the Completed ItemSet; this
* is because only NotationRules are input directly
* to the parser, with NullsPermitted and TypeConversion
* rules being "pre-parsed" into the input expr.
*/
ParseNodeHolder parseNodeHolder;
}
Algorithm:
The implementation's efficiency may be impacted
significantly by the need to search for multiple parse trees within the
Completed ItemSets. Therefore, we might consider using two functions:
the first, the fast one, is called to find at most one parse tree, and
the second is called if parseTreeArray.length > 1. These functions
may share some code and be similar in many respects, so here is a
(draft) description of the first. The top level call uses searchTyp = startRuleTyp
, exprThru = expr.length - 1
and exprFrom = 1
.
Variation 1 -- single tree
protected ParseNode buildOneParseNodeRootAtMost(
Cnst searchTyp,
int exprFrom,
int exprThru) {}
1
. Search pCompletedItem[exprThru]
backward, starting at [pCompletedItemSetCnt[exprThru] - 1]
, through item 1 looking for an EarleyItem
, "x
" with x.rule.getGrammarRuleTyp() == searchTyp
. If no item is found, return null, there are no parse trees. On the other hand, if such an x
is found, then that means we do have a parse tree available, and if the remainder of the code fails to produce a parse tree then IllegalStateException
should be thrown because there is a bug.)
2
. call "produceRuleParamArrayFromExpr(x.rule, exprFrom, exprThru)
" which builds x.rule.getNbrHypParamsUsed()
ParseNode
s from the pCompletedItem[][]
arrays, one ParseNode
for each variable hypothesis required by the GrammarRule
, x.rule
. These ParseNode
s should be returned in an array of ParseNodeHolder
s we will call, "paramArray
" (note that each parseNodeHolder.parseNode may be the root of an extensive sub-tree!).
3
. return x.rule.buildGrammaticalParseNode(paramArray));
protected ParseNodeHolder[] produceOneRuleParamArrayFromExpr(
GrammarRule rule,
int exprFrom,
int exprThru) {}
(assume ParseNodeHolder[] expr
is in global storage)
Context/Overview:
1. exprFrom and exprThru are passed because this function is used for
lower level nodes as well as the top level, so we may be pulling a
paramArray from just a portion of expr.
2. x.rule.getRuleFormatExpr() returns a Cnst[] we'll call "ruleExpr".
An example from set.mm is axiom "wb" with ruleExpr = "( wff <->
wff )", a Cnst[5] array.
3. ruleExpr is the template we overlay upon expr[exprFrom] through
expr[exprThru] to find the variable hypothesis parameters for item x.
4. ruleExpr.length may be less than or
equal to [1 + exprThru - exprFrom], but if it is greater than -- or if
for any other reason we cannot derive paramArray, we should return null
to the caller -- he must have passed us the wrong rule to work with!
5. In example from file:///C:/astevel3/mmj2/EarleyParseFunctionAlgorithm.html,
notice how the two wff parameters in ruleExpr must be produced from
their corresponding parts in the parse expr -- the "wff" to "wff" and
"wff" to "set = class". The problem is that we do not know which
portion of expr is overlaid by each wff parameter in ruleExpr!
- top parse rule (x) in completed item set 7, atIndex=1, ruleExpr=
"( wff <-> wff )"
Expr (in rule format)="
( wff <-> set = class )
"
To solve this problem we have to work right-to-left in ruleExpr and use
detective work to fight out the mappings. And, as can be seen, once we
have determined the mappings to use (i.e. ruleExpr wff #2 maps to
expr[4] through expr[6]) we can then make the recursive call to
buildOneParseNodeRootAtMost(mapTyp[i], mapFrom[i], mapThru[i]);
Note: the case of wff#1 is
illustrative of the problem our EarleyParser poses, as opposed to Dick
Grune's parser. We have already pre-parsed variables to their
corresponding variable hypotheses. That means that the mapping for wff
#1 in the example above will not have a rule to match to in Completed
ItemSet. However, we will know that mapFrom[i] = mapThru[i] and we can
just go directly to expr[i] to obtain the parseNode, which will have
typ == wff (if it doesn't then our detective screwed up or he needs to
keep searching.)
So, we do all this for each parameter in ruleExpr,
loading the returned ParseNodes into our ParseNodeHolder paramArray,
and voila!
6.
Also, if x.rule.getNbrHypParamsUsed()= 0
,
then we can just return a length=0 paramArray -- there cannot be any
substitutions to make except those implicit in GrammarRule x and its
paramTransformationTree (which hides the details of Nulls and Type
Conversions).