EarleyTreeBuilder.html

EarleyParseBuilder

(derived from Dick Grune and Ceriel J.H. Jacob's "Parsing Techniques -- A Practical Guide" -- my compliments to the authors! See page 149 in Edition 1 -- which is the 139th page in the .pdf book file.)

Overview/Context: EarleyTreeBuilder creates one or more unique ParseTrees from the pCompletedItem Completed ItemSets generated by the EarleyParser function algorithm.

ParseTrees are returned to the calling function in parseTreeArray, whose length indicates the maximum number of ParseTrees desired.
A parseCnt is returned to inform the caller whether 0, 1, ... or n ParseTrees were created. A parseCnt of zero indicates no parse was possible, while a parseCnt greater than one indicates an ambiguous parse since each output ParseTree is "guaranteed" to be different.
For a Grammar that is known to be unambiguous, the fastest result may be obtained by specifying a parseTreeArray with length = 1.
The input expr is a rewritten version of the original parse Expression, written as an array of ParseNodeHolder, with the first symbol in expr[1]. This is done to simplify indexing by coordinating with the Earley Parse numbering scheme. In addition the EarleyParser has "pre-parsed" certain symbols, including: Variable to corresponding VariableHypothesis, and Cnst to matching "gimme" length 1 Constant Syntax Axiom (AKA "Named, Typed Constant Syntax Axiom").
There are expr.length - 1 Completed ItemSets, one for each symbol in expr. Each Completed ItemSet, "n" contains pCompletedItemSetCnt[n] EarleyItems. (each Completed ItemSet array may contain unused items at the end.)
expr, pCompletedItem and pCompletedItemSetCnt have index values begin at 1 -- array position 0 is unused.Thus, expr[1] is the first symbol and pCompletedItem[1] is the first Completed ItemSet. These index values correspond to the EarleyItem.atIndex values: an atIndex value of m inside a Completed ItemSet[n] -- where m <= n -- indicates that the EarleyItem has a GrammarRule matching symbols m through n in expr.
Start Rule Type is derived in EarleyParser from the Formula's Type Code. For the purpose of tree building using Earley Completed Item Sets, the Start Rule Type is the target and will be the Type Code of the Stmt in the root of the ParseTree. In order to enable use of an arbitrary Start Rule Type, EarleyParse seeds Predictor(0) with all Type Codes that convert to the derived Start Rule Type. And, in EarleyTreeBuilder, accounting for the fact that a Start Rule Type might not have any NotationRules of its own, the search for a top level Rule includes Type Codes that convert to the Start Rule; a tree whose root type is not the Start Rule Type, but which does have a Type converting to the Start Type, will be supplemented with a new root converting the tree's root to the Start Type. As a practical matter, this has no effect on set.mm, but it might (somehow) be used with Type Codes that convert to "wff" (however, Type Codes that convert to the Provable Logic Stmt Type Code (i.e. "|-") are prohibited elsewhere via editing.)

Input:

Cnst startRuleTyp;
EarleyItem[][] pCompletedItem;
int[] pCompletedItemSetCnt;
ParseNodeHolder[] expr;
Grammar grammar (for error messages)

I/O:

ParseTree[] parseTreeArray;

Output:

int parseCnt (via return)

Algorithm "Scheme" (- 1) -- proposed and then revised:

EarleyMapSet: EarleyTreeBuilder creates a mapping from the Earley Parse Completed Item Sets, Start Rule Type and input expr to the corresponding Parse Tree(s). We will call this mapping the Earley Map Set. It contains 0, 1 or more EarleyRuleMaps.

EarleyRuleMap: Each EarleyRuleMap corresponds ("maps to") a single ParseTree (or sub-tree -- more on this below in "EarleyRuleMap Recursion". The unique identifying key of an EarleyRuleMap is a GrammarRule and 'n', where n is the nth position in the input expr. Thus, an EarleyRuleMap for an input expr with n symbols and rule = XXXX, will have an EarleyItem in pCompletedItemSet[n] with item.rule == XXXX and item.atIndex = 1. Given a valid EarleyRuleMap we know that we can obtain hypotheses from the input expression and feed them to the map's rule (XXXX(a,b,c), for example), and recreate the input expression exactly -- if not, there is a bug in the code somewhere!

EarleyRuleMap Twin Pointers: Because there can be more than one EarleyRuleMap in an EarleyMapSet, we include twin pointer references, forward and backward in the EarleyRuleMap object. This allows the data structure to accomodate an arbitrary number of output ParseTrees, though of course multiple ParseTrees indicate an Ambiguous Grammar! A given expression may have 0, 1 or many ParseTrees, with 0 indicating that the expression is syntactically invalid (not part of the gramar), and more than 1 indicating ambiguity. Thus, the existence of a twin EarleyRuleMap indicates ambiguity. During processing a global counter of the number of twin EarleyRuleMaps is maintained, and this counter is used to limit tree building as follows: if the user desires 'n' trees, then if Twin Counter >= Max Twins, no further processing to create twin EarleyRuleMaps will be done.

An interesting point about the EarleyRuleMap "twins": because each is generated from a different Rule at the same item.atIndex position in the input expr -- and the fact that EarleyParse prevents any duplicate items in pCompletedItem[][] -- every output ParseTree is "guaranteed" (subject to program bugs,) to be unique; there is no need to mechanically check parseTree.isDup(), which was done in BottomUpParse.java (that algorithm does generate duplicates.)

EarleyRuleMap ParseNodeHolder: in some cases, such as with an expression of length 1, there is not a matching GramarRule, only a Variable Hypothesis statment that was "pre-parsed" by EarleyParse. In these cases, a ParseNodeHolder is stored in EarleyRuleMap instead of a Rule -- which is not a problem because eventually each EarleyRuleMap is converted to a ParseNodeHolder with a root ParseNode stored in the ParseNodeHolder. (This is the point at which the present scheme varies from Grune's grammatical parsing -- we pre-parse variables plus NullsPermitted and TypeConversion GrammarRules into the final set of NotationRules, so the input expression contains only Cnst terminals and (pre)ParseNodes representing Variable Hypotheses, Type Conversions, etc.)

EarleyRuleMap Recursion: a typical expression is built syntactically using a composite function of multiple GrammarRules. A subsequence of an input expression may have its own EarleyRuleMap, which will generate a hypothesis for the next higher level rule in the ParseTree. Thus, the EarleyRuleMap has a recursive structure.

EarleyRuleMap "HypMap": Each EarleyRuleMap contains 'n' HypMap elements (in an array), one for each hypothesis used by the Rule. These HypMaps differ from EarleyRuleMaps only in that during processing a HypMap may be generated that does not end up in the final ParseTree for its parent EarleyRuleMap. This is a result of the fact that a Completed Item is found to be "completed" at the end of an expression sub-sequence and its associated items in the previous Completed Item Set are comingled with other completed items -- some of which do not end up in any ParseTree, but are simply by-products of the Earley Parse algorithm and the Grammar itself. So...to build an EarleyRuleMap there is a process of deductive elimination, or "detective work", which is required to find the EarleyRuleMap's hypotheses' completed items.

However, once an EarleyRuleMap is successfully constructed, each HypMap is itself a valid EarleyRuleMap in its own right -- and that means that it covers a subsequence of the input expr just as the top level EarleyRuleMap covers the entire expr. And it means that there can be layers and layers below; some of those lower level EarleyRuleMaps may even have twins! That is because a given subsequence of an expr with a "parse sub-tree", may have more than one equivalent parsing -- if the grammar is ambiguous.

EarleyRuleMap Data Structure "-1 version" -- which is subsequently revised for efficiency:

public class EarleyRuleMap {

    /**

     *  item - index into completedItemSet[exprThru] of EarleyItem

     *         that maps from a GrammarRule to a portion of an

     *         expression being parsed:

     *

    
*            
rule = completedItemSet[exprThru][itemIndex].rule;

     */

    int             itemIndex;



    /*

     *  typ  - typ code from EarleyItem.rule, here for convenience

     */

    Cnst            typ;



    /**

     *  exprFrom - index location of start of parameter in the

     *         expression being parsed.

     */

    int             exprFrom;



    /**

     *  thru - index location of end of parameter in the

     *         expression being parsed.

     */

    int             exprThru;



    /**

     *  hypMap -- array of EarleyRuleMaps, one element for each

     *         hypothesis of the GrammarRule.

     */

    EarleyRuleMap[] hypMap;



    /**

     *  parseNodeHolder -- if not null, used instead of rule 

     *         to generate the output ParseTree. Example: a

     *         VarHyp in the input expr (there are no GrammarRules

     *         for variable hypotheses, these are "pre-parsed".)

     */

    ParseNodeHolder parseNodeHolder;

    /**

     *  twinFwd -- pointer reference to "twin" EarleyRuleMap 

     *         for a given subsequence of the input expr.

     *         Each twin will result in an additional output

     *         ParseTree. 
     */

    EarleyRuleMap twinFwd;

    /**

     *  twinBwd -- pointer reference to "twin" EarleyRuleMap 

     *         for a given subsequence of the input expr.

     *         NOT SURE A BACKWARD POINTER IS NEEDED.
     */

    EarleyRuleMap twinBwd;



    public EarleyRuleMapping() {

        itemIndex = -1;

    }

Note that this structure can contain a condensed set of data for multiple ParseTrees, but it does not solve the problem of exactly how to generate the multiple trees. A brute force method for generating 'n' top-level trees is to keep track of which tree we're working on, TwinTree[0], TwinTree[1], TwinTree[2], etc. and then traverse the EarleyRuleMap structure 'n' times keeping track of how many twins have been seen and "consumed" so far: when working on Twin[0] don't use any twin data; when working on Twin[1], use only the first twin reference, etc. -- and that gets very awkward.

The other humongous problem with this structure is that it is merely an intermediate structure between the Earley Completed Itemsets and Parse Trees -- which are themselves a byproduct used to generate RPN's for Stmt's! This structure, so carefully built consumes the maximum amount of space and still does not deliver an easy way to generate the ParseTrees. It is however instructive of the main issue: once we have an EarleyRuleMap we are guaranteed to have a Parse Tree, but there can be many sets of HypMaps that can "fit" the EarleyRuleMap. The key to the problem is to immediately generate a root ParseNode from the EarleyRuleMap and then if there are alternative sets of VarHyps, to iterate through those one-by-one, adding new "twin" root ParseNodes to the EarleyRuleMap. This approach treats an EarleyRuleMap as a disposable item whose contents are used and disposed of, except for the chain of rootParseNodes -- which are the essential output. The work of assembling the chain of root ParseNodes is performed by the EarleyRuleMap object based on the sets of HypMaps for the rule. The chain of root ParseNodes is then used by the next higher level EarleyRuleMap to perform the same process, until finally, the top EarleyRuleMap for the top rule contains a chain of root ParseNodes, each corresponding to a ParseTree. So we save space and we minimize the complications of traversing the set of EarleyRuleMaps by building the ParseNodes as the EarleyRuleMaps are built; we don't need to store the whole forest of them, just use them temporarily and discard them once we have the needed ParseNodes!

As a sidenote, the EarleyRuleMap class can be embedded in the EarleyTreeBuilder class. This provides access to variables in the outer class and hides the EarleyRuleMap (since it is totally custom for EarleyTreeBuilder).

EarleyRuleMap Data Structure "Version 0" -- the ultimate :)

public class EarleyRuleMap {

    /**

     *  item - index into completedItemSet[exprThru] of EarleyItem

     *         that maps from a GrammarRule to a portion of an

     *         expression being parsed:

     *

    
*            
rule = completedItemSet[exprThru][itemIndex].rule;

     *        

     *         A value of '-1' indicates search of Completed Item

     *         Set should proceed from the beginning at the thru

     *         set,

     *     

     *         A value of '-2' indicates that the search has already 

     *         been (totally) performed and nothing more should or 

     *         can be done.

     */

    int             itemIndex;



    /*

     *  typ  - typ code from EarleyItem.rule, here for convenience

     */

    Cnst            typ;



    /**

     *  exprFrom - index location of start of parameter in the

     *         expression being parsed.

     */

    int             exprFrom;



    /**

     *  thru - index location of end of parameter in the

     *         expression being parsed.

     */

    int             exprThru;



    /**

     *  hypMap -- array of EarleyRuleMaps, one element for each

     *         hypothesis of the GrammarRule.

     */

    EarleyRuleMap[] hypMap;



    /**

     *  parseNodeHolder -- after full loading of EarleyRuleMap

     *         contains reference to ParseNodeHolder of root 

     *         of parse tree for the rule map. The ParseNodeHolder

     *         contains "fwd" and "bwd" references forming a

     *         twin chain that enables multiple parse trees

     *         to be generated (for ambiguous grammar parses).

     *         Note that a hypMap can contain a parseNodeHolder

     *         to just a VarHyp with no corresponding rule

     *         from EarleyParse in the Completed ItemSet; this

     *         is because only NotationRules are input directly 

     *         to the parser, with NullsPermitted and TypeConversion

     *         rules being "pre-parsed" into the input expr.


     */

    ParseNodeHolder parseNodeHolder;

}

Algorithm: The implementation's efficiency may be impacted significantly by the need to search for multiple parse trees within the Completed ItemSets. Therefore, we might consider using two functions: the first, the fast one, is called to find at most one parse tree, and the second is called if parseTreeArray.length > 1. These functions may share some code and be similar in many respects, so here is a (draft) description of the first. The top level call uses searchTyp = startRuleTyp, exprThru = expr.length - 1 and exprFrom = 1.

Variation 1 -- single tree

protected ParseNode buildOneParseNodeRootAtMost(

    Cnst       searchTyp,

    int        exprFrom,

    int        exprThru) {}

1. Search pCompletedItem[exprThru] backward, starting at [pCompletedItemSetCnt[exprThru] - 1], through item 1 looking for an EarleyItem, "x" with x.rule.getGrammarRuleTyp() == searchTyp. If no item is found, return null, there are no parse trees. On the other hand, if such an x is found, then that means we do have a parse tree available, and if the remainder of the code fails to produce a parse tree then IllegalStateException should be thrown because there is a bug.)

2. call "produceRuleParamArrayFromExpr(x.rule, exprFrom, exprThru)" which builds x.rule.getNbrHypParamsUsed() ParseNodes from the pCompletedItem[][] arrays, one ParseNode for each variable hypothesis required by the GrammarRule, x.rule. These ParseNodes should be returned in an array of ParseNodeHolders we will call, "paramArray" (note that each parseNodeHolder.parseNode may be the root of an extensive sub-tree!).

3. return

x.rule.buildGrammaticalParseNode(paramArray));

protected ParseNodeHolder[] produceOneRuleParamArrayFromExpr(

    GrammarRule rule,

    int         exprFrom,

    int         exprThru) {}

(assume ParseNodeHolder[] expr is in global storage)

Context/Overview:

1. exprFrom and exprThru are passed because this function is used for lower level nodes as well as the top level, so we may be pulling a paramArray from just a portion of expr.

2. x.rule.getRuleFormatExpr() returns a Cnst[] we'll call "ruleExpr". An example from set.mm is axiom "wb" with ruleExpr = "( wff <-> wff )", a Cnst[5] array.

3. ruleExpr is the template we overlay upon expr[exprFrom] through expr[exprThru] to find the variable hypothesis parameters for item x.

4. ruleExpr.length may be less than or equal to [1 + exprThru - exprFrom], but if it is greater than -- or if for any other reason we cannot derive paramArray, we should return null to the caller -- he must have passed us the wrong rule to work with!

5. In example from file:///C:/astevel3/mmj2/EarleyParseFunctionAlgorithm.html, notice how the two wff parameters in ruleExpr must be produced from their corresponding parts in the parse expr -- the "wff" to "wff" and "wff" to "set = class". The problem is that we do not know which portion of expr is overlaid by each wff parameter in ruleExpr!

top parse rule (x) in completed item set 7, atIndex=1, ruleExpr= "( wff <-> wff )"
Expr (in rule format)="( wff <-> set = class )"

To solve this problem we have to work right-to-left in ruleExpr and use detective work to fight out the mappings. And, as can be seen, once we have determined the mappings to use (i.e. ruleExpr wff #2 maps to expr[4] through expr[6]) we can then make the recursive call to

    buildOneParseNodeRootAtMost(mapTyp[i], mapFrom[i], mapThru[i]);



Note: the case of wff#1 is
illustrative of the problem our EarleyParser poses, as opposed to Dick
Grune's parser. We have already pre-parsed variables to their
corresponding variable hypotheses. That means that the mapping for wff
#1 in the example above will not have a rule to match to in Completed
ItemSet. However, we will know that mapFrom[i] = mapThru[i] and we can
just go directly to expr[i] to obtain the parseNode, which will have
typ == wff (if it doesn't then our detective screwed up or he needs to
keep searching.)

So, we do all this for each parameter in ruleExpr, loading the returned ParseNodes into our ParseNodeHolder paramArray, and voila!

6. Also, if x.rule.getNbrHypParamsUsed()= 0, then we can just return a length=0 paramArray -- there cannot be any substitutions to make except those implicit in GrammarRule x and its paramTransformationTree (which hides the details of Nulls and Type Conversions).