Cnst.earleyFIRST
and Cnst.earleyRules
items have been set to null
prior to construction of EarleyParser
(which happens at the end of Grammar.initializeGrammar()
.EarleyParser.parseExpr
consists of: public int parseExpr(ParseTree[] parseTreeArrayIn,
Cnst
formulaTypIn,
ParseNodeHolder[] parseNodeHolderExprIn,
int
highestSeqIn)
parseTreeArrayIn
-- ParseTree[]
of length 0, 1, ... parseTreeArrayIn.length
. The array will be
filled with 0, 1 ... to n ParseTree
s, depending on the number of
ParseTree
s generated
for the input expression and the size of the input array (zero means no
parse tree found, one means
unique parse tree, and two means two or more equivalent parse trees are
possible and the first two were returned.) Any other errors will result
in a thrown Exception
(example might be bogus VarHyp
array or some
other runtime error caused by bad data or a programming booboo.) If the
user does not care to check for ambiguity then the ParseTree[]
may have
length = 1, so that needless time is not wasted looking for, or
generating trees 2 and beyond; the parser will simpy ignore all secondary ParseTree
s and go directly for the first one it finds..formulaTypIn
-- Cnst
Formula Type
code where Type code is for "parsing" if the Expression's length = zero
(special case involving Nulls Permitted) and for determination of the
parse Start Rule Type: If formulaTypIn
= *the* Provable
Logic Stmt Type Code (i.e. "|-"), then the Start Rule Type is *the*
Logical Stmt Type (i.e. "wff"); otherwise, the Start Rule Type is the
Statement's Type Code itself (i.e. "class" -- this would generally
occur on either a Syntax Axiom or, rarely, on a Theorem representing a
grammatical parse RPN as a "proof", in essence, deriving the
Statement's Formula's Expression from the RPN.) Note: some puzzlement
may be in order, but consider the case of a 1 symbol Constant
expression, such as "c0" that is input on a Formula with Type Code =
"class": no problem, because "c0" is a Syntax Axiom of type class.
However, if the Formula Type Code = "|-" (as on a Theorem), then the
parse will fail because the Start Rule Type will be "wff" and there is
no "c0" rule for Type = "wff". parseNodeHolderExprIn
--
a Formula
's Expression (symbols 2 through 'n') written as an array of ParseNodeHolder
s. wherein variables have been replaced by their corresponding VarHyp's (Variable Hypotheses). These
contain mObj
and parseNode
. If (mObj.isCnst())
then the mObj
is a terminal symbol (i.e. "(
"), and ParseNode
is null
; otherwise, ParseNode
will contain something already parsed -- such as a VarHyp
or a Named-Typed Constant Syntax Axiom (i.e. "c0"). Note: For optimization,
especially with set.mm, immediately upon input, the parseNodeHolderExprIn will be
pre-loaded with ParseNode
s
for the "Gimme" Named-Typed Constant Syntax Axioms (see
GRForest.findLen1CnstNotationRule() and
NotationRule.getIsGimmeMatchNbr()) -- this allows us to minimize the
size or Cnst.earleyRules
and Cnst.earleyFIRST
,
which should dramatically improve performance. (the parser can
process a parseNodeHolderExpr that is already partially parsed.)highestSeqIn
-- Maximum Sequence Number used to ensure that parse uses
only Syntax Axioms occurring prior to the input Formula
in the Metamath
Datbase; default is java.lang.Integer.MAX_VALUE
if this check is
unnecessary). This variable is used during initialization before the
earleyLoop() to build predictedItemset[0] -- only Cnst.earleyRules
<= highestSeqIn are loaded, a nice little optimization (which will
be even better if we ensure that Cnst.earleyRules is stored in sequence
by GrammarRule.MAX_SEQ_NBR
, thus allowing the loader that is reading Cnst.earleyRules
to stop reading when highestSeqIn
is reached.)EarleyParser.parseExpr
is int
-- the number of generated ParseTree
s.
The input parseTreeArrayIn is treated as I/O and is loaded with the
output parseTrees, starting at index 0; unused array entries are not
rei-initialized, so the contents beyond the index = (output count - 1)
is indeterminate.
Cnst.getLen1CnstNotationRule()
- Grammar Rules for the "Named Typed Constant" Rules, such as "c0".Cnst.getIsGrammaticalTyp()
- identifies Cnst
in expr as a "non-terminal".Cnst.findFromTypConversionRule()
- obtain Type Conversion Rule for a grammatical Type.Grammar.provableLogicStmtTypArray[0]
-- *the* Provable Logic Stmt Type Code, used to derive Start Rule Type Code.Grammar.logicStmtTypArray[0]
-- *the* Logic Stmt Type Code, used to derive Start Rule Type Code.Grammar.getNullsPermittedGRList
-- Grammar Rules for Type Codes that allow nulls, used in Special Cases (see below).Grammar.getNotationGRList()
-- List of all Notation
Grammar Rules, used in EarleyParser to build cnst.earleyFIRST and
cnst.earleyRules. Also may be useful in determining allocation sizes
for pre-allocated data structures.Grammar.maxFormulaCnt
-- length of longest Formula
in LogicalSystem (as computed during Grammar Initialization). May be
used as an aid in determining allocation sizes for pre-allocated data
structures.GrammarRule / NotationRule / TypeConversionRule / NullsPermittedRule
-- incorporated by reference :)Axiom
, including Axiom.getSyntaxAxiomVarHypReseq()
-- used indirectly via GrammarRule.buildGrammaticalParseNode()
, which resequences input parameters as needed and builds a tree node with sub-nodes (a key component when building a ParseTree
!)grammar.accumErrorMsgInList()
-- parse error and informational messages returned to invoker.|-
").
In both cases, the
parse should return a Nulls Permitted result, but the question is, for
which Type Code? Answer: if the parsed Formula Type Code is a Provable
Logic Statement Type code, then return a Nulls Permitted parse for one
of the Logical
Statement Type Codes (and if 2 results are possible then that is an
ambiguity); otherwise, if the parsed Formula Type Code is not a Provable
Logic Statement Type Code, then return a Nulls Permitted parse result
for any of the
non-Provable Logic Statement Type Codes, including the Logical Type
Codes (ex. if
Formula Type Code is "wff
" then if there is a Nulls Permitted Rule for "wff
", return that, else no parse is possible.) NOTE:
At this time, the input is restricted to allow a maximum of ONE
Provable Logic Statement Type Code and ONE Logic Statement Type Code. (!parseNodeHolderExprIn[0].mObj.isCnst()) and (parseNodeHolderExprIn[0].getStmt().isVarHyp():
This occurs on Type Conversion Syntax
Axioms, on Variable Hypothesis Statements and on Logical Statements
(see ax-mp
in set.mm
!)
If the parsed Formula's Type Code is a Provable Logic Statement Type
Code then the parse result
should be just a Variable Hypothesis parse tree -- otherwise, the parse
result should be a Type Conversion parse tree. (NOTE: when parsing a
Metamath database in its entirety, we do not send variable hypotheses
through the parser, therefore weignore that scenario in this special case. Thus, if someone purposely sends, say, " xyz $f class x $
."
through the parser it will be treated as a type conversion, from class
to class, and it will generate an error message becase there can be no
conversion from a Type Code to itself.) Cnst.len1CnstNotationRule not = null
and NotationRule.isGimmeNbr == 1)
. At the same time, increase the
length of the rewritten expr so that the first symbol occurs in array
index 1; that simplifies a fair amount of code, leaving only the -1
adjustment -- from position to index -- for accessing GrammarRule
expression symbols.n
' loops, one for each input symbol in the RuleFormatExpr
. Note: if TheScanner()
(see below) cannot find any items to load into the CompletedItemset[i]
or Itemset[i]
then the parse loop halts -- the input cannot be parsed. n + 1
ItemSet
s plus n
Completed ItemSet
s (for simplicity we'll allocate n + 1 Completed Itemsets
and leave index 0
empty.). ItemSet[0]
is derived from the
Grammar and consists of NotationRules with Type Code equal to the Start
Rule Type, excluding those Notation Rules with Sequence Number greater
than the input Maximum Sequence Number (the Lookahead optimization, using Cnst.earleyFIRST and expr[0] may
further reduce the size of Itemset[0]
-- and, in any case, we may want
to keep separate sets of NotationRules by Type Code for initialization
of ItemSet[0]
.)
Itemset[0]
. Each FIRST set contains the grammar's allowable first rule symbols -- for example, a "wff
" can begin with a "(
", a "-.
" and a class can begin with "(
", "*
",
..., etc. These will be stored in Cnst.earleyFIRST
as TreeSets and are
used by the Predictor with Lookahead to eliminate bogus predictions.
Lookahead also requires the first symbol in each Rule, so for
expediency we will add NotationRule.exprFirst
(a Cnst -- see
NotationRule.getExprFirst()
).cnst.earleyRules
is a subset of the NotationRules for a given Type (VarHyp
Types and LogicalStmtTypeCode Type), we eliminate Gimme Notation Rules with
expression Length = 1. And, to speed up the Predictor, earleyRules is
sorted in GrammarRule.MAX_SEQ_NBR order -- that allows the Predictor to
stop reading once the highest sequence number needed for the input
expression is reached.mmj.verify.Grammar
generates a combinatorial explosion of varieties of base Syntax Axioms,
altered by Nulls Permitted and Type Conversion Rules to eliminate this
as a consideration for the actual parsing process. It may be that total
processing time would be shorter if the Earley Parser were allowed to
handle Nulls and Type Conversion on its own -- the combinatorial
explosion will actually increase the number of Predicted Earley Items
rather dramatically, though of course, also reducing the number of
Completed Items. The primary purpose for the extreme machinations and
finagling in Grammar involving the combinatorial explosions is to build
a unambiguous Grammar -- especially for the main "customer", set.mm,
which contains technical
Ambiguities involving overloading (see weq and wceq, among others). The
combinatorial explosion is also helpful in detection of other
Ambiguities since it allows us to explore all possible resulting rules,
including those derived from indirect Type Conversions and Nulls
Permitted operations. In theory, assuming the ambiguities were deemed
acceptable (subject, perhaps, to post-parse processing), this
EarleyParse implementation could be modified to handle its own Nulls
Permitted and Type Conversion Rules. We would need to feed it the base
Syntax Axioms Rules, as is, for generation of its cnst.earleyFIRST and
cnst.earleyRules sets. Also, for proper processing of Nulls Permitteds,
there needs to be a loop of Predictor/Completor that operates prior to
the first, and after the last symbol in each expression being parsed;
these loops operate until no more items are added by either, and serve
to 'honor" empty productions at the start and end of the expression. It
remains to be seen whether this will be necessary. Hopefully, use of
Max Sequence Numbers will significantly cuts down on the excessive
predictions -- and the pre-processing optimization of eliminating one
symbol Gimme constant Syntax Axioms will also help. NOTE: It *is* theorized that because the Earley Itemsets will be maintained in rule.ruleNbr
(and atIndex
) order, the first output ParseTree *should* correspond to the ParseTree resulting from the current implementation -- thus, weq
's "set = set
" will be chosen prior to wceq
's "class = class
" if weq
precedes wceq
in the set.mm
file.EarleyItem
Data Structure -- stored in Itemset
spublic class EarleyItem {
/**
* rule - NotationRule reference used to avoid copying expr
* symbols over and over again.
*
* EarleyItem.rule.ruleNbr is the high order sub-key
* of an Earley Item. Together, rule.ruleNbr and
* atIndex make a unique key, which is essential for
* maintaining *set* Earley Itemsets.
*/
NotationRule rule;
/**
* atIndex - index to next Cnst (Sym) in expr, equal to the
* Earley loop counter, "i"; 1 is first, not 0, which
* is left empty in expr for simplicity.
*
* EarleyItem.atIndex is the low-order sub-key of an
* Earley Item. Together, rule.ruleNbr and atIndex make
* a unique key, which is essential for maintaining
* Earley Itemsets.
*/
int atIndex;
/**
* dotindex - index of the Earley "dot" (or "gap") used instead
* of physically moving
the dot around in the EarleyItem. A
* value of 1 means that the dot is *before* the first
* sym in the rule's expr, while a value equal to
* expr.length means that the EarleyItem is completed.
*/
int dotIndex;
/**
* afterDot - if item completed equal to null, else equal to
* ruleExpr[dotIndex]. Used to avoid repetitious and
* annoying "hunting trips" into the rule's forest.
*/
Cnst afterDot;
/*
* Compare for equality with another EarleyItem.
* Equal if and only if Rule and atIndexSym are
* identical.
*
* @return returns true if equal, otherwise false.
*/
public boolean equals(Object obj) {
return (this == obj) ? true
: !(obj instanceof EarleyItem) ? false
: rule != ((EarleyItem)obj).rule ? false
: atIndex != ((EarleyItem)obj).atIndex? false
: true;
}
public String toString() {
StringBuilder s = new StringBuilder();
s.append(GrammarConstants.ERRMSG_AT_CAPTION);
s.append(atIndex);
s.append(GrammarConstants.ERRMSG_DOT_CAPTION);
s.append(dotIndex);
s.append(GrammarConstants.ERRMSG_AFTER_DOT_CAPTION);
if (afterDot == null) {
s.append(" ");
}
else {
s.append(afterDot);
}
s.append(" ");
s.append(rule.ruleNbr);
s.append(GrammarConstants.ERRMSG_COLON_CAPTION);
s.append(rule.getBaseSyntaxAxiom().getTyp());
s.append(GrammarConstants.ERRMSG_COLON_CAPTION);
s.append(rule.getBaseSyntaxAxiom().getLabel());
s.append(" ");
s.append(GrammarConstants.ERRMSG_RULE_EXPR_CAPTION);
s.append(rule.getRuleFormatExprAsString());
return s.toString();
}
}
xyz $a |- ( ph <-> x = A ) $.
" ( wff <-> set = class )
"wph vx cv cA wceq wb"
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
|
( |
wff |
<-> |
set |
= |
class
|
) |
ItemSet[0] (reduced to minimum rules for ease of exposition)
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
3:wb |
wff |
1 |
1 |
( |
Itemset[0]
preloaded by Predictor with all rules that derive directly from the
Start Rule Type Code (and that have Max Sequence Number <= the input
formula Max Sequence Number.) Input to Predictor(0) can be the
precomputed Earley Itemset[0] that is based on just the grammar, not
the input formula. The Predictor can also be enhanced with Lookahead
using the EarleyFIRST set, which is also pre-computed using just the
grammar -- this will reduce the size of both Itemset[0] and subsequent
predictions. |
18:wceq |
wff |
1 |
1 |
set |
NOTE:
With Lookahead, this item will not be loaded because set.earleyFirst
does not contain "(", the first symbol of the parsed expression. |
|
|
|
|
|
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
|
( |
wff |
<-> |
set |
= |
class
|
) |
ItemSet[1]
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
3:wb |
wff |
1 |
2 |
wff |
Scanner copied from Itemset[i -
1] because it contained afterDot = "(". dotIndex moved forward, after
the "(" to the next sym, which is "wff". |
3:wb |
wff |
2 |
1 |
( |
Predictor copied all rules for wff because of previous "Active" item in ItemSet[1], which contains Non-Terminal afterDot = wff |
18:wceq |
wff |
2 |
1 |
set |
Predictor: ditto. Note: there are no rules present for class or set, so Predictor stops here! |
CompletedItemSet[1]
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
|
|
|
|
|
empty? yep. |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
|
( |
wff |
<-> |
set |
= |
class
|
) |
ItemSet[2]
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
3:wb | wff | 1 | 3 | <-> | Scanner copied from
Itemset[i - 1] because it contained afterDot = "wff". dotIndex moved
forward, after the "wff" to the next sym, which is "<->". |
|
|
|
|
|
Predictor: does not add rules for terminal "<->"...done. |
|
|
|
|
|
|
CompletedItemSet[2]
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
|
|
|
|
|
empty? yep. |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
|
( |
wff |
<-> |
set |
= |
class
|
) |
ItemSet[3]
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
3:wb | wff | 1 | 4 | wff | Scanner copied from
Itemset[i - 1] because it contained afterDot = "<->". dotIndex moved
forward, after the "<->" to the next sym in the rule, which is "wff". |
3:wb | wff | 4 | 1 | ( | Predictor copied all
rules for wff because of previous "Active" item in ItemSet[3], which
contains Non-Terminal afterDot = wff. |
18:wceq | wff | 4 | 1 | set | Predictor: ditto. Note: there are no rules present for class or set, so Predictor stops here! |
CompletedItemSet[3]
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
|
|
|
|
|
empty? yep. |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
|
( |
wff |
<-> |
set |
= |
class
|
) |
ItemSet[4]
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
18:wceq | wff | 4 | 2 | = | Scanner copied from
Itemset[i - 1] because it contained afterDot = "set". dotIndex moved
forward, after the "set" to the next sym in the rule, which is "=", a terminal. |
|
|
|
|
|
Predictor: does not add rules for afterDot terminal, "="...done. |
CompletedItemSet[4]
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
|
|
|
|
|
empty? yep. |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
|
( |
wff |
<-> |
set |
= |
class
|
) |
ItemSet[5]
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
18:wceq | wff | 4 | 3 | class | Scanner copied from
Itemset[i - 1] because it contained afterDot = "=". dotIndex moved
forward, after the "=" to the next sym in the rule, which is "class", a non-terminal. |
|
|
|
|
|
Predictor: does not add rules for afterDot non-terminal, "class" because there are no grammar rules for "class". |
CompletedItemSet[5]
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
|
|
|
|
|
empty? yep. still empty. sigh!!! |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
|
( |
wff |
<-> |
set |
= |
class
|
) |
ItemSet[6]
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
3:wb | wff | 1 | 5 | ) | Completer copied
copied from
Itemset[3] because it contained afterDot = "wff" and the
CompletedItemSet item below has rule type "wff" atIndex 4 (it goes to
Itemset[m - 1]. It then moved dotIndex forward and stores the item in
Active because it is not completed, yet. |
|
|
|
|
|
Predictor: does not add rules for afterDot terminal, nothing for it to do. |
CompletedItemSet[6]
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
18:wceq | wff | 4 | 4 |
| Scanner copied from
Itemset[i - 1] because it contained afterDot = "class". dotIndex moved
forward, after the "class" to the end! Therefore this is completed and should be in CompletedItemSet! Yay! |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
|
( |
wff |
<-> |
set |
= |
class
|
) |
ItemSet[7]
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
|
|
|
|
|
Predictor:
nothing to do, Active ItemSet is empty. And on top of that, I don't
think we need to run Predictor on the final symbol. |
CompletedItemSet[7]
___________rule___________ |
rule |
at |
dot |
after |
_______________comments_______________ |
3:wb | wff | 1 | 6 |
| Scanner: copied
copied from
Itemset[6] because it contained afterDot = ")". It then moved
dotIndex forward and sets afterDot to null because this item is
completed -- so it is stored in CompletedItemSet instead of Active. |