我已经下载了 UPDT波斯语树库( Uppsala Persian Dependency Treebank ),我正在尝试使用斯坦福 NLP 从中构建依赖解析器模型。我尝试使用命令行和 Java 代码来训练模型,但在这两种情况下都出现异常。
1- 使用命令行训练模型:
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -train UPDT\train.conll 0 -saveToSerializedFile UPDT\updt.model.ser.gz
当我运行上面的命令时,我会得到这个异常:
done [read 26 trees]. Time elapsed: 0 ms
Options parameters:
useUnknownWordSignatures 0
smoothInUnknownsThreshold 100
smartMutation false
useUnicodeType false
unknownSuffixSize 1
unknownPrefixSize 1
flexiTag false
useSignatureForKnownSmoothing false
wordClassesFile null
parserParams edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams
forceCNF false
doPCFG true
doDep true
freeDependencies false
directional true
genStop true
distance true
coarseDistance false
dcTags true
nPrune false
Train parameters:
smooth=false
PA=true
GPA=false
selSplit=false
(0.0)
mUnary=0
mUnaryTags=false
sPPT=false
tagPA=false
tagSelSplit=false (0.0)
rightRec=false
leftRec=false
collinsPunc=false
markov=false
mOrd=1
hSelSplit=false (10)
compactGrammar=0
postPA=false
postGPA=false
selPSplit=false (0.0)
tagSelPSplit=false (0.0)
postSplitWithBase=false
fractionBeforeUnseenCounting=0.5
openClassTypesThreshold=50
preTransformer=null
taggedFiles=null
predictSplits=false
splitCount=1
splitRecombineRate=0.0
simpleBinarizedLabels=false
noRebinarization=false
trainingThreads=1
dvKBest=100
trainingIterations=40
batchSize=25
regCost=1.0E-4
qnIterationsPerBatch=1
qnEstimates=15
qnTolerance=15.0
debugOutputFrequency=0
randomSeed=0
learningRate=0.1
deltaMargin=0.1
unknownNumberVector=true
unknownDashedWordVectors=true
unknownCapsVector=true
unknownChineseYearVector=true
unknownChineseNumberVector=true
unknownChinesePercentVector=true
dvSimplifiedModel=false
scalingForInit=0.5
maxTrainTimeSeconds=0
unkWord=*UNK*
lowercaseWordVectors=false
transformMatrixType=DIAGONAL
useContextWords=false
trainWordVectors=true
stalledIterationLimit=12
markStrahler=false
Using EnglishTreebankParserParams splitIN=0 sPercent=false sNNP=0 sQuotes=false
sSFP=false rbGPA=false j#=false jJJ=false jNounTags=false sPPJJ=false sTRJJ=fals
e sJJCOMP=false sMoreLess=false unaryDT=false unaryRB=false unaryPRP=false reflP
RP=false unaryIN=false sCC=0 sNT=false sRB=false sAux=0 vpSubCat=false mDTV=0 sV
P=0 sVPNPAgr=false sSTag=0 mVP=false sNP%=0 sNPPRP=false dominatesV=0 dominatesI
=false dominatesC=false mCC=0 sSGapped=0 numNP=false sPoss=0 baseNP=0 sNPNNP=0 s
TMP=0 sNPADV=0 cTags=false rightPhrasal=false gpaRootVP=false splitSbar=0 mPPTOi
IN=0 cWh=0
Binarizing trees...done. Time elapsed: 12 ms
Extracting PCFG...PennTreeReader: warning: file has extra non-matching right par
enthesis [ignored]
Exception in thread "main" java.lang.IllegalArgumentException: No head rule defi
ned for _ using class edu.stanford.nlp.trees.ModCollinsHeadFinder in (_
DELM
DELM
DELM
13
punct
_
_
15
??????
_
N
N_SING
SING
13
appos
_
_
16
???????
_
ADJ
ADJ
ADJ
15
amod
_
_
17
??
_
P
P
P
15
prep
_
_
18
???
_
N
N_SING
SING
17
pobj
_
_
19
?
_
CON
CON
CON
18
cc
_
_
20
????
_
N
N_SING
SING
18
conj
_
_
21
????
_
N
N_SING
SING
20
poss/pc
_
_
22)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineNonTrivialH
ead(AbstractCollinsHeadFinder.java:242)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(Abstra
ctCollinsHeadFinder.java:189)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(Abstra
ctCollinsHeadFinder.java:140)
at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTreeHelper(T
reeAnnotator.java:145)
at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTree(TreeAnn
otator.java:51)
at edu.stanford.nlp.parser.lexparser.TreeAnnotatorAndBinarizer.transform
Tree(TreeAnnotatorAndBinarizer.java:104)
at edu.stanford.nlp.trees.CompositeTreeTransformer.transformTree(Composi
teTreeTransformer.java:30)
at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankItera
tor.next(TransformingTreebank.java:195)
at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankItera
tor.next(TransformingTreebank.java:176)
at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.pr
imeNext(FilteringTreebank.java:100)
at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.<i
nit>(FilteringTreebank.java:85)
at edu.stanford.nlp.trees.FilteringTreebank.iterator(FilteringTreebank.j
ava:72)
at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.tallyTrees(Ab
stractTreeExtractor.java:64)
at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.extract(Abstr
actTreeExtractor.java:89)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromTree
bank(LexicalizedParser.java:881)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedP
arser.java:1394)
2- 使用 Java 代码训练模型:
import java.io.File;
import java.io.IOException;
import java.util.Collection;
import java.util.List;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.parser.lexparser.Options;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.trees.GrammaticalStructure;
import edu.stanford.nlp.trees.GrammaticalStructureFactory;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.Treebank;
import edu.stanford.nlp.trees.TreebankLanguagePack;
public class FromTreeBank {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
String treebankPathUPDT = "src/model/UPDT.1.2/train.conll";
String persianFilePath = "src/txt/persianSentences.txt";
File file = new File(treebankPathUPDT);
Options op = new Options();
Treebank tr = op.tlpParams.diskTreebank();
tr.loadPath(file);
LexicalizedParser lpc = LexicalizedParser.trainFromTreebank(tr,op);
//Once the lpc is trained, use it to parse a file which contains Persian text
//demoDP(lpc, persianFilePath);
}
public static void demoDP(LexicalizedParser lp, String filename) {
// This option shows loading, sentence-segmenting and tokenizing
// a file using DocumentPreprocessor.
TreebankLanguagePack tlp = lp.treebankLanguagePack(); // a PennTreebankLanguagePack for English
GrammaticalStructureFactory gsf = null;
if (tlp.supportsGrammaticalStructures()) {
gsf = tlp.grammaticalStructureFactory();
}
// You could also create a tokenizer here (as below) and pass it
// to DocumentPreprocessor
for (List<HasWord> sentence : new DocumentPreprocessor(filename)) {
Tree parse = lp.apply(sentence);
parse.pennPrint();
System.out.println();
if (gsf != null) {
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCCprocessed();
System.out.println(tdl);
System.out.println();
}
}
}
}
上面的Java程序也出现了这个异常:
Options parameters:
useUnknownWordSignatures 0
smoothInUnknownsThreshold 100
smartMutation false
useUnicodeType false
unknownSuffixSize 1
unknownPrefixSize 1
flexiTag false
useSignatureForKnownSmoothing false
wordClassesFile null
parserParams edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams
forceCNF false
doPCFG true
doDep true
freeDependencies false
directional true
genStop true
distance true
coarseDistance false
dcTags true
nPrune false
Train parameters:
smooth=false
PA=true
GPA=false
selSplit=false
(0.0)
mUnary=0
mUnaryTags=false
sPPT=false
tagPA=false
tagSelSplit=false (0.0)
rightRec=false
leftRec=false
collinsPunc=false
markov=false
mOrd=1
hSelSplit=false (10)
compactGrammar=0
postPA=false
postGPA=false
selPSplit=false (0.0)
tagSelPSplit=false (0.0)
postSplitWithBase=false
fractionBeforeUnseenCounting=0.5
openClassTypesThreshold=50
preTransformer=null
taggedFiles=null
predictSplits=false
splitCount=1
splitRecombineRate=0.0
simpleBinarizedLabels=false
noRebinarization=false
trainingThreads=1
dvKBest=100
trainingIterations=40
batchSize=25
regCost=1.0E-4
qnIterationsPerBatch=1
qnEstimates=15
qnTolerance=15.0
debugOutputFrequency=0
randomSeed=0
learningRate=0.1
deltaMargin=0.1
unknownNumberVector=true
unknownDashedWordVectors=true
unknownCapsVector=true
unknownChineseYearVector=true
unknownChineseNumberVector=true
unknownChinesePercentVector=true
dvSimplifiedModel=false
scalingForInit=0.5
maxTrainTimeSeconds=0
unkWord=*UNK*
lowercaseWordVectors=false
transformMatrixType=DIAGONAL
useContextWords=false
trainWordVectors=true
stalledIterationLimit=12
markStrahler=false
Using EnglishTreebankParserParams splitIN=0 sPercent=false sNNP=0 sQuotes=false sSFP=false rbGPA=false j#=false jJJ=false jNounTags=false sPPJJ=false sTRJJ=false sJJCOMP=false sMoreLess=false unaryDT=false unaryRB=false unaryPRP=false reflPRP=false unaryIN=false sCC=0 sNT=false sRB=false sAux=0 vpSubCat=false mDTV=0 sVP=0 sVPNPAgr=false sSTag=0 mVP=false sNP%=0 sNPPRP=false dominatesV=0 dominatesI=false dominatesC=false mCC=0 sSGapped=0 numNP=false sPoss=0 baseNP=0 sNPNNP=0 sTMP=0 sNPADV=0 cTags=false rightPhrasal=false gpaRootVP=false splitSbar=0 mPPTOiIN=0 cWh=0
Binarizing trees...done. Time elapsed: 122 ms
Extracting PCFG...PennTreeReader: warning: file has extra non-matching right parenthesis [ignored]
java.lang.IllegalArgumentException: No head rule defined for _ using class edu.stanford.nlp.trees.ModCollinsHeadFinder in (_
DELM
DELM
DELM
13
punct
_
_
15
تلفیقی
_
N
N_SING
SING
13
appos
_
_
16
طنزآمیز
_
ADJ
ADJ
ADJ
15
amod
_
_
17
از
_
P
P
P
15
prep
_
_
18
اسم
_
N
N_SING
SING
17
pobj
_
_
19
و
_
CON
CON
CON
18
cc
_
_
20
شیوه
_
N
N_SING
SING
18
conj
_
_
21
کارش
_
N
N_SING
SING
20
poss/pc
_
_
22)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineNonTrivialHead(AbstractCollinsHeadFinder.java:242)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(AbstractCollinsHeadFinder.java:189)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(AbstractCollinsHeadFinder.java:140)
at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTreeHelper(TreeAnnotator.java:145)
at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTree(TreeAnnotator.java:51)
at edu.stanford.nlp.parser.lexparser.TreeAnnotatorAndBinarizer.transformTree(TreeAnnotatorAndBinarizer.java:104)
at edu.stanford.nlp.trees.CompositeTreeTransformer.transformTree(CompositeTreeTransformer.java:30)
at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankIterator.next(TransformingTreebank.java:195)
at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankIterator.next(TransformingTreebank.java:176)
at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.primeNext(FilteringTreebank.java:100)
at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.<init>(FilteringTreebank.java:85)
at edu.stanford.nlp.trees.FilteringTreebank.iterator(FilteringTreebank.java:72)
at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.tallyTrees(AbstractTreeExtractor.java:64)
at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.extract(AbstractTreeExtractor.java:89)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromTreebank(LexicalizedParser.java:881)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.trainFromTreebank(LexicalizedParser.java:267)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.trainFromTreebank(LexicalizedParser.java:278)
at FromTreeBank.main(FromTreeBank.java:46)
实际上,我不确定命令行或Java代码是否正确。我无法弄清楚命令行或 Java 代码中缺少什么,如果有人告诉我为什么会出现这些异常以及出了什么问题,我将不胜感激。或者建议任何更好的方法来从树库中训练模型。
谢谢
这里最大的问题是你试图用依赖树库训练选区树解析器(又名短语结构树解析器),这是行不通的。
CoreNLP 还附带了一个基于神经网络的依赖解析器,您可以使用 UPDT 数据对其进行训练。看看project page有关如何训练模型的说明的解析器。
我是一名优秀的程序员,十分优秀!