gpt4 book ai didi

java - 如何从 Treebank 训练斯坦福 NLP 的新解析器模型?

转载 作者:太空宇宙 更新时间:2023-11-04 12:51:45 26 4
gpt4 key购买 nike

我已经下载了 UPDT波斯语树库( Uppsala Persian Dependency Treebank ),我正在尝试使用斯坦福 NLP 从中构建依赖解析器模型。我尝试使用命令行和 Java 代码来训练模型,但在这两种情况下都出现异常。

1- 使用命令行训练模型:

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -train UPDT\train.conll 0 -saveToSerializedFile UPDT\updt.model.ser.gz

当我运行上面的命令时,我会得到这个异常:

done [read 26 trees]. Time elapsed: 0 ms
Options parameters:
useUnknownWordSignatures 0
smoothInUnknownsThreshold 100
smartMutation false
useUnicodeType false
unknownSuffixSize 1
unknownPrefixSize 1
flexiTag false
useSignatureForKnownSmoothing false
wordClassesFile null
parserParams edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams
forceCNF false
doPCFG true
doDep true
freeDependencies false
directional true
genStop true
distance true
coarseDistance false
dcTags true
nPrune false
Train parameters:
smooth=false
PA=true
GPA=false
selSplit=false
(0.0)
mUnary=0
mUnaryTags=false
sPPT=false
tagPA=false
tagSelSplit=false (0.0)
rightRec=false
leftRec=false
collinsPunc=false
markov=false
mOrd=1
hSelSplit=false (10)
compactGrammar=0
postPA=false
postGPA=false
selPSplit=false (0.0)
tagSelPSplit=false (0.0)
postSplitWithBase=false
fractionBeforeUnseenCounting=0.5
openClassTypesThreshold=50
preTransformer=null
taggedFiles=null
predictSplits=false
splitCount=1
splitRecombineRate=0.0
simpleBinarizedLabels=false
noRebinarization=false
trainingThreads=1
dvKBest=100
trainingIterations=40
batchSize=25
regCost=1.0E-4
qnIterationsPerBatch=1
qnEstimates=15
qnTolerance=15.0
debugOutputFrequency=0
randomSeed=0
learningRate=0.1
deltaMargin=0.1
unknownNumberVector=true
unknownDashedWordVectors=true
unknownCapsVector=true
unknownChineseYearVector=true
unknownChineseNumberVector=true
unknownChinesePercentVector=true
dvSimplifiedModel=false
scalingForInit=0.5
maxTrainTimeSeconds=0
unkWord=*UNK*
lowercaseWordVectors=false
transformMatrixType=DIAGONAL
useContextWords=false
trainWordVectors=true
stalledIterationLimit=12
markStrahler=false

Using EnglishTreebankParserParams splitIN=0 sPercent=false sNNP=0 sQuotes=false
sSFP=false rbGPA=false j#=false jJJ=false jNounTags=false sPPJJ=false sTRJJ=fals
e sJJCOMP=false sMoreLess=false unaryDT=false unaryRB=false unaryPRP=false reflP
RP=false unaryIN=false sCC=0 sNT=false sRB=false sAux=0 vpSubCat=false mDTV=0 sV
P=0 sVPNPAgr=false sSTag=0 mVP=false sNP%=0 sNPPRP=false dominatesV=0 dominatesI
=false dominatesC=false mCC=0 sSGapped=0 numNP=false sPoss=0 baseNP=0 sNPNNP=0 s
TMP=0 sNPADV=0 cTags=false rightPhrasal=false gpaRootVP=false splitSbar=0 mPPTOi
IN=0 cWh=0
Binarizing trees...done. Time elapsed: 12 ms
Extracting PCFG...PennTreeReader: warning: file has extra non-matching right par
enthesis [ignored]
Exception in thread "main" java.lang.IllegalArgumentException: No head rule defi
ned for _ using class edu.stanford.nlp.trees.ModCollinsHeadFinder in (_
DELM
DELM
DELM
13
punct
_
_
15
??????
_
N
N_SING
SING
13
appos
_
_
16
???????
_
ADJ
ADJ
ADJ
15
amod
_
_
17
??
_
P
P
P
15
prep
_
_
18
???
_
N
N_SING
SING
17
pobj
_
_
19
?
_
CON
CON
CON
18
cc
_
_
20
????
_
N
N_SING
SING
18
conj
_
_
21
????
_
N
N_SING
SING
20
poss/pc
_
_
22)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineNonTrivialH
ead(AbstractCollinsHeadFinder.java:242)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(Abstra
ctCollinsHeadFinder.java:189)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(Abstra
ctCollinsHeadFinder.java:140)
at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTreeHelper(T
reeAnnotator.java:145)
at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTree(TreeAnn
otator.java:51)
at edu.stanford.nlp.parser.lexparser.TreeAnnotatorAndBinarizer.transform
Tree(TreeAnnotatorAndBinarizer.java:104)
at edu.stanford.nlp.trees.CompositeTreeTransformer.transformTree(Composi
teTreeTransformer.java:30)
at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankItera
tor.next(TransformingTreebank.java:195)
at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankItera
tor.next(TransformingTreebank.java:176)
at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.pr
imeNext(FilteringTreebank.java:100)
at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.<i
nit>(FilteringTreebank.java:85)
at edu.stanford.nlp.trees.FilteringTreebank.iterator(FilteringTreebank.j
ava:72)
at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.tallyTrees(Ab
stractTreeExtractor.java:64)
at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.extract(Abstr
actTreeExtractor.java:89)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromTree
bank(LexicalizedParser.java:881)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedP
arser.java:1394)

2- 使用 Java 代码训练模型:

import java.io.File;
import java.io.IOException;
import java.util.Collection;
import java.util.List;

import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.parser.lexparser.Options;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.trees.GrammaticalStructure;
import edu.stanford.nlp.trees.GrammaticalStructureFactory;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.Treebank;
import edu.stanford.nlp.trees.TreebankLanguagePack;


public class FromTreeBank {

public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub

String treebankPathUPDT = "src/model/UPDT.1.2/train.conll";
String persianFilePath = "src/txt/persianSentences.txt";

File file = new File(treebankPathUPDT);

Options op = new Options();
Treebank tr = op.tlpParams.diskTreebank();
tr.loadPath(file);
LexicalizedParser lpc = LexicalizedParser.trainFromTreebank(tr,op);

//Once the lpc is trained, use it to parse a file which contains Persian text
//demoDP(lpc, persianFilePath);
}


public static void demoDP(LexicalizedParser lp, String filename) {
// This option shows loading, sentence-segmenting and tokenizing
// a file using DocumentPreprocessor.
TreebankLanguagePack tlp = lp.treebankLanguagePack(); // a PennTreebankLanguagePack for English
GrammaticalStructureFactory gsf = null;
if (tlp.supportsGrammaticalStructures()) {
gsf = tlp.grammaticalStructureFactory();
}
// You could also create a tokenizer here (as below) and pass it
// to DocumentPreprocessor
for (List<HasWord> sentence : new DocumentPreprocessor(filename)) {
Tree parse = lp.apply(sentence);
parse.pennPrint();
System.out.println();
if (gsf != null) {
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCCprocessed();
System.out.println(tdl);
System.out.println();
}
}
}

}

上面的Java程序也出现了这个异常:

Options parameters:
useUnknownWordSignatures 0
smoothInUnknownsThreshold 100
smartMutation false
useUnicodeType false
unknownSuffixSize 1
unknownPrefixSize 1
flexiTag false
useSignatureForKnownSmoothing false
wordClassesFile null
parserParams edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams
forceCNF false
doPCFG true
doDep true
freeDependencies false
directional true
genStop true
distance true
coarseDistance false
dcTags true
nPrune false
Train parameters:
smooth=false
PA=true
GPA=false
selSplit=false
(0.0)
mUnary=0
mUnaryTags=false
sPPT=false
tagPA=false
tagSelSplit=false (0.0)
rightRec=false
leftRec=false
collinsPunc=false
markov=false
mOrd=1
hSelSplit=false (10)
compactGrammar=0
postPA=false
postGPA=false
selPSplit=false (0.0)
tagSelPSplit=false (0.0)
postSplitWithBase=false
fractionBeforeUnseenCounting=0.5
openClassTypesThreshold=50
preTransformer=null
taggedFiles=null
predictSplits=false
splitCount=1
splitRecombineRate=0.0
simpleBinarizedLabels=false
noRebinarization=false
trainingThreads=1
dvKBest=100
trainingIterations=40
batchSize=25
regCost=1.0E-4
qnIterationsPerBatch=1
qnEstimates=15
qnTolerance=15.0
debugOutputFrequency=0
randomSeed=0
learningRate=0.1
deltaMargin=0.1
unknownNumberVector=true
unknownDashedWordVectors=true
unknownCapsVector=true
unknownChineseYearVector=true
unknownChineseNumberVector=true
unknownChinesePercentVector=true
dvSimplifiedModel=false
scalingForInit=0.5
maxTrainTimeSeconds=0
unkWord=*UNK*
lowercaseWordVectors=false
transformMatrixType=DIAGONAL
useContextWords=false
trainWordVectors=true
stalledIterationLimit=12
markStrahler=false

Using EnglishTreebankParserParams splitIN=0 sPercent=false sNNP=0 sQuotes=false sSFP=false rbGPA=false j#=false jJJ=false jNounTags=false sPPJJ=false sTRJJ=false sJJCOMP=false sMoreLess=false unaryDT=false unaryRB=false unaryPRP=false reflPRP=false unaryIN=false sCC=0 sNT=false sRB=false sAux=0 vpSubCat=false mDTV=0 sVP=0 sVPNPAgr=false sSTag=0 mVP=false sNP%=0 sNPPRP=false dominatesV=0 dominatesI=false dominatesC=false mCC=0 sSGapped=0 numNP=false sPoss=0 baseNP=0 sNPNNP=0 sTMP=0 sNPADV=0 cTags=false rightPhrasal=false gpaRootVP=false splitSbar=0 mPPTOiIN=0 cWh=0
Binarizing trees...done. Time elapsed: 122 ms
Extracting PCFG...PennTreeReader: warning: file has extra non-matching right parenthesis [ignored]
java.lang.IllegalArgumentException: No head rule defined for _ using class edu.stanford.nlp.trees.ModCollinsHeadFinder in (_
DELM
DELM
DELM
13
punct
_
_
15
تلفیقی
_
N
N_SING
SING
13
appos
_
_
16
طنزآمیز
_
ADJ
ADJ
ADJ
15
amod
_
_
17
از
_
P
P
P
15
prep
_
_
18
اسم
_
N
N_SING
SING
17
pobj
_
_
19
و
_
CON
CON
CON
18
cc
_
_
20
شیوه
_
N
N_SING
SING
18
conj
_
_
21
کارش
_
N
N_SING
SING
20
poss/pc
_
_
22)


at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineNonTrivialHead(AbstractCollinsHeadFinder.java:242)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(AbstractCollinsHeadFinder.java:189)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(AbstractCollinsHeadFinder.java:140)
at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTreeHelper(TreeAnnotator.java:145)
at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTree(TreeAnnotator.java:51)
at edu.stanford.nlp.parser.lexparser.TreeAnnotatorAndBinarizer.transformTree(TreeAnnotatorAndBinarizer.java:104)
at edu.stanford.nlp.trees.CompositeTreeTransformer.transformTree(CompositeTreeTransformer.java:30)
at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankIterator.next(TransformingTreebank.java:195)
at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankIterator.next(TransformingTreebank.java:176)
at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.primeNext(FilteringTreebank.java:100)
at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.<init>(FilteringTreebank.java:85)
at edu.stanford.nlp.trees.FilteringTreebank.iterator(FilteringTreebank.java:72)
at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.tallyTrees(AbstractTreeExtractor.java:64)
at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.extract(AbstractTreeExtractor.java:89)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromTreebank(LexicalizedParser.java:881)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.trainFromTreebank(LexicalizedParser.java:267)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.trainFromTreebank(LexicalizedParser.java:278)
at FromTreeBank.main(FromTreeBank.java:46)

实际上,我不确定命令行或Java代码是否正确。我无法弄清楚命令行或 Java 代码中缺少什么,如果有人告诉我为什么会出现这些异常以及出了什么问题,我将不胜感激。或者建议任何更好的方法来从树库中训练模型。

谢谢

最佳答案

这里最大的问题是你试图用依赖树库训练选区树解析器(又名短语结构树解析器),这是行不通的。

CoreNLP 还附带了一个基于神经网络的依赖解析器,您可以使用 UPDT 数据对其进行训练。看看project page有关如何训练模型的说明的解析器。

关于java - 如何从 Treebank 训练斯坦福 NLP 的新解析器模型?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35761744/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com