java - Coref 解析的 Hobbs 算法-6ren

java - Coref 解析的 Hobbs 算法

转载作者：搜寻专家更新时间：2023-11-01 01:36:44

<分区>

我已经实现了 Hobbs 的照应解决算法以及替代方案的 Lappin & Leass 排名。

让我感到困扰的是算法的描述完全是非正式的，并且由于我的实现没有正确解析某些句子，所以我不确定限制是在我的实现上还是在实际算法上。

这是我使用过的版本，在 Jurafsky&Martin 中找到:

Begin at the noun phrase (NP) node immediately dominating the pronoun.

Go up the tree to the first NP or sentence (S) node encountered. Call this node X, and call the path used to reach it p.

Traverse all branches below node X to the left of path p in a left-to-right, breadth-first fashion. Propose as the antecedent any NP node that is encountered which has an NP or S node between it and X.

If node X is the highest S node in the sentence, traverse the surface parse trees of previous sentences in the text in order of recency, the most recent first; each tree is traversed in a left-to-right, breadth-first manner, and when an NP node is encountered, it is proposed as antecedent. If X is not the highest S node in the sentence, continue to step 5.

From node X, go up the tree to the first NP or S node encountered. Call this new node X, and call the path traversed to reach it p.

If X is an NP node and if the path p to X did not pass through the Nominal node that X immediately dominates, propose X as the antecedent.

Traverse all branches below node X to the left of path p in a left-to-right, breadth- first manner. Propose any NP node encountered as the antecedent.

If X is an S node, traverse all branches of node X to the right of path p in a left-to- right, breadth-first manner, but do not go below any NP or S node encountered. Propose any NP node encountered as the antecedent.

Go to Step4

看第 3 步:“在路径 p 的左侧”。我解释它的方式是从左到右遍历子树，直到找到包含代词的分支(因此是从代词到 X 的路径的一部分)。在 Java 中:

for (Tree relative : X.children()) {
            for (Tree candidate : relative) {
                if (candidate.contains(pronoun)) break; // I am looking to all the nodes to the LEFT (i.e. coming before) the path leading to X. contain <-> in the path
...

但是，这样做不会处理像“The house is of King Arthur himself”这样的句子。这是因为“亚瑟王”包含“他自己”，因此没有被考虑在内。

这是 Hobbs 算法的限制还是我在这里弄错了什么？

作为引用，完整的 Java 代码(使用 Stanford Parser)位于此处:

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.Set;
import java.util.StringTokenizer;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.apache.commons.lang3.ArrayUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.StringEscapeUtils;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.ling.Word;
import edu.stanford.nlp.ling.Sentence;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.process.Tokenizer;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;

class ParseAllXMLDocuments {
    /** 
     * @throws ParserConfigurationException 
     * @throws SAXException 
     * @throws TransformerException 
     * @throws ModifyException 
     * @throws NavException 
     * @throws TranscodeException 
     * @throws ParseException 
     * @throws EntityException 
     * @throws EOFException 
     * @throws EncodingException */
    static final int MAXPREVSENTENCES = 4;
    public static void main(String[] args) throws IOException, SAXException, ParserConfigurationException, TransformerException  {
        //      File dataFolder = new File("DataToPort");
        //      File[] documents;
        String grammar = "grammar/englishPCFG.ser.gz";
        String[] options = { "-maxLength", "100", "-retainTmpSubcategories" };
        LexicalizedParser lp = 
                new LexicalizedParser(grammar, options);
        //
        //      if (dataFolder.isDirectory()) {
        //          documents = dataFolder.listFiles();
        //      } else {
        //          documents = new File[] {dataFolder};
        //      }
        //      int currfile = 0;
        //      int totfiles = documents.length;
        //      for (File paper : documents) {
        //          currfile++;
        //          if (paper.getName().equals(".DS_Store")||paper.getName().equals(".xml")) {
        //              currfile--;
        //              totfiles--;
        //              continue;
        //          }
        //          System.out.println("Working on "+paper.getName()+" (file "+currfile+" out of "+totfiles+").");
        //
        //          DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance(); // This is for XML
        //          DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
        //          Document doc = docBuilder.parse(paper.getAbsolutePath());
        //
        //          NodeList textlist = doc.getElementsByTagName("text");
        //          for(int i=0; i < textlist.getLength(); i++) {
        //              Node currentnode = textlist.item(i);
        //              String wholetext = textlist.item(i).getTextContent();
        String wholetext = "The house of King Arthur himself. You live in it all the day.";
        //System.out.println(wholetext);
        //Iterable<List<? extends HasWord>> sentences;
        System.out.println(wholetext);
        ArrayList<Tree> parseTrees = new ArrayList<Tree>();
        String asd = "";
        int j = 0;
        StringReader stringreader = new StringReader(wholetext);
        DocumentPreprocessor dp = new DocumentPreprocessor(stringreader);
        @SuppressWarnings("rawtypes")
        ArrayList<List> sentences = preprocess(dp);
        for (List sentence : sentences) {
            parseTrees.add( lp.apply(sentence) ); // Parsing a new sentence and adding it to the parsed tree
            ArrayList<Tree> PronounsList = findPronouns(parseTrees.get(j)); // Locating all pronouns to resolve in the sentence
            Tree corefedTree;
            for (Tree pronounTree : PronounsList) { 
                parseTrees.set(parseTrees.size()-1, HobbsResolve(pronounTree, parseTrees)); // Resolving the coref and modifying the tree for each pronoun
            }
            StringWriter strwr = new StringWriter();
            PrintWriter prwr = new PrintWriter(strwr);
            TreePrint tp = new TreePrint("penn");
            tp.printTree(parseTrees.get(j), prwr);
            prwr.flush();   
            asd += strwr.toString();
            j++;
        }
        String armando = "";
        for (Tree sentence : parseTrees) {
            for (Tree leaf : Trees.leaves(sentence))
                armando += leaf + " ";
        }
        System.out.println(armando);
        System.out.println("All done.");
        //              currentnode.setTextContent(asd);
        //          }
        //          TransformerFactory transformerFactory = TransformerFactory.newInstance();
        //          Transformer transformer = transformerFactory.newTransformer();
        //          DOMSource source = new DOMSource(doc);
        //          StreamResult result = new StreamResult(paper);
        //          transformer.transform(source, result);
        //
        //          System.out.println("Done");
        //      }
    }

    public static Tree HobbsResolve(Tree pronoun, ArrayList<Tree> forest) {
        Tree wholetree = forest.get(forest.size()-1); // The last one is the one I am going to start from
        ArrayList<Tree> candidates = new ArrayList<Tree>();
        List<Tree> path = wholetree.pathNodeToNode(wholetree, pronoun);
        System.out.println(path);
        // Step 1
        Tree ancestor = pronoun.parent(wholetree); // This one locates the NP the pronoun is in, therefore we need one more "parenting" !
        // Step 2
        ancestor = ancestor.parent(wholetree);
        //System.out.println("LABEL: "+pronoun.label().value() + "\n\tVALUE: "+pronoun.firstChild());
        while ( !ancestor.label().value().equals("NP") && !ancestor.label().value().equals("S") )
            ancestor = ancestor.parent(wholetree);
        Tree X = ancestor;
        path = X.pathNodeToNode(wholetree, pronoun);
        System.out.println(path);
        // Step 3
        for (Tree relative : X.children()) {
            for (Tree candidate : relative) {
                if (candidate.contains(pronoun)) break; // I am looking to all the nodes to the LEFT (i.e. coming before) the path leading to X. contain <-> in the path
                //System.out.println("LABEL: "+relative.label().value() + "\n\tVALUE: "+relative.firstChild());
                if ( (candidate.parent(wholetree) != X) && (candidate.parent(wholetree).label().value().equals("NP") || candidate.parent(wholetree).label().value().equals("S")) )
                    if (candidate.label().value().equals("NP")) // "Propose as the antecedent any NP node that is encountered which has an NP or S node between it and X"
                        candidates.add(candidate);
            }
        }
        // Step 9 is a GOTO step 4, hence I will envelope steps 4 to 8 inside a while statement.
        while (true) { // It is NOT an infinite loop. 
            // Step 4
            if (X.parent(wholetree) == wholetree) {
                for (int q=1 ; q < MAXPREVSENTENCES; ++q) {// I am looking for the prev sentence (hence we start with 1)
                    if (forest.size()-1 < q) break; // If I don't have it, break
                    Tree prevTree = forest.get(forest.size()-1-q); // go to previous tree
                    // Now we look for each S subtree, in order of recency (hence right-to-left, hence opposite order of that of .children() ).
                    ArrayList<Tree> backlist = new ArrayList<Tree>();
                    for (Tree child : prevTree.children()) {
                        for (Tree subtree : child) {
                            if (subtree.label().value().equals("S")) {
                                backlist.add(child);
                                break;
                            }
                        }
                    }
                    for (int i = backlist.size()-1 ; i >=0 ; --i) {
                        Tree Treetovisit = backlist.get(i);
                        for (Tree relative : Treetovisit.children()) {
                            for (Tree candidate : relative) {
                                if (candidate.contains(pronoun)) continue; // I am looking to all the nodes to the LEFT (i.e. coming before) the path leading to X. contain <-> in the path
                                //System.out.println("LABEL: "+relative.label().value() + "\n\tVALUE: "+relative.firstChild());
                                if (candidate.label().value().equals("NP")) { // "Propose as the antecedent any NP node that you find"
                                    if (!candidates.contains(candidate)) candidates.add(candidate);
                                }
                            }
                        }
                    }
                }
                break; // It will always come here eventually
            }
            // Step 5
            ancestor = X.parent(wholetree);
            //System.out.println("LABEL: "+pronoun.label().value() + "\n\tVALUE: "+pronoun.firstChild());
            while ( !ancestor.label().value().equals("NP") && !ancestor.label().value().equals("S") )
                ancestor = ancestor.parent(wholetree);
            X = ancestor;
            // Step 6
            if (X.label().value().equals("NP")) { // If X is an NP
                for (Tree child : X.children()) { // Find the nominal nodes that X directly dominates
                    if (child.label().value().equals("NN") || child.label().value().equals("NNS") || child.label().value().equals("NNP") || child.label().value().equals("NNPS") )
                        if (! child.contains(pronoun)) candidates.add(X); // If one of them is not in the path between X and the pronoun, add X to the antecedents
                }
            }
            // Step SETTE
            for (Tree relative : X.children()) {
                for (Tree candidate : relative) {
                    if (candidate.contains(pronoun)) continue; // I am looking to all the nodes to the LEFT (i.e. coming before) the path leading to X. contain <-> in the path
                    //System.out.println("LABEL: "+relative.label().value() + "\n\tVALUE: "+relative.firstChild());
                    if (candidate.label().value().equals("NP")) { // "Propose as the antecedent any NP node that you find"
                        boolean contains = false;
                        for (Tree oldercandidate : candidates) {
                            if (oldercandidate.contains(candidate)) { 
                                contains=true;
                                break;
                            }
                        }
                        if (!contains) candidates.add(candidate);
                    }
                }
            }
            // Step 8
            if (X.label().value().equals("S")) {
                boolean right = false;
                // Now we want all branches to the RIGHT of the path pronoun -> X.
                for (Tree relative : X.children()) {
                    if (relative.contains(pronoun)) {
                        right = true;
                        continue;
                    }
                    if (!right) continue;
                    for (Tree child : relative) { // Go in but do not go below any NP or S node. Go below the rest
                        if (child.label().value().equals("NP")) {
                            candidates.add(child);
                            break; // not sure if this means avoid going below NP but continuing with the rest of non-NP children. Should be since its DFS.
                        }
                        if (child.label().value().equals("S")) break; // Same
                    }
                }
            }
        }

        // Step 9 is a GOTO, so we use a while.

        System.out.println(pronoun + ": CHAIN IS " + candidates.toString());
        ArrayList<Integer> scores = new ArrayList<Integer>();

        for (int j=0; j < candidates.size(); ++j) {
            Tree candidate = candidates.get(j);
            Tree parent = null;
            int parent_index = 0;
            for (Tree tree : forest) {
                if (tree.contains(candidate)) { 
                    parent = tree;
                    break;
                }
                ++parent_index;
            }
            scores.add(0);
            if (parent_index == 0) 
                scores.set(j, scores.get(j)+100); // If in the last sentence, +100 points
            scores.set(j, scores.get(j) + syntacticScore(candidate, parent));

            if (existentialEmphasis(candidate)) // Example: "There was a dog standing outside"
                scores.set(j, scores.get(j)+70);
            if (!adverbialEmphasis(candidate, parent))
                scores.set(j, scores.get(j)+50);
            if (headNounEmphasis(candidate, parent))
                scores.set(j, scores.get(j)+80);

            int sz = forest.size()-1;
//          System.out.println("pronoun in sentence " + sz + "(sz). Candidate in sentence "+parent_index+" (parent_index)");
            int dividend = 1;
            for (int u=0; u < sz - parent_index; ++u)
                dividend *= 2;
            //System.out.println("\t"+dividend);
            scores.set(j, scores.get(j)/dividend);
            System.out.println(candidate + " -> " + scores.get(j) );
        }
        int max = -1;
        int max_index = -1;
        for (int i=0; i < scores.size(); ++i) {
            if (scores.get(i) > max) {
                max_index = i;
                max = scores.get(i);
            }
        }
        Tree final_candidate = candidates.get(max_index);
        System.out.println("My decision for " + pronoun + " is: " + final_candidate);
        // Decide what candidate, with both gender resolution and Lappin and Leass ranking.

        Tree pronounparent = pronoun.parent(wholetree).parent(wholetree); // 1 parent gives me the NP of the pronoun
        int pos = 0;
        for (Tree sibling : pronounparent.children()) {
            System.out.println("Sibling "+pos+": " + sibling);
            if (sibling.contains(pronoun)) break;
            ++pos;
        }
        System.out.println("Before setchild: " + pronounparent);
        @SuppressWarnings("unused")
        Tree returnval = pronounparent.setChild(pos, final_candidate);
        System.out.println("After setchild: " + pronounparent);

        return wholetree; // wholetree is already modified, since it contains pronounparent
    }

    private static int syntacticScore(Tree candidate, Tree root) {
        // We will check whether the NP is inside an S (hence it would be a subject)
        // a VP (direct object)
        // a PP inside a VP (an indirect obj)
        Tree parent = candidate;
        while (! parent.label().value().equals("S")) {
            if (parent.label().value().equals("VP")) return 50; // direct obj
            if (parent.label().value().equals("PP")) {
                Tree grandparent = parent.parent(root);
                while (! grandparent.label().value().equals("S")) {
                    if (parent.label().value().equals("VP")) // indirect obj is a PP inside a VP
                        return 40;
                    parent = grandparent;
                    grandparent = grandparent.parent(root);
                } 
            }
            parent = parent.parent(root);
        }
        return 80; // If nothing remains, it must be the subject
    }

    private static boolean existentialEmphasis(Tree candidate) {
        // We want to check whether our NP's Dets are "a" or "an".
        for (Tree child : candidate) {
            if (child.label().value().equals("DT")) {
                for (Tree leaf : child) {
                    if (leaf.value().equals("a")||leaf.value().equals("an")
                            ||leaf.value().equals("A")||leaf.value().equals("An") ) {
                        //System.out.println("Existential emphasis!");
                        return true;
                    }
                }
            }
        }
        return false;
    }

    private static boolean headNounEmphasis(Tree candidate, Tree root) {
        Tree parent = candidate.parent(root);
        while (! parent.label().value().equals("S")) { // If it is the head NP, it is not contained in another NP (that's exactly how the original algorithm does it)
            if (parent.label().value().equals("NP")) return false;
            parent = parent.parent(root);
        }
        return true;
    }

    private static boolean adverbialEmphasis(Tree candidate, Tree root) { // Like in "Inside the castle, King Arthur was invincible". "Castle" has the adv emph.
        Tree parent = candidate;
        while (! parent.label().value().equals("S")) {
            if (parent.label().value().equals("PP")) {
                for (Tree sibling : parent.siblings(root)) {
                    if ( (sibling.label().value().equals(","))) {
                        //System.out.println("adv Emph!");
                        return true;
                    }
                }
            }
            parent = parent.parent(root);
        }
        return false;
    }

    public static ArrayList<Tree> findPronouns(Tree t) {
        ArrayList<Tree> pronouns = new ArrayList<Tree>();
        if (t.label().value().equals("PRP") && !t.children()[0].label().value().equals("I") && !t.children()[0].label().value().equals("you") && !t.children()[0].label().value().equals("You")) {
            pronouns.add(t);
        }
        else
            for (Tree child : t.children())
                pronouns.addAll(findPronouns(child));
                    return pronouns;
    }

    @SuppressWarnings("rawtypes")
    public static ArrayList<List> preprocess(DocumentPreprocessor strarray) {
        ArrayList<List> Result = new ArrayList<List>();
        for (List<HasWord> sentence : strarray) {
            if (!StringUtils.isAsciiPrintable(sentence.toString())) {
                continue; // Removing non ASCII printable sentences
            }
            //string = StringEscapeUtils.escapeJava(string);
            //string = string.replaceAll("([^A-Za-z0-9])", "\\s$1");
            int nonwords_chars = 0;
            int words_chars = 0;
            for (HasWord hasword : sentence ) {
                String next = hasword.toString();
                if ((next.length() > 30)||(next.matches("[^A-Za-z]"))) nonwords_chars += next.length(); // Words too long or non alphabetical will be junk
                else words_chars += next.length();
            }
            if ( (nonwords_chars / (nonwords_chars+words_chars)) > 0.5) // If more than 50% of the string is non-alphabetical, it is going to be junk
                continue;   // Working on a character-basis because some sentences may contain a single, very long word
            if (sentence.size() > 100) {
                System.out.println("\tString longer than 100 words!\t" + sentence.toString());
                continue;
            }
            Result.add(sentence);
        }
        return Result;
    }
}

文章推荐： javascript - Google Analytics 事件跟踪未显示

文章推荐： javascript - D3 Sankey 图中链接的梯度

文章推荐： java - 在 Java 中获取两个 URI 的相对路径

javascript - 控制台错误 - 解析 AJAX JSON 解析
我一直在使用 AJAX 从我正在创建的网络服务中解析 JSON 数组时遇到问题。我的前端是一个简单的 ajax 和 jquery 组合，用于显示从我正在创建的网络服务返回的结果。尽管知道我的数据库查
xml - Json 解析 vs xml 解析？
很难说出这里要问什么。这个问题模棱两可、含糊不清、不完整、过于宽泛或夸夸其谈，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开，visit the help center . 关闭 1
android - java.lang.NoClassDefFoundError : com. 解析。解析
我在尝试运行 Android 应用程序时遇到问题并收到以下错误 java.lang.NoClassDefFoundError: com.parse.Parse 当我尝试运行该应用时。最佳答案在这
python - 解析 HTML 内容时防止 etree 解析 HTML 实体
有什么办法可以防止etree在解析HTML内容时解析HTML实体吗？ html = etree.HTML('&') html.find('.//body').text 这给了我 '&' 但我想
javascript - 使用 JSON 解析/解析 js 对象时，返回方法中的函数范围会丢失
我有一个有点疯狂的例子，但对于那些 JavaScript 函数作用域专家来说，它看起来是一个很好的练习: (function (global) { // our module number one
java - 使用 Java 解析 HTML 数据(DOM 解析)
关闭。此题需要details or clarity 。目前不接受答案。想要改进这个问题吗？通过 editing this post 添加详细信息并澄清问题. 已关闭 8 年前。 Improve th
php - 在服务器上用 PHP 解析 HTML 还是在最终用户端用 JavaScript 解析 HTML 会更好？
我需要编写一个脚本来获取链接并解析链接页面的 HTML 以提取标题和其他一些数据，例如可能是简短的描述，就像您链接到 Facebook 上的内容一样。当用户向站点添加链接时将调用它，因此在客户端启动
node.js - 为什么 npm 包从/AppData 解析，而不是从 local/node_modules 解析？
在 VS Code 中本地开发时，包解析为 C:/Users//AppData/Local/Microsoft/TypeScript/3.5/node_modules/@types//index而不是
php - 解析 json 错误 : SyntaxError: JSON. 解析:JSON 数据的第 1 行第 2 列出现意外字符
我在将 json 从 php 解析为 javascript 时遇到问题这是我的示例代码: //function MethodAjax = function (wsFile, param) {
php - 解析 json 错误 : SyntaxError: JSON. 解析:JSON 数据的第 1 行第 2 列出现意外字符
我在将 json 从 php 解析为 javascript 时遇到问题这是我的示例代码: //function MethodAjax = function (wsFile, param) {
解析，在哪里可以了解
我被赋予了将一种语言“翻译”成另一种语言的工作。对于使用正则表达式的简单逐行方法来说，源代码过于灵活(复杂)。我在哪里可以了解更多关于词法分析和解析器的信息？最佳答案如果你想对这个主题产生“情绪化
正则表达式 {} 解析
您好，我在解析此文本时遇到问题 { { { {[system1];1;1;0.612509325}; {[system2];1;
JavaScript 解析？
我正在为 adobe after effects 在 extendscript 中编写一些代码，最终变成了 javascript。我有一个数组，我想只搜索单词“assemble”并返回整个 jc3_
JavaScript 解析
我有这段代码: $(document).ready(function() { // }); 问题:FB_RequireFeatures block 外部的代码先于其内部的代码执行。因此 who
解析.netcore项目中IStartupFilter使用教程
背景： netcore项目中有些服务是在通过中间件来通信的，比如orleans组件。它里面服务和客户端会指定网关和端口，我们只需要开放客户端给外界，服务端关闭端口。相当于去掉host，这样省掉了些
解析:继承ViewGroup后的子类如何重写onMeasure方法
1.首先贴上我试验成功的代码复制代码代码如下: protected void onMeasure(int widthMeasureSpec, int heightMeasureSpec)
Python如何对XML 解析
什么是 XML？ XML 指可扩展标记语言（eXtensible Markup Language），标准通用标记语言的子集，是一种用于标记电子文件使其具有结构性的标记语言。你可以通过本站学习 X
解析:php调用MsSQL存储过程使用内置RETVAL获取过程中的return值
【PHP代码】复制代码代码如下: $stmt = mssql_init('P__Global_Test', $conn) or die("initialize sto
解析:清除SQL被注入恶意病毒代码的语句
在SQL查询分析器执行以下代码就可以了。复制代码代码如下: declare @t varchar(255),@c varchar(255) declare table_cursor curs
【JavaScript】前端算法题40道题+解析
前言最近练习了一些前端算法题，现在做个总结，以下题目都是个人写法，并不是标准答案，如有错误欢迎指出，有对某道题有新的想法的友友也可以在评论区发表想法，互相学习🤭 题目题目一: 二维数组中的

搜寻专家

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

java - Coref 解析的 Hobbs 算法