java - 在 hadoop 上解析 Stackoverflow 的 posts.xml-6ren

java - 在 hadoop 上解析 Stackoverflow 的 posts.xml

转载作者：行者123 更新时间：2023-12-01 13:59:09

我正在关注这个article由 Anoop Madhusudanan 在 codeproject 上构建一个推荐引擎，不是在集群上而是在我的系统上。

问题是当我尝试解析 posts.xml 时，其结构如下:

 <row Id="99" PostTypeId="2" ParentId="88" CreationDate="2008-08-01T14:55:08.477" Score="2" Body="&lt;blockquote&gt;&#xD;&#xA;  &lt;p&gt;The actual resolution of gettimeofday() depends on the hardware architecture. Intel processors as well as SPARC machines offer high resolution timers that measure microseconds. Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz. In such cases, the time resolution will be less accurate. &lt;/p&gt;&#xD;&#xA;&lt;/blockquote&gt;&#xD;&#xA;&#xD;&#xA;&lt;p&gt;I obtained this answer from &lt;a href=&quot;http://www.informit.com/guides/content.aspx?g=cplusplus&amp;amp;seqNum=272&quot; rel=&quot;nofollow&quot;&gt;High Resolution Time Measurement and Timers, Part I&lt;/a&gt;&lt;/p&gt;" OwnerUserId="25" LastActivityDate="2008-08-01T14:55:08.477" />

现在我需要在 hadoop 上解析这个文件(大小 1.4 GB)，我已经用 java 编写了代码并创建了它的 jar。Java类如下:

import java.io.IOException;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.Element;

import java.io.File;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;


public class Recommend {

    static class Map extends Mapper<Text, Text, Text, Text> {
        Path path;
        String fXmlFile;
        DocumentBuilderFactory dbFactory;
        DocumentBuilder dBuilder;
        Document doc;

        /**
         * Given an output filename, write a bunch of random records to it.
         */
        public void map(LongWritable key, Text value,
                OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
            try{
                fXmlFile=value.toString();
                dbFactory = DocumentBuilderFactory.newInstance();
                dBuilder= dbFactory.newDocumentBuilder();
                doc= dBuilder.parse(fXmlFile);

                doc.getDocumentElement().normalize();
                NodeList nList = doc.getElementsByTagName("row");

                for (int temp = 0; temp < nList.getLength(); temp++) {

                    Node nNode = nList.item(temp);
                    Element eElement = (Element) nNode;

                    Text keyWords =new Text(eElement.getAttribute("OwnerUserId"));
                    Text valueWords = new Text(eElement.getAttribute("ParentId"));
                    String val=keyWords.toString()+" "+valueWords.toString();
                    // Write the sentence 
                    if(keyWords != null && valueWords != null){
                        output.collect(keyWords, new Text(val));
                    }
                }

            }catch (Exception e) {
                e.printStackTrace();
            } 
        }
    }

    /**
     * 
     * @throws IOException 
     */
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        //String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        /*if (args.length != 2) {
          System.err.println("Usage: wordcount <in> <out>");
          System.exit(2);
        }*/
//      FileSystem fs = FileSystem.get(conf);
        Job job = new Job(conf, "Recommend");
        job.setJarByClass(Recommend.class);
        
        // the keys are words (strings)
        job.setOutputKeyClass(Text.class);
        job.setMapOutputKeyClass(LongWritable.class);
        job.setMapOutputValueClass(Text.class);
        
        // the values are counts (ints)
        job.setOutputValueClass(Text.class);

        job.setMapperClass(Map.class);
        //conf.setReducerClass(Reduce.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
       
        System.exit(job.waitForCompletion(true) ? 0 : 1);
         Path outPath = new Path(args[1]);
            FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
            if (dfs.exists(outPath)) {
            dfs.delete(outPath, true);
            }
    }
}

我希望输出是 hadoop 中的一个文件，其中包含 OwnerUserId ParentId 的输出但我得到的输出为:

1599788   <row Id="2292" PostTypeId="2" ParentId="2284" CreationDate="2008-08-05T13:28:06.700" Score="0" ViewCount="0" Body="&lt;p&gt;The first thing you should do is contact the main people who run the open source project. Ask them if it is ok to contribute to the code and go from there.&lt;/p&gt;&#xD;&#xA;&#xD;&#xA;&lt;p&gt;Simply writing your improved code and then giving it to them may result in your code being rejected.&lt;/p&gt;" OwnerUserId="383" LastActivityDate="2008-08-05T13:28:06.700" />

我不知道1599788作为映射器的键值出现的起源。

我不太了解如何为 hadoop 编写映射器类，我需要帮助来修改我的代码以获得所需的输出。

提前致谢。

最佳答案

经过大量的研究和实验，终于学会了为解析 xml 文件编写映射的方法，其语法类似于我提供的语法。我改变了我的方法，这是我的新映射器代码......它适用于我的用例。

希望它可以帮助别人并节省他们的时间:)

import java.io.IOException;
import java.util.StringTokenizer;

import javax.xml.parsers.ParserConfigurationException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.xml.sax.SAXException;

public class Map extends Mapper<LongWritable, Text, NullWritable, Text> {
    NullWritable obj;

    @Override
    public void map(LongWritable key, Text value, Context context) throws InterruptedException {
        StringTokenizer tok= new StringTokenizer(value.toString()); 
        String pa=null,ow=null,pi=null,v;
        while (tok.hasMoreTokens()) {
            String[] arr;
            String val = (String) tok.nextToken();
            if(val.contains("PostTypeId")){
                arr= val.split("[\"]");
                pi=arr[arr.length-1];
                if(pi.equals("2")){
                    continue;
                }
                else break;
            }
            if(val.contains("ParentId")){
                arr= val.split("[\"]");
                pa=arr[arr.length-1];
            } 
            else if(val.contains("OwnerUserId") ){
                arr= val.split("[\"]");
                ow=arr[arr.length-1];
                try {
                    if(pa!=null && ow != null){
                        v=String.format("{0},{1}", ow,pa);
                        context.write(obj,new Text(v));

                    }
                } catch (IOException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }
            }
        }


    }

}

关于java - 在 hadoop 上解析 Stackoverflow 的 posts.xml，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/19445528/

文章推荐： java - 字符串连接时 Java 字符串池如何工作？

文章推荐： java - Quartz Scheduler 不返回值方法每 3 分钟执行一次

文章推荐： java - 在服务器端使用selenium出现异常

post - 防止双重HTTP POST
我已经制作了一个用于报名参加 Activity 的小应用程序。用户输入他们的数据，然后单击“登录我”。现在有时人们在数据库中是双倍的，完全相同的数据彼此之间很快被插入了两次。这只能表示某人单击了两次
post - post 方法中的未定义索引
这个问题在这里已经有了答案: 关闭 10 年前。 Possible Duplicate: PHP: “Notice: Undefined variable” and “Notice: Undefin
post - post 方法中的未定义索引
这个问题在这里已经有了答案: 关闭 10 年前。 Possible Duplicate: PHP: “Notice: Undefined variable” and “Notice: Undefin
post - 通过 HTTP 请求 POST 发送 Post 数据
我正在尝试将数据从 Textarea 发布到经典的 ASP 脚本，该脚本更新本地计算机上的 MS SQL，然后发布到另一台服务器上的 PHP 脚本。但是，执行以下操作不起作用，因为它会切断文本区域的数
ruby - 无法使用字符串 'post' 但使用 'POST' 识别表单。 html 包含 'post'
无效的代码: login_form = page.form_with(:method => 'post') 和有效的代码: login_form = page.form_with(:method =>
javascript - 当触发特定 POST 请求时修改 HTTP POST 请求正文(post 参数)
我希望能够在 HTTP Post 请求被触发时拦截它，然后修改其请求正文(或参数)，然后发送它。这怎么可能用 jquery/js 实现。谢谢最佳答案 Jquery ajax beforeSend
php - 你怎么写“Select (all) From (table) Where posting=$posting except this posting？(Mysql)
我想编写一个 Mysql 语句，从表(发布)中选择所有内容，其中标题类似于 $title 除了 $title 的标题。基本上我想显示某个帖子的所有相关帖子。我希望查询选择表中标题或详细信息中具有标题名
post - S3 POST 上传最小策略
我已经成功创建了一个简单的 HTML 表单，它将上传的文件发布到我的 Amazon S3 存储桶。我遵循了以下说明: http://aws.amazon.com/articles/1434 现在我正在
post - PHP POST 推荐人
我正在实现一个 PayPal IPN 页面，并想检查以确保请求真正来自 PayPal 而不是被欺骗。我假设 HTTP_REFERRER 不是一个好的检查方式？我已经尝试过这种方法，但变量只是空的。有
post - nginx 不将 POST 数据转发到 uwsgi -- [UPDATE] flask 不读取 post 数据
我有一个非常简单的设置有一个非常特殊的问题。该设置部署了 nginx Web 服务器以提供一些静态页面。它还有一个用于处理 POST 请求的后端 uwsgi 守护进程。我的nginx位置配置如下
javascript - 为什么我的 HTML POST 请求表单没有通过 Express app.post？如何将数字变量传递给另一个 POST 请求？
我认为我做错了什么，或者误解了我在网上阅读的有关 POST 和 GET 请求的内容。我在 myNumber.ejs 上有一个提交表单。当我按下提交时，有 Add.ejs 的 View 。 Add.ej
javascript - jQuery $.post 和 $.ajax POST 请求不适用于 Express 中的 app.post() 方法
我需要将数据从 Express 应用程序的前端发送到后端，然后使用需要显示该数据的 EJS 呈现页面。问题是 app.post() 方法，随后 res.render() 函数似乎没有完全执行或者当我
amazon-web-services - AWS cfn-hup 配置中的 post.add、post.update 与 post.delete
根据AWS Documentation对于 CloudFormation cfn-hup 帮助程序脚本，cfn-hup Hook 可以具有“要检测的以逗号分隔的条件列表”。这些条件/触发器可以是 po
php - HTTP 500 错误但确实将数据存储到数据库 - Wordpress new-post.php、post.php、edit-post.php 在更新数据结构和插件后无法正常工作
位于“wp-admin/includes/”的“post.php”文件中的 wordpress 函数“get_default_post_to_edit”无法正常工作。当我加载页面时:wp-admin
post - 带有空体的 Flutter post 方法
我使用请求库发布数据，但在服务器上收到空主体，没有传递任何数据。我在代码中遗漏了什么吗？ Map map = new Map(); map[csrfNameKey] = csrfName;
post - Siege 不发送 POST 数据
我正在尝试使用 siege 3.0.1 测试我的网站。但是好像siege不发送POST数据。这是我从网络浏览器收到的请求 POST / HTTP/1.0 Accept: text/html,appli
post - Feign 无法使用 POST 方法
我正在尝试为 stockfigher 游戏 api 编写包装器，只是为了了解 feign 是如何工作的，而且我在第一个 POST 方法中遇到了问题: @RequestMapping(method =
post - 如何使用 Jersey 获取原始 POST？
如何使用 Jersey 获取原始 POST？ @FormParam将不起作用，因为我发布的原始 JSON 不在任何特定的 POST 字段中。最佳答案 Jersey 带有一个用于将 JSON 映射到
post - 使用扩展创建 OData POST 实体
我正在尝试同时创建一个实体和两个子实体的实例。如果我将以下 JSON 发布到/user_objects 资源，它会很高兴地创建父 user_object 实体和链接的 User_object_att
post - IPV6 curl POST 请求
在 IPV6 中如何使用 IPV6 地址和端口号构建 CURL POST http 请求。任何类型的线程都将受到赞赏。尝试构建如下请求 >curl --interface 'http://[2001

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

java - 在 hadoop 上解析 Stackoverflow 的 posts.xml