- android - 多次调用 OnPrimaryClipChangedListener
- android - 无法更新 RecyclerView 中的 TextView 字段
- android.database.CursorIndexOutOfBoundsException : Index 0 requested, 光标大小为 0
- android - 使用 AppCompat 时,我们是否需要明确指定其 UI 组件(Spinner、EditText)颜色
我正在尝试使用 Apache Spark Java API 设置 Twitter 流。将 Twitter 流保存到 Elasticsearch 时,我遇到了异常。我想我正在尝试保存原始推文,这就是问题所在。请让我知道我可以尝试什么来解决此异常。
代码如下:
package com.twitter.streaming;
import com.twitter.util.TwitterStreamUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.serializer.KryoSerializer;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.twitter.TwitterUtils;
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
import twitter4j.Status;
/**
* Created by Manali on 1/28/2017.
*/
public class TwitterStream {
private static final String[] filters = {"#football"};
public static void main(String[] args) throws InterruptedException {
// create the spark configuration and spark context
System.setProperty("hadoop.home.dir", "C:\\winutil\\");
SparkConf conf = new SparkConf().setAppName("SparkTwitterStreamExample").setMaster("local[2]")
.set("spark.serializer", KryoSerializer.class.getName())
.set("es.nodes", "localhost:9200")
.set("es.index.auto.create", "true");
// create a java streaming context and define the window (3 seconds batch)
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(15));
System.out.println("Initializing Twitter stream...");
// create a DStream (sequence of RDD). The object tweetsStream is a DStream of tweet statuses:
// - the Status class contains all information of a tweet
// See http://twitter4j.org/javadoc/twitter4j/Status.html
// and fill the keys and tokens in the Streamutils class!
JavaDStream<Status> twitterStream = TwitterUtils.createStream(jssc, TwitterStreamUtils.getAuth());
JavaDStream<String> statuses = twitterStream.map(
new Function<Status, String>() {
public String call(Status status) { return status.toString(); }
}
);
statuses.print();
statuses.foreachRDD(tweets->{
// save tweet to Elasticsearch
JavaEsSpark.saveJsonToEs(tweets, "spark/tweets");
return null;
});
jssc.start();
jssc.awaitTermination();
}
}
堆栈跟踪:
-------------------------------------------
Time: 1486397175000 ms
-------------------------------------------
StatusJSONImpl{createdAt=Mon Feb 06 10:06:11 CST 2017, id=828635913144016896, text='夢王國超強大的XDDD
托托大愛( ´▽` )ノ
發棉花糖的執事超高超帥wwwww
#夢100 #CWT45', rel="nofollow">Twitter for Android</a>', isTruncated=false, inReplyToStatusId=-1, inReplyToUserId=-1, isFavorited=false, isRetweeted=false, favoriteCount=0, inReplyToScreenName='null', geoLocation=null, place=null, retweetCount=0, isPossiblySensitive=false, lang='ja', contributorsIDs=[], retweetedStatus=null, userMentionEntities=[], urlEntities=[], hashtagEntities=[HashtagEntityJSONImpl{text='夢100'}, HashtagEntityJSONImpl{text='CWT45'}], mediaEntities=[MediaEntityJSONImpl{id=828635824715505665, symbolEntities=[], currentUserRetweetId=-1, user=UserJSONImpl{id=4298859732, name='草加美燕', screenName='mU7oEb6DVbCda4S', location='臺灣 新北市中和', description='17歲的高
17/02/06 10:06:16 INFO BlockGenerator: Pushed block input-0-1486397175800
17/02/06 10:06:16 ERROR TaskContextImpl: Error in TaskCompletionListener
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Invalid UTF-8 start byte 0x89
at [Source: [B@25c68cc; line: 1, column: 3]
at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:478)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:436)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:426)
at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:153)
at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:225)
at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:248)
at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:267)
at org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:130)
at org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply$mcV$sp(EsRDDWriter.scala:42)
at org.apache.spark.TaskContextImpl$$anon$2.onTaskCompletion(TaskContextImpl.scala:68)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77)
at org.apache.spark.scheduler.Task.run(Task.scala:90)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/02/06 10:06:16 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 6)
org.apache.spark.util.TaskCompletionListenerException: Invalid UTF-8 start byte 0x89
at [Source: [B@25c68cc; line: 1, column: 3]
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:90)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/02/06 10:06:16 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 7, localhost, NODE_LOCAL, 1943 bytes)
17/02/06 10:06:16 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, localhost): org.apache.spark.util.TaskCompletionListenerException: Invalid UTF-8 start byte 0x89
at [Source: [B@25c68cc; line: 1, column: 3]
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:90)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/02/06 10:06:16 INFO Executor: Running task 1.0 in stage 3.0 (TID 7)
17/02/06 10:06:16 ERROR TaskSetManager: Task 0 in stage 3.0 failed 1 times; aborting job
17/02/06 10:06:16 INFO BlockManager: Found block input-0-1486397172800 locally
17/02/06 10:06:16 INFO TaskSchedulerImpl: Cancelling stage 3
17/02/06 10:06:16 INFO Executor: Executor is trying to kill task 1.0 in stage 3.0 (TID 7)
17/02/06 10:06:16 INFO TaskSchedulerImpl: Stage 3 was cancelled
17/02/06 10:06:16 INFO DAGScheduler: ResultStage 3 (foreachRDD at TwitterStream.java:47) failed in 0.589 s
17/02/06 10:06:16 INFO DAGScheduler: Job 3 failed: foreachRDD at TwitterStream.java:47, took 0.608443 s
17/02/06 10:06:16 INFO JobScheduler: Finished job streaming job 1486397175000 ms.1 from job set of time 1486397175000 ms
17/02/06 10:06:16 INFO JobScheduler: Total delay: 1.086 s for time 1486397175000 ms (execution: 1.001 s)
17/02/06 10:06:16 ERROR JobScheduler: Error running job streaming job 1486397175000 ms.1
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 6, localhost): org.apache.spark.util.TaskCompletionListenerException: Invalid UTF-8 start byte 0x89
at [Source: [B@25c68cc; line: 1, column: 3]
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:90)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
最佳答案
解析推文值时出现问题。我使用 ObjectMapper,以下是使用 Apache Spark 将 Twitter 流保存到 Elasticsearch 的工作代码。
package com.twitter.streaming;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.twitter.util.TwitterStreamUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.serializer.KryoSerializer;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.twitter.TwitterUtils;
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
import twitter4j.Status;
/**
* Created by Manali on 1/28/2017.
*/
public class TwitterStream {
private static final String[] filters = {"#trumph", "#happy"};
public static void main(String[] args) throws InterruptedException {
// create the spark configuration and spark context
System.setProperty("hadoop.home.dir", "C:\\winutil\\");
SparkConf conf = new SparkConf().setAppName("SparkTwitterStreamExample").setMaster("local[2]")
.set("spark.serializer", KryoSerializer.class.getName())
.set("es.nodes", "localhost:9200")
.set("es.index.auto.create", "true");
// create a java streaming context and define the window (3 seconds batch)
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(15));
System.out.println("Initializing Twitter stream...");
// create a DStream (sequence of RDD). The object tweetsStream is a DStream of tweet statuses:
// - the Status class contains all information of a tweet
// See http://twitter4j.org/javadoc/twitter4j/Status.html
// and fill the keys and tokens in the Streamutils class!
JavaDStream<Status> twitterStream = TwitterUtils.createStream(jssc, TwitterStreamUtils.getAuth());
/* JavaDStream<String> statuses = twitterStream.map(
new Function<Status, String>() {
public String call(Status status) {
return status.toString();
}
}
);*/
//statuses.print();
// Jackson ObjectMapper for parsing
ObjectMapper mapper = new ObjectMapper();
// parse and save Twitter stream to Elasticsearch
twitterStream//.map(t -> new Tweet(t.getUser().getName(), t.getText()))
.map(t -> mapper.writeValueAsString(t))
.foreachRDD(tweets -> {
JavaEsSpark.saveJsonToEs(tweets, "spark/tweets");
return null;
});
jssc.start();
jssc.awaitTermination();
}
}
关于java - Apache Spark Java API + Twitter4j + 将 Twitter 流保存到 Elasticsearch 时出现异常,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42073415/
这个问题在这里已经有了答案: Why filter() after flatMap() is "not completely" lazy in Java streams? (8 个答案) 关闭 6
我正在创建一个应用程序来从 Instagram 收集数据。我正在寻找像 Twitter 流 API 这样的流 API,这样我就可以自动实时收集数据而无需发送请求。 Instagram 有类似的 API
我正在使用 Apache Commons 在 Google App Engine 中上传一个 .docx 文件,如此链接中所述 File upload servlet .上传时,我还想使用 Apach
我尝试使用 DynamoDB 流和 AWS 提供的 Java DynamoDB 流 Kinesis 适配器捕获 DynamoDB 表更改。我正在 Scala 应用程序中使用 AWS Java 开发工具
我目前有一个采用 H.264 编码的 IP 摄像机流式视频 (RTSP)。 我想使用 FFmpeg 将此 H.264 编码流转换为另一个 RTSP 流,但 MPEG-2 编码。我该怎么做?我应该使用哪
Redis 流是否受益于集群模式?假设您有 10 个流,它们是分布在整个集群中还是都分布在同一节点上?我计划使用 Redis 流来实现真正的高吞吐量(200 万条消息/秒),所以我担心这种规模的 Re
这件事困扰了我一段时间。 所以我有一个 Product 类,它有一个 Image 列表(该列表可能为空)。 我想做 product.getImages().stream().filter(...) 但
是否可以使用 具有持久存储的 Redis 流 还是流仅限于内存数据? 我知道可以将 Redis 与核心数据结构的持久存储一起使用,但我已经能够理解是否也可以使用 Redis 中的流的持久存储。 最佳答
我开始学习 Elixir 并遇到了一个我无法轻松解决的挑战。 我正在尝试创建一个函数,该函数接受一个 Enumerable.t 并返回另一个 Enumerable.t ,其中包含下 n 个项目。它与
我试图从 readLine 调用创建一个无限的字符串流: import java.io.{BufferedReader, InputStreamReader} val in = new Buffere
你能帮我使用 Java 8 流 API 编写以下代码吗? SuperUser superUser = db.getSuperUser; for (final Client client : super
我正在尝试服用补品routeguide tutorial,并将客户端变成rocket服务器。我只是接受响应并将gRPC转换为字符串。 service RouteGuide { rpc GetF
流程代码可以是run here. 使用 flow,我有一个函数,它接受一个键值对对象并获取它的值 - 它获取的值应该是字符串、数字或 bool 值。 type ValueType = string
如果我有一个函数返回一个包含数据库信息的对象或一个空对象,如下所示: getThingFromDB: async function(id:string):Promise{ const from
我正在尝试使用javascript api和FB.ui将ogg音频文件发布到流中, 但是我不知道该怎么做。 这是我给FB.ui的电话: FB.ui( { method: '
我正在尝试删除工作区(或克隆它以使其看起来像父工作区,但我似乎两者都做不到)。但是,当我尝试时,我收到此消息:无法删除工作区 test_workspace,因为它有一个非空的默认组。 据我所知,这意味
可以使用 Stream|Map 来完成此操作,这样我就不需要将结果放入外部 HashMap 中,而是使用 .collect(Collectors.toMap(...)); 收集结果? Map rep
当我们从集合列表中获取 Stream 时,幕后到底发生了什么?我发现很多博客都说Stream不存储任何数据。如果这是真的,请考虑代码片段: List list = new ArrayList(); l
我对流及其工作方式不熟悉,我正在尝试获取列表中添加的特定对象的出现次数。 我找到了一种使用Collections来做到这一点的方法。其过程如下: for (int i = 0; i p.conten
我希望将一个 map 列表转换为另一个分组的 map 列表。 所以我有以下 map 列表 - List [{ "accId":"1", "accName":"TestAcc1", "accNumber
我是一名优秀的程序员,十分优秀!