Currently I'm using spark to write data into neo4j which when starting a write creates 200 threads writing into neo4j simultaneously. Is there a way to limit how many threads are created and used simultaneously or is the only way to decrease the cluster and instance size?
I know this kind of goes against what Spark is meant to do but would love to get any feedback.
目前,我正在使用Spark向ne4j写入数据,当开始写入时,它会创建200个线程同时写入到ne4j。有没有办法限制同时创建和使用的线程数量,或者这是减少集群和实例大小的唯一方法?我知道这与Spark的意图背道而驰,但我希望得到任何反馈。
I have tried
spark.conf.set("spark.executor.cores", 4)
我已经尝试了spak.conf.set(“spak.ecutor.cores”,4)
with no Luck.
没有运气。
edges.write.format("org.neo4j.spark.DataSource")\
.option("url", "neo4j://url:7687") \
.mode("overwrite")\
.option("relationship", "connected")\
.option("batch.size",1000)\
.option("relationship.save.strategy", "keys")\
.option("relationship.source.node.keys", "id:id")\
.option("relationship.target.node.keys", "id:id")\
.option("relationship.source.labels", "node")\
.option("relationship.target.labels", "node")\
.save()
更多回答
spark.executor.cores
defines the number of threads per executor. So, if you set it to "1", you will reduce the number of writers to 1 per executor. This change is equivalent to reducing the cluster size, though, so it doesn't make much sense in this scenario.
Cores定义每个执行器的线程数。因此,如果将其设置为“1”,则每个执行器的写入器数量将减少到1个。不过,此更改等同于减少集群大小,因此在此场景中没有多大意义。
优秀答案推荐
Try to repartition your dataframe before the write:
尝试在写入之前对数据帧进行重新分区:
edges.repartition(parallelism)
...
.option("url", "neo4j://url:7687") \
.mode("overwrite")\
...
Where parallelism
is the number of tasks that will be writing concurrently.
其中,并行度是将并发写入的任务数。
The common solution is to coalesce (which is similar to repartition
but much more efficient as it does not require a shuffle). Something like:
常见的解决方案是合并(这类似于重新分区,但效率更高,因为它不需要洗牌)。类似于:
edges.coalesce(4).write...
One problem with this solution is that it assumes that edges/4
fits in the memory of the executor. If it does, that works great; if not, I don't think there is a way to limit the number of writers except by reducing the cluster.
这种解决方案的一个问题是,它假设边/4适合执行器的内存。如果是这样的话,效果很好;如果不是这样,我认为除了减少集群之外,没有其他方法来限制编写器的数量。
更多回答
Repartition has worked for me I now only see the specified amount of connections created to neo4j. Thank you.
重新分区对我起作用了,我现在只看到指定数量的连接创建到了ne4j。谢谢。
我是一名优秀的程序员,十分优秀!