gpt4 book ai didi

python - 架构 python 问题

转载 作者:行者123 更新时间:2023-11-30 23:59:02 25 4
gpt4 key购买 nike

创建一个分布式爬行Python应用程序。它由主服务器和将在客户端服务器上运行的关联客户端应用程序组成。客户端应用程序的目的是在目标站点上运行,以提取特定数据。客户需要“深入”网站,在多个级别的表单后面,因此每个客户都专门针对给定的网站。

每个客户端应用程序看起来都像

main:

parse initial url

call function level1 (data1)

function level1 (data)
parse the url, for data1
use the required xpath to get the dom elements
call the next function
call level2 (data)


function level2 (data2)
parse the url, for data2
use the required xpath to get the dom elements
call the next function
call level3

function level3 (dat3)
parse the url, for data3
use the required xpath to get the dom elements
call the next function
call level4

function level4 (data)
parse the url, for data4
use the required xpath to get the dom elements

at the final function..
--all the data output, and eventually returned to the server
--at this point the data has elements from each function...

我的问题: 鉴于调用的电话数量 当前函数的子函数有所不同,我想弄清楚 找出最好的方法。

 each function essentialy fetches a page of content, and then parses 
the page using a number of different XPath expressions, combined
with different regex expressions depending on the site/page.

if i run a client on a single box, as a sequential process, it'll
take awhile, but the load on the box is rather small. i've thought
of attempting to implement the child functions as threads from the
current function, but that could be a nightmare, as well as quickly
bring the "box" to its knees!

i've thought of breaking the app up in a manner that would allow
the master to essentially pass packets to the client boxes, in a
way to allow each client/function to be run directly from the
master. this process requires a bit of rewrite, but it has a number
of advantages. a bunch of redundancy, and speed. it would detect if
a section of the process was crashing and restart from that point.
but not sure if it would be any faster...

我正在用 python 编写解析脚本..

所以...任何想法/评论将不胜感激...

我可以了解更多细节,但不想让任何人感到无聊!!

谢谢!

汤姆

最佳答案

这听起来像是 Hadoop 上的 MapReduce 用例。

Hadoop Map/Reduce 是一个软件框架,用于轻松编写应用程序,在大型商用硬件集群(数千个节点)上以可靠、容错的方式并行处理大量数据(多 TB 数据集)方式。 就您而言,这将是一个较小的集群。

Map/Reduce 作业通常将输入数据集分割成独立的 block ,这些 block 由 Map 任务以完全并行的方式处理。

你提到过,

i've thought of breaking the app up in a manner that would allow the master to essentially pass packets to the client boxes, in a way to allow each client/function to be run directly from the master.

据我了解,您需要一台主机(盒子)作为主机,并拥有运行其他功能的客户端盒子。例如,您可以运行 main() 函数并解析其上的初始 URL。好处是您可以在不同的机器上并行化每个 URL 的任务,因为它们看起来彼此独立。

由于 level4 依赖于 level3,而 level3 又依赖于 level2 .. 等等,因此您可以将每个的输出通过管道传输到下一个,而不是从每个中调用一个。

有关如何执行此操作的示例,我建议按给定顺序查看以下教程,

希望这有帮助。

关于python - 架构 python 问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2670323/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com