gpt4 book ai didi

erlang - 是否有可能使用 Erlang、Mnesia 和 Yaws 开发一个强大的网络搜索引擎?

转载 作者:行者123 更新时间:2023-12-02 07:06:13 25 4
gpt4 key购买 nike

我正在考虑使用 Erlang、Mnesia 和 Yaws 开发一个网络搜索引擎。是否有可能使用这些软件制作一个功能强大且速度最快的网络搜索引擎?实现这一目标需要什么以及我该如何开始?

最佳答案

Erlang 可以打造当今最强大的网络爬虫。让我带您了解我的简单爬虫。

第 1 步。我创建一个简单的并行模块,我将其称为 mapreduce

-module(mapreduce).-export([compute/2]).%%=====================================================================%% usage example%% Module = string%% Function = tokens%% List_of_arg_lists = [["file\r\nfile","\r\n"],["muzaaya_joshua","_"]]%% Ans = [["file","file"],["muzaaya","joshua"]]%% Job being done by two processes%% i.e no. of processes spawned = length(List_of_arg_lists)compute({Module,Function},List_of_arg_lists)->    S = self(),    Ref = erlang:make_ref(),    PJob = fun(Arg_list) -> erlang:apply(Module,Function,Arg_list) end,    Spawn_job = fun(Arg_list) ->                     spawn(fun() -> execute(S,Ref,PJob,Arg_list) end)                end,    lists:foreach(Spawn_job,List_of_arg_lists),    gather(length(List_of_arg_lists),Ref,[]).
gather(0, _, L) -> L;gather(N, Ref, L) -> receive {Ref,{'EXIT',_}} -> gather(N-1,Ref,L); {Ref, Result} -> gather(N-1, Ref, [Result|L]) end.
execute(Parent,Ref,Fun,Arg)-> Parent ! {Ref,(catch Fun(Arg))}.

步骤 2. HTTP 客户端

人们通常会使用 inets httpc module内置于 erlang 或 <a href="https://github.com/cmullaparthi/ibrowse" rel="noreferrer noopener nofollow">ibrowse</a> 。然而,为了内存管理和速度(使内存占用尽可能低),一个好的 erlang 程序员会选择使用 <a href="http://curl.haxx.se/docs/manual.html" rel="noreferrer noopener nofollow">curl</a> 。通过应用<a href="http://www.erlang.org/doc/man/os.html#cmd-1" rel="noreferrer noopener nofollow">os:cmd/1</a>它采用curl命令行,可以将输出直接输入到erlang调用函数中。然而,最好让curl将其输出放入文件中,然后我们的应用程序有另一个线程(进程)来读取和解析这些文件

Command: curl "http://www.erlang.org" -o "/downloaded_sites/erlang/file1.html"
In Erlang
os:cmd("curl \"http://www.erlang.org\" -o \"/downloaded_sites/erlang/file1.html\"").
所以你可以生成许多进程。请记住在执行该命令时转义 URL 以及输出文件路径。另一方面,有一个进程的工作是监视下载页面的目录。它读取并解析这些页面,然后可能会在解析后删除或保存在不同的位置,甚至更好,使用 zip module 将它们存档。
folder_check()->    spawn(fun() -> check_and_report() end),    ok.-define(CHECK_INTERVAL,5).check_and_report()->    %% avoid using    %% filelib:list_dir/1    %% if files are many, memory !!!    case os:cmd("ls | wc -l") of        "0\n" -> ok;        "0" -> ok;        _ -> ?MODULE:new_files_found()    end,    sleep(timer:seconds(?CHECK_INTERVAL)),    %% keep checking    check_and_report().new_files_found()->    %% inform our parser to pick files    %% once it parses a file, it has to     %% delete it or save it some    %% where else    gen_server:cast(?MODULE,files_detected).

步骤 3.html 解析器。
最好用这个<a href="http://ppolv.wordpress.com/2008/05/09/fun-with-mochiwebs-html-parser-and-xpath/" rel="noreferrer noopener nofollow">mochiweb's html parser and XPATH</a> 。这将帮助您解析并获取所有您喜欢的 HTML 标签,提取内容,然后就可以开始了。在下面的例子中,我只关注Keywords , descriptiontitle在标记中


在 shell 中进行模块测试...非常棒的结果!!!

2> spider_bot:parse_url("http://erlang.org").[[[],[],  {"keywords",   "erlang, functional, programming, fault-tolerant, distributed, multi-platform, portable, software, multi-core, smp, concurrency "},  {"description","open-source erlang official website"}], {title,"erlang programming language, official website"}]

3> spider_bot:parse_url("http://facebook.com").[[{"description",   " facebook is a social utility that connects people with friends and others who work, study and live around them. people use facebook to keep up with friends, upload an unlimited number of photos, post links and videos, and learn more about the people they meet."},  {"robots","noodp,noydir"},    [],[],[],[]], {title,"incompatible browser | facebook"}]

4> spider_bot:parse_url("http://python.org").[[{"description",   "      home page for python, an interpreted, interactive, object-oriented, extensible\n      programming language. it provides an extraordinary combination of clarity and\n      versatility, and is free andcomprehensively ported."},  {"keywords",   "python programming language object oriented web free source"},  []], {title,"python programming language – official website"}]

5> spider_bot:parse_url("http://www.house.gov/").[[[],[],[],  {"description",   "home page of the united states house of representatives"},  {"description",   "home page of the united states house of representatives"},  [],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],  [],[],[]|...], {title,"united states house of representatives, 111th congress, 2nd session"}]


您现在可以意识到,我们可以根据关键字对页面进行索引,并制定良好的页面重新访问计划。另一个挑战是如何制作一个爬虫(可以在整个网络中从一个域移动到另一个域的东西),但这很容易。可以通过解析 Html 文件中的 href 标签来实现。让 HTML 解析器提取所有 href 标签,然后您可能需要一些正则表达式来获取给定域下的链接。

运行爬虫

7> spider_connect:conn2("http://erlang.org").                Links: ["http://www.erlang.org/index.html",                "http://www.erlang.org/rss.xml",                "http://erlang.org/index.html","http://erlang.org/about.html",                "http://erlang.org/download.html",                "http://erlang.org/links.html","http://erlang.org/faq.html",                "http://erlang.org/eep.html",                "http://erlang.org/starting.html",                "http://erlang.org/doc.html",                "http://erlang.org/examples.html",                "http://erlang.org/user.html",                "http://erlang.org/mirrors.html",                "http://www.pragprog.com/titles/jaerlang/programming-erlang",                "http://oreilly.com/catalog/9780596518189",                "http://erlang.org/download.html",                "http://www.erlang-factory.com/conference/ErlangUserConference2010/speakers",                "http://erlang.org/download/otp_src_R14B.readme",                "http://erlang.org/download.html",                "https://www.erlang-factory.com/conference/ErlangUserConference2010/register",                "http://www.erlang-factory.com/conference/ErlangUserConference2010/submit_talk",                "http://www.erlang.org/workshop/2010/",                "http://erlangcamp.com","http://manning.com/logan",                "http://erlangcamp.com","http://twitter.com/erlangcamp",                "http://www.erlang-factory.com/conference/London2010/speakers/joearmstrong/",                "http://www.erlang-factory.com/conference/London2010/speakers/RobertVirding/",                "http://www.erlang-factory.com/conference/London2010/speakers/MartinOdersky/",                "http://www.erlang-factory.com/",                "http://erlang.org/download/otp_src_R14A.readme",                "http://erlang.org/download.html",                "http://www.erlang-factory.com/conference/London2010",                "http://github.com/erlang/otp",                "http://erlang.org/download.html",                "http://erlang.org/doc/man/erl_nif.html",                "http://github.com/erlang/otp",                "http://erlang.org/download.html",                "http://www.erlang-factory.com/conference/ErlangUserConference2009",                "http://erlang.org/doc/efficiency_guide/drivers.html",                "http://erlang.org/download.html",                "http://erlang.org/workshop/2009/index.html",                "http://groups.google.com/group/erlang-programming",                "http://www.erlang.org/eeps/eep-0010.html",                "http://erlang.org/download/otp_src_R13B.readme",                "http://erlang.org/download.html",                "http://oreilly.com/catalog/9780596518189",                "http://www.erlang-factory.com",                "http://www.manning.com/logan",                "http://www.erlang.se/euc/08/index.html",                "http://erlang.org/download/otp_src_R12B-5.readme",                "http://erlang.org/download.html",                "http://erlang.org/workshop/2008/index.html",                "http://www.erlang-exchange.com",                "http://erlang.org/doc/highlights.html",                "http://www.erlang.se/euc/07/",                "http://www.erlang.se/workshop/2007/",                "http://erlang.org/eep.html",                "http://erlang.org/download/otp_src_R11B-5.readme",                "http://pragmaticprogrammer.com/titles/jaerlang/index.html",                "http://erlang.org/project/test_server",                "http://erlang.org/download-stats/",                "http://erlang.org/user.html#smtp_client-1.0",                "http://erlang.org/user.html#xmlrpc-1.13",                "http://erlang.org/EPLICENSE",                "http://erlang.org/project/megaco/",                "http://www.erlang-consulting.com/training_fs.html",                "http://erlang.org/old_news.html"]ok
存储:是搜索引擎最重要的概念之一。将搜索引擎数据存储在 MySQL、Oracle、MS SQL 等 RDBMS 中是一个很大的错误。此类系统非常复杂,与它们交互的应用程序采用启发式算法。这将我们带到 Key-Value Stores ,其中我最好的两个是 <a href="http://www.couchbase.com/" rel="noreferrer noopener nofollow">Couch Base Server</a> <a href="http://basho.com/" rel="noreferrer noopener nofollow">Riak</a> 。这些都是很棒的云文件系统。另一个重要参数是缓存。缓存是通过使用 <b><a href="http://memcached.org/" rel="noreferrer noopener nofollow">Memcached</a></b> 来实现的,其中上面提到的另外两个存储系统都支持它。搜索引擎的存储系统应该是 schemaless DBMS ,重点关注 Availability rather than Consistency 。从这里阅读有关搜索引擎的更多信息: http://en.wikipedia.org/wiki/Web_search_engine

关于erlang - 是否有可能使用 Erlang、Mnesia 和 Yaws 开发一个强大的网络搜索引擎?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/195809/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com