xidel - 我们可以使用 Xidel 将整个站点的数据提取到搜索文件中吗？-6ren

xidel - 我们可以使用 Xidel 将整个站点的数据提取到搜索文件中吗？

转载作者：行者123 更新时间：2023-12-05 06:35:41

28

4

背景:我们正在汇总一些网站的内容(经许可)，用于另一个应用程序的补充搜索功能。一个例子是https://centenary.bahai.us 的新闻部分。 .为此，我们考虑使用 xidel，因为模板文件范例似乎是一种从 html 中提取数据的优雅方式，例如对于模板:

<h1 class="title">{$title}</h1>?
<div class="node build-mode-full">
  {$url:=$url}
  <div class="field-field-audio">?
    <audio src="{$audio:='https://' || $host || .}"></audio>?
  </div>?
  <div class="field-field-clip-img">
    <a href="{$image:='https://' || $host || .}" class="imagefield-field_clip_img"></a>*
  </div>?
  <div class="field-field-pubname">{$publication}</div>?
  <div class="field-field-historical-date">{$date}</div>?
  <div class="location"><div class="adr">{$location}</div>?</div>?
  <div class="node-body">{$text}</div>
</div>?

...我们可以运行如下命令:

xidel "https://centenary.bahai.us" -e "$(< template.html)" -f "//a[contains(@href, '/news/')]" --silent --color=never --output-format=json-wrapped > index.json

...这将为我们提供来自 centenary.bahai.us 上所有新闻页面的 json 格式数据。示例文章如下所示:

{
"title": "Bahá’ísm the Religion of Brotherhood", 
"url": "https://centenary.bahai.us/news/bahaism-religion-brotherhood", 
"audio": "https://centenary.bahai.us/sites/default/files/453_0.mp3", 
"image": "https://centenary.bahai.us/sites/default/files/imagecache/lightbox-large/images/press_clippings/03-31-1912_NYT_Bahaism_the_Religion_of_Brotherhood.png", 
"publication": "The New York Times", 
"date": "March 31, 1912", 
"location": "New York, NY", 
"text": "A posthumous volume of “Essays in Radical Empiricism,” by William James, will be published in April by Longmans, Green & Co. This house will also bring out “Leo XIII, and Anglican Orders,” by Viscount Halifax, and “Bahá’ísm, the Religion of Brotherhood, and Its Place in the Evolution of Creeds,” by Francis H. Skrine. In the latter an analysis is made of the Gospel of Bahá’u’lláh and his successor. ‘Abdu’l-Bahá — whose arrival in this country is expected early in April — and a forecast is attempted of its influence on civilization."
},

这很漂亮，比 httrack 和 pup 或(上帝保佑)sed 和 regex 的一些混搭要容易得多，但有一些问题:

我们希望每个文档都有单独的文件，而这给了我们一个大的 json 文件。
即使使用 --silent 标志，我们仍然会在输出中获得使 json 无效的状态消息，例如 **** Retrieving (GET):https://centenary .bahai.us **** 或 **** 处理:https://centenary.bahai.us/**** 或 ** 当前变量状态: **
这个过程似乎太脆弱了；如果模板和实际 html 之间存在任何差异，整个过程就会出错，我们什么也得不到。我们希望它只输出一个页面的错误，然后继续下一个 URL。

Xidel 似乎是一个改变游戏规则的工具，它应该可以通过一行命令和一个简单的提取模板文件来完成这项工作；我在这里错过了什么？

最佳答案

从$(< template.html)的使用来看我猜你在 Linux 发行版上。在那种情况下，您的引用是错误的。参见 #9 和 #10 here .

由于您使用的是提取模板文件，我会说 --extract-file=template.html将是要使用的参数，但是您的 -e "$(< template.html)"似乎也有效。这对我来说是新的。谢谢。
多亏了 BeniBela 的回答，我才知道 -e @template.html同样有效。

接下来是你的参数顺序错误。我不得不承认，Xidel 的自述文件对此并不太清楚。
在 xidel 之后应该来--silent --color=never ，并且您显然必须先“关注”一个网址，然后才能进行提取。所以这应该有效:

$ xidel --silent --color=never "https://centenary.bahai.us" \
  -f '//div[@class="views-field-title"]/span/a[starts-with(@href,"/news/")]/@href' \
  --extract-file=template.html \
  --output-format=json-wrapped \
  > index.json

我自己几乎从不使用模板，所以我会通过自己构建 json 来做一些不同的事情:

$ xidel -s "https://centenary.bahai.us" -e '
  for $x in //div[@class="views-field-title"]/span/a[starts-with(@href,"/news/")]/@href return
  file:write(
    substring-after($x,"/news/")||".json",
    doc($x)/{
      "title"://h1/text(),
      "url":resolve-uri($x),
      "audio"://audio/resolve-uri(@src),
      "image"://div[ends-with(@class,"clip-img")]//img/resolve-uri(@src),
      "publication"://div[ends-with(@class,"pubname")]/div/normalize-space(div[@class="field-item odd"]),
      "date"://div[ends-with(@class,"historical-date")]//span/text(),
      "location"://span[@class="locality"]/text(),
      "text":string-join(//div[@class="node-body"]//text())
    },
    {"method":"json","indent":true()}
  )
'

//div[@class="views-field-title"]/span/a[starts-with(@href,"/news/")]/@href返回当前新闻文章的相对路径:

/news/visit-abdul-baha-abbas
/news/abdul-baha-prays-ascension-church
/news/bahaist-leader-here-interest-world-peace
/news/abdul-baha-abbas-coming-lewis-g-gregory

对于每篇新闻文章，都会打开 url，分析 html 源并将提取的信息保存为缩进/美化的 JSON 文件。第一个，visit-abdul-baha-abbas.json ，例如:

{
  "title": "A Visit to ‘Abdu’l-Bahá Abbas",
  "url": "https:\/\/centenary.bahai.us\/news\/visit-abdul-baha-abbas",
  "audio": "https:\/\/centenary.bahai.us\/sites\/default\/files\/257_0.mp3",
  "image": "https:\/\/centenary.bahai.us\/sites\/default\/files\/imagecache\/page-secondary-images\/images\/press_clippings\/04-17-1912%20Utica%20NY%20Press%20A%20Visit%20to%20Abdul%20Baha%20Abbas.png",
  "publication": "Utica New York Press",
  "date": "April 17, 1912",
  "location": "Acca",
  "text": "An American Girl Tells of a Memorable Experience in Her Life.[...]"
}

关于xidel - 我们可以使用 Xidel 将整个站点的数据提取到搜索文件中吗？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49616244/

28

4

0

文章推荐： angular - 错误消息 : Member 'makes' implicitly has an 'any' type

文章推荐： angular - 数据未显示。 ng-reflect-ng-for-off :null?

java - 如何使用 Ruby、PHP 或 Java 解析/提取/提取 ASP.net 网站内容？
我正在做一个业余爱好项目，使用 Ruby、PHP 或 Java 来抓取 ASP.net 网站的内容。例如，如果网站 url“www.myaspnet.com/home.aspx”。我想从 home.a
r - 提取/之间的字符串
如果我有这些字符串： mystrings <- c("X2/D2/F4", "X10/D9/F4", "X3/D22/F4",
regex - 提取 | 之间的最后一个单词|
我有以下数据集 > head(names$SAMPLE_ID) [1] "Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Mor
grails - 提取: 'join'被忽略
设置: 3个域类A，B和C。A和B在插件中。 C在依赖于此插件的应用程序中。 class A{ B b static mapping = { b fetch: 'joi
JAVA StAX 提取
我不知道如何提取 XML 文件中的开始标记元素名称。我很接近〜意味着没有错误，我正在获取标签名称，但我正在获取标签名称加上信息。我得到的是: {http://www.publishing.org}au
regex - 提取 "?"之后的文本
我有一个字符串 x <- "Name of the Student? Michael Sneider" 我想从中提取“Michael Sneider”。我用过: str_extract_all(x,
Java - 提取 [* ... *] 之间的所有内容
我有一个如下所示的文本文件: [* content I want *] [ more content ] 我想读取该文件并能够提取我想要的内容。我能做的最好的事情如下，但它会返回 [更多内容] 请注意
Twig 提取 FOR 循环变量
假设我有一个项目集合 $collection = array( 'item1' => array( 'post' => $post, 'ca
java - 读取一个文本文件并写入多个文本文件以进行过滤/提取
我正在寻找一种过滤文本文件的方法。我有许多文件夹名称，其中包含许多文本文件，文本文件有几个没有人员，每个人员有 10 个群集/组(我在这里只显示了 3 个)。但是每个组/簇可能包含几个原语(我在这里展
python - Unicode 提取
我已经编写了一个从某个网页中提取网址的代码，我面临的问题是它不会以网页上相同的方式提取网址，我的意思是如果该网址位于某些网页中法语，它不会按原样提取它。我该如何解决这个问题？ import reque
c# - 提取 ZipFile
如何在 C# 中提取 ZipFile？(ZipFile 是包含文件和目录) 最佳答案为此使用工具。类似于 SharpZip .据我所知 - .NET 不支持开箱即用的 ZIP 文件。来自 here
c++ - 提取[]之间内容的正则表达式
我有一个表达: [training_width]:lofmimics 我要提取[]之间的内容，在上面的例子中我要 training_width 我试过以下方法: QRegularExpression
bash - 提取 "$@"中最后一个参数之前的参数
我正在尝试创建一个 Bash 脚本，该脚本将从命令行给出的最后一个参数提取到一个变量中以供其他地方使用。这是我正在处理的脚本: #!/bin/bash # compact - archive and
Javascript 提取 *.com
我正在寻找一个 JavaScript 函数/正则表达式来从 URI 中提取 *.com...(在客户端完成) 它应该适用于以下情况: siphone.com = siphone.com qwr.sip
python - BeautifulSoup 提取
关闭。这个问题需要更多focused .它目前不接受答案。想改进这个问题吗？更新问题，使其只关注一个问题 editing this post . 关闭 8 年前。 Improve this qu
Python JSON 提取
编辑:添加了实际的 JSON 对象和代码以供审查我有这种格式的 JSON(只是这种层次结构，假设 JSON 正常工作) {u'kind': u'calendar#events', u'default
python - 提取标签的内容
我已经编写了代码来使用 BeautifulSoup 提取一本书的 url 和标题来自页面。但它并没有在 > 之间提取惊人的 super 科学故事 1930 年 4 月这本书的名字。和标签。如何提
Java，提取$符号之间的单词
使用 Java，我想提取美元符号 $ 之间的单词。例如: String = " this is first attribute $color$. this is the second attribu
string - 提取.txt文件中以00开头的数字
您好，我正在尝试找到一种方法来确定字符串中的常量，然后提取该常量左侧的一定数量的字符。例如-我有一个 .txt 文件，在那个文件的某处有数字 00nnn 数字的例子是 00234 00765 ...
php操作（删除,提取,增加）zip文件方法详解
php读取zip文件(删除文件,提取文件,增加文件)实例从zip压缩文件中提取文件复制代码代码如下: <?php /* php 从zip压缩文件

首页

博学

6Ren·AI

商城

xidel - 我们可以使用 Xidel 将整个站点的数据提取到搜索文件中吗？