gpt4 book ai didi

xml - 使用 sed、awk、cat 或 grep 将 xml 中的 url 通过管道传输到 Linux 中的单独文件中

转载 作者:太空宇宙 更新时间:2023-11-04 03:57:47 26 4
gpt4 key购买 nike

我有一个 xml 文件,其中包含许多产品,如下面的 xml 示例所示。

我想 grep 出此文档中的所有 url,并将它们通过管道传输到一个新文档中。例如我想获取以下网址:

<url></url>

并将它们通过管道传输到一个新的 txt 文件中,每个 url 位于新行上。因此输出看起来像一个 url 列表,例如:

http://www.example.com/nav/rooms/kitchens/kitchen-worktops/gemstone_solid_surface_worktops/-specificproducttype-worktops/Cooke-and-Lewis-Gemstone-Triassic-Worktop-3050mm-13128613
http://www.example.com/nav/fix/nails-screws-fixings-hardware/furniture-hardware/legs___supports/-specificproducttype-furniture_legs/Rothley-Furniture-Leg-Angled-L501XN-Brushed-Nickel-Effect-H128mm-9281999
http://www.example.com/nav/fix/electrical/cable-management/cable_clips/Corelectric-Clips-Cable-Round-Polybag-Pk20-11348134
http://www.example.com/nav/fix/power-tool-accessories/router-bits/jointing_biscuits/Trend-T-Tech-Beech-Biscuit-No-10-TT-BSC-10-100-Pack-9288386
etc...

这是 xml 的示例,对于许多产品来说,此示例会重复多次:

<product>
<id>13128613</id>
<name>Cooke &amp; Lewis Gemstone Triassic Worktop 3050mm</name>
<categoryId>9372151</categoryId>
<features>Edged 1 long, 2 short sides, No templating required reducing fitting complexities, time and cost, This stunning design is made from 85% recycled material including glass and shell, supporting environmental sustainability, A 6mm solid material bonded to a 28mm solid chipboard core, backed with a moisture resistant balance paper for complete water resistance, A hard surface that is resistant to daily wear and tear</features>
<url>http://www.example.com/nav/rooms/kitchens/kitchen-worktops/gemstone_solid_surface_worktops/-specificproducttype-worktops/Cooke-and-Lewis-Gemstone-Triassic-Worktop-3050mm-13128613</url>
<productHierarchy>Rooms &gt; Kitchens &gt; Kitchen Worktops &gt; Gemstone Solid Surface Worktops &gt; Worktops</productHierarchy>
<quantity/>
<sku>
<id>13619319</id>
<name>Cooke &amp; Lewis Gemstone Triassic Worktop 3050mm</name>
<description>A 6mm solid material bonded to a 28mm high performance chipboard core, Cooke &amp; Lewis Gemstone is the perfect green choice, formulated with 85% recycled material.</description>
<ean>5397007119039</ean>
<condition>new</condition>
<price>582.00</price>
<wasPrice/>
<deliveryCost>0.0</deliveryCost>
<deliveryTime>Delivery usually within 5 weeks</deliveryTime>
<stockAvailability>1</stockAvailability>
<skuAvailableInStore>0</skuAvailableInStore>
<skuAvailableOnline>1</skuAvailableOnline>
<channel>Home Delivery Only</channel>
<buyerCats>
<catLevel0>KITCHENS</catLevel0>
<catLevel1>SOLID SURFACE WORKTOPS</catLevel1>
<catLevel2>SPEEDSTONE SOLID SURFACE</catLevel2>
</buyerCats>
<affiliateCats>
<affiliateCat0>Home &amp; Garden</affiliateCat0>
</affiliateCats>
<manufacturersPartNumber/>
<specificationsModelNumber/>
<featuresBrand>Cooke &amp; Lewis Gemstone</featuresBrand>
<imageUrl>http://example.com/is/image/5397007119039_001c_v001_zp</imageUrl>
<thumbnailUrl>http://example.com/is/image/5397007119039_001c_v001_zp?$75x75_generic$=</thumbnailUrl>
<skuNavAttributes>
<ecoGrowFoods>false</ecoGrowFoods>
<ecoDLME>false</ecoDLME>
<ecoRecycle>false</ecoRecycle>
<ecoSavesWater>false</ecoSavesWater>
<ecoHealthyHomes>false</ecoHealthyHomes>
<ecoNurtureNature>false</ecoNurtureNature>
<ecoSavesEnergy>false</ecoSavesEnergy>
</skuNavAttributes>
</sku>
</product>

我只想获取产品的主 url,我不关心 xml 结构中的其他 url,例如 imageUrl 和thumbnailUrl。

我已经尝试过:

sed -rn '/<url>([^"]*)<\/url>/' file.xml > file.txt

但是到目前为止输出为空。

最佳答案

您可以先 grep 查找 <url>行(如果 XML 文件的格式如您所示),最后删除 XML 标签:

grep '<url>' file.xml | sed 's/.*>\([^<]*\)<.*/\1/' >> file.txt

您可以完全删除标签

grep '<url>' a.txt | sed 's/<\/*url>//g'

替换 < 后可以选择第二列和>带空格:

grep '<url>' a.txt | tr '<>' ' ' | awk '{print $2}'

此外,您可以使用 xpath 而不是使用 grep选择正确的标签,例如像这样

xpath -q -e '//product/url' file.xml | ... > file.txt

关于xml - 使用 sed、awk、cat 或 grep 将 xml 中的 url 通过管道传输到 Linux 中的单独文件中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24086107/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com