gpt4 book ai didi

perl 提取 和
之间的句子

转载 作者:行者123 更新时间:2023-12-02 04:56:50 26 4
gpt4 key购买 nike

我想提取 SPAN 和 br 之间的句子。我正在尝试使用 HTML::TreeBuilder。我是 perl 的新手。任何帮助将不胜感激。

<p>
<SPAN class="verse" id="1">1 </SPAN> ଆରମ୍ଭରେ ପରମେଶ୍ବର ଆକାଶ ଓ ପୃଥିବୀକୁ ସୃଷ୍ଟି କଲେ।
<br><SPAN class="verse" id="2">2 </SPAN> ପୃଥିବୀ ସେତବେେଳେ ସଂପୂରନ୍ଭାବେ ଶୂନ୍ଯ ଓ କିଛି ନଥିଲା। ଜଳଭାଗ ଉପରେ ଅନ୍ଧକାର ଘାଡ଼ଇେେ ରଖିଥିଲା ଏବଂ ପରମେଶ୍ବରଙ୍କର ଆତ୍ମା ଜଳଭାଗ
<br><SPAN class="verse" id="3">3 </SPAN> ଉପରେ ବ୍ଯାପ୍ତ ଥିଲା।
<br><SPAN class="verse" id="4">4 </SPAN> ପରମେଶ୍ବର ଆଲୋକକୁ ଦେଖିଲେ ଏବଂ ସେ ଜାଣିଲେ, ତାହା ଉତ୍ତମ, ଏହାପ ରେ ପରମେଶ୍ବର ଆଲୋକକୁ ଅନ୍ଧକାରରୁ ଅଲଗା କଲେ।
</p>

我做了什么

 foreach $line (@lines)
{
# Now create a new tree to parse the HTML from String $str
my $tr = HTML::TreeBuilder->new_from_content($line);

# And now find all <p> tags and create an array with the values.
my @lists =
map { $_->content_list }
$tr->find_by_tag_name('p');

# And loop through the array returning our values.
foreach my $val (@lists) {
print $val, "\n";printf FILE1 "\n%s", $val ;
}


}

我无法跳过嵌套在 p 标签中的那些 html 标签。我只想提取 unicode 文本并跳过嵌套标签。

最佳答案

我会使用 XML::Twig,只是因为我熟悉它。在底层,它使用 HTML::TreeBuilder 将 HTML 转换为 XHTML。

您的问题的一个简单解决方案是:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

binmode( STDOUT, ':utf8'); # to avoid warnings when printing out wide (multi-byte) characters


my $file= shift @ARGV;

my $t= XML::Twig->new->parsefile_html( $file);

foreach my $p ($t->descendants( 'p'))
{ $p->cut_children( 'span'); # HTML::TreeBuilder lowercases tags
my @texts= $p->children_text( '#TEXT'); # just get the text
print join "---\n", @texts; # or do whatever with the text
}

关于perl 提取 </SPAN> 和 <br> 之间的句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21015777/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com