gpt4 book ai didi

perl - 使用 perl 将 RTF 转换为 TEXT

转载 作者:行者123 更新时间:2023-12-02 04:27:43 24 4
gpt4 key购买 nike

有人可以告诉我如何使用 perl 编程语言将 rtf 文件转换为包含所有标签、表格和格式化数据的文本吗?

@Ahmad Bilal、@petersergeant:我一直在使用下面的代码进行 RTF 到 TXT 的转换,并且我能够转换为文本。但问题是我无法捕获表格或图像格式,甚至输入文件中的所有实体都没有使用该程序捕获。

use 5.8.0;
use strict;
use warnings;
use Getopt::Long;
use Pod::Usage;
use RTF::HTMLConverter;

#-------------------------------------------------------------------
#Variable Declarions
#-------------------------------------------------------------------
my $tempfile = "";
my $Outfile = "";
my $txtfile = "";
my $URL = "";
my $Format = "";
my $TreeBuilder = "";
my $Parsed = "";
my $line = "";


my %opts;
GetOptions(
"help|h|?" => \$opts{help},
"man|m" => \$opts{man},
"dom=s" => \$opts{dom},
"noimages|n" => \$opts{noimages},
"imagedir|d=s" => \$opts{imagedir},
"imageuri|u=s" => \$opts{imageuri},
"encoding|e=s" => \$opts{encoding},
"indented|i=i" => \$opts{indented},
);

pod2usage(-verbose => 1, -exitval => 0) if $opts{help};
pod2usage(-verbose => 2, -exitval => 0) if $opts{man};

my %params;
if($opts{dom}){
eval "require $opts{dom}";
die $@ if $@;
$params{DOMImplementation} = $opts{dom};
}else{
eval { require XML::GDOME };
if($@){
eval { require XML::DOM };
die "Can't load either XML::GDOME or XML::DOM\n" if $@;
$params{DOMImplementation} = 'XML::DOM';
}
}

if($opts{noimages}){
$params{discard_images} = 1;
}else{
$params{image_dir} = $opts{imagedir} if defined $opts{imagedir};
$params{image_uri} = $opts{imageuri} if defined $opts{imageuri};
}

$params{codepage} = $opts{encoding} if $opts{encoding};
$params{formatting} = $opts{indented} if defined $opts{indented};

#-----------------------------------------------
# Converting RTF to HTML
#-----------------------------------------------

if(defined $ARGV[0]){
open(FR, "< $ARGV[0]") or die "Can't open '$ARGV[0]': $!!\n";
$params{in} = \*FR;
$tempfile = $ARGV[0];
$tempfile =~ /^(.*?)rtf/;
$Outfile = $1."html";
$txtfile = $1."txt";

open(FW, "> $Outfile") or die "Can't open '$Outfile': $!!\n";
$params{out} = \*FW;
print "\n$Outfile - HTML Created\n"

}

my $parser = RTF::HTMLConverter->new(%params);
$parser->parse();


close FW;

#-----------------------------------------------
# Opening HTML and TXT files
#-----------------------------------------------

open (FILE1, ">$txtfile") or die "Can't open '$txtfile': $!!\n";
open (FILE2, "$Outfile") or die "Can't open '$Outfile': $!!\n";

#-----------------------------------------------
# Converting HTML to TXT file
#-----------------------------------------------

local $/ = undef;
while ($line = <FILE2>) {
$line =~ s/\n//g;
$line =~ s/(<!DOCTYPE HTML.*><html><head>.*<\/style>)/<sectd>/;
$line =~ s/<font.*?>//g;
$line =~ s/<\/font>//g;
$line =~ s/<table .*?>/\n<table>\n/g;
$line =~ s/<\/table>/\n<\/table>/g;
$line =~ s/<td .*?>/\n<td>/g;
$line =~ s/<tr>/\n<tr>/g;
$line =~ s/<\/tr>/\n<\/tr>/g;
$line =~ s/<ul.*?>/\n<ul>/g;
$line =~ s/<li.*?>/\n<li>/g;
$line =~ s/<\/ul>/\n<\/ul>/g;
$line =~ s/<\/body><\/html>//g;
$line =~ s/<p.*?>/\n<p>/g;
$line =~ s/<p>(&nbsp;|\*|\s)+<\/p>//g;
$line =~ s/&nbsp;//g;
$line =~ s/(<sectd>\n?.*?)<\/head><body>/$1/g;

#-------------------
# Entity Conversion
#-------------------
$line =~ s/&rsquo;/&#x2018;/g;
$line =~ s/“/&#x201C;/g;
$line =~ s/”/&#x201D;/g;
$line =~ s/¶/&para;/g;

print FILE1 $line;
}

print "$txtfile - TXT file Created \n";

close FILE1;
close FILE2;

unlink ("$Outfile");

最佳答案

我是链接模块的作者。不要使用它。如果可能的话,购买真正的 RTF 到文本转换器,例如 Pandoc。

关于perl - 使用 perl 将 RTF 转换为 TEXT,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25710250/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com