gpt4 book ai didi

mysql - 使用 PERL 将雅虎金融公司债券数据提取到 mysql

转载 作者:行者123 更新时间:2023-11-30 00:47:51 26 4
gpt4 key购买 nike

我想提取下表中列出的所有多个页面的每一行的配置文件信息:

http://reports.finance.yahoo.com/z1?b=1&so=a&sf=m&tc=1&stt=-&pr=0&cpl=-1&cpu=-1&yl=-1&yu=-1&ytl=-1&ytu=-1&mtl=-1&mtu=-1&rl=5&ru=-1&cll=0

以下是表中列出的某一行的链接示例(全部位于“问题”列中):

http://reports.finance.yahoo.com/z2?ce=5415446151491606016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000

我想将每个问题包含的所有行和页面的所有信息存储在 mysql 数据库中。我认为 PERL 将是一个很好的工具,但我对它的经验非常有限。

我认为我需要收集表中所有页面(当时有 2600 多页)的问题列中的所有链接,并以某种方式从链接中的每个页面中提取信息。

任何帮助将不胜感激。

最佳答案

这将让您以某种方式开始,并向您展示使用正则表达式执行此操作的一般技术(如果您不太熟悉 perl 和正则表达式匹配,这可能很难理解)。

仅在第一页这样做了,并且我确实在代码中添加了尽可能多的注释,以帮助您理解它。如果您无法理解这段代码的实际作用,我建议尝试使用不同的工具(或者尝试像 Web::ScraperMojo::DOM 这样的模块)。如果您真的想用 Perl 完成工作,请阅读一些 Perl 文档...

http://perldoc.perl.org/perlre.html

#!/usr/bin/perl                                                                                                                                                                                                                                                               
use strict;
use warnings;
use LWP::Simple;
use feature 'say';

my $start_url = 'http://reports.finance.yahoo.com/z1?b=1&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000';
my $page_content = get($start_url);
die "Oops, something went wrong!" unless defined $page_content;

process_bond_results_page($page_content);

sub process_bond_results_page {
my $content = shift;
# iterates $content as long as /<tr class=\"yfnc_tabledata1\">(.+?)<\/tr>/g regex matches
# puts row content (content between <tr...>(...)</tr> in a special $1 variable)
while($content =~ /<tr class=\"yfnc_tabledata1\">(.+?)<\/tr>/g) {
# uncomment line below to see what $1 contains
# say $1;

# cleanup not needed HTML tags
my $tr_data = cleanup_html_tags($1);

# match content in between <td> & </td> tags and put them on @tds list
my (@tds) = $tr_data =~ /<td>(.*?)<\/td>/g;

# 2nd element of @tds list contains <a href="link_to_issue">ISSUE NAME</a> text
# Line below extracts link_to_issue and $issue_name and assigns them to respective variables
my ($link_to_issue, $issue_name) = $tds[1] =~ /<a[^>]*?href=\"([^\"]*?)\"[^>]*?>(.+?)<\/a>/g;

# Replace 2nd element of list that contains data like <a href="link_to_issue">ISSUE NAME</a>
# with just ISSUE NAME
$tds[1] = $issue_name;

# Append $link_to_issue at the end of @tds list
push(@tds,$link_to_issue);

# Print @tds array with values seaparated by TABs
say join("\t", @tds);
}

# Does it have Next link?
my ($next_link) = $content =~ /<a[^>]*?href=\"([^\"]+?)\">Next<\/a><\/b>/g;
say 'NEXT: ' . $next_link if $next_link;

return;
}

sub cleanup_html_tags {
my $html = shift;
$html =~ s/<\/?(font|div)[^>]*?>//g; # remove <font...>, <div...>, </font>, </div>
$html =~ s/<td[^>]*?>/<td>/g; # replace all <td...> with just <td>
$html =~ s/<\/?nobr>//g; # remove <nobr> and </nobr>
return $html;
}

上面将打印:

Corp    MERRILL LYNCH CO INC MTN BE 100.63  5.000    3-Feb-2014 -19.649 4.969   A   No  /z2?ce=5314754150501796218050&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp CME GROUP INC 100.84 5.750 15-Feb-2014 -8.334 5.702 AA No /z2?ce=5715449144561716016149&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp CAPITAL ONE BK MTN BE 100.80 5.125 15-Feb-2014 -8.334 5.084 A No /z2?ce=5715254147581635317455&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp HESS CORP 100.92 7.000 15-Feb-2014 -8.351 6.937 BBB No /z2?ce=5415446151491606016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp PACCAR INC 100.90 6.875 15-Feb-2014 -8.295 6.813 A No /z2?ce=5214751144551836016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp WACHOVIA CORP NEW 100.78 4.875 15-Feb-2014 -8.337 4.837 A No /z2?ce=4915445142581546016054&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp CATERPILLAR FINL SVCS MTNS BE 100.89 6.125 17-Feb-2014 -7.597 6.071 A No /z2?ce=5715245150561764615951&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp KRAFT FOODS INC 100.97 6.750 19-Feb-2014 -6.921 6.685 BBB No /z2?ce=5315654144531746017754&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp WESTERN UN CO 101.05 6.500 26-Feb-2014 -5.154 6.432 BBB No /z2?ce=4915145143581556015548&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp AMERICA MOVIL SAB DE CV 101.06 5.500 1-Mar-2014 -4.615 5.443 A No /z2?ce=5815451145541816015954&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp HARTFORD FINL SVCS GROUP INC 100.96 4.750 1-Mar-2014 -4.454 4.705 BBB No /z2?ce=5415548146571526017250&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp HEWLETT PACKARD CO 101.12 6.125 1-Mar-2014 -4.599 6.057 BBB No /z2?ce=5415446149551516016556&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp RYDER SYS MTN BE 101.08 5.850 1-Mar-2014 -4.495 5.788 BBB No /z2?ce=5114851146531605117352&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp HSBC FIN CORP HSBC FIN 100.72 2.000 15-Mar-2014 -3.011 1.986 A No /z2?ce=5415650149491807117451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp SYSCO CORP 101.06 4.600 15-Mar-2014 -2.772 4.552 A No /z2?ce=5014953143561486015756&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
NEXT: z1?b=2&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000

关于mysql - 使用 PERL 将雅虎金融公司债券数据提取到 mysql,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21211064/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com