gpt4 book ai didi

linux - 使用 R 从基于 Web 的 PDF 中抓取信息

转载 作者:行者123 更新时间:2023-12-05 07:57:18 26 4
gpt4 key购买 nike

我正在尝试从以下基于网络的 PDF 中抓取文本信息:http://www.cmegroup.com/delivery_reports/IssuesAndStopsReport.pdf

关于如何做到这一点有什么建议吗?我探索了 tm 包但运气不佳(无法识别路径):

> pdf.loader <- readPDF(control= list(text = "-layout"))
> rr <- pdf.loader(elem=list(uri="http://www.cmegroup.com/delivery_reports/IssuesAndStopsReport.pdf"),language="en",id="id1")
Error: Cannot handle URI 'http://www.cmegroup.com/delivery_reports/IssuesAndStopsReport.pdf'.
Error: Cannot handle URI 'http://www.cmegroup.com/delivery_reports/IssuesAndStopsReport.pdf'.
Warning messages:
1: In normalizePath(file) :
path[1]="http://www.cmegroup.com/delivery_reports/IssuesAndStopsReport.pdf": No such file or directory
2: running command ''pdftotext' -layout 'http://www.cmegroup.com/delivery_reports/IssuesAndStopsReport.pdf' -' had status 1

我也曾尝试在 readPDF() 中输入不同的“引擎”参数,但没有成功。

最佳答案

您可以考虑以下代码:

library(pdftools)
pdf_text('http://www.cmegroup.com/delivery_reports/IssuesAndStopsReport.pdf')

[1] " CME CLEARING - CHICAGO BOARD OF TRADE\nDLV600-T\nBUSINESS DATE: 09/23/2022 DAILY ISSUES AND STOPS RUN DATE: 09/23/2022\nPRODUCT GROUP: FINANCIAL RUN TIME: 08:30:43PM\n\n\n\n CONTRACT: SEPTEMBER 2022 30 YR U.S. TREASURY BOND FUTURES\n SETTLEMENT: 130.218750000 USD\n NEXT AVAILABLE DATE: 08/26/2022\n INTENT DATE: DELIVERY DATE:\n\nFIRM NBR ORIG FIRM NAME ISSUED STOPPED\n______________________________________________________________________________\n______________________________________________________________________________\n\n TOTAL: 0 0\n MONTH TO DATE: 4,671\n\n CONTRACT: SEPTEMBER 2022 10Y TREASURY NOTE FUTURES\n SETTLEMENT: 113.625000000 USD\n NEXT AVAILABLE DATE: 09/20/2022\n INTENT DATE: DELIVERY DATE:\n\nFIRM NBR ORIG FIRM NAME ISSUED STOPPED\n______________________________________________________________________________\n______________________________________________________________________________\n\n TOTAL: 0 0\n MONTH TO DATE: 17,671\n\n CONTRACT: SEPTEMBER 2022 5 YR TREASURY NOTE FUTURES\n SETTLEMENT: 107.789062500 USD\n NEXT AVAILABLE DATE: 09/01/2022\n INTENT DATE: 09/23/2022 DELIVERY DATE: 09/27/2022\n\nFIRM NBR ORIG FIRM NAME ISSUED STOPPED\n______________________________________________________________________________\n660 CUST JP MORGAN 500\n825 CUST PTG DIV SGAS 500\n______________________________________________________________________________\n\n TOTAL: 500 500\n MONTH TO DATE: 28,664\n\n CONTRACT: SEPTEMBER 2022 2 YEAR TREASURY NOTE FUTURES\n SETTLEMENT: 102.855468750 USD\n NEXT AVAILABLE DATE: 08/26/2022\n INTENT DATE: DELIVERY DATE:\n\nFIRM NBR ORIG FIRM NAME ISSUED STOPPED\n______________________________________________________________________________\n______________________________________________________________________________\n"
[2] " CME CLEARING - CHICAGO BOARD OF TRADE\nDLV600-T\nBUSINESS DATE: 09/23/2022 DAILY ISSUES AND STOPS RUN DATE: 09/23/2022\nPRODUCT GROUP: FINANCIAL RUN TIME: 08:30:43PM\n\n\n TOTAL: 0 0\n MONTH TO DATE: 14,954\n\n CONTRACT: SEPTEMBER 2022 3 YEAR TREASURY NOTE FUTURE\n SETTLEMENT: 104.578125000 USD\n NEXT AVAILABLE DATE: 09/23/2022\n INTENT DATE: 09/23/2022 DELIVERY DATE: 09/27/2022\n\nFIRM NBR ORIG FIRM NAME ISSUED STOPPED\n______________________________________________________________________________\n617 HOUS MORGAN STANLEY 1\n709 CUST BARCLAYS 95\n825 CUST PTG DIV SGAS 656 560\n______________________________________________________________________________\n\n TOTAL: 656 656\n MONTH TO DATE: 656\n\n CONTRACT: SEPTEMBER 2022 ULTRA 10-YEAR U S TREASURY NOTE FUT\n SETTLEMENT: 121.437500000 USD\n NEXT AVAILABLE DATE: 09/20/2022\n INTENT DATE: DELIVERY DATE:\n\nFIRM NBR ORIG FIRM NAME ISSUED STOPPED\n______________________________________________________________________________\n______________________________________________________________________________\n\n TOTAL: 0 0\n MONTH TO DATE: 43,088\n\n CONTRACT: SEPTEMBER 2022 20-YEAR U.S. TREASURY BOND FUTURES\n SETTLEMENT: 133.281250000 USD\n NEXT AVAILABLE DATE: 08/25/2022\n INTENT DATE: DELIVERY DATE:\n\nFIRM NBR ORIG FIRM NAME ISSUED STOPPED\n______________________________________________________________________________\n______________________________________________________________________________\n\n TOTAL: 0 0\n MONTH TO DATE: 17\n"
[3] " CME CLEARING - CHICAGO BOARD OF TRADE\nDLV600-T\nBUSINESS DATE: 09/23/2022 DAILY ISSUES AND STOPS RUN DATE: 09/23/2022\nPRODUCT GROUP: FINANCIAL RUN TIME: 08:30:43PM\n\n CONTRACT: SEPTEMBER 2022 LONG TERM U.S. TREASURY BOND FUTURE\n SETTLEMENT: 142.750000000 USD\n NEXT AVAILABLE DATE: 09/09/2022\n INTENT DATE: DELIVERY DATE:\n\nFIRM NBR ORIG FIRM NAME ISSUED STOPPED\n______________________________________________________________________________\n______________________________________________________________________________\n\n TOTAL: 0 0\n MONTH TO DATE: 10,416\n\n\n\n <<< End of Report >>>\n"

关于linux - 使用 R 从基于 Web 的 PDF 中抓取信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27280234/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com