- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我一直在为一个非常重要的个人项目做研究。我想创建一个 Flask 搜索应用程序,它允许我在 100 多个 PDF 文件中搜索内容。我发现了一些关于 A ElasticSearch Lib 的信息,它可以很好地与 flask 配合使用。
#!/usr/bin/env python3
#-*- coding: utf-8 -*-
# import libraries to help read and create PDF
import PyPDF2
from fpdf import FPDF
import base64
import json
from flask import Flask, jsonify, request, render_template, json
from datetime import datetime
import pandas as pd
# import the Elasticsearch low-level client library
from elasticsearch import Elasticsearch
# create a new client instance of Elasticsearch
elastic_client = Elasticsearch(hosts=["localhost"])
es = Elasticsearch("http://localhost:9200/")
app = Flask(__name__)
# create a new PDF object with FPDF
pdf = FPDF()
# use an iterator to create 10 pages
for page in range(10):
pdf.add_page()
pdf.set_font("Arial", size=14)
pdf.cell(150, 12, txt="Object Rocket ROCKS!!", ln=1, align="C")
# output all of the data to a new PDF file
pdf.output("object_rocket.pdf")
'''
read_pdf = PyPDF2.PdfFileReader("object_rocket.pdf")
page = read_pdf.getPage(0)
page_mode = read_pdf.getPageMode()
page_text = page.extractText()
print (type(page_text))
'''
#with open(path, 'rb') as file:
# get the PDF path and read the file
file = "Sheet3.pdf"
read_pdf = PyPDF2.PdfFileReader(file, strict=False)
#print (read_pdf)
# get the read object's meta info
pdf_meta = read_pdf.getDocumentInfo()
# get the page numbers
num = read_pdf.getNumPages()
print ("PDF pages:", num)
# create a dictionary object for page data
all_pages = {}
# put meta data into a dict key
all_pages["meta"] = {}
# Use 'iteritems()` instead of 'items()' for Python 2
for meta, value in pdf_meta.items():
print (meta, value)
all_pages["meta"][meta] = value
# iterate the page numbers
for page in range(num):
data = read_pdf.getPage(page)
#page_mode = read_pdf.getPageMode()
# extract the page's text
page_text = data.extractText()
# put the text data into the dict
all_pages[page] = page_text
# create a JSON string from the dictionary
json_data = json.dumps(all_pages)
#print ("\nJSON:", json_data)
# convert JSON string to bytes-like obj
bytes_string = bytes(json_data, 'utf-8')
#print ("\nbytes_string:", bytes_string)
# convert bytes to base64 encoded string
encoded_pdf = base64.b64encode(bytes_string)
encoded_pdf = str(encoded_pdf)
#print ("\nbase64:", encoded_pdf)
# put the PDF data into a dictionary body to pass to the API request
body_doc = {"data": encoded_pdf}
# call the index() method to index the data
result = elastic_client.index(index="pdf", doc_type="_doc", id="42", body=body_doc)
# print the returned sresults
#print ("\nindex result:", result['result'])
# make another Elasticsearch API request to get the indexed PDF
result = elastic_client.get(index="pdf", doc_type='_doc', id=42)
# print the data to terminal
result_data = result["_source"]["data"]
#print ("\nresult_data:", result_data, '-- type:', type(result_data))
# decode the base64 data (use to [:] to slice off
# the 'b and ' in the string)
decoded_pdf = base64.b64decode(result_data[2:-1]).decode("utf-8")
#print ("\ndecoded_pdf:", decoded_pdf)
# take decoded string and make into JSON object
json_dict = json.loads(decoded_pdf)
#print ("\njson_str:", json_dict, "\n\ntype:", type(json_dict))
result2 = elastic_client.index(index="pdftext", doc_type="_doc", id="42", body=json_dict)
# create new FPDF object
pdf = FPDF()
# build the new PDF from the Elasticsearch dictionary
# Use 'iteritems()` instead of 'items()' for Python 2
""" for page, value in json_data:
if page != "meta":
# create new page
pdf.add_page()
pdf.set_font("Arial", size=14)
# add content to page
output = value + " -- Page: " + str(int(page)+1)
pdf.cell(150, 12, txt=output, ln=1, align="C")
else:
# create the meta data for the new PDF
for meta, meta_val in json_dict["meta"].items():
if "title" in meta.lower():
pdf.set_title(meta_val)
elif "producer" in meta.lower() or "creator" in meta.lower():
pdf.set_creator(meta_val)
"""
# output the PDF object's data to a PDF file
#pdf.output("object_rocket_from_elaticsearch.pdf" )
@app.route('/', methods=['GET'])
def index():
return jsonify(json_dict)
@app.route('/<id>', methods=['GET'])
def index_by_id(id):
return jsonify(json_dict[id])
""" @app.route('/insert_data', methods=['PUT'])
def insert_data():
slug = request.form['slug']
title = request.form['title']
content = request.form['content']
body = {
'slug': slug,
'title': title,
'content': content,
'timestamp': datetime.now()
}
result = es.index(index='contents', doc_type='title', id=slug, body=body)
return jsonify(result) """
app.run(port=5003, debug=True)
# Load_single_PDF_BY_PAGE_TO_index.py
#!/usr/bin/env python3
#-*- coding: utf-8 -*-
# import libraries to help read and create PDF
import PyPDF2
from fpdf import FPDF
import base64
from flask import Flask, jsonify, request, render_template, json
from datetime import datetime
import pandas as pd
# import the Elasticsearch low-level client library
from elasticsearch import Elasticsearch
# create a new client instance of Elasticsearch
elastic_client = Elasticsearch(hosts=["localhost"])
es = Elasticsearch("http://localhost:9200/")
app = Flask(__name__)
#with open(path, 'rb') as file:
# get the PDF path and read the file
file = "Sheet3.pdf"
read_pdf = PyPDF2.PdfFileReader(file, strict=False)
#print (read_pdf)
# get the read object's meta info
pdf_meta = read_pdf.getDocumentInfo()
# get the page numbers
num = read_pdf.getNumPages()
print ("PDF pages:", num)
# create a dictionary object for page data
all_pages = {}
# put meta data into a dict key
all_pages["meta"] = {}
# Use 'iteritems()` instead of 'items()' for Python 2
for meta, value in pdf_meta.items():
print (meta, value)
all_pages["meta"][meta] = value
x = 44
# iterate the page numbers
for page in range(num):
data = read_pdf.getPage(page)
#page_mode = read_pdf.getPageMode()
# extract the page's text
page_text = data.extractText()
# put the text data into the dict
all_pages[page] = page_text
body_doc2 = {"data": page_text}
result3 = elastic_client.index(index="pdfclearn", doc_type="_doc", id=x, body=body_doc2)
x += 1
from flask import Flask, jsonify, request,render_template
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch("http://localhost:9200/")
app = Flask(__name__)
@app.route('/pdf', methods=['GET'])
def index():
results = es.get(index='pdfclearn', doc_type='_doc', id='44')
return jsonify(results['_source'])
@app.route('/pdf/<id>', methods=['GET'])
def index_by_id(id):
results = es.get(index='pdfclearn', doc_type='_doc', id=id)
return jsonify(results['_source'])
@app.route('/search/<keyword>', methods=['POST','GET'])
def search(keyword):
keyword = keyword
body = {
"query": {
"multi_match": {
"query": keyword,
"fields": ["data"]
}
}
}
res = es.search(index="pdfclearn", doc_type="_doc", body=body)
return jsonify(res['hits']['hits'])
@app.route("/searhbar")
def searhbar():
return render_template("index.html")
@app.route("/searhbar/<string:box>")
def process(box):
query = request.args.get('query')
if box == 'names':
keyword = box
body = {
"query": {
"multi_match": {
"query": keyword,
"fields": ["data"]
}
}
}
res = es.search(index="pdfclearn", doc_type="_doc", body=body)
return jsonify(res['hits']['hits'])
app.run(port=5003, debug=True)
curl http://127.0.0.1:5003/search/test //it works!!
curl "http://localhost:9200/pdftext/_doc/42"
curl -X POST "http://localhost:9200/pdf/_search?q=*"
最佳答案
- - - 进步 - - -
我现在有一个没有前端搜索功能的工作解决方案:
# Load_single_PDF_BY_PAGE_TO_index.py
#!/usr/bin/env python3
#-*- coding: utf-8 -*-
# import libraries to help read and create PDF
import PyPDF2
from fpdf import FPDF
import base64
from flask import Flask, jsonify, request, render_template, json
from datetime import datetime
import pandas as pd
# import the Elasticsearch low-level client library
from elasticsearch import Elasticsearch
# create a new client instance of Elasticsearch
elastic_client = Elasticsearch(hosts=["localhost"])
es = Elasticsearch("http://localhost:9200/")
app = Flask(__name__)
#with open(path, 'rb') as file:
# get the PDF path and read the file
file = "Sheet3.pdf"
read_pdf = PyPDF2.PdfFileReader(file, strict=False)
#print (read_pdf)
# get the read object's meta info
pdf_meta = read_pdf.getDocumentInfo()
# get the page numbers
num = read_pdf.getNumPages()
print ("PDF pages:", num)
# create a dictionary object for page data
all_pages = {}
# put meta data into a dict key
all_pages["meta"] = {}
# Use 'iteritems()` instead of 'items()' for Python 2
for meta, value in pdf_meta.items():
print (meta, value)
all_pages["meta"][meta] = value
x = 44
# iterate the page numbers
for page in range(num):
data = read_pdf.getPage(page)
#page_mode = read_pdf.getPageMode()
# extract the page's text
page_text = data.extractText()
# put the text data into the dict
all_pages[page] = page_text
body_doc2 = {"data": page_text}
result3 = elastic_client.index(index="pdfclearn", doc_type="_doc", id=x, body=body_doc2)
x += 1
from flask import Flask, jsonify, request,render_template
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch("http://localhost:9200/")
app = Flask(__name__)
@app.route('/pdf', methods=['GET'])
def index():
results = es.get(index='pdfclearn', doc_type='_doc', id='44')
return jsonify(results['_source'])
@app.route('/pdf/<id>', methods=['GET'])
def index_by_id(id):
results = es.get(index='pdfclearn', doc_type='_doc', id=id)
return jsonify(results['_source'])
@app.route('/search/<keyword>', methods=['POST','GET'])
def search(keyword):
keyword = keyword
body = {
"query": {
"multi_match": {
"query": keyword,
"fields": ["data"]
}
}
}
res = es.search(index="pdfclearn", doc_type="_doc", body=body)
return jsonify(res['hits']['hits'])
@app.route("/searhbar")
def searhbar():
return render_template("index.html")
@app.route("/searhbar/<string:box>")
def process(box):
query = request.args.get('query')
if box == 'names':
keyword = box
body = {
"query": {
"multi_match": {
"query": keyword,
"fields": ["data"]
}
}
}
res = es.search(index="pdfclearn", doc_type="_doc", body=body)
return jsonify(res['hits']['hits'])
app.run(port=5003, debug=True)
curl http://127.0.0.1:5003/search/test //it works!!
关于python - 如何使 PDF 可用于 flask 搜索应用程序的搜索?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60031112/
我在为 MacOSX 构建的独立包中添加 DMG 背景的自定义图标时遇到问题。我在项目的根目录中添加了一个包。正在从中加载自定义图标,但没有加载 DMG 背景图标。我正在使用 Java fx 2.2.
Qt for Symbian 和 Qt for MeeGo 有什么区别?我知道 Qt 是一个交叉编译平台。这是否意味着如果我使用来自 Qt 的库,完全相同的库可以在所有支持 Qt 的设备(例如 Sym
我正在尝试使用 C# .NET 3.5/4.0 务实地运行 SQL Server 数据库的备份。我已经找到了如何完成此操作,但是我似乎找不到用于备份的命名空间库。 我正在寻找 Microsoft.Sq
我最近在疯狂学习 Java,但我通常是一名 .NET 开发人员。 (所以请原谅我的新手问题。) 在 .Net 中,我可以在不使用 IIS 的情况下开发 ASP.Net 页面,因为它有一个简化的 Web
这post仅当打印命令中有字符串时才有用。现在我有大量的源代码,其中包含一条声明,例如 print milk,butter 应该格式化为 print(milk,butter) 用\n 捕获行尾并不成功
所以我的问题是: https://gist.github.com/panSarin/4a221a0923927115584a 当我保存这个表格时,我收到了标题中的错误 NoMethodError (u
如何让 Html5 音频在点击时播放声音? (ogg 用于 Firefox 等浏览器,mp3 用于 chrome 等浏览器) 到目前为止,我可以通过 onclick 更改为单个文件类型,但我无法像在普
如果it1和it2有什么区别? std::set s; auto it1 = std::inserter(s, s.begin()); auto it2 = std::inserter(s, s.en
4.0.0 com.amkit myapp SpringMVCFirst
我目前使用 Eclipse 作为其他语言的 IDE,而且我习惯于不必离开 IDE 做任何事情 - 但是我真的很难为纯 ECMAScript-262 找到相同或类似的设置。 澄清一下,我不是在寻找 DO
我想将带有字符串数组的C# 结构发送到C++ 函数,该函数接受void * 作为c# 结构和char** 作为c# 结构字符串数组成员。 我能够将结构发送到 c++ 函数,但问题是,无法从 c++ 函
我正在使用动态创建的链接: 我想为f:param附加自定义转换器,以从#{name}等中删除空格。 但是f:param中没有转换器
是否可以利用Redis为.NET创建后写或直写式缓存?理想情况下,透明的高速缓存是由单个进程写入的,并且支持从数据库加载丢失的数据,并每隔一段时间持久保存脏块? 我已经搜查了好几个小时,也许是goog
我正在通过bash执行命令的ssh脚本。 FILENAMES=( "export_production_20200604.tgz" "export_production_log_2020060
我需要一个正则表达式来出现 0 到 7 个字母或 0 到 7 个数字。 例如:匹配:1234、asdbs 不匹配:123456789、absbsafsfsf、asf12 我尝试了([a-zA-Z]{0
我有一个用于会计期间的表格,该表格具有期间结束和开始的开始日期和结束日期。我使用此表来确定何时发生服务交易以及何时在查询中收集收入,例如... SELECT p.PeriodID, p.FiscalY
我很难为只接受字符或数字的 Laravel 构建正则表达式验证。它是这样的: 你好<-好的 123 <- 好的 你好123 <-不行 我现在的正则表达式是这样的:[A-Za-z]|[0-9]。 reg
您实际上会在 Repeater 上使用 OnItemDataBound 做什么? 最佳答案 “此事件为您提供在客户端显示数据项之前访问数据项的最后机会。引发此事件后,数据项将被清空,不再可用。” ~
我有一个 fragment 工作正常的项目,我正在使用 jeremyfeinstein 的 actionbarsherlock 和滑动菜单, 一切正常,但是当我想自定义左侧抽屉列表单元格时,出现异常
最近几天,我似乎平均分配时间在构建我的第一个应用程序和在这里发布问题!! 这是我的第一个应用程序,也是我们的设计师完成的第一个应用程序。我试图满足他所做的事情的外观和感觉,但我认为他没有做适当的事情。
我是一名优秀的程序员,十分优秀!