gpt4 book ai didi

python - 从 powerpoint 文件中分离文本提取时遇到问题

转载 作者:太空宇宙 更新时间:2023-11-03 20:58:04 24 4
gpt4 key购买 nike

我有一个从 PowerPoint 中提取文本的函数。然而,输出是一个大列表中所有 Powerpoint 文件的所有文本。如何分离文本,以便最终为我提取的两个 powerpoint 文件提供两个文本列表?

text_runs = []

def pptx_collect(x):
for file in pptx_files:
prs = Presentation(file)
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text_runs.append(run.text)
return(text_runs)

def Powerpoint(pptx_files):
for name in pptx_files:
#print(name)
IP_list = (pptx_collect(name))
for item in IP_list:
#print(item)
keyword = re.findall(inp,item)
keyword1 = re.findall(inp1,item)
keyword2 = re.findall(word_search,item)
#print(ip_test)
file_dict['keyword'].append(keyword+keyword1+keyword2)
file_dict['name'].append(name.name[0:])
file_dict['created'].append(time.ctime(name.stat().st_ctime))
file_dict['modified'].append(time.ctime(name.stat().st_mtime))
file_dict['path'].append(name)
file_dict["content"].append(IP_list) #<--- This is where the
#problem is.
#print(file_dict)
return(file_dict)
Powerpoint(pptx_files)

我得到的输出是:

['Billy’s ', 'pii', 'Just a test', '04/15/1991', '04.15.1991', '234-23-6456-billys ', 'SSN', 'Address: 58 bonnie ', 'rd', ', 'mass 07037', 'Text from second 2 ', 'Text from second ', 'powerpoint', ' ', '(second page)',  'Text from second 2 ', 'Text from second ', 'powerpoint', ' ', '(second page)', 'FOUO Test', 'Secret', 'This is a test to check ', 'for keywords']

我想要得到:

['Billy’s ', 'pii', 'Just a test', '04/15/1991', '04.15.1991', '234-23-6456-billys ', 'SSN', 'Address: 58 bonnie ', 'rd', ', Boston, mass 07037', 'Text from second 2 '] 

['Text from second ', 'powerpoint', ' ', '(second page)', 'Text from second 2 ', 'Text from second ', 'powerpoint', ' ', '(second page)', 'FOUO Test', 'Secret', 'This is a test to check ', 'for keywords']

最佳答案

pptx_collect() 函数会遍历所有文件。试试这个:

def pptx_collect(x):
prs = Presentation(x)
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text_runs.append(run.text)
return(text_runs)

关于python - 从 powerpoint 文件中分离文本提取时遇到问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55882853/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com