gpt4 book ai didi

python - 分隔大写字母上的连接单词

转载 作者:行者123 更新时间:2023-12-01 04:04:45 25 4
gpt4 key购买 nike

使用 Python,我必须编写一个基本上“清理”数据文本文件的脚本。到目前为止,我已经删除了所有不需要的字符或将它们替换为可接受的字符(例如,破折号 - 可以替换为空格)。现在我已经到了必须将连接在一起的单词分开的地步。这是文本文件前 15 行的片段

AccessibleComputing  Computer accessibility
AfghanistanHistory History of Afghanistan
AfghanistanGeography Geography of Afghanistan
AfghanistanPeople Demographics of Afghanistan
AfghanistanCommunications Communications in Afghanistan
AfghanistanMilitary Afghan Armed Forces
AfghanistanTransportations Transport in Afghanistan
AfghanistanTransnationalIssues Foreign relations of Afghanistan
AssistiveTechnology Assistive technology
AmoeboidTaxa Amoeba
AsWeMayThink As We May Think
AlbaniaHistory History of Albania
AlbaniaPeople Demographics of Albania
AlbaniaEconomy Economy of Albania
AlbaniaGovernment Politics of Albania

我想要做的是将大写字母出现处连接的单词分开。例如,我希望第一行看起来像这样:

Accessible Computing  Computer accessibility

脚本必须获取文件输入并将结果写入输出文件。这就是我目前所拥有的,但根本不起作用! (不确定我是否走在正确的道路上)

import re

input_file = open("C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned2.txt",'r')
output_file = open("C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned3.txt",'w')

for line in input_file:
if line.contains('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'):
newline = line.

output_file.write(newline)

input_file.close()
output_file.close()

最佳答案

我建议使用以下正则表达式分割单词:

import re, os

input_file = 'input.txt'
output_file = 'output.txt'

with open(input_file, 'r') as f_in:
with open(output_file, 'w') as f_out:
for line in f_in.readlines():
p = re.compile(r'[A-Z][a-z]+|\S+')

matches = re.findall(p, line)
matches = ' '.join(matches)

f_out.write(matches+ os.linesep)

假设 data.txt 包含您在帖子中粘贴的文本,它将打印:

Accessible Computing Computer accessibility
Afghanistan History History of Afghanistan
Afghanistan Geography Geography of Afghanistan
Afghanistan People Demographics of Afghanistan
Afghanistan Communications Communications in Afghanistan
Afghanistan Military Afghan Armed Forces
Afghanistan Transportations Transport in Afghanistan
Afghanistan Transnational Issues Foreign relations of Afghanistan
Assistive Technology Assistive technology
Amoeboid Taxa Amoeba
As We May Think As We May Think
Albania History History of Albania
Albania People Demographics of Albania
Albania Economy Economy of Albania
Albania Government Politics of Albania
...

关于python - 分隔大写字母上的连接单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35820759/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com