gpt4 book ai didi

python-3.x - 两个Python循环看起来应该做同样的事情,但输出不同的结果?

转载 作者:行者123 更新时间:2023-11-30 08:39:33 25 4
gpt4 key购买 nike

昨天,我试图完成 Udacity 的第 11 课,关于文本矢量化。我检查了代码,一切似乎都工作正常 - 我收到一些电子邮件,打开它们,删除一些签名词并将每封电子邮件的词干返回到列表中。

这是循环 1:

for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
for path in from_person:
### only look at first 200 emails when developing
### once everything is working, remove this line to run over full dataset
# temp_counter += 1
if temp_counter < 200:
path = os.path.join('/xxx', path[:-1])
email = open(path, "r")

### use parseOutText to extract the text from the opened email

email_stemmed = parseOutText(email)

### use str.replace() to remove any instances of the words
### ["sara", "shackleton", "chris", "germani"]

email_stemmed.replace("sara","")
email_stemmed.replace("shackleton","")
email_stemmed.replace("chris","")
email_stemmed.replace("germani","")

### append the text to word_data

word_data.append(email_stemmed.replace('\n', ' ').strip())

### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
if from_person == "sara":
from_data.append(0)
elif from_person == "chris":
from_data.append(1)

email.close()

这是循环 2:

for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
for path in from_person:
### only look at first 200 emails when developing
### once everything is working, remove this line to run over full dataset
# temp_counter += 1
if temp_counter < 200:
path = os.path.join('/xxx', path[:-1])
email = open(path, "r")

### use parseOutText to extract the text from the opened email
stemmed_email = parseOutText(email)

### use str.replace() to remove any instances of the words
### ["sara", "shackleton", "chris", "germani"]
signature_words = ["sara", "shackleton", "chris", "germani"]
for each_word in signature_words:
stemmed_email = stemmed_email.replace(each_word, '') #careful here, dont use another variable, I did and broke my head to solve it

### append the text to word_data
word_data.append(stemmed_email)

### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
if name == "sara":
from_data.append(0)
else: # its chris
from_data.append(1)


email.close()

代码的下一部分按预期工作:

print("emails processed")
from_sara.close()
from_chris.close()

pickle.dump( word_data, open("/xxx/your_word_data.pkl", "wb") )
pickle.dump( from_data, open("xxx/your_email_authors.pkl", "wb") )


print("Answer to Lesson 11 quiz 19: ")
print(word_data[152])


### in Part 4, do TfIdf vectorization here

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import stop_words
print("SKLearn has this many Stop Words: ")
print(len(stop_words.ENGLISH_STOP_WORDS))

vectorizer = TfidfVectorizer(stop_words="english", lowercase=True)
vectorizer.fit_transform(word_data)

feature_names = vectorizer.get_feature_names()

print('Number of different words: ')
print(len(feature_names))

但是当我用循环1计算总单词数时,我得到了错误的结果。当我用循环 2 执行此操作时,我得到了正确的结果。

我已经查看这段代码太久了,但我无法发现其中的区别 - 我在循环 1 中做错了什么?

郑重声明,我一直得到的错误答案是 38825。正确答案应该是 38757。

非常感谢您的帮助,善良的陌生人!

最佳答案

这些行不执行任何操作:

email_stemmed.replace("sara","")
email_stemmed.replace("shackleton","")
email_stemmed.replace("chris","")
email_stemmed.replace("germani","")

replace 返回一个新字符串,并且不会修改 email_stemmed。相反,您应该将返回值设置为 email_stemmed:

email_stemmed = email_stemmed.replace("sara", "")

依此类推。

循环二实际上在 for 循环中设置了返回值:

for each_word in signature_words:
stemmed_email = stemmed_email.replace(each_word, '')

上面的代码片段并不等效,因为在第一个片段的末尾 email_stemmed 完全没有变化,因为 replace 被正确使用,而最后第二个 stemmed_email 的每个单词实际上都被删除了。

关于python-3.x - 两个Python循环看起来应该做同样的事情,但输出不同的结果?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54316993/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com