python删除标点符号电子邮件垃圾邮件-6ren

python删除标点符号电子邮件垃圾邮件

转载作者：行者123 更新时间：2023-12-04 09:38:14

25

4

试图从单词列表中删除标点符号。 python 编程的新手，所以如果有人可以提供帮助，那就太好了。其目的是用于电子邮件垃圾邮件分类。以前我在检查标点符号是否存在后加入了单词，但这给了我单个字符而不是整个单词。更改它以获取单词后，这就是我在下面的内容，因此现在尝试删除标点符号，因为与以前的工作方式不同。

import os
import string
from collections import Counter
from os import listdir  # return all files and folders in the directory

import nltk
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import pandas as pd
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

# used for importing the lingspam dataset
def importLingspamDataset(dir):
    allEmails = [] # for storing the emails once read
    fileNames = []
    for file in listdir(dir):
        f = open((dir + '/' + file), "r")  # used for opening the file in read only format
        fileNames.append(file)
        allEmails.append(f.read()) # appends the read emails to the emails array
        f.close()
    return allEmails, fileNames

def importEnronDataset(dir):
    allEmails = []  # for storing the emails once read
    fileNames = []
    for file in listdir(dir):
        f = open((dir + '/' + file), "r")  # used for opening the file in read only format
        fileNames.append(file)
        allEmails.append(f.read())  # appends the read emails to the emails array
        f.close()
        return allEmails, fileNames

# used to remove punctuation from the emails as this is of no use for detecting spam
def removePunctuation(cleanedEmails):
    punc = set(string.punctuation)
    for word, line in enumerate(cleanedEmails):
        words = line.split()
        x = [''.join(c for c in words if c not in string.punctuation)]
        allWords = []
        allWords += x
        return allWords

# used to remove stopwords i.e. words of no use in detecting spam
def removeStopwords(cleanedEmails):
    removeWords = set(stopwords.words('english')) # sets all the stopwords to be removed
    for stopw in removeWords: # for each word in remove words
        if stopw not in removeWords: # if the word is not in the stopwords to be removed
            cleanedEmails.append(stopw) # add this word to the cleaned emails
    return(cleanedEmails)

# funtion to return words to its root form - allows simplicity
def lemmatizeEmails(cleanedEmails):
    lemma = WordNetLemmatizer() # to be used for returning each word to its root form
    lemmaEmails = [lemma.lemmatize(i) for i in cleanedEmails] # lemmatize each word in the cleaned emails
    return lemmaEmails

# function to allow a systematic process of elimating the undesired elements within the emails
def cleanAllEmails(cleanedEmails):
    cleanPunc = removePunctuation(cleanedEmails)
    cleanStop = removeStopwords(cleanPunc)
    cleanLemma = lemmatizeEmails(cleanStop)
    return cleanLemma

def createDictionary(email):
    allWords = []
    allWords.extend(email)
    dictionary = Counter(allWords)
    dictionary.most_common(3000)
    word_cloud = WordCloud(width=400, height=400, background_color='white',
              min_font_size=12).generate_from_frequencies(dictionary)
    plt.imshow(word_cloud)
    plt.axis("off")
    plt.margins(x=0, y=0)
    plt.show()
    word_cloud.to_file('test1.png')

def featureExtraction(email):
     emailFiles = []
     emailFiles.extend(email)
     featureMatrix = np.zeros((len(emailFiles), 3000))


def classifyLingspamDataset(email):
    classifications = []
    for name in email:
         classifications.append("spmsg" in name)
    return classifications

# Lingspam dataset
trainingDataLingspam, trainingLingspamFilename = importLingspamDataset("spam-non-spam-dataset/train-mails") # extract the training emails from the dataset
#testingDataLingspam, testingLingspamFilename = importLingspamDataset("spam-non-spam-dataset/test-mails") # extract the testing emails from the dataset

trainingDataLingspamClean = cleanAllEmails(trainingDataLingspam)
#testingDataLingspamClean = cleanAllEmails(testingDataLingspam)

#trainClassifyLingspam = classifyLingspamDataset(trainingDataLingspam)
#testClassifyLingspam = classifyLingspamDataset(testingDataLingspam)

trainDictionary = createDictionary(trainingDataLingspamClean)
#createDictionary(testingDataLingspamClean)

#trainingDataEnron, trainingEnronFilename = importEnronDataset("spam-non-spam-dataset-enron/bigEmailDump/training/")

最佳答案

根据您的问题，我假设您有一个电子邮件列表，您希望为每封电子邮件删除标点符号。此答案基于您发布的代码的第一次修订。

import string


def removePunctuation(emails):

    # I am using a list comprehension here to iterate over the emails.
    # For each iteration, translate the email to remove the punctuation marks.
    # Translate only allows a translation table as an argument.
    # This is why str.maketrans is used to create the translation table.

    cleaned_emails = [email.translate(str.maketrans('', '', string.punctuation))
                      for email in emails]

    return cleaned_emails


if __name__ == '__main__':

    # Assuming cleanedEmails is a list of emails, 
    # I am substituting cleanedEmails with emails.
    # I used cleanedEmails as the result.

    emails = ["This is a, test!", "This is another#@! \ntest"]
    cleaned_emails = removePunctuation(emails)
    print(cleaned_emails)

input: ["This is a, test!", "This is another#@! \ntest"]
output: ['This is a test', 'This is another \ntest']

编辑:

与 OP 对话后问题得到解决。 OP 遇到 WordCloud 问题，我提供的解决方案正在运行。通过让 WordCloud 工作来管理指导 OP。 OP 现在正在微调 WordCloud 的结果。

关于python删除标点符号电子邮件垃圾邮件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62448491/

25

4

0

电子邮件:具有互斥值的合法重复电子邮件标题键
在电子邮件中 Received: header 可以合法地多次出现，并且具有互斥的值... Received: three.example.com Received: two.example.co
如果出现错误，SAS 电子邮件
是否有任何代码/宏可以合并到我的 sas 程序中，一旦我的 sas 代码在运行时发生错误，它会立即给我发送电子邮件？另外，这封电子邮件是否可能包含发生的错误？最佳答案是的……也不是…… 这是可能
HTML 电子邮件 - 使图像适合表格单元格
我有一个包含三个 td 的表格，每个表格都需要包含图像。 td 的宽度和高度是固定的，但图像大小可以变化。目标是在不扭曲单元格或图像本身的情况下拟合图像。不能使用 background-image 属
iphone - 如何从应用程序发送短信/电子邮件
首先非常感谢大家过去提出的宝贵建议，我们正在创建一个应用程序，在某些事件中想要将电子邮件/短信发送到我们已经尝试过 openURL 调用的指定电话号码，但它会打开现有的内置iPhone 的电子邮件/短
Java 电子邮件 - 异常服务器不受信任
我正在使用 apache commons mail 发送电子邮件。不幸的是，我遇到了以下异常: org.apache.commons.mail.EmailException: Sending the
mercurial - 为一个项目设置一个多变的用户名/电子邮件？
我可以在我的 ~/.hgrc 文件中设置我常用的电子邮件地址，但是有没有办法为一个 hg 项目指定我想被称为不同的名称/电子邮件(类似到项目目录中的 git 的 .git/config 文件覆盖 ~/
php - 电子邮件——在电子邮件中换行的正确方法是什么？
$message = 'New user registration\n\n There is a new submission on the site and below are the detail
带有图像的 php 电子邮件()
使用 outlook 我可以发送在邮件正文中插入图像的电子邮件(不是作为附件)。我如何使用 PHP 中的 mail() 函数来做到这一点？最佳答案我会推荐 Swift Mailer: http:/
VBA 电子邮件，正文中粘贴有图表和文本
以下代码的目标是将所选图表粘贴到我的文本下方的电子邮件正文中。但是，它继续将其粘贴在我的文本上方。我该如何更改它以使其粘贴在下面？谢谢! Set OutApp = CreateObject("Outl
Java 正则表达式电子邮件
首先，我知道不建议使用正则表达式发送电子邮件，但我必须对此进行测试。我有这个正则表达式: \b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b 在 Java 中，我这样
Python 电子邮件？最简单的方法？
如何在没有任何第三方程序的情况下从 Python 发送电子邮件？最佳答案使用Python email和 smtplib模块。示例:http://docs.python.org/library/em
新页面上的 php 电子邮件
我目前正在使用此代码在 html 表中显示 mysql 记录 "; . . echo ' '. $row["Email1"] . ' '; . . echo ""; }
HTML 电子邮件 - 为链接的一部分着色
在电子邮件中使用 HTML 时，是否可以仅将链接的一部分着色为特定颜色？我试过: red part of link normal part ...我知道如果我拆分链接是可能的，但我正在努力将它们保持
html 电子邮件 - 将元素向下移动页面？
我正在处理一封 html 电子邮件，我有一个非常简单的元素 (ul)，我想将它移到页面下方。我检查了campaign monitor's guide并且不支持负边距，或者 position: abs
HTML 电子邮件 - 使用背景图片
我使用表格创建了我的 HTML 电子邮件，该表格有一个背景图像，在大多数基于 Web 的电子邮件客户端中都能正常显示。我正在努力让背景图片显示在 Outlook 中。我最近的尝试，我尝试了以下操作
php - 在发送之前格式化文本区域(电子邮件)
我对 PHP/CSS 和一般编程还很陌生。我想改变文本区域中文本的格式，就像在这里所做的那样，例如，当为突出显示的文本添加标签“代码示例”时，它会缩进它，或者当将它设置为粗体时，它会加粗它。这样做
C++ 电子邮件/SMTP
嘿，你能推荐我哪些 C++ 库或类可用于在 C++ 中通过 SMTP 发送电子邮件。我在 Windows 平台上。我需要一个支持附件和 SSL 连接的库。有哪些可用选项。我不打算实现我自己的 :) 问
HTML 电子邮件 - 按钮作为电子邮件中的表单
想知道是否可以在 HTML 电子邮件中包含一个表单。我要做的就是将图像输入提交到 Paypal 购买页面。我希望它直接进入 Paypal ，而无需先进入营销页面... 我会拥有 paypal 要求的完
HTML 电子邮件 - 不能限制宽度
我负责“ reshape ”我们的 IT 部门通信。我想用纯 HTML/CSS 来发送我们的电子邮件通知，以确保它的可移植性。下面是代码，它在 Outlook 中看起来完全符合我的要求，但是一旦将内
HTML 电子邮件，导航显示在移动设备的底部
我正在学习编写响应式电子邮件模板。目前我有:https://jsfiddle.net/q12yg2z6/

首页

博学

6Ren·AI

商城

python删除标点符号电子邮件垃圾邮件