python - 如何将文本清理步骤压缩为单个 Python 函数？-6ren

python - 如何将文本清理步骤压缩为单个 Python 函数？

转载作者：行者123 更新时间：2023-12-01 07:20:22

这里是新程序员，非常感谢这个知识渊博的社区愿意提供的任何帮助。

我在 pandas 数据框中有一列 140,000 个文本字符串(公司名称)，我想在其中删除字符串中/周围的所有空格，删除所有标点符号，替换特定子字符串，并统一转换为小写。然后我想获取字符串中的前 0:10 元素并将它们存储在新的数据帧列中。

这是一个可重现的示例。

import string
import pandas as pd

data = ["West Georgia Co", 
        "W.B. Carell Clockmakers", 
        "Spine & Orthopedic LLC",
        "LRHS Saint Jose's Grocery",
        "Optitech@NYCityScape"]

df = pd.DataFrame(data, columns = ['co_name'])

def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

# applying remove_punctuations function
df['co_name_transform'] = df['co_name'].apply(remove_punctuations)
# this next step replaces 'Saint' with 'st' to standardize,
# and I may want to make other substitutions but this is a common one.
df['co_name_transform'] = df.co_name_transform.str.replace('Saint', 'st')
# replace whitespace
df['co_name_transform'] = df.co_name_transform.str.replace(' ', '')
# make lowercase
df['co_name_transform'] = df.co_name_transform.str.lower()
# select first 0:10 of strings
df['co_name_transform'] = df.co_name_transform.str[0:10]

print(df)

                     co_name        co_name_transform
0            West Georgia Co               westgeorgi
1    W.B. Carell Clockmakers               wbcarellcl
2     Spine & Orthopedic LLC               spineortho
3  LRHS Saint Jose's Grocery               lrhsstjose
4       Optitech@NYCityScape               optitechny

如何将所有这些步骤放入这样的单个函数中？

def clean_text(df[col]):
    for co in co_name:
        do_all_the_steps
    return df[new_col]

谢谢

最佳答案

您不需要函数来执行此操作。尝试下面的一句台词。

df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]

最终输出将是。

                     co_name co_name_transform
0            West Georgia Co        westgeorgi
1    W.B. Carell Clockmakers        wbcarellcl
2     Spine & Orthopedic LLC        spineortho
3  LRHS Saint Jose's Grocery        lrhsstjose
4       Optitech@NYCityScape        optitechny

关于python - 如何将文本清理步骤压缩为单个 Python 函数？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57730415/