gpt4 book ai didi

python - 如何在机器学习训练集中结合文本和数字特征?

转载 作者:行者123 更新时间:2023-11-30 08:59:07 25 4
gpt4 key购买 nike

我正在尝试创建一个监督机器学习模型,根据数字特征和文本特征来预测给定 URL 属于良性或恶意类的概率。

数值特征 -

  • 网址长度
  • 主域长度
  • 点数
  • 包含IP等

文本特征 -

  • 注册商姓名
  • 注册人姓名
  • 国家
  • 网址等中的单词列表

我有具有所需功能的数据框,但我不知道如何处理文本数据。有人可以指导我吗?

下面是我拥有的示例数据框-

   url_length    length_domain    is_ip    registrar    registrants    tokens_in_url
0 50 18 0 a1 z1 [abc, def, ghi, jkl]
1 98 23 0 a2 z2 [mno, pqr, stu]
2 146 8 0 a3 z3 [vwx, yz]

提前致谢。

最佳答案

考虑以下演示:

来源 DF:

In [113]: df
Out[113]:
registrar registrant country
0 registrar1 registrant1 country1
1 registrar8 registrant2 country2
2 registrar1 registrant3 country1
3 registrar5 registrant4 country3

编码:

In [114]: from sklearn.preprocessing import LabelEncoder

In [115]: str_cols = df.columns[df.dtypes.eq('object')]

In [116]: clfs = {c:LabelEncoder() for c in str_cols}

In [117]: for col, clf in clfs.items():
...: df[col] = clfs[col].fit_transform(df[col])
...:

In [118]: df
Out[118]:
registrar registrant country
0 0 0 0
1 2 1 1
2 0 2 0
3 1 3 2

逆变换:

In [119]: clfs['country'].inverse_transform(df['country'])
Out[119]: array(['country1', 'country2', 'country1', 'country3'], dtype=object)

更新:

Is it possible to use TF-IDF (List of words in URL) with your given answer?

In [86]: from sklearn.feature_extraction.text import TfidfVectorizer

In [87]: vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')

In [88]: X = vect.fit_transform(df['tokens_in_url'].str.join(' '))

In [89]: X
Out[89]:
<3x9 sparse matrix of type '<class 'numpy.float64'>'
with 9 stored elements in Compressed Sparse Row format>

In [90]: X.A
Out[90]:
array([[ 0.5 , 0.5 , 0.5 , 0.5 , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0.57735027, 0.57735027, 0.57735027, 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.70710678, 0.70710678]])


In [91]: vect.get_feature_names()
Out[91]: ['abc', 'def', 'ghi', 'jkl', 'mno', 'pqr', 'stu', 'vwx', 'yz']

In [92]: tok = pd.SparseDataFrame(X, columns=vect.get_feature_names(), index=df.index, default_fill_value=0)

In [93]: tok
Out[93]:
abc def ghi jkl mno pqr stu vwx yz
0 0.5 0.5 0.5 0.5 0.00000 0.00000 0.00000 0.000000 0.000000
1 0.0 0.0 0.0 0.0 0.57735 0.57735 0.57735 0.000000 0.000000
2 0.0 0.0 0.0 0.0 0.00000 0.00000 0.00000 0.707107 0.707107

关于python - 如何在机器学习训练集中结合文本和数字特征?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47511376/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com