gpt4 book ai didi

python - 从带有文本列的 Spark 数据帧创建 TF_IDF 向量

转载 作者:太空宇宙 更新时间:2023-11-04 00:42:20 25 4
gpt4 key购买 nike

我有一个包含三列的 Spark Dataframe。 ['id','标题','desc']。这里 'title' 和 'desc' 都是文本。 id是文档的id。示例几行如下所示:

[Row(id=-33753621, title=u'Royal Bank of Scotland is testing a robot that could solve your banking problems (RBS)', desc=u"If you hate dealing with bank tellers or customer service representatives, then the Royal Bank of Scotland might have a solution for you.If this program is successful, it could be a big step forward on the road to automated customer service through the use of AI, notes Laurie Beaver, research associate for BI Intelligence, Business Insider's premium research service.It's noteworthy that Luvo does not operate via a third-party app such as Facebook Messenger, WeChat, or Kik, all of which are currently trying to create bots that would assist in customer service within their respective platforms.Luvo would be available through the web and through smartphones. It would also use machine learning to learn from its mistakes, which should ultimately help with its response accuracy.Down the road, Luvo would become a supplement to the human staff. It can currently answer 20 set questions but as that number grows, it would allow the human employees to more complicated issues. If a problem is beyond Luvo's comprehension, then it would refer the customer to a bank employee; however,\xa0a user could choose to speak with a human instead of Luvo anyway.AI such as Luvo, if successful, could help businesses become more efficient and increase their productivity, while simultaneously improving customer service capacity, which would consequently\xa0save money that would otherwise go toward manpower.And this trend is already starting. Google, Microsoft, and IBM are investing significantly into AI research. Furthermore, the global AI market is estimated to grow from approximately $420 million in 2014 to $5.05 billion in 2020, according to a forecast by Research and Markets.\xa0The move toward AI would be just one more way in which the digital age is disrupting retail banking. Customers, particularly millennials, are increasingly moving toward digital banking, and as a result, they're walking into their banks' traditional brick-and-mortar branches less often than ever before."),
Row(id=-761323061, title=u'Teen sexting is prompting an overhaul in child pornography laws', desc=u"Rampant teen sexting has left politicians and law enforcement authorities around the country struggling to find some kind of legal middle ground between prosecuting students for child porn and letting them off the hook.Most states consider sexually explicit images of minors to be child pornography, meaning even teenagers who share nude selfies among themselves can, in theory at least, be hit with felony charges that can carry heavy prison sentences and require lifetime registration as a sex offender.Many authorities consider that overkill, however, and at least 20 states have adopted sexting laws with less-serious penalties, mostly within the past five years. Eleven states have made sexting between teens a misdemeanor; in some of those places, prosecutors can require youngsters to take courses on the dangers of social media instead of charging them with a crime.Hawaii passed a 2012 law saying youths can escape conviction if they take steps to delete explicit photos. Arkansas adopted a 2013 law sentencing first-time youth sexters to eight hours of community service. New Mexico last month removed criminal penalties altogether in such cases.At least 12 other states are considering sexting laws this year, many to create new a category of crime that would apply to young people.But one such proposal in Colorado has revealed deep divisions about how to treat the phenomenon. Though prosecutors and researchers agree that felony sex crimes shouldn't apply to a pair of 16-year-olds sending each other selfies, they disagree about whether sexting should be a crime at all.Colorado's bill was prompted by a scandal last year at a Canon City high school where more than 100 students were found with explicit images of other teens. The news sent shockwaves through the city of 16,000. Dozens of students were suspended, and the football team forfeited the final game of the season.Fremont County prosecutors ultimately decided against filing any criminal charges, saying Colorado law doesn't properly distinguish between adult sexual predators and misbehaving teenagers.In a similar case last year out Fayetteville, North Carolina, two dating teens who exchanged nude selfies at age 16 were charged as adults with a felony \u2014 sexual exploitation of a minor. After an uproar, the cha"),

我想将此“desc”列(文档的实际文本)转换为 Spark 中的 TF-IDF 向量。

这是我为此所做的。

def tfIdf(df):
""" This fucntion takes the text data and converts it into a term frequency-Inverse Document Frequency vector

parameter:
returns: dataframe with tf-idf vectors

"""

# Importing the feature transformation classes for doing TF-IDF
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF

# Carrying out the Tokenization of the text documents (splitting into words)

tokenizer = Tokenizer(inputCol="desc", outputCol="tokenised_text")
tokensDf = tokenizer.transform(df)

# Carrying out the StopWords Removal for TF-IDF
stopwordsremover=StopWordsRemover(inputCol='tokenised_text',outputCol='words')
swremovedDf= stopwordsremover.transform(tokensDf)

# Creating Term Frequency Vector for each word
cv=CountVectorizer(inputCol="words", outputCol="tf_features", vocabSize=3, minDF=2.0)
cvModel=cv.fit(swremovedDf)
tfDf=cvModel.transform(swremovedDf)

# Carrying out Inverse Document Frequency on the TF data
idf=IDF(inputCol="tf_features", outputCol="tf-idf_features")
idfModel = idf.fit(tfDf)
tfidfDf = idfModel.transform(tfDf)

tfidfDf.cache().count()

return tfidfDf


tfidfDf=tfIdf(sdf_cleaned)

我首先使用 Tokenizer Class 对“desc”列中的每个文本文档进行了标记化。然后使用 StopWordsRemover 类删除停用词。然后,我将其转换为词袋模型,并使用 CountVectorizer 类获取词频。

最后,我使用 IDF 类将 IDF 权重应用于词频向量。

返回数据框的最终结果如下-仅显示第一行:

[Row(id=-33753621, title=u'Royal Bank of Scotland is testing a robot that could solve your banking problems (RBS)', desc=u"If you hate dealing with bank tellers or customer service representatives, then the Royal Bank of Scotland might have a solution for you.If this program is successful, it could be a big step forward on the road to automated customer service through the use of AI, notes Laurie Beaver, research associate for BI Intelligence, Business Insider's premium research service.It's noteworthy that Luvo does not operate via a third-party app such as Facebook Messenger, WeChat, or Kik, all of which are currently trying to create bots that would assist in customer service within their respective platforms.Luvo would be available through the web and through smartphones. It would also use machine learning to learn from its mistakes, which should ultimately help with its response accuracy.Down the road, Luvo would become a supplement to the human staff. It can currently answer 20 set questions but as that number grows, it would allow the human employees to more complicated issues. If a problem is beyond Luvo's comprehension, then it would refer the customer to a bank employee; however,\xa0a user could choose to speak with a human instead of Luvo anyway.AI such as Luvo, if successful, could help businesses become more efficient and increase their productivity, while simultaneously improving customer service capacity, which would consequently\xa0save money that would otherwise go toward manpower.And this trend is already starting. Google, Microsoft, and IBM are investing significantly into AI research. Furthermore, the global AI market is estimated to grow from approximately $420 million in 2014 to $5.05 billion in 2020, according to a forecast by Research and Markets.\xa0The move toward AI would be just one more way in which the digital age is disrupting retail banking. Customers, particularly millennials, are increasingly moving toward digital banking, and as a result, they're walking into their banks' traditional brick-and-mortar branches less often than ever before.", tokenised_text=[u'if', u'you', u'hate', u'dealing', u'with', u'bank', u'tellers', u'or', u'customer', u'service', u'representatives,', u'then', u'the', u'royal', u'bank', u'of', u'scotland', u'might', u'have', u'a', u'solution', u'for', u'you.if', u'this', u'program', u'is', u'successful,', u'it', u'could', u'be', u'a', u'big', u'step', u'forward', u'on', u'the', u'road', u'to', u'automated', u'customer', u'service', u'through', u'the', u'use', u'of', u'ai,', u'notes', u'laurie', u'beaver,', u'research', u'associate', u'for', u'bi', u'intelligence,', u'business', u"insider's", u'premium', u'research', u"service.it's", u'noteworthy', u'that', u'luvo', u'does', u'not', u'operate', u'via', u'a', u'third-party', u'app', u'such', u'as', u'facebook', u'messenger,', u'wechat,', u'or', u'kik,', u'all', u'of', u'which', u'are', u'currently', u'trying', u'to', u'create', u'bots', u'that', u'would', u'assist', u'in', u'customer', u'service', u'within', u'their', u'respective', u'platforms.luvo', u'would', u'be', u'available', u'through', u'the', u'web', u'and', u'through', u'smartphones.', u'it', u'would', u'also', u'use', u'machine', u'learning', u'to', u'learn', u'from', u'its', u'mistakes,', u'which', u'should', u'ultimately', u'help', u'with', u'its', u'response', u'accuracy.down', u'the', u'road,', u'luvo', u'would', u'become', u'a', u'supplement', u'to', u'the', u'human', u'staff.', u'it', u'can', u'currently', u'answer', u'20', u'set', u'questions', u'but', u'as', u'that', u'number', u'grows,', u'it', u'would', u'allow', u'the', u'human', u'employees', u'to', u'more', u'complicated', u'issues.', u'if', u'a', u'problem', u'is', u'beyond', u"luvo's", u'comprehension,', u'then', u'it', u'would', u'refer', u'the', u'customer', u'to', u'a', u'bank', u'employee;', u'however,\xa0a', u'user', u'could', u'choose', u'to', u'speak', u'with', u'a', u'human', u'instead', u'of', u'luvo', u'anyway.ai', u'such', u'as', u'luvo,', u'if', u'successful,', u'could', u'help', u'businesses', u'become', u'more', u'efficient', u'and', u'increase', u'their', u'productivity,', u'while', u'simultaneously', u'improving', u'customer', u'service', u'capacity,', u'which', u'would', u'consequently\xa0save', u'money', u'that', u'would', u'otherwise', u'go', u'toward', u'manpower.and', u'this', u'trend', u'is', u'already', u'starting.', u'google,', u'microsoft,', u'and', u'ibm', u'are', u'investing', u'significantly', u'into', u'ai', u'research.', u'furthermore,', u'the', u'global', u'ai', u'market', u'is', u'estimated', u'to', u'grow', u'from', u'approximately', u'$420', u'million', u'in', u'2014', u'to', u'$5.05', u'billion', u'in', u'2020,', u'according', u'to', u'a', u'forecast', u'by', u'research', u'and', u'markets.\xa0the', u'move', u'toward', u'ai', u'would', u'be', u'just', u'one', u'more', u'way', u'in', u'which', u'the', u'digital', u'age', u'is', u'disrupting', u'retail', u'banking.', u'customers,', u'particularly', u'millennials,', u'are', u'increasingly', u'moving', u'toward', u'digital', u'banking,', u'and', u'as', u'a', u'result,', u"they're", u'walking', u'into', u'their', u"banks'", u'traditional', u'brick-and-mortar', u'branches', u'less', u'often', u'than', u'ever', u'before.'], words=[u'hate', u'dealing', u'bank', u'tellers', u'customer', u'service', u'representatives,', u'royal', u'bank', u'scotland', u'solution', u'you.if', u'program', u'successful,', u'big', u'step', u'forward', u'road', u'automated', u'customer', u'service', u'use', u'ai,', u'notes', u'laurie', u'beaver,', u'research', u'associate', u'bi', u'intelligence,', u'business', u"insider's", u'premium', u'research', u"service.it's", u'noteworthy', u'luvo', u'does', u'operate', u'third-party', u'app', u'facebook', u'messenger,', u'wechat,', u'kik,', u'currently', u'trying', u'create', u'bots', u'assist', u'customer', u'service', u'respective', u'platforms.luvo', u'available', u'web', u'smartphones.', u'use', u'machine', u'learning', u'learn', u'mistakes,', u'ultimately', u'help', u'response', u'accuracy.down', u'road,', u'luvo', u'supplement', u'human', u'staff.', u'currently', u'answer', u'20', u'set', u'questions', u'number', u'grows,', u'allow', u'human', u'employees', u'complicated', u'issues.', u'problem', u"luvo's", u'comprehension,', u'refer', u'customer', u'bank', u'employee;', u'however,\xa0a', u'user', u'choose', u'speak', u'human', u'instead', u'luvo', u'anyway.ai', u'luvo,', u'successful,', u'help', u'businesses', u'efficient', u'increase', u'productivity,', u'simultaneously', u'improving', u'customer', u'service', u'capacity,', u'consequently\xa0save', u'money', u'manpower.and', u'trend', u'starting.', u'google,', u'microsoft,', u'ibm', u'investing', u'significantly', u'ai', u'research.', u'furthermore,', u'global', u'ai', u'market', u'estimated', u'grow', u'approximately', u'$420', u'million', u'2014', u'$5.05', u'billion', u'2020,', u'according', u'forecast', u'research', u'markets.\xa0the', u'ai', u'just', u'way', u'digital', u'age', u'disrupting', u'retail', u'banking.', u'customers,', u'particularly', u'millennials,', u'increasingly', u'moving', u'digital', u'banking,', u'result,', u"they're", u'walking', u"banks'", u'traditional', u'brick-and-mortar', u'branches', u'before.'], tf_features=SparseVector(3, {}), tf-idf_features=SparseVector(3, {})),

所以前三列是原来的 ['id','title','desc']。根据使用的转换添加新列。如果您看到 Tokeniser 和 StopWords 工作正常,因为它们的输出列是正确的。

但是我不确定为什么 CountVectorizer 中的 tf_features 列和 IDF 类中的 tf_idf_features 列为空且没有 tf-idf 值向量。

同样在 Spark 中,我们从每个列单元格中传递一个文档。 Spark 如何找到 tf vector 的词汇表?词汇表是出现在整个语料库(所有文档)中的独特单词,而不仅仅是一个文档。 tf 的 Count 是词条在每篇文档中出现的频率。那么 Spark 是否会累积“Desc”中的所有单元格值并找到相同的唯一词汇表,然后计算每个单元格中每个文档的词频?

请指教。

编辑1:

我更改了词汇量,因为显然 3 没有意义。现在我得到了 tf_features。如下:

tf_features=SparseVector(2000, {6: 1.0, 8: 1.0, 14: 1.0, 17: 2.0, 18: 1.0, 20: 1.0, 32: 1.0, 35: 2.0, 42: 1.0, 52: 1.0, 53: 3.0, 54: 1.0, 62: 1.0, 65: 1.0, 68: 1.0, 79: 1.0, 93: 4.0, 95: 2.0, 98: 1.0, 118: 1.0, 132: 1.0, 133: 1.0, 149: 1.0, 157: 1.0, 167: 5.0, 202: 3.0, 215: 1.0, 219: 1.0, 224: 1.0, 232: 1.0, 265: 3.0, 302: 1.0, 303: 1.0, 324: 2.0, 330: 1.0, 355: 1.0, 383: 1.0, 395: 1.0, 405: 1.0, 432: 1.0, 456: 1.0, 466: 1.0, 472: 1.0, 501: 1.0, 525: 1.0, 537: 1.0, 548: 1.0, 620: 1.0, 630: 1.0, 639: 1.0, 657: 1.0, 662: 1.0, 674: 1.0, 720: 1.0, 734: 1.0, 975: 1.0, 1003: 1.0, 1057: 1.0, 1148: 1.0, 1187: 1.0, 1255: 1.0, 1273: 1.0, 1294: 1.0, 1386: 1.0, 1400: 1.0, 1463: 1.0, 1477: 1.0, 1491: 1.0, 1724: 1.0, 1898: 1.0, 1937: 3.0, 1954: 1.0})

我正在努力理解这一点。第一个值是没有。特征(术语)。这也是我输入的vocabsize。然而,这里的另一本词典是什么?是术语(单词)的关键“索引”,值是术语频率?。如果是,那么我们如何将这个单词索引映射回原始单词?就像如果我想知道哪些是具有上述计数的单词如何将字典的键映射到字符串?

其次,这个输出不是向量(就像它的字典)。这是任何 ML 算法的消耗性输出吗?我会非常需要一个特征向量而不是字典。这是如何工作的?

最佳答案

在您展示的示例中 tf_featurestf-idf_features不为空。这些是有效的 SparseVectors所有特征都等于 0.0 ([ 0.0, 0.0, 0.0])。

我认为 CountVectorizer 的配置很荒谬.与 vocabSize等于 3 您只考虑三个最常见的术语(“CountVectorizer will build a vocabulary that only considers the top vocabSize terms ordered by term frequency across the corpus.”)如果这些术语中没有一个出现在特定文本中,您将获得观察到的输出。

df = sc.parallelize([
(["a", ], ), # 'a' occurs only once
(["b", "c"], ), (["c", "c", "d"], ), (["b", "d"], )
]).toDF(["tokens"])

vectorizer = CountVectorizer(
inputCol="tokens", outputCol="features", vocabSize=3
).fit(df)

# With vocabulary size 3 A is not in the vocabulary (3 most common words)
vectorizer.vocabulary
['c', 'd', 'b']
vectorizer.transform(df).take(3)
[Row(tokens=['a'], features=SparseVector(3, {})),
Row(tokens=['b', 'c'], features=SparseVector(3, {0: 1.0, 2: 1.0})),
Row(tokens=['c', 'c', 'd'], features=SparseVector(3, {0: 2.0, 1: 1.0}))]

如您所见,第一个文档不包含词汇表中的任何标记,因此所有特征都等于 0。

SparseVector(3, {}).toArray()
array([ 0.,  0.,  0.])

为了比较,第一个文档包含两个 cd :

v = SparseVector(3, {0: 2.0, 1: 1.0})

{vectorizer.vocabulary[i]: cnt for (i, cnt) in zip(v.indices, v.values)}
{'c': 2.0, 'd': 1.0}

关于 CountVectorizer 的更详细解释行为可以在 Handle unseen categorical string Spark CountVectorizer 中找到.

取决于应用vocabSize至少应该在数百个范围内,使用数十万或更多的情况并不少见,特别是如果您考虑应用一些降维技术。

关于python - 从带有文本列的 Spark 数据帧创建 TF_IDF 向量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41352267/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com