mapreduce - 良好的 MapReduce 示例-6ren

mapreduce - 良好的 MapReduce 示例

转载作者：行者123 更新时间：2023-12-03 04:10:52

25

4

除了“如何使用 MapReduce 计算长文本中的单词数”任务之外，我想不出任何好的例子。我发现这并不是让其他人了解这个工具有多强大的最佳示例。

我不是在寻找代码片段，实际上只是“文本”示例。

最佳答案

MapReduce是一个为高效处理海量数据而开发的框架。例如，如果数据集中有 100 万条记录，并且它存储在关系表示中 - 派生值并对这些记录执行任何类型的转换都非常昂贵。

例如，在 SQL 中，给定出生日期，要找出一百万条记录中有多少人年龄 > 30 岁，这需要一段时间，而且当查询复杂性增加时，这只会按数量级增加。MapReduce提供了基于集群的实现，其中数据以分布式方式处理

这里有一篇维基百科文章解释了什么 map-reduce is all about

另一个很好的例子是通过 MapReduce 寻找 friend 可以是理解这个概念的有力例子，并且一个很好用的用例。

个人发现this link对于理解这个概念非常有用

复制博客中提供的说明(以防链接失效)

Finding Friends

MapReduce is a framework originally developed at Google that allows for easy large scale distributed computing across a number of domains. Apache Hadoop is an open source implementation.

I'll gloss over the details, but it comes down to defining two functions: a map function and a reduce function. The map function takes a value and outputs key:value pairs. For instance, if we define a map function that takes a string and outputs the length of the word as the key and the word itself as the value then map(steve) would return 5:steve and map(savannah) would return 8:savannah. You may have noticed that the map function is stateless and only requires the input value to compute it's output value. This allows us to run the map function against values in parallel and provides a huge advantage. Before we get to the reduce function, the mapreduce framework groups all of the values together by key, so if the map functions output the following key:value pairs:
3 : the
3 : and
3 : you
4 : then
4 : what
4 : when
5 : steve
5 : where
8 : savannah
8 : research
They get grouped as:
3 : [the, and, you]
4 : [then, what, when]
5 : [steve, where]
8 : [savannah, research]
Each of these lines would then be passed as an argument to the reduce function, which accepts a key and a list of values. In this instance, we might be trying to figure out how many words of certain lengths exist, so our reduce function will just count the number of items in the list and output the key with the size of the list, like:
3 : 3
4 : 3
5 : 2
8 : 2
The reductions can also be done in parallel, again providing a huge advantage. We can then look at these final results and see that there were only two words of length 5 in our corpus, etc...

The most common example of mapreduce is for counting the number of times words occur in a corpus. Suppose you had a copy of the internet (I've been fortunate enough to have worked in such a situation), and you wanted a list of every word on the internet as well as how many times it occurred.

The way you would approach this would be to tokenize the documents you have (break it into words), and pass each word to a mapper. The mapper would then spit the word back out along with a value of 1. The grouping phase will take all the keys (in this case words), and make a list of 1's. The reduce phase then takes a key (the word) and a list (a list of 1's for every time the key appeared on the internet), and sums the list. The reducer then outputs the word, along with it's count. When all is said and done you'll have a list of every word on the internet, along with how many times it appeared.

Easy, right? If you've ever read about mapreduce, the above scenario isn't anything new... it's the "Hello, World" of mapreduce. So here is a real world use case (Facebook may or may not actually do the following, it's just an example):

Facebook has a list of friends (note that friends are a bi-directional thing on Facebook. If I'm your friend, you're mine). They also have lots of disk space and they serve hundreds of millions of requests everyday. They've decided to pre-compute calculations when they can to reduce the processing time of requests. One common processing request is the "You and Joe have 230 friends in common" feature. When you visit someone's profile, you see a list of friends that you have in common. This list doesn't change frequently so it'd be wasteful to recalculate it every time you visited the profile (sure you could use a decent caching strategy, but then I wouldn't be able to continue writing about mapreduce for this problem). We're going to use mapreduce so that we can calculate everyone's common friends once a day and store those results. Later on it's just a quick lookup. We've got lots of disk, it's cheap.

Assume the friends are stored as Person->[List of Friends], our friends list is then:
A -> B C D
B -> A C D E
C -> A B D E
D -> A B C E
E -> B C D
Each line will be an argument to a mapper. For every friend in the list of friends, the mapper will output a key-value pair. The key will be a friend along with the person. The value will be the list of friends. The key will be sorted so that the friends are in order, causing all pairs of friends to go to the same reducer. This is hard to explain with text, so let's just do it and see if you can see the pattern. After all the mappers are done running, you'll have a list like this:
For map(A -> B C D) :

(A B) -> B C D
(A C) -> B C D
(A D) -> B C D

For map(B -> A C D E) : (Note that A comes before B in the key)

(A B) -> A C D E
(B C) -> A C D E
(B D) -> A C D E
(B E) -> A C D E
For map(C -> A B D E) :

(A C) -> A B D E
(B C) -> A B D E
(C D) -> A B D E
(C E) -> A B D E
For map(D -> A B C E) :

(A D) -> A B C E
(B D) -> A B C E
(C D) -> A B C E
(D E) -> A B C E
And finally for map(E -> B C D):

(B E) -> B C D
(C E) -> B C D
(D E) -> B C D
Before we send these key-value pairs to the reducers, we group them by their keys and get:

(A B) -> (A C D E) (B C D)
(A C) -> (A B D E) (B C D)
(A D) -> (A B C E) (B C D)
(B C) -> (A B D E) (A C D E)
(B D) -> (A B C E) (A C D E)
(B E) -> (A C D E) (B C D)
(C D) -> (A B C E) (A B D E)
(C E) -> (A B D E) (B C D)
(D E) -> (A B C E) (B C D)
Each line will be passed as an argument to a reducer. The reduce function will simply intersect the lists of values and output the same key with the result of the intersection. For example, reduce((A B) -> (A C D E) (B C D)) will output (A B) : (C D) and means that friends A and B have C and D as common friends.

The result after reduction is:
(A B) -> (C D)
(A C) -> (B D)
(A D) -> (B C)
(B C) -> (A D E)
(B D) -> (A C E)
(B E) -> (C D)
(C D) -> (A B E)
(C E) -> (B D)
(D E) -> (B C)
Now when D visits B's profile, we can quickly look up (B D) and see that they have three friends in common, (A C E).

关于mapreduce - 良好的 MapReduce 示例，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12375761/

25

4

0

文章推荐： ruby-on-rails - helper 和 helper_method 的作用是什么？

文章推荐： .net - Response.End() 被认为有害吗？

文章推荐： webpack - 如何在 webpack.config.js 中使用 ES6？

c# - 良好、安全的加密
嗨，我正在考虑开发一种文件传输程序，想知道我是否想要尽可能好的加密，我应该使用什么？我会用 C# 开发它，所以我可以访问 .net 库 :P在我的 usb 上有一个证书来访问服务器是没有问题的，如果
java - 复杂性良好、中等和最差
我创建的这个计算两个数组的交集是线性的方法的复杂度(在良好、平均、最差的情况下)？ O(n) public void getInt(int[] a,int[] b){ int i=0; int
webrtc - 是否有将 WebRTC 音频质量评定为优秀、良好、一般或差的公式？
我已经能够使用 RTCPeerConnection.getStats() API 获得 WebRTC 音频调用的各种统计信息(抖动、RTT、丢包等)。我需要将整体通话质量评为优秀、良好、一般或差。
c++ - 通过条目维护/更新/访问文件的“良好”编程形式
基本问题: 如果我正在讲述/修改数据，我应该通过索引硬编码索引访问文件的元素，即 targetFile.getElement(5);通过硬编码标识符(内部翻译成索引)，即 target.getElem
c - 我如何像 "top"命令那样获取每个 CPU 的统计信息(系统、空闲、良好...)？
在 Linux 上，我想知道要调用什么“C”API 来获取每个 CPU 的统计信息。我知道并且可以从我的应用程序中读取 /proc/loadavg，但这是系统范围的负载平均值，而不是每个 CPU 的
javascript - 使用 DELETE 获取 api 问题 - 即使 cors 良好，也会更改 OPTIONS
在客户端浏览器中使用 fetch api，GET 或 POST 没有问题，但 fetch 和 DELETE 有问题。它似乎将 DELETE 请求方法更改为 OPTIONS。大多数研究表明是一个cor

首页

博学

6Ren·AI

商城

mapreduce - 良好的 MapReduce 示例

Finding Friends