<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress.com" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>mapreduce &amp;laquo; WordPress.com Tag Feed</title>
	<link>http://en.wordpress.com/tag/mapreduce/</link>
	<description>Feed of posts on WordPress.com tagged "mapreduce"</description>
	<pubDate>Sun, 27 Dec 2009 14:18:08 +0000</pubDate>

	<generator>http://en.wordpress.com/tags/</generator>
	<language>en</language>

<item>
<title><![CDATA[Top five reasons you should adopt Cloud MapReduce]]></title>
<link>http://huanliu.wordpress.com/2009/12/17/top-five-reasons-you-should-adopt-cloud-mapreduce/</link>
<pubDate>Thu, 17 Dec 2009 21:52:20 +0000</pubDate>
<dc:creator>huanliu</dc:creator>
<guid>http://huanliu.wordpress.com/2009/12/17/top-five-reasons-you-should-adopt-cloud-mapreduce/</guid>
<description><![CDATA[There are a lot of MapReduce implementations out there, including the popular Hadoop project. So why]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>There are a lot of MapReduce implementations out there, including the popular Hadoop project. So why would you want to adopt a new implementation like Cloud MapReduce? I list the top five reasons here. <strong></strong></p>
<p><strong>1. No single failure point</strong>.</p>
<p>Almost all other MapReduce implementations adopted a master/slave architecture as described in Google&#8217;s MapReduce paper. The master node presents a single point of failure. Even though there are secondary nodes, failure recovery is still a hassle at best. For example, in the Hadoop implementation, the secondary node only keeps a log. When the primary master fails, you have to bring back up the primary, then replay the log file in the secondary master. Many enterprise clients we work with simply cannot accept a single point of failure for their critical data. <strong></strong></p>
<p><strong>2. Single storage location</strong>.</p>
<p>When running MapReduce in a cloud, most people store their data permanently in the cloud storage (e.g., Amazon S3), and copy over their data to the Hadoop file system before they start the analysis. The copy stage not only wastes valueable time, but it is also a hassle to maintain two copies of the same data. In comparison, Cloud MapReduce stores everything in a single location (e.g., Amazon S3) and all accesses during analysis go directly to the storage location. In our test, Amazon S3 can sustain a high throughput and it is not a bottleneck in analysis. <strong></strong></p>
<p><strong>3. No cluster configuration</strong>.</p>
<p>Unlike other MapReduce implementations, you do not have to setup a cluster first, e.g., setup a master and then add in slaves. You simply launch a number of machines and each will be working away on the job. Further, there is no hassle when you need to dynamically reconfigure your cluster. If you feel the job progress is too slow, you can simply launch more machines, and they will join the computation right away. No complicated cluster reconfiguration is needed. <strong></strong></p>
<p><strong>4. Simple to change</strong>.</p>
<p>Some applications do not fit the MapReduce programming model. One can try to change the application to fit the rigid programming model, which will result in either inefficiency or complicated change or setup on the framework (e.g., Hadoop). With Cloud MapReduce, you can easily change the framework to suit your needs. Since there are only 3,000 lines of code, it is easy to change. <strong></strong></p>
<p><strong>5. Higher performance</strong>.</p>
<p>Cloud MapReduce is faster than Hadoop in our study. The exact speed up really depends on the application. In one representative case, we saw a 60x speedup. This is neither the maximum nor the minimum speedup you can get. We could massage the data (e.g., having more and even smaller files) to show a much bigger speedup, but we decide to make the experiment more realistic (uses the &#8220;reverse index&#8221; application &#8212; the application the MapReduce framework was designed for &#8212; and a public set of data to enable easy replication). One may argue that the comparison is unfair becasue Hadoop is not designed to handle small files. It is true that we can apply bandit to Hadoop to close the gap, but the experiment is really a scaled down version of a large-scale test with many large files and many slave nodes. The experiment highlights a bottleneck in the master/slave architecture that you will eventually encounter. Even without hitting the scalability bottleneck, Cloud MapReduce is faster than Hadoop. The detailed reasons are listed in the paper.</p>
<ol></ol>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Cloud Computing Service - Amazon EC2 vs Google GAE]]></title>
<link>http://setandbma.wordpress.com/2009/11/25/cloud-computing-service-amazon-ec2-vs-google-gae/</link>
<pubDate>Wed, 25 Nov 2009 07:09:05 +0000</pubDate>
<dc:creator>Udayan Banerjee</dc:creator>
<guid>http://setandbma.wordpress.com/2009/11/25/cloud-computing-service-amazon-ec2-vs-google-gae/</guid>
<description><![CDATA[Cloud Computing Service Service provider with large number of networked computer systems Allowing yo]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><div>
<table style="border-collapse:collapse;" border="0">
<col>
<col>
<tbody valign="top">
<tr>
<td style="padding-left:7px;padding-right:7px;border-top:solid black .5pt;border-left:solid black .5pt;border-bottom:solid black .5pt;border-right:solid black .5pt;">
<p><strong><em>Cloud Computing Service</em></strong></p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:solid black .5pt;border-left:none;border-bottom:solid black .5pt;border-right:solid black .5pt;">
<ol>
<li><em>Service provider with large number of networked computer systems<br />
</em></li>
<li><em>Allowing you to use a slice of that processing power and storage<br />
</em></li>
<li><em>Shielding your program and data from others sharing the same service, and<br />
</em></li>
<li><em>Charging you for your actual usage</em></li>
</ol>
</td>
</tr>
<tr>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:solid black .5pt;border-bottom:solid black .5pt;border-right:solid black .5pt;">
<p><strong><em>Value Proposition of Cloud Computing</em></strong></p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid black .5pt;border-right:solid black .5pt;">
<ul>
<li><em>Elastic Capacity – Pay for what you actually use<br />
</em></li>
<li><em>Economy of Scale – Of hardware, Infrastructure and Management</em></li>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
<h1><span style="font-size:14pt;">Comparison between Amazon EC2 and Google GAE<br />
</span></h1>
<p>Though different cloud service providers are following different strategies, these are the two uniquely different approaches. Others either are similar to one of these or fall somewhere in between.
</p>
<p>I have excluded SaaS from this discussion – you can see the comparison between IaaS, PaaS and SaaS on this post on <a href="http://setandbma.wordpress.com/2009/09/03/cloud-strategy/">Cloud Strategy</a>.</p>
<table style="border-collapse:collapse;" border="0">
<col>
<col>
<col>
<tbody valign="top">
<tr style="background:#b8cce4;">
<td style="padding-left:7px;padding-right:7px;border-top:solid #4bacc6 1pt;border-left:solid #4bacc6 1pt;border-bottom:solid #4bacc6 2.25pt;border-right:solid #4bacc6 1pt;"> </td>
<td style="padding-left:7px;padding-right:7px;border-top:solid #4bacc6 1pt;border-left:none;border-bottom:solid #4bacc6 2.25pt;border-right:solid #4bacc6 1pt;">
<p><a href="http://aws.amazon.com/ec2/">Amazon EC2</a><strong> (<em>Elastic Computing Cloud)</em></strong></p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:solid #4bacc6 1pt;border-left:none;border-bottom:solid #4bacc6 2.25pt;border-right:solid #4bacc6 1pt;">
<p><a href="http://code.google.com/appengine/docs/whatisgoogleappengine.html">Google GAE</a><strong> (<em>Google App Engine)</em></strong></p>
</td>
</tr>
<tr style="background:#d2eaf1;">
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:solid #4bacc6 1pt;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p><strong>Base Technology</strong></p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>Virtualization</p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>Existing Google infrastructure</p>
</td>
</tr>
<tr>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:solid #4bacc6 1pt;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p><strong>Unit of Scalability</strong></p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>Dynamically instantiated virtual machines</p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>Every transactions initiated by user
</p>
<p>Every scheduled or queued task</p>
</td>
</tr>
<tr style="background:#d2eaf1;">
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:solid #4bacc6 1pt;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p><strong>Persistence</strong></p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>Reserved Virtual Machine using standard RDBMS</p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>By directly persisting objects on to <a href="http://labs.google.com/papers/bigtable.html">Google BigTable</a>
						</p>
<p>No need for any object-relational mapping</p>
</td>
</tr>
<tr>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:solid #4bacc6 1pt;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p><strong>Software License</strong></p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>All software license required
</p>
<p>OS, RDBMS, Web Server, App Server …</p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>Behaves like a Service Bus of infinite capacity
</p>
<p>Application code can be directly deployed</p>
</td>
</tr>
<tr style="background:#d2eaf1;">
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:solid #4bacc6 1pt;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p><strong>Readiness</strong></p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>Reasonably mature
</p>
<p>Can be viewed as an extension to existing hosting services</p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>Very much in Beta
</p>
<p>Will take couple of years to mature</p>
</td>
</tr>
<tr>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:solid #4bacc6 1pt;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p><strong>Best for …</strong></p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>Application requiring heavy processing power for short duration</p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>Startups wanting to start free and have the ability to scale when the venture succeed</p>
</td>
</tr>
<tr style="background:#d2eaf1;">
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:solid #4bacc6 1pt;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p><strong>Economics</strong></p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>Saving potential of 30-70% for the right type of application</p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>Not clear
</p>
<p>However, it can be an order of magnitude improvement</p>
</td>
</tr>
<tr>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:solid #4bacc6 1pt;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p><strong>Innovativeness</strong></p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p>Incremental</p>
</td>
<td style="padding-left:7px;padding-right:7px;border-top:none;border-left:none;border-bottom:solid #4bacc6 1pt;border-right:solid #4bacc6 1pt;">
<p><strong>Potentially Disruptive</strong></p>
</td>
</tr>
</tbody>
</table>
<h1><span style="font-size:14pt;">Why is GAE potentially disruptive?<br />
</span></h1>
<ul>
<li>Over the last decade, Google has build a huge cloud infrastructure for its search and other services
</li>
</ul>
<ul style="margin-left:54pt;">
<li>The infrastructure has been build using very cost effective hardware
</li>
<li>Fault tolerance is designed into the architecture
</li>
<li>They have perfected technologies and algorithms like <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce</a> and <a href="http://en.wikipedia.org/wiki/BigTable">BigTable</a> created for such infrastructure
</li>
<li>It is highly scalable
</li>
</ul>
<ul>
<li>Google is following a strategy of opening up their infrastructure for developers to use – for example <a href="http://news.cnet.com/8301-30685_3-10391002-264.html">Closure JavaScript Toolset</a>
		</li>
<li>They will optimize cloud access through <a href="http://en.wikipedia.org/wiki/Google_Chrome_OS">Chrome OS</a> and <a href="http://en.wikipedia.org/wiki/Android_(operating_system)">Android Mobile OS</a>
		</li>
</ul>
<p><strong>Their economy of scale will be difficult to match.<br />
</strong></p>
<p><span style="color:red;"><strong>What about Microsoft – they are constrained by the fact that they have to defend their desktop business – which will prevent them from following optimal cloud strategy!<br />
</strong></span></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[M2009 Interview Peter Pawlowski AsterData]]></title>
<link>http://decisionstats.wordpress.com/2009/11/24/m2009-interview-peter-pawloski-asterdata/</link>
<pubDate>Tue, 24 Nov 2009 17:46:01 +0000</pubDate>
<dc:creator>Ajay Ohri</dc:creator>
<guid>http://decisionstats.wordpress.com/2009/11/24/m2009-interview-peter-pawloski-asterdata/</guid>
<description><![CDATA[Here is an interview with Peter Pawlowski, who is the MTS for Data Mining at Aster Data. I ran into ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Here is an interview with Peter Pawlowski, who is the MTS for Data Mining at Aster Data. I ran into Peter at his booth at AsterData during M2009, and followed up with an email interview. Also included is a presentation by him of which he was a co-author.</p>

<div>
<blockquote><p>Ajay- Describe your career in Science leading up till today.</p></blockquote>
</div>
<div>Peter- Went to Stanford, where I got a BS &#38; MS in Computer Science. I did some work on automated bug-finding tools while at Stanford.</div>
<div><em>( Note- that sums up the career of almost 60 % of CS scientists)</em></div>
<div>
<blockquote><p>Ajay- How is life working at Aster Data- what are the challenges and the great stuff</p></blockquote>
</div>
<div>Peter- Working at Aster is great fun, due to the sheer breadth and variety of the technical challenges. We have problems to solve in the optimization, languages, networking, databases, operating systems, etc. It&#8217;s been great to think about problems end-to-end &#38; consider the impact of a change on all aspects of the system. I worked on SQL/MR in particular, which had lots of interesting challenges: how do you define the API? how do you integrate with SQL? how do you make it run fast? how do you make it scale?</div>
<div>
<blockquote><p>Ajay- Do you think Universities offer adequate preparation for in demand skills like Mapreduce, Hadoop and Business Intelligence</p></blockquote>
</div>
<div>Peter-   Probably not BI&#8211;I learned everything I know about BI while at Aster. In terms of M/R, it&#8217;d be useful to have more hands-on experience with distributed system which at school. We read the MapReduce paper but didn&#8217;t get a chance to actually play with M/R. I think that sort of exposure would be useful. We recently made our software available to some students taking a data mining class at Stanford, and they came up with some fascinating use cases for our system, esp. around the Netflix challenge dataset.</div>
<div>
<blockquote><p>Ajay- Describe some of the recent engineering products that you have worked with at Aster</p></blockquote>
</div>
<div>Peter-  SQL/MR is the main aspects of nCluster that i&#8217;ve worked with&#8211;interesting challenged described in #2.</div>
<div>
<blockquote><p>Ajay- All BI companies claim to crunch data the fastest at the lowest price at highest quality as per their marketing brochure- How would you validate your product&#8217;s performance scientifically and transparently.</p></blockquote>
</div>
<div>Peter- I&#8217;ve found that the hardest part of judging performance is to come up with a realistic workload. There are public benchmarks out there, but they may or may not reflect the kinds of workloads that our customers want to run. Our goal is to make our customers&#8217; experience as good as possible, so we focus on speeding up the sorts of workloads they ask about.</div>
<div>And here is a presentation at Slideshare.net on more of what Peter works on.</div>
<p><!-- SlideShare error: doc is missing or has illegal characters /[^-_a-zA-Z0-9]/ --></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Dynamic Offline Reports]]></title>
<link>http://developeraspirations.wordpress.com/2009/11/23/dynamic-offline-reports/</link>
<pubDate>Tue, 24 Nov 2009 00:38:41 +0000</pubDate>
<dc:creator>jearil</dc:creator>
<guid>http://developeraspirations.wordpress.com/2009/11/23/dynamic-offline-reports/</guid>
<description><![CDATA[Many applications have the primary concern of storing and retrieving data. The raw data by itself is]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Many applications have the primary concern of storing and retrieving data. The raw data by itself is often not very useful, so an additional process is put into place to turn that data into useful information. Many applications generate these reports from the data quickly at the user&#8217;s request through either a narrow SQL select statement, in application data processing, or both. However, in larger applications where the data is too large to handle in memory at a time, processing is too heavy, and report customization is too varied, information generation needs to be pushed to its own scalable system.</p>
<p>A lot of data has various pieces of metadata associated with it naturally. Data such as what user added the record, the date it was added or modified (or both), maybe the size of the record, categories, tags, keywords, or other pieces of associate data that is used to break it up into more manageable chunks. This meta data is useful, but only if we can use it as an aggregate to generate specific information relating to the grouping of items based on this data.</p>
<p>Sometimes generating these more specific reports is as easy as adding additional WHERE or GROUP BY clauses in SQL. However, when more advanced business rules are taking place where there isn&#8217;t an easy or succinct way of extracting this information via a query, or if the query returns such a large amount of data as to cause memory issues, a different approach can be taken.</p>
<p>For instance; in an application I am currently working on we need to generate reports based on a table with about 5 million rows. The SQL queries we use can limit the amount of rows returned to perhaps a few hundred thousand for some of our larger reports. However, a lot of the data needs to be processed in application code rather than by the database itself due to some special business rules. Because of this, we end up creating a large number of objects in Java to hold these result rows. If multiple users are generating different reports we might end up holding too many of these objects in memory at a time and receive an OOM error. Also, the processing on this data can be intense enough that if the server is slammed with report requests that the entire system slows down, causing difficulties for people wanting to insert or modify data. This is the case I am in while I contemplate offline report generation.</p>
<p>The basic idea is that the main application should be concerned purely with manipulating the data of the system. That is basic CRUD stuff such as creating new records, updating them, the rare deletions, and showing single records to the user (so they can edit or delete it). We want that part of the application to remain fast, and not be effected by the purely read-only needs imposed by report generation. In order to nullify the impact, we move reporting to its own system that reads from a separate read-only replication of our production database.</p>
<p>When a report request comes in to our application, we send a request to the separate reporting system. This can be done either as a web service or maybe an RPC call. The reporting system uses its read-only copy of the data to generate the report numbers and send it back, causing no speed delay for insertion or regular operation of the main application.</p>
<p>This doesn&#8217;t solve our OOM issues however, as many drivers for our database (MySQL) return ResultSet objects with the entire contents of the results which might be too large to fit into memory. However, since we&#8217;re using a read-only list anyway we can convert the table or tables we use to process our results into flat files that can be read in on a line by line basis, perform some intermediate result processing, deallocate those lines and work on additional lines. Since our reports are mostly generating statistical data over a large data set, we can process results on that data set in parallel using multiple threads or possibly multiple computers using a Hadoop cluster.</p>
<p>By making report generation asynchronous to our applications general work flow we will free up the processing power and the connection pool that&#8217;s used to handle requests by asking users to either poll for when the result is finished or to notify the system when a report is finished and thereby avoid the instances where we use all of our available connections or resources processing reports. There is also the added possibility of continuously generating all possible report data on a separate machine or cluster to decrease access time by increasing storage requirements.</p>
<p>I&#8217;m currently researching the use of MapReduce in Hadoop for processing this flat file of all data for reports. In addition, I&#8217;m researching a few languages that are reported to be good at concurrent processing so that I can benefit from multiple cores when generating reports from our raw data. My current focus is on Scala, Erlang, and Clojure, but I do not have much to report on those areas yet. If anyone has any information those languages as far as report generation based on a largish data set (currently 5 million, but the rate of growth is fairly alarming), let me know.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA["BotGraph: Large Scale Spamming Botnet Detection"]]></title>
<link>http://everythingisdata.wordpress.com/2009/11/22/botgraph-large-scale-spamming-botnet-detection/</link>
<pubDate>Sun, 22 Nov 2009 03:16:36 +0000</pubDate>
<dc:creator>Neil Conway</dc:creator>
<guid>http://everythingisdata.wordpress.com/2009/11/22/botgraph-large-scale-spamming-botnet-detection/</guid>
<description><![CDATA[Botnets are used for various nefarious ends; one popular use is sending spam email by creating and t]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Botnets are used for various nefarious ends; one popular use is sending spam email by creating and then using accounts on free webmail providers like Hotmail and Google Mail. In the past, <a href="http://en.wikipedia.org/wiki/CAPTCHA">CAPTCHAs</a> have been used to try to prevent this, but they are increasingly ineffective. Hence, the <a href="http://research.microsoft.com/pubs/79413/botgraph.pdf">BotGraph</a> paper proposes an algorithm for detecting bot-created accounts by analyzing user access behavior. They describe the algorithm, its implementation with <a href="http://research.microsoft.com/en-us/projects/dryad/">Dryad</a>, and present experimental results from real-world Hotmail access logs.</p>
<h3>Algorithm</h3>
<p>BotGraph employs three different ideas for detecting automated users:</p>
<ol>
<li>They regard sudden spikes in the number of accounts created by a single IP as suspicious. Hence, they use a simple exponentially-weighted moving average (EWMA) to detect such spikes, and throttle/rate-limit account signups from suspicious IPs. This has the effect of making it more difficult for spammers to obtain webmail accounts.</li>
<li>They argue that the number of bot machines will be much smaller than the number of bot-created webmail accounts; hence, one bot machine will access a large number of accounts. They also argue that a single bot-created webmail account will be accessed from multiple bots on different <a href="http://en.wikipedia.org/wiki/Autonomous_system_(Internet)">autonomous systems</a> (ASs), due to churn in the botnet (although this seems pretty unconvincing to me), and the fact that rate-limiting makes it more difficult to create large numbers of bot accounts. Hence, they look for pairs of user accounts that had logins from an overlapping set of ASs.</li>
<li>Finally, they consider a user&#8217;s email-sending behavior:<br />
<blockquote><p>
Normal users usually send a small number of emails per day on average, with different email sizes. On the other hand, bot-users usually send many emails per day, with identical or similar email sizes
</p></blockquote>
<p>Hence, they regard users who send 3+ emails per day as &#8220;suspicious&#8221;; they also regard as suspicious users whose email-size distributions are dissimilar from most other users.</li>
</ol>
<p>They use feature #1 primarily to rate-throttle new account creations. Feature #3 is used to avoid false positives.</p>
<p>Feature #2 is the primary focus of the paper. They construct a <i>user-user</i> graph with a vertex for each user account. Each edge has a weight that gives the number of shared login ASs &#8212; that is, the number of ASs that were used to login to both accounts. Within the user-user graph, they look for connected components with an edge weight over a threshold <i>T</i>: they begin by finding components with <i>T=2</i>, and then iteratively increasingly the threshold until each component has no more than 100 members.</p>
<h3>Implementation</h3>
<p>They describe two ways to implement the construction of the user-user graph using a data-parallel system like MapReduce or Dryad, using the login log from Hotmail (~220GB for one month of data):</p>
<ol>
<li>Partition the login records by client IP. Emit an intermediate record <i>(i, j, k)</i> for each shared login on the same day from AS <i>k</i> to accounts <i>i</i> and <i>j</i>. In the reduce phase, group on <i>(i, j)</i> and sum. The problem with this approach is that it requires a lot of communication: most edges in the user-user graph have weight 1, and hence can be dropped, but this approach still requires sending them over the network.</li>
<li>Partition the login records by user name. For each partition, compute a &#8220;summary&#8221; of the IP-day keys present for users in that partition (the paper doesn&#8217;t specify the nature of the summary, but presumably it is analogous to a <a href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filter</a>). Each partition sends its summary to every other partition. Using the summaries, each partition can exchange login records with other partitions in a way that allows edge weights to be computed, but doesn&#8217;t require sending weight 1 edges over the network.</li>
</ol>
<p>They argue that the second method can&#8217;t be implemented with Map and Reduce, although I&#8217;m not sure if I believe them: multicasting can be done by writing to HDFS, as can shipping data between logical partitions.</p>
<h3>Discussion</h3>
<p>I think the major problem with their experimental results is that there&#8217;s effectively no adversary: botnet operators presumably weren&#8217;t aware of this technique when the experiments were performed. Hence, they haven&#8217;t adapted their tactics &#8212; which might actually be quite easy to do.</p>
<p>For example, it seems like it would be quite easy to defeat their EWMA-based throttling by simply increasing the number of signups/time gradually. Essentially, the bot machine acts like an HTTP proxy with a gradually-increasing user population. One can imagine such a bot even mimicking the traffic patterns exhibited by a real-world proxy (e.g. increase at 9AM, decrease at 5PM). Certainly using a simple EWMA seems too primitive to defeat a dedicated adversary.</p>
<p>Similarly, it also seems quite easy to avoid sharing a single webmail account among multiple botnets: simply assign a single webmail account to a single bot machine, and don&#8217;t reuse webmail accounts if the bot machine becomes inaccessible. The idea, again, is to simulate an HTTP proxy that accesses a large number of webmail accounts. The paper&#8217;s argument that &#8220;churn&#8221; <i>requires</i> reuse of webmail accounts &#8220;to maximize bot-account utilization&#8221; is unconvincing and unsubstantiated. Since this is the entire principle upon which their technique is based, I&#8217;d be quite concerned that a relatively simple adaptation on the part of botnet operators would make this analysis ineffective.</p>
<p>I thought the paper&#8217;s wide-eyed tone toward using MapReduce-style systems for graph algorithms was annoying. <i>Lots</i> of people do large-scale graph algorithms using MapReduce-style systems; in fact, that&#8217;s one of the main things MapReduce was originally designed for (e.g. computing PageRank). The paper is not novel in this respect, and I was surprised that they didn&#8217;t cite one of the <a href="http://scholar.google.com/scholar?q=mapreduce+graph">many prior papers</a> on this subject.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[MapReduce in Microsoft&rsquo;s DryadLINQ]]></title>
<link>http://jclouds.wordpress.com/2009/11/20/mapreduce-in-microsofts-dryadlinq/</link>
<pubDate>Fri, 20 Nov 2009 20:13:33 +0000</pubDate>
<dc:creator>Agile Cat</dc:creator>
<guid>http://jclouds.wordpress.com/2009/11/20/mapreduce-in-microsofts-dryadlinq/</guid>
<description><![CDATA[Data-Intensive Computing on Windows HPC Server with the DryadLINQ Framework John Vert in 408A on Tue]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Data-Intensive Computing on Windows HPC Server with the DryadLINQ Framework<br />
John Vert in 408A on Tuesday at 3:00 PM</p>
<p>Come get an overview of the DryadLINQ features and runtime environment, and walk through some real-world examples of DryadLINQ programs based on the familiar declarative syntax of LINQ combined with the fault-tolerant distributed graph scheduling of the Dryad runtime. Hear how DryadLINQ provides a programming model and runtime for data-parallel programs running across large clusters and partitioned data sets.</p>
<p><span style="color:#ff0000;">E</span> &#60;<a title="http://agilecat.wordpress.com/2009/11/21/mapreduce-in-dryadlinq/" href="http://agilecat.wordpress.com/2009/11/21/mapreduce-in-dryadlinq/">http://agilecat.wordpress.com/2009/11/21/mapreduce-in-dryadlinq/</a>&#62;</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[MapReduce in DryadLINQ]]></title>
<link>http://agilecat.wordpress.com/2009/11/21/mapreduce-in-dryadlinq/</link>
<pubDate>Fri, 20 Nov 2009 20:07:24 +0000</pubDate>
<dc:creator>Agile Cat</dc:creator>
<guid>http://agilecat.wordpress.com/2009/11/21/mapreduce-in-dryadlinq/</guid>
<description><![CDATA[Data-Intensive Computing on Windows HPC Server with the DryadLINQ Framework John Vert in 408A on Tue]]></description>
<content:encoded><![CDATA[Data-Intensive Computing on Windows HPC Server with the DryadLINQ Framework John Vert in 408A on Tue]]></content:encoded>
</item>
<item>
<title><![CDATA[Curt Monash on Analytics with MapReduce]]></title>
<link>http://decisionstats.wordpress.com/2009/11/16/curt-monash-on-analytics-with-mapreduce/</link>
<pubDate>Mon, 16 Nov 2009 17:56:56 +0000</pubDate>
<dc:creator>Ajay Ohri</dc:creator>
<guid>http://decisionstats.wordpress.com/2009/11/16/curt-monash-on-analytics-with-mapreduce/</guid>
<description><![CDATA[In AsterData&#8217;s continued webcast series on MapReduce enabled analytics, here is the next in li]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a href="http://www.asterdata.com/wc_091203_masteringmapreduce/"><img class="size-full wp-image-3227  alignleft" title="mon1" src="http://decisionstats.wordpress.com/files/2009/11/mon1.gif" alt="mon1" width="540" height="751" /></a>In AsterData&#8217;s continued webcast series on MapReduce enabled analytics, here is the next in line, Curt Monash on Analytics for Data with MapReduce.</p>
<p><a href="http://www.asterdata.com/wc_091203_masteringmapreduce/">http://www.asterdata.com/wc_091203_masteringmapreduce/</a></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Undergound PDC だって！]]></title>
<link>http://agilecat.wordpress.com/2009/11/15/undergound-pdc-%e3%81%a0%e3%81%a3%e3%81%a6%ef%bc%81/</link>
<pubDate>Sun, 15 Nov 2009 00:26:36 +0000</pubDate>
<dc:creator>Agile Cat</dc:creator>
<guid>http://agilecat.wordpress.com/2009/11/15/undergound-pdc-%e3%81%a0%e3%81%a3%e3%81%a6%ef%bc%81/</guid>
<description><![CDATA[これは面白そうですね ・・・ 現地時時間の 18日19時～19日01時で、まさしくアンダーグラウンド的に開催される、いわば ウラ PDC のようなものなんでしょうね。 昔の Vbits にあった、ビー]]></description>
<content:encoded><![CDATA[これは面白そうですね ・・・ 現地時時間の 18日19時～19日01時で、まさしくアンダーグラウンド的に開催される、いわば ウラ PDC のようなものなんでしょうね。 昔の Vbits にあった、ビー]]></content:encoded>
</item>
<item>
<title><![CDATA[Open sourcing Cloud MapReduce]]></title>
<link>http://huanliu.wordpress.com/2009/11/13/open-sourcing-cloud-mapreduce/</link>
<pubDate>Fri, 13 Nov 2009 19:12:22 +0000</pubDate>
<dc:creator>huanliu</dc:creator>
<guid>http://huanliu.wordpress.com/2009/11/13/open-sourcing-cloud-mapreduce/</guid>
<description><![CDATA[After a lengthy review process, I finally received the approval to open source Cloud MapReduce ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>After a lengthy review process, I finally received the approval to open source <a href="http://code.google.com/p/cloudmapreduce">Cloud MapReduce </a>&#8211; an implementation of MapReduce on top of the Amazon cloud Operating System (OS). It was developed as part of a  research project we have done at Accenture Technology Labs. This shows that Accenture is not only committed to using open source technology, but we are also committed to continue our contribution to the community.</p>
<p>MapReduce was first invented by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google&#8217;s production index system was converted to MapReduce. Since then, it is quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce jobs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007.</p>
<p>MapReduce enjoyed wide adoption outside of Google too. Many enterprises are increasingly facing the same challenges of dealing with a large amount of data. They want to analyze and act on their data quickly to gain competitive advantages, but their existing technology could not keep up with the workload. MapReduce could be the perfect answer to address the challenge.</p>
<p>There are already several open source implementations of MapReduce. The most popular one is Hadoop. Recently, it has gained a lot of tractions in the market. Even Amazon is offering an Elastic MapReduce service which is providing Hadoop on-demand. However, even after 3 years of many engineer&#8217;s dedication, Hadoop still has many limitations. For example, Hadoop is still based on a master/slave architecture, where the master node is not only the scalability bottleneck, but it is also a single point of failure. The reason is that implementing a fully distributed system is very difficult.</p>
<p>Cloud MapReduce is not just another implementation &#8212; it is not a clone of Hadoop. Instead, it is based on a totally different concept. Hadoop is complex and inefficient because it is designed to run on bare-bone hardware; therefore, Hadoop has to implement many functionalities to make a cluster of servers appear as a single big server. In  comparison, Cloud MapReduce is built on top of the Amazon cloud Operating System(OS), using cloud services such as S3/SQS/SimpleDB. Even though a cloud service could be running on many servers behind the scene, Amazon presents a single big server abstraction, which greatly simplifies a MapReduce implementation.</p>
<p>By building on the Amazon cloud OS, Cloud MapReduce achieves three key advantages over Hadoop.</p>
<ul>
<li><strong>It is faster</strong>. In one case, it is 60 times faster than Hadoop (Actual speedup depends on the application and the input data).</li>
<li><strong>It is more scalable and failure resistant</strong>. It is fully distributed and there is not a single point of bottleneck or a single point of failure.</li>
<li><strong>It is dramatically simpler</strong>. It has only 3,000 lines of code, two orders of magnitude smaller than Hadoop.</li>
</ul>
<p>All these advantages directly translate into lower cost, higher reliability and faster turn-around for enterprises to gain competitive advantages.</p>
<p>On the surface, it looks surprising that a simple implementation like Cloud MapReduce could outperform Hadoop. However, if you count in the efforts from hundreds of Amazon engineers, it is natural that we are able to develop a more scalable and higher performance system. Cloud MapReduce demonstrates the power of leveraging cloud services for application design.</p>
<p>Cloud MapReduce has an ambitious vision, so there are many areas that we are looking for help on from the community. Even though Cloud MapReduce was only developed on Amazon OS initially, we envision it will run on many cloud services in the future. For example, it could be ported to Windows Azure, filling a missing capability in Azure that there is no large-scale processing framework at all (Hadoop does not run in Azure). The ultimate goal is to run Cloud MapReduce inside a private cloud. We envision an enterprise would deploy similar cloud services behind the firewall, so that Cloud MapReduce can just build on top. There are already open source projects filling that vision, such as <a href="http://project-voldemort.com/">project Voldemort </a>for storage and <a href="http://activemq.apache.org/">ActiveMQ </a>for queuing.</p>
<p>Check out the <a href="http://code.google.com/p/cloudmapreduce">Cloud MapReduce</a> project. We welcome your contributions. Please join the <a href="http://groups.google.com/group/CloudMapReduce">Cloud MapReduce discussions</a> to share you thoughts.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[David Chappell interview by Masahiko Issiki]]></title>
<link>http://jclouds.wordpress.com/2009/11/11/david-chappell-interview-by-masahiko-issiki/</link>
<pubDate>Wed, 11 Nov 2009 01:00:48 +0000</pubDate>
<dc:creator>Agile Cat</dc:creator>
<guid>http://jclouds.wordpress.com/2009/11/11/david-chappell-interview-by-masahiko-issiki/</guid>
<description><![CDATA[Nov.11 &#8211; In this interview, David Chappell talked about now and then of cloud computing and th]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Nov.11 &#8211; In this interview, David Chappell talked about now and then of cloud computing and the choice of technologies in Windows Azure application developments. Furthermore, he pointed out the economy of cloud computing, the essence of enterprise systems and the advantages of Windows Azure. Regarding MapReduce, he asked to Microsoft sometimes, however, no response, he added. </p>
<p><font color="#000080">J </font>&#60;<a href="http://www.atmarkit.co.jp/fdotnet/dnfuture/intvwdavidchappell_01/intvwdavidchappell_01_01.html">http://www.atmarkit.co.jp/fdotnet/dnfuture/intvwdavidchappell_01/intvwdavidchappell_01_01.html</a>&#62;</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Hadoop ： Twitter よ、お前もか！]]></title>
<link>http://agilecat.wordpress.com/2009/11/09/hadoop-%ef%bc%9a-twitter-%e3%82%88%e3%80%81%e3%81%8a%e5%89%8d%e3%82%82%e3%81%8b%ef%bc%81/</link>
<pubDate>Sun, 08 Nov 2009 22:03:18 +0000</pubDate>
<dc:creator>Agile Cat</dc:creator>
<guid>http://agilecat.wordpress.com/2009/11/09/hadoop-%ef%bc%9a-twitter-%e3%82%88%e3%80%81%e3%81%8a%e5%89%8d%e3%82%82%e3%81%8b%ef%bc%81/</guid>
<description><![CDATA[NoSQL East 2009 &#8211; Summary of Day 2 From &lt;http://journal.uggedal.com/&gt; なんとなく、そんな気はしていたのです]]></description>
<content:encoded><![CDATA[NoSQL East 2009 &#8211; Summary of Day 2 From &lt;http://journal.uggedal.com/&gt; なんとなく、そんな気はしていたのです]]></content:encoded>
</item>
<item>
<title><![CDATA[またまた Hadoop の記事が！]]></title>
<link>http://agilecat.wordpress.com/2009/11/06/%e3%81%be%e3%81%9f%e3%81%be%e3%81%9f-hadoop-%e3%81%ae%e8%a8%98%e4%ba%8b%e3%81%8c%ef%bc%81/</link>
<pubDate>Thu, 05 Nov 2009 23:49:55 +0000</pubDate>
<dc:creator>Agile Cat</dc:creator>
<guid>http://agilecat.wordpress.com/2009/11/06/%e3%81%be%e3%81%9f%e3%81%be%e3%81%9f-hadoop-%e3%81%ae%e8%a8%98%e4%ba%8b%e3%81%8c%ef%bc%81/</guid>
<description><![CDATA[Hadoopが秘める可能性：オンプレミスでもクラウドでも使えるプラットフォームの魅力 今朝に気づきましたが、今度は ZDNet から Hadoop World の記事がでました。 これも、10月2日の]]></description>
<content:encoded><![CDATA[Hadoopが秘める可能性：オンプレミスでもクラウドでも使えるプラットフォームの魅力 今朝に気づきましたが、今度は ZDNet から Hadoop World の記事がでました。 これも、10月2日の]]></content:encoded>
</item>
<item>
<title><![CDATA[Reading fixed length/width input records with Hadoop mapreduce]]></title>
<link>http://bitsofinfo.wordpress.com/2009/11/01/reading-fixed-length-width-input-record-reader-with-hadoop-mapreduce/</link>
<pubDate>Sun, 01 Nov 2009 18:34:42 +0000</pubDate>
<dc:creator>bitsofinfo</dc:creator>
<guid>http://bitsofinfo.wordpress.com/2009/11/01/reading-fixed-length-width-input-record-reader-with-hadoop-mapreduce/</guid>
<description><![CDATA[While working on a project where I needed to quickly import 50-100 million records I ended up using ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>While working on a project where I needed to <strong>quickly</strong> import 50-100 million records I ended up using Hadoop for the job. Unfortunately the input files I was dealing with were fixed width/length records, hence they had no delimiters which separated records, nor did they have any CR/LFs to separate records. Each record was exactly 502 bytes in size. Hadoop provides a <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html" target="_new">TextInputFormat</a> out of the box for reading input files, however it requires that your files contain CR/LFs or some combination thereof. </p>
<p>So&#8230;. I went ahead a wrote a couple of classes to support fixed length, fixed width (same thing) records in input files. These classes were inspired by Hadoop&#8217;s TextInputFormat and LineRecordReader. The two classes are <code>FixedLengthInputFormat</code> and <code>FixedLengthRecordReader</code>, they are presented below. I have also <a href="https://issues.apache.org/jira/browse/MAPREDUCE-1176" target="_new">created a Hadoop JIRA issue to contribute these classes to the Hadoop project</a>.</p>
<p>This input format overrides <code>computeSplitSize()</code> in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of <code>computeSplitSize()</code> delegates to FileInputFormat&#8217;s compute method, and then adjusts the returned split size by doing the following: <code>(Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength)</code></p>
<p>FixedLengthInputFormat does NOT support compressed files. To use this input format, you do so as follows:</p>
<pre class="brush: java;">
// setup your job configuration etc
...

// be sure to set the length of your fixed length records, so the
// FixedLengthRecordReader can extract the records correctly.
myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, 502);

// OR alternatively you can set it this way, the name of the property is
// &#34;mapreduce.input.fixedlengthinputformat.record.length&#34;
myJobConf.setInt(&#34;mapreduce.input.fixedlengthinputformat.record.length&#34;,502);

// create your job
Job job = new Job(myJobConf);
job.setInputFormatClass(FixedLengthInputFormat.class);

// do the rest of your job setup, specifying input locations etc
...

myJob.submit();
</pre>
<p>Below are the two classes which you are free to use. Hope this helps you out if you have a need to read fixed width/length records out of input files using Hadoop MapReduce! Enjoy.</p>
<p><B>FixedLengthInputFormat.java</b> &#8211; <a href="https://issues.apache.org/jira/browse/MAPREDUCE-1176" target="_new">download</a></p>
<pre class="brush: java;">
package org.bitsofinfo.hadoop.mapreduce.lib.input;

import java.io.IOException;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

/**
 * FixedLengthInputFormat is an input format which can be used
 * for input files which contain fixed length records with NO
 * delimiters and NO carriage returns (CR, LF, CRLF) etc. Such
 * files typically only have one gigantic line and each &#34;record&#34;
 * is of a fixed length, and padded with spaces if the record's actual
 * value is shorter than the fixed length.&#60;BR&#62;&#60;BR&#62;
 *
 * Users must configure the record length property before submitting
 * any jobs which use FixedLengthInputFormat.&#60;BR&#62;&#60;BR&#62;
 *
 * myJobConf.setInt(&#34;mapreduce.input.fixedlengthinputformat.record.length&#34;,[myFixedRecordLength]);&#60;BR&#62;&#60;BR&#62;
 *
 * This input format overrides &#60;code&#62;computeSplitSize()&#60;/code&#62; in order to ensure
 * that InputSplits do not contain any partial records since with fixed records
 * there is no way to determine where a record begins if that were to occur.
 * Each InputSplit passed to the FixedLengthRecordReader will start at the beginning
 * of a record, and the last byte in the InputSplit will be the last byte of a record.
 * The override of &#60;code&#62;computeSplitSize()&#60;/code&#62; delegates to FileInputFormat's
 * compute method, and then adjusts the returned split size by doing the following:
 * &#60;code&#62;(Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength)&#60;/code&#62;
 *
 * &#60;BR&#62;&#60;BR&#62;
 * This InputFormat returns a FixedLengthRecordReader. &#60;BR&#62;&#60;BR&#62;
 *
 * Compressed files currently are not supported.
 *
 * @see	FixedLengthRecordReader
 *
 * @author bitsofinfo.g (AT) gmail.com
 *
 */
public class FixedLengthInputFormat extends FileInputFormat&#60;LongWritable, Text&#62; {

	/**
	 * When using FixedLengthInputFormat you MUST set this
	 * property in your job configuration to specify the fixed
	 * record length.
	 * &#60;BR&#62;&#60;BR&#62;
	 *
	 * i.e. myJobConf.setInt(&#34;mapreduce.input.fixedlengthinputformat.record.length&#34;,[myFixedRecordLength]);
	 */
	public static final String FIXED_RECORD_LENGTH = &#34;mapreduce.input.fixedlengthinputformat.record.length&#34;; 

	// our logger reference
	private static final Log LOG = LogFactory.getLog(FixedLengthInputFormat.class);

	// the default fixed record length (-1), error if this does not change
	private int recordLength = -1;

	/**
	 * Return the int value from the given Configuration found
	 * by the FIXED_RECORD_LENGTH property.
	 *
	 * @param config
	 * @return	int record length value
	 * @throws IOException if the record length found is 0 (non-existant, not set etc)
	 */
	public static int getRecordLength(Configuration config) throws IOException {
		int recordLength = config.getInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, 0); 

		// this would be an error
		if (recordLength == 0) {
			throw new IOException(&#34;FixedLengthInputFormat requires the Configuration property:&#34; + FIXED_RECORD_LENGTH + &#34; to&#34; +
					&#34; be set to something &#62; 0. Currently the value is 0 (zero)&#34;);
		}

		return recordLength;
	}

	/**
	 * This input format overrides &#60;code&#62;computeSplitSize()&#60;/code&#62; in order to ensure
	 * that InputSplits do not contain any partial records since with fixed records
	 * there is no way to determine where a record begins if that were to occur.
	 * Each InputSplit passed to the FixedLengthRecordReader will start at the beginning
	 * of a record, and the last byte in the InputSplit will be the last byte of a record.
	 * The override of &#60;code&#62;computeSplitSize()&#60;/code&#62; delegates to FileInputFormat's
	 * compute method, and then adjusts the returned split size by doing the following:
	 * &#60;code&#62;(Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength)&#60;/code&#62;
	 *
	 * @inheritDoc
	 */
	@Override
	protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
		long defaultSize = super.computeSplitSize(blockSize, minSize, maxSize);

		// 1st, if the default size is less than the length of a
		// raw record, lets bump it up to a minimum of at least ONE record length
		if (defaultSize &#60;= recordLength) {
			return recordLength;
		}

		// determine the split size, it should be as close as possible to the
		// default size, but should NOT split within a record... each split
		// should contain a complete set of records with the first record
		// starting at the first byte in the split and the last record ending
		// with the last byte in the split.

		long splitSize = ((long)(Math.floor((double)defaultSize / (double)recordLength))) * recordLength;
		LOG.info(&#34;FixedLengthInputFormat: calculated split size: &#34; + splitSize);

		return splitSize;

	}

	/**
	 * Returns a FixedLengthRecordReader instance
	 *
	 * @inheritDoc
	 */
	@Override
	public RecordReader&#60;LongWritable, Text&#62; createRecordReader(InputSplit split,
			TaskAttemptContext context) throws IOException, InterruptedException {
		return new FixedLengthRecordReader();
	}

	/**
	 * @inheritDoc
	 */
 	@Override
 	protected boolean isSplitable(JobContext context, Path file) {

 		try {
			if (this.recordLength == -1) {
				this.recordLength = getRecordLength(context.getConfiguration());
			}
			LOG.info(&#34;FixedLengthInputFormat: my fixed record length is: &#34; + recordLength);

 		} catch(Exception e) {
 			LOG.error(&#34;Error in FixedLengthInputFormat.isSplitable() when trying to determine the fixed record length, returning false, input files will NOT be split!&#34;,e);
 			return false;
 		}

 		CompressionCodec codec = new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
	 	if (codec != null) {
	 		return false;
	 	}

	 	return true;
	 } 

}
</pre>
<p><B>FixedLengthRecordReader.java</b> &#8211; <a href="https://issues.apache.org/jira/browse/MAPREDUCE-1176" target="_new">download</a></p>
<pre class="brush: java;">
package org.bitsofinfo.hadoop.mapreduce.lib.input;

import java.io.IOException;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.Seekable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.MapContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

/**
 *
 * FixedLengthRecordReader is returned by FixedLengthInputFormat. This reader
 * uses the record length property set within the FixedLengthInputFormat to
 * read one record at a time from the given InputSplit. This record reader
 * does not support compressed files.&#60;BR&#62;&#60;BR&#62;
 *
 * Each call to nextKeyValue() updates the LongWritable KEY and Text VALUE.&#60;BR&#62;&#60;BR&#62;
 *
 * KEY = byte position in the file the record started at&#60;BR&#62;
 * VALUE = the record itself (Text)
 *
 *
 * @author bitsofinfo.g (AT) gmail.com
 *
 */
public class FixedLengthRecordReader extends RecordReader&#60;LongWritable, Text&#62; {

	// reference to the logger
	private static final Log LOG = LogFactory.getLog(FixedLengthRecordReader.class);

	// the start point of our split
	private long splitStart;

	// the end point in our split
	private long splitEnd; 

	// our current position in the split
	private long currentPosition;

	// the length of a record
	private int recordLength; 

	// reference to the input stream
	private FSDataInputStream fileInputStream;

	// the input byte counter
	private Counter inputByteCounter; 

	// reference to our FileSplit
	private FileSplit fileSplit;

	// our record key (byte position)
	private LongWritable recordKey = null;

	// the record value
	private Text recordValue = null; 

	@Override
	public void close() throws IOException {
		if (fileInputStream != null) {
			fileInputStream.close();
		}
	}

	@Override
	public LongWritable getCurrentKey() throws IOException,
			InterruptedException {
		return recordKey;
	}

	@Override
	public Text getCurrentValue() throws IOException, InterruptedException {
		return recordValue;
	}

	@Override
	public float getProgress() throws IOException, InterruptedException {
		if (splitStart == splitEnd) {
			return (float)0;
		} else {
			return Math.min((float)1.0, (currentPosition - splitStart) / (float)(splitEnd - splitStart));
		}
	}

	@Override
	public void initialize(InputSplit inputSplit, TaskAttemptContext context)
			throws IOException, InterruptedException {

		// the file input fileSplit
		this.fileSplit = (FileSplit)inputSplit;

		// the byte position this fileSplit starts at within the splitEnd file
		splitStart = fileSplit.getStart();

		// splitEnd byte marker that the fileSplit ends at within the splitEnd file
		splitEnd = splitStart + fileSplit.getLength();

		// log some debug info
		LOG.info(&#34;FixedLengthRecordReader: SPLIT START=&#34;+splitStart + &#34; SPLIT END=&#34; +splitEnd + &#34; SPLIT LENGTH=&#34;+fileSplit.getLength() );

		// the actual file we will be reading from
		Path file = fileSplit.getPath(); 

		// job configuration
		Configuration job = context.getConfiguration(); 

		// check to see if compressed....
		CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
	 	if (codec != null) {
	 		throw new IOException(&#34;FixedLengthRecordReader does not support reading compressed files&#34;);
	 	}

		// for updating the total bytes read in
	 	inputByteCounter = ((MapContext)context).getCounter(&#34;FileInputFormatCounters&#34;, &#34;BYTES_READ&#34;); 

	 	// THE JAR COMPILED AGAINST 0.20.1 does not contain a version of FileInputFormat with these constants (but they exist in trunk)
	 	// uncomment the below, then comment or discard the line above
	 	//inputByteCounter = ((MapContext)context).getCounter(FileInputFormat.COUNTER_GROUP, FileInputFormat.BYTES_READ); 

		// the size of each fixed length record
		this.recordLength = FixedLengthInputFormat.getRecordLength(job);

		// get the filesystem
		final FileSystem fs = file.getFileSystem(job); 

		// open the File
		fileInputStream = fs.open(file,(64 * 1024)); 

		// seek to the splitStart position
		fileInputStream.seek(splitStart);

		// set our current position
	 	this.currentPosition = splitStart;
	}

	@Override
	public boolean nextKeyValue() throws IOException, InterruptedException {
		if (recordKey == null) {
		 	recordKey = new LongWritable();
	 	}

		// the Key is always the position the record starts at
	 	recordKey.set(currentPosition);

	 	// the recordValue to place the record text in
	 	if (recordValue == null) {
	 		recordValue = new Text();
	 	} else {
	 		recordValue.clear();
	 	}

	 	// if the currentPosition is less than the split end..
	 	if (currentPosition &#60; splitEnd) {

	 		// setup a buffer to store the record
	 		byte[] buffer = new byte[this.recordLength];
	 		int totalRead = 0; // total bytes read
	 		int totalToRead = recordLength; // total bytes we need to read

	 		// while we still have record bytes to read
	 		while(totalRead != recordLength) {
	 			// read in what we need
	 			int read = this.fileInputStream.read(buffer, 0, totalToRead);

	 			// append to the buffer
	 			recordValue.append(buffer,0,read);

	 			// update our markers
	 			totalRead += read;
	 			totalToRead -= read;
	 			//LOG.info(&#34;READ: just read=&#34; + read +&#34; totalRead=&#34; + totalRead + &#34; totalToRead=&#34;+totalToRead);
	 		}

	 		// update our current position and log the input bytes
	 		currentPosition = currentPosition +recordLength;
	 		inputByteCounter.increment(recordLength);

	 		//LOG.info(&#34;VALUE=&#124;&#34;+fileInputStream.getPos()+&#34;&#124;&#34;+currentPosition+&#34;&#124;&#34;+splitEnd+&#34;&#124;&#34; + recordLength + &#34;&#124;&#34;+recordValue.toString());

	 		// return true
	 		return true;
	 	}

	 	// nothing more to read....
		return false;
	}

}
</pre>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Greed and Other Virtues]]></title>
<link>http://jsaia.wordpress.com/2009/10/29/greed-and-other-virtues/</link>
<pubDate>Thu, 29 Oct 2009 21:28:03 +0000</pubDate>
<dc:creator>Jared</dc:creator>
<guid>http://jsaia.wordpress.com/2009/10/29/greed-and-other-virtues/</guid>
<description><![CDATA[We just recently finished talking about greedy algorithms in my graduate algorithms class.  In the p]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>We just recently finished talking about greedy algorithms in my graduate algorithms class.  In the past few months, I&#8217;ve developed a new appreciation for greed.  Part of this is due to a discussion with <a href="http://www.cs.toronto.edu/~bor/">Alan Borodin</a> at a recent workshop, who made a good case for greediness as a stand-in for simplicity.  Anyone working in algorithms will agree that simple algorithms are better; even more so in the area of distributed computing, where even simple algorithms can be notoriously difficult to implement.  Unfortunately, it&#8217;s hard to improve what you can&#8217;t measure, and so there is a real need for good measures of algorithmic simplicity.  This makes me think of several important techniques in algorithm design and how these might intersect with the notion of simplicity.</p>
<ul>
<li>Local algorithms: In distributed computing, local algorithms are those that run in time independent of the network size &#8211; i.e. each node in the network only receives information from its local neighborhood.  Clearly, these types of algorithms are more simple than algorithms that require information to be exchanged over long distances in the network.  Approximate edge coloring of graphs and some resource allocation problems are amenable to this approach.  See the two great papers: <a href="http://portal.acm.org/citation.cfm?id=167088.167149">What can be computed locally?</a> by Naor and Stockmeyer and <a href="http://http://portal.acm.org/citation.cfm?id=167088.167149">What cannot be computed locally</a> by Kuhn, Moscibroda and Wattenhoffer for more on this area.</li>
<li>Data streams: Algorithms that use little space are likely to be simple.  Data stream algorithms use <strong>very </strong>little space (or more specifically the amount of space used is very small compared to the input size).  While the analysis for many of these algorithms can be quite sophisticated, the algorithms themselves can usually be described in less than a quarter of a page of pseudo-code.  Muthukrishnan&#8217;s book is a great resource in this area (plus it is freely available <a href="http://www.cs.rutgers.edu/~muthu/stream">here</a>).  Recently, there has been a lot of interest in applying the data stream model in a distributed setting.  See for example the <a href="http://mysliceofpizza.blogspot.com/2009/02/more-stream-models.html">Massive, Unordered Data</a> (MUD) model, which tries to capture the approach of systems like <a href="http://labs.google.com/papers/mapreduce.html">Mapreduce</a>.</li>
<li>Greedy algorithms : Many of us are familiar with several algorithms of this type.  Perhaps Kruskall&#8217;s algorithm is the most frequently used greedy algorithm that is always correct.  Of course, there are many, many heuristics that are greedy that are not guaranteed to be correct.  I wonder what are the most frequently used greedy approximation algorithms.  Nothing comes to mind except for the greedy set cover algorithm or greedy approximation algorithms for maximizing submodular functions.</li>
<li>&#8220;Selfish&#8221; algorithms: This is a phrase I just made up to describe distributed algorithms that require each node to behave selfishly.  In particular, these are problems for which 1) the social welfare in any Nash equilibria is close to the optimal social welfare, and 2) the players can quickly converge to a Nash by following a locally optimal strategy.  There is a <strong>lot </strong>of interest in these types of algorithms in the distributed computing community right now, partially, I think, because of the great fact that they are simple to describe and implement.  Selfish algorithms are also simple in the sense that they may arise naturally when large groups of agents get together; in some sense, the algorithmic engineer is (at least partially) removed from the picture.</li>
</ul>
<p>Are there other ways of formulating  problems that ensure &#8220;simple&#8221; algorithms?  Perhaps you, dear reader, can help me flesh out this list.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Windows Azure MapReduce Demo]]></title>
<link>http://agilecat.wordpress.com/2009/10/27/windows-azure-mapreduce-demo/</link>
<pubDate>Mon, 26 Oct 2009 22:09:39 +0000</pubDate>
<dc:creator>Agile Cat</dc:creator>
<guid>http://agilecat.wordpress.com/2009/10/27/windows-azure-mapreduce-demo/</guid>
<description><![CDATA[リアルタイム系に MapReduce の概念を？ 以前に紹介した、Simon Guest さんの TechED コンテンツですが、自身のブログで Web キャスト化してくれました。 以下の各図は単なる]]></description>
<content:encoded><![CDATA[リアルタイム系に MapReduce の概念を？ 以前に紹介した、Simon Guest さんの TechED コンテンツですが、自身のブログで Web キャスト化してくれました。 以下の各図は単なる]]></content:encoded>
</item>
<item>
<title><![CDATA[Swarm: Distributed Computation in the Cloud]]></title>
<link>http://markusklems.wordpress.com/2009/10/11/swarm-distributed-computation-in-the-cloud/</link>
<pubDate>Sun, 11 Oct 2009 23:04:05 +0000</pubDate>
<dc:creator>Markus Klems</dc:creator>
<guid>http://markusklems.wordpress.com/2009/10/11/swarm-distributed-computation-in-the-cloud/</guid>
<description><![CDATA[Ian Clarke, former lead developer of Freenet, is working on a cool project, named Swarm. The key que]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Ian Clarke, former lead developer of Freenet, is working on a cool project, named <a href="http://code.google.com/p/swarm-dpl/" target="_blank">Swarm</a>. The key question that inspired Swarm is: <strong>how to distribute data and computation across multiple computers such that the programmer need not think about it?</strong></p>
<p>Based on <a href="http://en.wikipedia.org/wiki/Scala_(programming_language)" target="_blank">Scala 2.8</a>, Swarm draws upon the programming language feature &#8220;<strong>portable &#38; delimited continuations</strong>&#8220;, i.e. the capability of migrating a piece of a thread to a different computer (&#8220;<strong>move the computation, not the data</strong>&#8220;). The promise: let the programming framework handle the distribution problem, instead of having the programmer care about it (aka MapReduce, Databases, &#8230;).</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Hadoop World 2009 レポート]]></title>
<link>http://agilecat.wordpress.com/2009/10/08/hadoop-world-2009-%e3%83%ac%e3%83%9d%e3%83%bc%e3%83%88/</link>
<pubDate>Thu, 08 Oct 2009 13:32:25 +0000</pubDate>
<dc:creator>Agile Cat</dc:creator>
<guid>http://agilecat.wordpress.com/2009/10/08/hadoop-world-2009-%e3%83%ac%e3%83%9d%e3%83%bc%e3%83%88/</guid>
<description><![CDATA[10月2日、ニューヨークにて Cloudera が主催 ・・・ Hadoop World 2009 へ行ってきました。エンタープライズという概念が大きく変化しているのだなぁという、全体的な印象がありま]]></description>
<content:encoded><![CDATA[10月2日、ニューヨークにて Cloudera が主催 ・・・ Hadoop World 2009 へ行ってきました。エンタープライズという概念が大きく変化しているのだなぁという、全体的な印象がありま]]></content:encoded>
</item>
<item>
<title><![CDATA[Programming Praxis - MapReduce]]></title>
<link>http://bonsaicode.wordpress.com/2009/10/06/programming-praxis-mapreduce/</link>
<pubDate>Tue, 06 Oct 2009 17:40:41 +0000</pubDate>
<dc:creator>Remco Niemeijer</dc:creator>
<guid>http://bonsaicode.wordpress.com/2009/10/06/programming-praxis-mapreduce/</guid>
<description><![CDATA[In today&#8217;s Programming Praxis exercise, we have to implement the famous MapReduce algorithm. L]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>In <a href="http://programmingpraxis.com/2009/10/06/mapreduce/" target="_blank">today&#8217;s</a> Programming Praxis exercise, we have to implement the famous MapReduce algorithm. Let&#8217;s get going, shall we?</p>
<p>First, some imports:</p>
<pre style="color:#000000;background-color:#ffffff;font-size:9pt;font-family:'Courier New';">import Control<span style="color:#ff0000;">.</span>Arrow
import Data<span style="color:#ff0000;">.</span><span style="color:#0000ff;">Char</span>
import Data<span style="color:#ff0000;">.</span>List
import qualified Data<span style="color:#ff0000;">.</span>Map as M
</pre>
<p>Since I wasn&#8217;t here for the Red-Black Tree exercise, I&#8217;ll just use Maps.</p>
<pre style="color:#000000;background-color:#ffffff;font-size:9pt;font-family:'Courier New';">mapReduce <span style="color:#ff0000;">::</span> Ord k <span style="color:#ff0000;">=&#62; (</span>a <span style="color:#ff0000;">-&#62; (</span>k<span style="color:#ff0000;">,</span> v<span style="color:#ff0000;">)) -&#62; (</span>v <span style="color:#ff0000;">-&#62;</span> v <span style="color:#ff0000;">-&#62;</span> v<span style="color:#ff0000;">) -&#62;</span>
                      <span style="color:#ff0000;">(</span>k <span style="color:#ff0000;">-&#62;</span> k <span style="color:#ff0000;">-&#62;</span> <span style="color:#0000ff;">Bool</span><span style="color:#ff0000;">) -&#62; [</span>a<span style="color:#ff0000;">] -&#62; [(</span>k<span style="color:#ff0000;">,</span> v<span style="color:#ff0000;">)]</span>
mapReduce m r lt <span style="color:#ff0000;">=</span> <span style="color:#ec7f15;">sortBy</span> <span style="color:#ff0000;">(</span>\<span style="color:#ff0000;">(</span>a<span style="color:#ff0000;">,</span>_<span style="color:#ff0000;">) (</span>b<span style="color:#ff0000;">,</span>_<span style="color:#ff0000;">) -&#62;</span> if lt a b then LT else GT<span style="color:#ff0000;">) .</span>
                   M<span style="color:#ff0000;">.</span><span style="color:#ec7f15;">assocs</span> <span style="color:#ff0000;">.</span> M<span style="color:#ff0000;">.</span><span style="color:#ec7f15;">map</span> <span style="color:#ff0000;">(</span><span style="color:#ec7f15;">foldl1</span> r<span style="color:#ff0000;">) .</span>
                   M<span style="color:#ff0000;">.</span>fromListWith <span style="color:#ff0000;">(++) .</span> <span style="color:#ec7f15;">map</span> <span style="color:#ff0000;">(</span>second <span style="color:#ec7f15;">return</span> <span style="color:#ff0000;">.</span> m<span style="color:#ff0000;">)</span>
</pre>
<p>With that, the version that works on files is trivial.</p>
<pre style="color:#000000;background-color:#ffffff;font-size:9pt;font-family:'Courier New';">mapReduceInput <span style="color:#ff0000;">::</span> Ord k <span style="color:#ff0000;">=&#62; (</span>a <span style="color:#ff0000;">-&#62; (</span>k<span style="color:#ff0000;">,</span> v<span style="color:#ff0000;">)) -&#62; (</span>v <span style="color:#ff0000;">-&#62;</span> v <span style="color:#ff0000;">-&#62;</span> v<span style="color:#ff0000;">) -&#62;</span>
    <span style="color:#ff0000;">(</span>k <span style="color:#ff0000;">-&#62;</span> k <span style="color:#ff0000;">-&#62;</span> <span style="color:#0000ff;">Bool</span><span style="color:#ff0000;">) -&#62; (</span><span style="color:#0000ff;">String</span> <span style="color:#ff0000;">-&#62; [</span>a<span style="color:#ff0000;">]) -&#62;</span> <span style="color:#0000ff;">FilePath</span> <span style="color:#ff0000;">-&#62;</span> <span style="color:#0000ff;">IO</span> <span style="color:#ff0000;">[(</span>k<span style="color:#ff0000;">,</span> v<span style="color:#ff0000;">)]</span>
mapReduceInput m r lt g <span style="color:#ff0000;">=</span> <span style="color:#ec7f15;">fmap</span> <span style="color:#ff0000;">(</span>mapReduce m r lt <span style="color:#ff0000;">.</span> g<span style="color:#ff0000;">) .</span> <span style="color:#ec7f15;">readFile</span>
</pre>
<p>In order to test our algorithm, let&#8217;s reproduce the tests from Programming Praxis:</p>
<pre style="color:#000000;background-color:#ffffff;font-size:9pt;font-family:'Courier New';">anagrams <span style="color:#ff0000;">=</span> <span style="color:#ec7f15;">map snd</span> <span style="color:#ff0000;">.</span> mapReduce <span style="color:#ff0000;">(</span><span style="color:#ec7f15;">sort</span> <span style="color:#ff0000;">&#38;&#38;&#38;</span> <span style="color:#ec7f15;">id</span><span style="color:#ff0000;">) (</span>\a b <span style="color:#ff0000;">-&#62;</span> a <span style="color:#ff0000;">++</span> <span style="color:#ff0000;">" "</span> <span style="color:#ff0000;">++</span> b<span style="color:#ff0000;">) (&#60;)</span>

getWords <span style="color:#ff0000;">=</span> <span style="color:#ec7f15;">concat</span> <span style="color:#ff0000;">.</span> <span style="color:#ec7f15;">zipWith</span> <span style="color:#ff0000;">(</span>\i <span style="color:#ff0000;">-&#62;</span> <span style="color:#ec7f15;">map</span> <span style="color:#ff0000;">(</span>\w <span style="color:#ff0000;">-&#62; (</span>w<span style="color:#ff0000;">, [</span>i<span style="color:#ff0000;">])) .</span> <span style="color:#ec7f15;">words</span><span style="color:#ff0000;">) [</span><span style="color:#a900a9;">1</span><span style="color:#ff0000;">..] .</span>
           <span style="color:#ec7f15;">map</span> <span style="color:#ff0000;">(</span><span style="color:#ec7f15;">map</span> clean<span style="color:#ff0000;">) .</span> <span style="color:#ec7f15;">lines</span> where
           clean c <span style="color:#ff0000;">=</span> if <span style="color:#ec7f15;">isAlphaNum</span> c <span style="color:#ff0000;">&#124;&#124;</span> <span style="color:#ec7f15;">isSpace</span> c then c else <span style="color:#ff0000;">' '</span>

xref <span style="color:#ff0000;">=</span> mapReduceInput <span style="color:#ec7f15;">id</span> <span style="color:#ff0000;">(</span><span style="color:#ec7f15;">flip union</span><span style="color:#ff0000;">) (&#60;)</span> getWords

main <span style="color:#ff0000;">=</span> do <span style="color:#ec7f15;">print</span> $ mapReduce <span style="color:#ff0000;">(</span>\x <span style="color:#ff0000;">-&#62; (</span>x<span style="color:#ff0000;">,</span> <span style="color:#a900a9;">1</span><span style="color:#ff0000;">)) (+) (&#60;)</span> <span style="color:#ff0000;">"banana"</span>
          <span style="color:#ec7f15;">print</span> $ anagrams <span style="color:#ff0000;">[</span><span style="color:#ff0000;">"time"</span><span style="color:#ff0000;">,</span> <span style="color:#ff0000;">"stop"</span><span style="color:#ff0000;">,</span> <span style="color:#ff0000;">"pots"</span><span style="color:#ff0000;">,</span> <span style="color:#ff0000;">"cars"</span><span style="color:#ff0000;">,</span> <span style="color:#ff0000;">"emit"</span><span style="color:#ff0000;">]</span>
          <span style="color:#ec7f15;">print</span> <span style="color:#ff0000;">=&#60;&#60;</span> xref <span style="color:#ff0000;">"mapreduce.txt"</span>
</pre>
<p>Everything&#8217;s working correctly.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Cloud MapReduce -- MapReduce built on a cloud operating system]]></title>
<link>http://huanliu.wordpress.com/2009/10/05/cloud-mapreduce-mapreduce-built-on-a-cloud-operating-system/</link>
<pubDate>Mon, 05 Oct 2009 07:38:14 +0000</pubDate>
<dc:creator>huanliu</dc:creator>
<guid>http://huanliu.wordpress.com/2009/10/05/cloud-mapreduce-mapreduce-built-on-a-cloud-operating-system/</guid>
<description><![CDATA[We have finally finished a cool project &#8212; Cloud MapReduce &#8212; that we have been working on]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>We have finally finished a cool project &#8212; Cloud MapReduce &#8212; that we have been working on on-off for almost the whole past year. It is a new MapReduce implementation built on top of a cloud operating system. I described <a title="what is cloud operating system" href="http://huanliu.wordpress.com/2009/07/20/what-is-a-cloud-operating-system/">what is a cloud operating system</a> before. We looked hard to understand how different is a cloud operating system (OS) from a traditional OS.  I think we have found the key difference &#8212; a cloud OS&#8217;s scalability. Unlike a traditional OS, a cloud OS has to be much more scalable because it must manage a large infrastructure (much bigger than a PC) and it must serve many customers.  By exploiting a cloud OS&#8217; scalability, Cloud MapReduce achieves three advantages against other MapReduce implementations, such as <a href="http://hadoop.apache.org/">Hadoop</a>:</p>
<p><strong>Faster</strong>: Cloud MapReduce is faster than Hadoop, up to 60 times in one case.</p>
<p><strong>More scalable</strong>: No single point of bottleneck, i.e., no single master node that coordinates everything. It is a fully distributed implementation.</p>
<p><strong>Simpler</strong>: Only 3000 lines of Java code. Which means it is very easy to change it to suit your needs. Have you ever thought about changing Hadoop? I got a headache even thinking about the 280K lines of code in Hadoop.</p>
<p>I encourage you to read the <a href="http://huanliu.googlepages.com/cloudos.pdf">Cloud MapReduce</a> technical report to learn more about what we have done.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[On MapReduce and Relational Databases - Part 1]]></title>
<link>http://hypecycles.wordpress.com/2009/10/03/mapreduce-1/</link>
<pubDate>Sat, 03 Oct 2009 21:55:43 +0000</pubDate>
<dc:creator>amrith</dc:creator>
<guid>http://hypecycles.wordpress.com/2009/10/03/mapreduce-1/</guid>
<description><![CDATA[This is the first of a two-part blog post that presents a perspective on the recent trend to integra]]></description>
<content:encoded><![CDATA[This is the first of a two-part blog post that presents a perspective on the recent trend to integra]]></content:encoded>
</item>
<item>
<title><![CDATA[On MapReduce and Relational Databases - Part 2]]></title>
<link>http://hypecycles.wordpress.com/2009/10/03/mapreduce-2/</link>
<pubDate>Sat, 03 Oct 2009 21:08:52 +0000</pubDate>
<dc:creator>amrith</dc:creator>
<guid>http://hypecycles.wordpress.com/2009/10/03/mapreduce-2/</guid>
<description><![CDATA[This is the second of a two-part blog post that presents a perspective on the recent trend to integr]]></description>
<content:encoded><![CDATA[This is the second of a two-part blog post that presents a perspective on the recent trend to integr]]></content:encoded>
</item>
<item>
<title><![CDATA[Google Sites API とは？]]></title>
<link>http://agilecat.wordpress.com/2009/09/29/google-sites-api-%e3%81%a8%e3%81%af%ef%bc%9f/</link>
<pubDate>Mon, 28 Sep 2009 22:16:57 +0000</pubDate>
<dc:creator>Agile Cat</dc:creator>
<guid>http://agilecat.wordpress.com/2009/09/29/google-sites-api-%e3%81%a8%e3%81%af%ef%bc%9f/</guid>
<description><![CDATA[Google Sites API Lets Developers Move Data to, from Wikis eWeeks より By: Clint Boulton －－－ 2009-09-24]]></description>
<content:encoded><![CDATA[Google Sites API Lets Developers Move Data to, from Wikis eWeeks より By: Clint Boulton －－－ 2009-09-24]]></content:encoded>
</item>
<item>
<title><![CDATA[The Anatomy of Hadoop I/O Pipeline _5]]></title>
<link>http://agilecat.wordpress.com/2009/09/21/the-anatomy-of-hadoop-io-pipeline-_5/</link>
<pubDate>Sun, 20 Sep 2009 23:12:00 +0000</pubDate>
<dc:creator>Agile Cat</dc:creator>
<guid>http://agilecat.wordpress.com/2009/09/21/the-anatomy-of-hadoop-io-pipeline-_5/</guid>
<description><![CDATA[From &lt;http://developer.yahoo.net/blogs/hadoop/&gt; Optimizing Output Pipeline Overall, including ]]></description>
<content:encoded><![CDATA[From &lt;http://developer.yahoo.net/blogs/hadoop/&gt; Optimizing Output Pipeline Overall, including ]]></content:encoded>
</item>

</channel>
</rss>
