<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress.com" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>map-reduce &amp;laquo; WordPress.com Tag Feed</title>
	<link>http://en.wordpress.com/tag/map-reduce/</link>
	<description>Feed of posts on WordPress.com tagged "map-reduce"</description>
	<pubDate>Tue, 01 Dec 2009 18:44:57 +0000</pubDate>

	<generator>http://en.wordpress.com/tags/</generator>
	<language>en</language>

<item>
<title><![CDATA[Apache Mahout 0.2 Released - Now classify, cluster and generate recommendations!]]></title>
<link>http://techdigger.wordpress.com/2009/11/18/apache-mahout-0-2-released-now-classify-cluster-and-generate-recommendations/</link>
<pubDate>Wed, 18 Nov 2009 13:48:32 +0000</pubDate>
<dc:creator>TechDigger</dc:creator>
<guid>http://techdigger.wordpress.com/2009/11/18/apache-mahout-0-2-released-now-classify-cluster-and-generate-recommendations/</guid>
<description><![CDATA[Apache Mahout For the past two years, I have been working with this amazing bunch of people whilst, ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><div class="wp-caption alignright" style="width: 92px"><a href="http://lucene.apache.org/mahout"><img src="http://lucene.apache.org/mahout/images/Mahout-logo-82x100.png" alt="Apache Mahout" width="82" height="100" /></a><p class="wp-caption-text">Apache Mahout</p></div>
<p align="justify">
For the past two years, I have been working with this amazing bunch of people whilst, being paid by Google in their summer of code program in a project called <a href="http://lucene.apache.org/mahout">Mahout</a>. And like the name says, it is trying to tame the young beast known as <a href="http://hadoop.apache.org">Hadoop</a>. I have received a lot from the community. Being part of the project, I have got some real exposure to Java, data mining, machine learning and hands on experience over distributed systems like <a href="http://hadoop.apache.org">Hadoop</a>, <a href="http://hadoop.apache.org/hbase">Hbase</a>, <a href="http://hadoop.apache.org/pig">Pig</a>.  The project is still in its infancy, but, its ambitions are high in the sky. I am happy to announce the second release of the project, and proud to be a part of it. I hope people will adapt it in their projects and that it becomes the defacto standard machine learning library the way lucene and hadoop has become in their respective focus areas.
</p>
<p>If you are already excited and want to take it for a ride, read Grant&#8217;s article on IBM developerworks <a href="https://www.ibm.com/developerworks/java/library/j-mahout/index.html">here</a><br />
The release announcement below</p>
<div align="justify" style="font-size:90%;border:1px dashed #337733;padding:10px;">
<p>Apache Mahout 0.2 has been released and is now available for public download at<a href="http://www.apache.org/dyn/closer.cgi/lucene/mahout">http://www.apache.org/dyn/closer.cgi/lucene/mahout</a></p>
<p>Up to date maven artifacts can be found in the Apache repository at<br />
<a href="https://repository.apache.org/content/repositories/releases/org/apache/mahout/">https://repository.apache.org/content/repositories/releases/org/apache/mahout/</a></p>
<p>Apache Mahout is a subproject of Apache Lucene with the goal of delivering scalable machine learning algorithm implementations under the Apache license. http://www.apache.org/licenses/LICENSE-2.0</p>
<p>Mahout is a machine learning library meant to scale: Scale in terms of community to support anyone interested in using machine learning. Scale in terms of business by providing the library under a commercially friendly, free software license. Scale in terms of computation to the size of data we manage today.</p>
<p>Built on top of the powerful map/reduce paradigm of the Apache Hadoop project, Mahout lets you solve popular machine learning problem settings like clustering, collaborative filtering and classification<br />
over Terabytes of data over thousands of computers.</p>
<p>Implemented with scalability in mind the latest release brings many performance optimizations so that even in a single node setup the library performs well.</p>
<p>The complete changelist can be found here:</p>
<p><a href="http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278">http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278</a></p>
<p>New Mahout 0.2 features include</p>
<ul>
<li>Major performance enhancements in Collaborative Filtering, Classification and Clustering</li>
<li>New: Latent Dirichlet Allocation(LDA) implementation for topic modelling</li>
<li>New: Frequent Itemset Mining for mining top-k patterns from a list of transactions</li>
<li>New: Decision Forests implementation for Decision Tree classification (In Memory &#38; Partial Data)</li>
<li>New: HBase storage support for Naive Bayes model building and classification</li>
<li>New: Generation of vectors from Text documents for use with Mahout Algorithms</li>
<li>Performance improvements in various Vector implementations</li>
<li>Tons of bug fixes and code cleanup</li>
</ul>
<p>Getting started: New to Mahout?</p>
<ul>
<li> Download Mahout at <a href="http://www.apache.org/dyn/closer.cgi/lucene/mahout">http://www.apache.org/dyn/closer.cgi/lucene/mahout</a></li>
<li> Check out the Quick start: <a href="http://cwiki.apache.org/MAHOUT/quickstart.html">http://cwiki.apache.org/MAHOUT</a></li>
<li> Read the Mahout Wiki: <a href="http://cwiki.apache.org/MAHOUT">http://cwiki.apache.org/MAHOUT</a></li>
<li> Join the community by subscribing to mahout-user@lucene.apache.org</li>
<li> Give back: <a href="http://www.apache.org/foundation/getinvolved.html">http://www.apache.org/foundation/getinvolved.html</a></li>
<li> Consider adding yourself to the power by Wiki page:<a href="http://cwiki.apache.org/MAHOUT/poweredby.html">http://cwiki.apache.org/MAHOUT/poweredby.html</a></li>
</ul>
<p>For more information on Apache Mahout, see <a href="http://lucene.apache.org/mahout">http://lucene.apache.org/mahout</a>
</div>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Map Reduce Using R]]></title>
<link>http://findingdelta.wordpress.com/2009/11/17/map_reduce_using_r/</link>
<pubDate>Tue, 17 Nov 2009 12:00:43 +0000</pubDate>
<dc:creator>mattalcock</dc:creator>
<guid>http://findingdelta.wordpress.com/2009/11/17/map_reduce_using_r/</guid>
<description><![CDATA[From Revolutions There&#8217;s been a lot of buzz recently around the MapReduce algorithm and its fa]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a href="http://blog.revolution-computing.com/2009/11/hadoop-ported-to-r.html">From Revolutions</a></p>
<p>There&#8217;s been a lot of buzz recently around the MapReduce algorithm and its famous open-source implementation, Hadoop. It&#8217;s the go-to algorithm for performing any kind of analytical computation on very large data sets. But what is, the MapReduce algorithm, exactly? Well, if you&#8217;re an R programmer, you&#8217;ve probably been using it routinely without even knowing it. As a functional language, R has a whole class of functions &#8212; the &#8220;apply&#8221; functions &#8212; designed to evaluate a function over a series of data values (the &#8220;map&#8221; step) and collate and condense the results (the &#8220;reduce&#8221; step).</p>
<p>In fact, you can almost boil it down to a single line of R code:<br />
<code><br />
sapply(map(data), reduce)<br />
</code><br />
where map is a function, which when applied to a data set data, splits the data into a list with each list element collecting values with a common key assignment, and reduce is a function that processes each element of the list to create a single value from all the data mapped to each key value.</p>
<p>It&#8217;s not quite that simple, of course: one of the strengths of Hadoop is that it provides the infrastructure for distributing these map and reduce computations across a vast cluster of networked machines. But R has parallel programming tools too, and the Open Data Group has created a package to implement the MapReduce algorithm in parallel in R. The MapReduce package is available from any CRAN mirror.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[MapReduce Online! (and some gimmes)]]></title>
<link>http://databeta.wordpress.com/2009/10/18/mapreduce-online/</link>
<pubDate>Mon, 19 Oct 2009 02:22:13 +0000</pubDate>
<dc:creator>jmh</dc:creator>
<guid>http://databeta.wordpress.com/2009/10/18/mapreduce-online/</guid>
<description><![CDATA[Hadoop MapReduce is a batch-processing system.  Why?  Because that&#8217;s the way Google described ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a href="http://www.flickr.com/photos/altemark/273968506/"><img class="alignright size-full wp-image-210" title="oscillo" src="http://databeta.wordpress.com/files/2009/10/oscillo1.jpg" alt="oscillo" width="240" height="203" /></a>Hadoop MapReduce is a batch-processing system.  Why?  Because that&#8217;s the way Google described their MapReduce implementation.</p>
<p>But it doesn&#8217;t have to be that way. Introducing <a title="HOP Tech Report" href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html">HOP: the Hadoop Online Prototype</a>. With modest changes to the structure of Hadoop, we were able to convert it from a batch-processing system to an interactive, online system that can provide features like &#8220;early returns&#8221; from big jobs, and continuous data stream processing, while preserving the simple MapReduce programming and fault tolerance models popularized by Google and Hadoop.  And by the way, it exposes pipeline parallelism that can even make batch jobs finish faster.  This is a project led by <a title="Tyson Condie homepage" href="http://www.cs.berkeley.edu/~tcondie/">Tyson Condie</a>, in collaboration with folks at Berkeley and Yahoo! Research.</p>
<p><!--more-->Background: Parallel database engines have always been able to stream out results to certain queries while they run.  And a bunch of us did research over the years to leverage that feature to do more things, like <a title="Berkeley CONTROL project" href="http://control.cs.berkeley.edu">online aggregation</a> (for &#8220;early returns&#8221; from inherently batch jobs), and <a title="telegraph project" href="http://telegraph.cs.berkeley.edu/index.html">continuous queries</a> (for processing infinitely streaming data inputs, producing infinitely streaming data outputs).  So why not do the same for MapReduce engines?  Why are they limited to batch processing?</p>
<p>The natural response is that the <a title="google mapreduce paper" href="http://labs.google.com/papers/mapreduce.html">Dean/Ghemawat MapReduce design</a> from Google enables high availability and performance reliability at unheard-of scale, via a simple checkpoint/restart fault-tolerance mechanism that requires batch-writing things to disk.  And I have to say, their approach has an elegant simplicity of mechanism, especially as compared to our work on FLuX from the same year. (FLuX = [<a title="FLuX-Fault tolerance" href="http://db.cs.berkeley.edu/papers/sigmod04-fluxft.pdf">Fault-tolerant</a>] [<a title="FLuX load-balancing" href="http://db.cs.berkeley.edu/papers/icde03-fluxlb.pdf">Load balancing</a>] [<a title="Graefe's exchange paper" href="http://portal.acm.org/citation.cfm?id=98720">eXchange</a>].  We tackled the same challenges without resorting to batch processing, via a process-pairs mechanism.  But in fairness, FLuX&#8217;s ambitions make it much more complex than the D/G approach.)</p>
<p>So in our recent work, we tried to preserve the D/G fault-tolerance model already in Hadoop, while also providing the pipelining of results across tasks and jobs.</p>
<p>HOP is infrastructure.  There&#8217;s lots more to do now that it&#8217;s out there &#8212; from systems work (e.g. adaptive dataflow processing and optimization, efficient streaming aggregates, intelligent scheduling&#8230;) to algorithmics (sampling, robust estimators and confidence intervals, anytime machine learning techniques&#8230;).  We hope to get the HOP stuff into the Hadoop distribution, but in the meantime get in touch with us if you&#8217;re eager to work with the code.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Advanced Analytics on Multi-Terabyte Datasets- Conferences]]></title>
<link>http://decisionstats.wordpress.com/2009/10/15/advanced-analytics-on-multi-terabyte-datasets-conferences/</link>
<pubDate>Thu, 15 Oct 2009 04:02:55 +0000</pubDate>
<dc:creator>ajayohri</dc:creator>
<guid>http://decisionstats.wordpress.com/2009/10/15/advanced-analytics-on-multi-terabyte-datasets-conferences/</guid>
<description><![CDATA[Some news on Data Mining 2009 by Aster Data - SAS and Aster Data to Present &#8220;Advanced Analytic]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Some news on Data Mining 2009 by Aster Data -</p>
<table border="0" cellspacing="0" cellpadding="0" width="347">
<tbody>
<tr>
<td style="color:#b61234;font-size:12px;font-weight:bold;"></td>
</tr>
<tr>
<td height="10"></td>
</tr>
<tr>
<td><img src="http://img.en25.com/eloquaimages/clients/AsterDataSystems/%7Bd03c25dd-602c-4667-aa27-8832b84566fd%7D_dotted-line.gif" alt="" width="343" height="1" /></td>
</tr>
<tr>
<td height="10"></td>
</tr>
<tr>
<td height="10">
<table style="height:72px;" border="0" cellspacing="0" cellpadding="0" width="722">
<tbody>
<tr>
<td width="11" valign="top"><img src="http://img.en25.com/eloquaimages/clients/AsterDataSystems/%7B4afc4ef4-af8b-49a1-af3e-c3875f2c5008%7D_circle-bullet.gif" alt="" width="11" height="13" /></td>
<td width="6"><img src="http://img.en25.com/eloquaimages/clients/AsterDataSystems/%7B3ea58d0c-8d1c-4f7a-a9de-5ec6a3fe9287%7D_spacer.gif" alt="" width="6" height="1" /></td>
<td colspan="2"><span style="font-size:11px;"><strong>SAS and Aster Data to Present &#8220;Advanced Analytics on Multi-Terabyte Datasets&#8221; at M2009 in Las Vegas &#8211; Oct. 26-27</strong><br />
Learn how the tight coupling of SQL and MapReduce provided by Aster Data creates new ‘big data’ analytics opportunities when combined with SAS. Aster Data will exhibit throughout the event. </span></td>
</tr>
<tr>
<td colspan="4" align="right"><a style="text-align:right;color:#b61234;font-size:11px;font-weight:bold;text-decoration:none;" href="http://app.en25.com/e/er.aspx?s=1015&#38;lid=126&#38;elq=e7a02aff561e428cafbc990937112296">More</a></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>And also a nice  webcast by Curt Monash on the same Big Data topic-</p>
<table style="height:104px;" border="0" cellspacing="0" cellpadding="0" width="693">
<tbody>
<tr>
<td width="6"><img src="http://img.en25.com/eloquaimages/clients/AsterDataSystems/%7B3ea58d0c-8d1c-4f7a-a9de-5ec6a3fe9287%7D_spacer.gif" alt="" width="6" height="1" /></td>
<td colspan="2"><span style="font-size:11px;"><strong>Mastering MapReduce Webinar Series, Session 1<br />
&#8220;Big Data Reality: The Role of MapReduce in Big Data Management and Analysis&#8221;- Oct. 15<br />
</strong></span><span style="font-size:11px;">Industry analyst Curt Monash explains the basics of MapReduce, key uses cases, and which industries and applications are heavily using MapReduce. Topics include recommendations for integrating MapReduce in an enterprise business intelligence and data warehousing environment.</span></td>
</tr>
<tr>
<td colspan="4" align="right"><a style="text-align:right;color:#b61234;font-size:11px;font-weight:bold;text-decoration:none;" href="http://app.en25.com/e/er.aspx?s=1015&#38;lid=122&#38;elq=e7a02aff561e428cafbc990937112296">More<img src="http://img.en25.com/eloquaimages/clients/AsterDataSystems/%7Bd7614805-b248-4548-b6c2-02514508cb57%7D_read-arrow-icon.gif" border="0" alt="" width="7" height="6" /></a></td>
</tr>
</tbody>
</table>
<p>Also,</p>
<p>Here is a brief synopsis on the Aster Data ( <a href="http://www.facebook.com/pages/Aster-Data-Systems/5601042375">http://www.facebook.com/pages/Aster-Data-Systems/5601042375</a>) Sponsored Big Data Summit  ( <a href="http://www.facebook.com/pages/Big-Data-Summit/143312171156">http://www.facebook.com/pages/Big-Data-Summit/143312171156</a> )which I attended-</p>
<ul>
<li>A Plan for Large Scale Data Analytics: How to Utilize Aster <em>n</em>Cluster and Hadoop in a    Symbiotic<br />
Relationship to Support Processing in Excess of 100 Billion Rows Per Month<br />
– Michael Brown and Will Duckworth<br />
<span class="style1">(EVP, Software Engineering, comScore, Inc. and Director, Software Engineering, comScore, Inc.)</span></li>
</ul>
<p>This talked of the special needs of Com Score in handling big data and why Map Reduce and Hadoop seem to be the cost effective solutions for big big data while RDBMS seems stuck in the middle of middle data. Broadly informative on the statistical challenges of the future given the explosion of data as well.</p>
<ul>
<li>Making Sense of Hadoop &#8211; Its Fit With Data Warehouses &#8211; Colin White<br />
<span class="style1">(President and Founder of BI Research)</span></li>
</ul>
<p>Colin brought a nice perspective on the open source Hadoop vis a vis the Properietary packages and the traditional DBMS. His perspective on the solution is no software is perfect for all needs while all softwares that sell have their own good points while the converging solution could be a heterogeneous solution of the above.</p>
<ul>
<li>MapReduce Inside a Database System &#8211; When and How<em> Case Studies from ShareThis, Specific Media, and Other</em> &#8211; Tasso Argyros<span class="style1"> </span><span class="style1">(Chief Technology Officer and Co-Founder of Aster Data)</span></li>
</ul>
<p>This was a more detailed look at the Big Product Launch ( the Hadoop Connector) by Tasso and an interesting look at time series analysis using nPath rather than SQL . Interesting given the ongoing convergence analytics and business intelligence.</p>
<p>Also Tasso lived up to his presenting charm with an excellent pitch on nPath (as his interview said ).</p>
<ul>
<li>Large-Scale Analytics at LinkedIn &#8211; Jonathan Goldman<br />
<span class="style1">(Former Principal Scientist at LinkedIn)</span></li>
</ul>
<p>This was nice given Jonathan&#8217;s perscpective ( he has Phd In Physics) and now does consulting for LinkedIn while maintaining his interests in education- the special needs for social media websites, designing experiments on the fly with huge real time datasets as well as some interesting visualizations (like India and America have the second biggest cross country Li connections after USA- UK. Apparently Linkedin ( <a href="http://www.facebook.com/group.php?gid=2211231478">http://www.facebook.com/group.php?gid=2211231478</a> ) does not sound so good when translated in Chinese ( AT Dinner I learnt from a fellow Chinese student that China censors Facebook &#8211; sigh!).</p>
<ul>
<li>Networking Mixer: Beer, wine, hot hors d&#8217;oeuvres</li>
</ul>
<p>I got interviewed ( AFTER) I had mixed some Beer and Wine for myself. The Video interview which was the first video interview I have given ( You know- I have taken SOME interviews by Email and plan to do some more while in Vegas for the Data Mining 2009  with SAS <a href="http://www.facebook.com/group.php?gid=2227381262">http://www.facebook.com/group.php?gid=2227381262</a>)</p>
<p>They are still editing that interview <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>&#8212;That was all &#8211; you need to send me a Facebook invite to see the rest of the NY trip or better still just join the Facebook page of Decision Stats at</p>
<p><a href="http://www.facebook.com/pages/DecisionStats/191421035186">http://www.facebook.com/pages/DecisionStats/191421035186</a></p>
<p>After two weeks I hope to have some more coverage on Data Mining 2009 while at the same time enjoying my much needed Fall Break-  Life at University at Tennessee is looking up (<em> since we beat Georgia 45-19 <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </em> )</p>
<div id="_mcePaste" style="overflow:hidden;position:absolute;left:-10000px;top:0;width:1px;height:1px;">
<p>r*xE5HeUJa(%</p>
</div>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[The Big Data Event- Why am I here?]]></title>
<link>http://decisionstats.wordpress.com/2009/10/01/the-big-data-event-why-am-i-here/</link>
<pubDate>Thu, 01 Oct 2009 15:45:23 +0000</pubDate>
<dc:creator>ajayohri</dc:creator>
<guid>http://decisionstats.wordpress.com/2009/10/01/the-big-data-event-why-am-i-here/</guid>
<description><![CDATA[I am here braving New York&#8217;s cold weather, as I prepare for this evening&#8217;s events. If yo]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>I am here braving New York&#8217;s cold weather, as I prepare for this evening&#8217;s events. If you follow this blog closely ( including the poems) ,it is a welcome change&#8212; New York is a nice city people are friendly if you ask them nicely and the bus is a great way to watch the city &#8211; best of all I like the crowds which I have grown used while living in India.</p>
<p><strong><em><span style="text-decoration:underline;">Why Am I here? </span></em></strong></p>
<p>Because the topics that are discussed here are cutting edge to the point that I cannot find anyone willing to teach me Hadoop and Map-Reduce while in University and at the same time teach me statistics on them as well ( as in how do we do a K Means clustering on a 1 terabyte dataset).</p>
<p>I asked the organizers on what makes the event special ( every event promises special Mojo after all).</p>
<p>This is what they said-</p>
<p><strong>What is the unique value proposition of the event that will help developers and both current and potential customers-</strong></p>
<p>The essence of the event is to explore new innovations in massively-parallel processing data warehousing technology and how it can help companies gain more insight from their data.  Applications include fraud detection, behavioral targeting, social network analysis, better predictions/forecasting, bioinformatics, etc.  We are exploring how MapReduce and Hadoop can be integrated into the enterprise IT system to help evolve data warehousing/BI/data mining</p>
<p>and to put it even more nicely&#8217;</p>
<p><span style="font-size:11pt;line-height:22px;color:#1f497d;">“</span><span style="font-size:11pt;line-height:22px;"><em>&#8220;</em></span><span style="font-size:11pt;line-height:22px;"><em>The industry’s first big data event, Big Data Summit ‘09, being held this evening in New York City, will showcase Hadoop’s fit with MPP data warehouses. Aster Data will be presenting alongside Colin White, President and Founder of BI Research, Mike Brown of comScore Inc., and Jonathan Goldman, who represents LinkedIn.”</em></span></p>
<p><span style="font-size:11pt;line-height:22px;">That&#8217;s good enough for me to drop into Roosevelt Hotel on East 45th Street at around 6 pm for some reluctant networking ( read: beers). 5 years ago whie working for GE , I used to run queries using SAS on a 147 million row database (the size of the DB) and wait 3 hours for it to come back. Today that much data fits very snugly in my laptop. How soon will we have Terabyte level personal computing, and Petabyte level business computing and the challenges it poses to standard statistical assumptions and synching of hardware and software- Big Big Data is an interesting area to watch.</span></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Interview Shawn Kung Sr Director Aster Data]]></title>
<link>http://decisionstats.wordpress.com/2009/10/01/interview-shawn-king-sr-director-aster-data/</link>
<pubDate>Thu, 01 Oct 2009 15:31:19 +0000</pubDate>
<dc:creator>ajayohri</dc:creator>
<guid>http://decisionstats.wordpress.com/2009/10/01/interview-shawn-king-sr-director-aster-data/</guid>
<description><![CDATA[Here is an interview with Shawn Kung, Senior Director of Product Management at Aster Data. Shawn exp]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><span style="font-family:arial;line-height:normal;border-collapse:collapse;"> </span></p>
<div style="color:#500050;">
<p style="margin-left:.5in;padding-left:30px;"><span style="color:#000000;">Here is an interview with Shawn Kung, Senior Director of Product Management at Aster Data. Shawn explains the difference between the various database technologies, Aster&#8217;s rising appeal to its unique technological approach and touches upon topics of various other interests as well to people in the BI and technology space.</span></p>
<p style="margin-left:.5in;"><img class="alignnone size-full wp-image-2725" title="image001" src="http://decisionstats.wordpress.com/files/2009/10/image001.png" alt="image001" width="102" height="102" /></p>
<p style="margin-left:.5in;"><span style="color:#000000;"><strong>Ajay -Describe your career journey from a high school student of science till today .Do you think science is a more lucrative career?</strong></span></p>
</div>
<p style="margin-left:.5in;"><span style="color:#000000;"><strong>Shawn:</strong> My career journey has spanned over a decade in several Silicon Valley technology companies.  In both high school and my college studies at Princeton, I had a fervent interest in math and quantitative economics.  Silicon Valley drew me to companies like upstart procurement software maker Ariba and database giant Oracle.  I continued my studies by returning to get a Master’s in Management Science at Stanford before going on to lead core storage systems for nearly 5 years at NetApp and subsequently Aster. </span></p>
<p style="margin-left:.5in;"><span style="color:#000000;"> Science (whether it is math, physics, economics, or the hard engineering sciences) provides a solid foundation.  It teaches you to think and test your assumptions – those are valuable skills that can lead to a both a financially lucrative and personally inspiring career.</span></p>
<div style="color:#500050;">
<p style="margin-left:.5in;"><span style="color:#000000;"><strong>Ajay- How would you describe the difference between Map Reduce and Hadoop and Oracle and SAS, DBMS and Teradata and Aster Data products to a class of undergraduate engineers ?</strong></span></p>
</div>
<p style="margin-left:.5in;"><span style="color:#1f497d;"><span style="color:#000000;"><strong>Shawn:</strong> Let’s start with the database guys – Oracle and Teradata.  They focus on structured data – data that has a logical schema and is manipulated via a standards-based structured query language (SQL).  Oracle tries to be everything to everyone – it does OLTP (low-latency transactions like credit card or stock trade execution apps) and some data warehousing (typically summary reporting).  Oracle’s data warehouse</span><strong><em><span style="color:#000000;"> </span></em></strong><span style="color:#000000;">is not known for large-scale data warehousing and is more often used for back-office reporting. </span></span></p>
<p style="margin-left:.5in;"><span style="color:#1f497d;"><span style="color:#000000;">Teradata is focused on data warehousing and scales very well, but is extremely expensive – it runs on high-end custom hardware and takes a mainframe approach to data processing.  This approach makes less sense as commodity hardware becomes more compute-rich and better software comes along to support large-scale MPP data warehousing. </span></span></p>
<p style="margin-left:.5in;"><span style="color:#000000;">SAS is very different – it’s not a relational database. It really offers an application platform for data analysis, specifically data mining.  Unlike Oracle and Teradata which is used by SQL developers and managed by DBAs, SAS is typically run in business units by data analysts – for example a quantitative marketing analyst, a statistician/mathematician, or a savvy engineer with a data mining/math background.  SAS is used to try to find patterns, understand behaviors, and offer predictive analytics that enable businesses to identify trends and make smarter decisions than their competitors.</span></p>
<p style="margin-left:.5in;"><span style="color:#1f497d;"><span style="color:#000000;">Hadoop offers an open-source framework for large-scale data processing.  MapReduce is a component of Hadoop, which also contains multiple other modules including a distributed filesystem (HDFS).  MapReduce offers a programming paradigm for distributed computing (a parallel data flow processing framework). </span></span></p>
<p style="margin-left:.5in;"><span style="color:#1f497d;"><span style="color:#000000;"> Both Hadoop and MapReduce are catered toward the application developer or programmer.  It’s not catered for enterprise data centers or IT.  If you have a finite project in a line of business and want to get it done, Hadoop offers a low-cost way to do this.  For example, if you want to do large-scale data munging like aggregations, transformations, manipulations of unstructured data – Hadoop offers a solution for this without compromising on the performance of your main data warehouse.  Once the data munging is finished, the post-processed data set can be loaded into a database for interactive analysis or analytics.</span><strong><em><span style="color:#000000;"> </span></em></strong><span style="color:#000000;">It is a great combination of big data technologies for certain use-cases.</span></span></p>
<p style="margin-left:.5in;"><span style="color:#000000;">Aster takes a very unique approach.  Our Aster nCluster software offers the best of all worlds – we offer the potential for deep analytics of SAS, the low-cost scalability and parallel processing of Hadoop/MapReduce, and the structured data advantages (schema, SQL, ACID compliance and transactional integrity, indexes, etc) of a relational database like Teradata and Oracle.  Often, we find complementary approaches and therefore view SAS and Hadoop/MapReduce as synergistic to a complete solution.  Data warehouses like Teradata and Oracle tend to be more competitive.</span></p>
<div style="color:#500050;">
<p style="margin-left:.5in;"><span style="color:#000000;"><strong>Ajay- What exciting products have you launched so far and what makes them unique both from a technical developer perspective and a business owner perspective</strong></span></p>
</div>
<p style="margin-left:.5in;"><span style="color:#000000;"><strong>Shawn:</strong> Aster was the first-to-market to offer In-Database MapReduce, which provides the standards and familiarity of SQL and databases with the analytic power of MapReduce.  This is very unique as it offers technical developers and application programmers to write embedded procedural algorithms once, upload it, and allow business analysts or IT folks (SQL developers, DBAs, etc) to invoke these SQL-MapReduce functions forever. </span></p>
<p style="margin-left:.5in;"><span style="color:#000000;">It is highly polymorphic (re-usable), highly fault-tolerant, highly flexible (any language – Java, Python, Ruby, Perl, R statistical language, C# in the .NET world, etc) and natively massively parallel – all of which differentiate these SQL extensions from traditional dumb user-defined functions (UDFs).</span></p>
<div style="color:#500050;">
<p style="margin-left:.5in;"><span style="color:#000000;"><strong>Ajay- &#8220;I am happy with my databases and I don&#8217;t need too much diversity or experimentation in my systems&#8221;, says a CEO to you.</strong></span></p>
<p style="margin-left:.5in;"><span style="color:#000000;"><strong>How do you convince him using quantitative numbers and not marketing adjectives?</strong></span></p>
</div>
<p style="margin-left:.5in;"><span style="color:#000000;"><strong>Shawn:</strong> Aster has dozens of production customers including big-names like MySpace, LinkedIn, Akamai, Full Tilt Poker, comScore, and several yet-to-be-named retail and financial service accounts.  We have quantified proof points that show orders of magnitude improvements in scalability, performance, and analytic insights compared to incumbent or competitor solutions.  Our highly referenceable customers would be happy to discuss their positive experiences with the CEO.</span></p>
<p style="margin-left:.5in;"><span style="color:#000000;">But taking a step back, there’s a fundamental concept that this CEO needs to first understand.  The world is changing – data growth is proliferating due to the digitization of so many applications and the emergence of unstructured data and new data types.  Like the book “Competing on Analytics”, the world is shifting to a paradigm where companies that don’t take risks and push the limits on analytics will die like the dinosaurs. </span></p>
<p style="margin-left:.5in;"><span style="color:#000000;">IDC is projecting 10x+ growth in data over the next few years to zetabytes of aggregate data driven by digitization (Internet, digital television, RFID, etc).  The data is there and in order to compete effectively and understand your customers more intimately, you need a large-scale analytics solution like the one Aster nCluster offers.  If you hold off on experimentation and innovation, it will be too late by the time you realize you have a problem at hand.</span></p>
<div style="color:#500050;">
<p style="margin-left:.5in;"><span style="color:#000000;"><strong>Ajay- How important is work life balance for you?</strong></span></p>
</div>
<p style="margin-left:.5in;"><span style="color:#000000;"><strong>Shawn:</strong> Very important.  I hang out with my wife most weekends – we do a lot of outdoors activities like hiking and gardening.  In Silicon Valley, it’s all too easy to get caught up in the rush of things.  Taking breaks, especially during the weekend, is important to recharge and re-energize to be as productive as possible. </span></p>
<div style="color:#500050;">
<p style="margin-left:.5in;"><span style="color:#000000;"><strong>Ajay- Are you looking for college interns and new hires what makes aster exciting for you so you are pumped up every day to go to work?</strong></span></p>
</div>
<p style="margin-left:.5in;"><span style="color:#000000;"><strong>Shawn</strong>: We’re always looking for smart, innovative, and entrepreneurial new college grads and interns, especially on the technical side.  So if you are a computer science major or recent grad or graduate student, feel free to contact us for opportunities. </span></p>
<p style="margin-left:.5in;"><span style="color:#000000;">What makes Aster exciting is 2 things – </span></p>
<p style="margin-left:.5in;"><span style="color:#000000;">first, the people.  Everyone is very smart and innovative so you learn a tremendous amount, which is personally gratifying and professionally useful long-term. </span></p>
<p style="margin-left:.5in;"><span style="color:#000000;">Second, Aster is changing the world! </span></p>
<p style="margin-left:.5in;"><span style="color:#000000;"> Distributed systems computing focused on big data processing and analytics – these are massive game-changers that will fundamentally change the landscape in data warehousing and analytics.  Traditional databases have been a oligopoly for over a generation – they haven’t been challenged and so the 1970’s based technology has stuck around.  The emergence of big data and low-cost commodity hardware has created a unique opportunity to carve out a brand new market…</span></p>
<p style="margin-left:.5in;"><span style="color:#000000;">what gets me pumped every day is I have the ability to contribute to a pioneer that is quickly becoming Silicon Valley’s next great success story!</span></p>
<p style="margin-left:.5in;text-align:left;"><span style="color:#000000;"><strong><span style="text-decoration:underline;">Biography-</span></strong></span></p>
<p>Over the past decade, Shawn has led product management for some of Silicon Valley&#8217;s most successful and innovative technology companies.  Most recently, he spent nearly 5 years at Network Appliance leading Core Systems storage product management, where he oversaw the development of high availability software and Storage Systems hardware products that grew in annual revenue from $200M to nearly $800M.  Prior to NetApp, Shawn held senior product management and corporate strategy roles at Oracle Corporation and Ariba Inc.</p>
<p>Shawn holds an M.S. in Management Science and engineering from Stanford University, where he was awarded the Valentine Fellowship (endowed by Don Valentine of Sequoia Capital).  He also received a B.A. with high honors from Princeton University.</p>
<p><strong><span style="text-decoration:underline;">About Aster</span></strong></p>
<p>Aster Data Systems is a proven leader in high-performance database systems for data warehousing and analytics &#8211; the first DBMS to tightly integrate SQL with <a style="color:#005488;" href="http://www.asterdata.com/blog/index.php/category/mapreduce/" target="_blank">MapReduce</a> &#8211; providing deep insights on data analyzed on clusters of low-cost commodity hardware. The Aster<em>n</em>Cluster database cost-effectively powers frontline analytic applications for companies such as MySpace, aCerno (an Akamai company), and ShareThis.</p>
<p>Running on low-cost off-the-shelf hardware, and providing &#8216;hands-free&#8217; administration, Aster enables enterprises to meet their data warehousing needs within their budget. Aster is headquartered in San Carlos, California and is backed by Sequoia Capital, JAFCO Ventures, IVP, Cambrian Ventures, and First-Round Capital, as well as industry visionaries including David Cheriton and Ron Conway.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Interview Dr Usama Fayyad Founder Open Insights LLC]]></title>
<link>http://decisionstats.wordpress.com/2009/08/11/interview-dr-usama-fayyad-founder-open-insights-llc/</link>
<pubDate>Tue, 11 Aug 2009 22:41:53 +0000</pubDate>
<dc:creator>ajayohri</dc:creator>
<guid>http://decisionstats.wordpress.com/2009/08/11/interview-dr-usama-fayyad-founder-open-insights-llc/</guid>
<description><![CDATA[Here is an interview with Dr Usama Fayyad, founder of Open Insights LLC (www.open-insights.com). Pri]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Here is an interview with Dr Usama Fayyad, founder of Open Insights LLC <a href="http://open-insights.com">(www.open-insights.com)</a>. Prior to this he was Yahoo&#8217;s Chief Data Officer. In his prior role as Chief Data Officer of Yahoo! he built the data teams and infrastructure to manage the 25 terabytes of data per day that resulted from the company’s operations.</p>
<p><strong> </strong></p>
<p><strong><img title="Picture_004_(2)" src="http://www.decisionstats.com/wp-content/uploads/2009/08/Picture_004_2.jpg" alt="Picture_004_(2)" width="400" height="600" /></strong></p>
<p><strong>Ajay-     Describe your career in science. How would you motivate young people today to take science careers rather than other careers<br />
Dr Fayyad-</strong> My career started out in science and engineering. My original plan was to be in research and to become a university professor. Indeed, my first few jobs were strictly in basic Research. After doing summer internships at place like GM Research Labs and JPL, my first full-time position was at the NASA &#8211; Jet Propulsion Laboratory, California Institute of Technology.</p>
<p>I started in research in Artificial Intelligence for autonomous monitoring and control and in Machine Learning and data mining. The first major success was with Caltech Astronomers on using machine learning classification techniques to automatically recognize objects in a large sky survey (POSS-II – the 2nd Palomar Observatory Sky Survey).  The Survey consists of taking high resolution images of the entire northern sky. The images, when digitized, contain over 2 billion sky objects. The main problem is to recognize if an object is a star of galaxy. For “faint objects” – which constitute the majority of objects, this was an exceedingly hard problem that people wrestled with for 30 years. I was surprised how well the algorithms could do at solving it.</p>
<p>This was a real example of data sets where the dimensionality is so high that algorithms are better suited at solving it than humans – even well-trained astronomers. Our methods had over 94% accuracy on faint objects that no one could reliably classify before at better than 75% accuracy. This additional accuracy made all the difference in enabling all sort of new science, discoveries and theories about formation of large scale structure in the Universe.<br />
The success of this work and its wide recognition in scientific and engineering communities let to the creation of a new group – I founded and managed the Machine Learning Systems group at JPL which went on to address hard problems in object recognition in scientific data – mostly from remote sensing instruments – like Magellan images of the planet Venus (we recognized and classified over a million small volcanoes on the planet in collaboration with geologists at Brown University) and Earth Observing System data, including Atmospherics and storm data.<br />
At the time, Microsoft was interested in figuring out data mining applications in the corporate world and after a long recruiting cycle they got me to join the newly formed Microsoft Research as a Senior Researcher in late 1995. My work there focus on algorithms, database systems, and basic science issues in the newly formed field of Data Mining and Knowledge Discovery. We had just finished publishing a nice edited collection of chapters in a book that became very popular, and I had agreed to become the founding Editor-in-Chief of a brand new journal called: Data Mining and Knowledge Discovery. This journal today is the premier scientific journal in the field. My research work at Microsoft led to several applications – especially in databases. I founded the Data Mining &#38; Exploration group at MSR and later a product group in SQL Server that built and shipped the first integrated data mining product in a large-scale commercial DBMS  &#8211; SQL Server 2000 (analysis Services). We created extensions to the SQL language (that we called DMX) and tried to make data mining mainstream. I really enjoyed the life of doing basic research as well as having a real product group that built and shipped components in a major DBMS.<br />
That’s when I learned that the real challenging problems in the real-world where really not in data mining but in getting the data ready and available for analysis – Data Warehousing was a field littered with failures and data stores that were write-only (meaning data never came out!)  &#8212; I used to call these Data Tombs at the time and I likened them to the pyramids in Ancient Egypt: great engineering feats to build, but really just tombs.</p>
<p>In 2000 I decided to leave the world of Research at Microsoft to do my first venture-backed start-up company – digiMine. The company wanted to solve the problem of managing the data and performing data mining and analysis over data sets, and we targeted a model of hosted data warehouses and mining applications as an ASP – one of the first Software as a Service (SaaS) firms in that arena. This began my transition from the world of research and science to business and technology.  We focused on on-line data and web analytics since the data volumes their were about 10x the size of transactional databases and most companies did not know how to deal with all that data. The business grew fast and so did the company – reaching 120 employees in about 1 year.</p>
<p>After 3 years of doing high-growth start-up and raising some $50 million in venture capital for the company, I was beginning to feel the itch again to do technical work.<br />
In June 2003, we had a chance to spin-off part of the business that was focused on difficult high-end data mining problems. This opportunity was exactly what I needed and we formed DMX Group as a spinoff company that had a solid business from its first day. At DMX Group I got to work on some of the hardest data mining problems in predicting sales of automobiles, churn of wireless users, financial scoring and credit risk analysis, and many related deep business Intelligence problems.</p>
<p>Our client list included many of the Fortune 500 companies. One of these clients was Yahoo!  &#8212; After 6 months of working with Yahoo! As a client they decided to acquire DMX Group and use the people to build a serious data team for Yahoo!  We negotiated a deal that got about half the employees into Yahoo! And we spun-off the rest of DMX Group to continue focusing on consulting work in data mining and BI.  I thus became the industry’s first Chief Data Officer. </p>
<p> The original plan was to spend 2 years or so to help Yahoo! Form the right data teams and build the data processing and targeting technology to deliver high value from its inventory of ads.<br />
Yahoo! Proved to be a wonderful experience and I learned so much about the Internet. I also learned that even someone like me who worked on Internet data from the early days of MSN (in 1996) and who ran a web analytics firm still did not scratch the service on the depth of the area. I learned a lot about the Internet from Jerry Yaang (Yahoo! Co-founder) and much about advertising/media business from Dan Rosensweig (COO) and mTerry Semel (then CEO) and lots about technology management and strategic deal-making from Farzad (Zod) Nazem who was the CTO. As Executive VP at Yahoo!</p>
<p>I built one of the industry’s largest and best data teams and we were able to to process over 25 terabytes of data per year and power several hundred million Dollars of new revenue for Yahoo! Resulting from these data systems. A year after joining Yahoo! I was asked to form a new Research Lab to study much of what we did not understand about the Internet. This was yet another return of basic research into my life. I founded Yahoo! Research to invent the new sciences of the Internet, and I wanted them to be focused on only 4 areas (the idea of focus came from my exposure to Caltech and its philosophy in picking few areas of excellence). The goal was the become the best research lab in the world in these new focused areas. Surprisingly we did it within 2 years. I hired Prabhakar Raghavan to run Research and he did a phenomenal job in building out the Research organization. The four areas we chose were: Search and information navigation, Community Systems, Micro-economics of the Web, and Computational Advertising.  We were able to attract the top talent in the world to lead or work on these emerging areas. Yahoo! Research was a success in basic research but also in influencing product. The chief scientists for all the major areas of company products all came from Yahoo! Research and all owned the product development agenda and plans: Raghu Ramakrishnan (CS for Audience), Andrew Tomkins (CS for Search), Anrei Broder (CS for Monetization) and Preston McCaffee (CS for Marketplaces/Exchanges). I consider this an unprecendented achievement in the world of Research in general: excellence in basic research and huge impact on company products, all within 3-4 years.<br />
I have recently left Yahoo! And started Open Insights (<a href="http://www.open-insights.com">www.open-insights.com</a>) to focus on data strategy and helping enterprises realize the value of data, develop the right data strategies, and create new business models. Sort of an ‘outsourced version” of my Chief Data Officer job at Yahoo!<br />
Finally, on my advice to young people: it is not just about science careers, I would call it engineering careers. My advice to any young person in fact, whether they plan to become a business person, a medical doctor, and artist, a lawyer, or a scientist – basic training in engineering and abstract problem solving will be a huge assets. Some of the best lawyers, doctors, and even CEO’s started out with engineering training.<br />
For those young people who want to become scientists, my advice is always look for real-world applications where the research can be conducted in their context. The reason for that is technical and sociological. From a technical perspective, the reality of an application and the fact that things have to work force a regiment of technical discipline and make sure that the new ideas are tested and challenged. Socially, working on a real application forces interactions with people who care about the problem and provides continuous feedback which is really crucial in guiding good work (even if scientists deny this, social pressure is a big factor) – it also ensures that your work will be relevant and will evolve in relevant directions. I always tell people who are seeking basic research: “some of the deepest fundamental science problems can often be found lurking in the most mundane of applications”. So embrace applied work but always look for the abstract deep problems – that summarizes my advice.<br />
<strong>Ajay- What are the challenges of running data mining for a big big website.<br />
Dr Fayyad-</strong> There are many challenges. Most algorithms will not work due to scale. Also, most of the problems have an unusually high dimensionality – so simple tricks like sampling won’t work. You need to be very clever on how to sample and how to reduce dimensionality by applying the right variable transformations.</p>
<p>The variety of problems is huge, and the fact that the Internet is evolving and growing rapidly, means that the problems are not fixed or stationary. A solution that works well today will likely fail in a few months – so you need to always innovate and always look at new approaches. Also, you need to build automated tools to help detect changes and address them as soon as they arise. </p>
<p>Problems with 1000 10,000 or millions of variables are very common in web challenges. Finally, whatever you do needs to work fast or else you will not be able to keep up with the data flux. Imagine falling behind on processing 25 Terabytes of data per day. If you fall behind by two days, you will never be able to catch up again! Not within any reasonable budget constraint. So you try never to go down.<br />
<strong>Ajay-      What are the 5 most important things that the data miner should avoid in doing analysis.</strong></p>
<p><strong>Dr Fayyad-</strong>I never thought about this in terms of top 5, but here are the big ones that come to mind, not necessarily in any order<br />
a.       <em><strong>The algorithms knows nothing about the data,</strong></em> and the knowledge of the domain is in the head of the domain experts. As I always say, <strong>an ounce of knowledge is worth a ton of data</strong> – so seek and model what the experts know or your results will look silly<br />
b.      Don’t let an algorithm fish blindly when you have lots of data. Use what you know to reduce the dimensionality quickly. <strong><em>The curse of dimensionality is never to be under-estimated</em></strong><br />
c.       Resist the temptation to cheat: selecting training and test sets can easily fool you into thinking you have something that works. Test it honestly against new data, <strong><em>never “peek” at the test data – what you see will force you to cheat without knowing it.</em></strong><br />
d.      Business rules typically dominate data mining accuracy, so <strong><em>be sure to incorporate the business and legal constraints into your mining.<br />
</em></strong>e.       I have never seen a large database in my life that came from a static distribution that was sampled independently. Real databases grow to be big through lots of systematic changes and biases, and they are collected over years from changing underlying distribution: <strong><em>segmentation is a pre-requisite to any analysis. Most algorithms assume that data is IID (independent and identically distributed)<br />
</em></strong></p>
<p><strong>Ajay-   Do you think softwares like Hadoop and MapReduce will change the online database permanently. What further developments do you see in this area.</strong></p>
<p><strong><br />
Dr Fayyad-</strong> I think they will (and have) changed the landscape dramatically, but they do not address everything. Many problems lend themselves naturally to Map-Reduce and many new approaches are enabled by Map-Reduce. However, there are many problems where M-R does not do much. I see a lot of problems being addressed by a large grid nowadays when they don’t need it. This is often a huge waste of computational resources. We need to learn how to deal with a mix of tools and platforms. I think M-R will be with us for a long time and will be a staple tool – but not a universal one.<br />
<strong>Ajay-    I look forward to the day when I have just a low priced netbook and fast internet connection, and upload a Gigabyte of data and run advanced analytics on the browser. How far or soon do you think it is possible?</strong><br />
<strong>Dr Fayyad-</strong> Well, I thnk the day is already here. In fact, much of our web search today is conducted exactly in that model. A lot of web analysis, and much of scientific analysis is done like this today.<br />
<strong>Ajay-    Describe some of the conferences you are currently involved with and the research areas that excites you the most.<br />
Dr Fayyad-</strong> I am still very involved in knowledge discovery and data mining conferences (especially the KDD series), machine learning, some statistics, and some conferences on search and internet.  Most exciting conferences for me are ones that cover a mix of topics but that address real problems. Examples include understanding how social networks evolve and behave, understanding dimensionality reductions (like random projections in very high-D spaces) and generally any work that gives us insight into why a particular technique works better and where the open challenges are.<br />
<strong>Ajay-  What are the next breakthrough areas in data mining. Can we have a  Google or Yahoo in fields of business intelligence as well given their huge market potential and uncertain ROI.</strong><br />
<strong>Dr Fayyad-</strong> We already have some large and healthy businesses in BI and quite a huge industry in consulting. If you are asking particularly about the tools market then I think that market is very limited. The users of analysis tools are always going to be small in number. However, once the BI and Data Mining tools are embedded in vertical applications, then the number of users will be tremendous. That’s where you will see success.<br />
Consider the examples of Google or Yahoo! – and now Microsoft with BING search engine.  Search engines today would not be good without machine learning/data mining technology. In fact MLR (Machine Learned Ranking) is at the core of the ranking methodology that decides which search results bubble to the top of the list. The typical web query is 2.6 keywords long and has about a billion matches. What matters are the top 10. The function that determines these is a relevance ranking algrorithm that uses machine learning to tune a formula that considers hundreds or thousands of variables about each document. So in many ways, you have a great example of this technology being used by hundreds of millions of people every day – without knowing it!<br />
Success will be in applications where the technology becomes invisible – much like the internal combustion engine in your car or the electric motor in your coffee grinder or power supply fan. I think once people start building verticalized solutions that embed data mining and BI, we will hit success. This already has happened in web search, in direct marketing, in advertising targeting, in credit scoring, in fraud detection, and so on…</p>
<p><strong>Ajay-  What do you do to relax. What publications would you recommend for staying up to date for the data mining people especially the younger analysts.<br />
Dr Fayyad-</strong> My favorite activity is sleep when I can get it J.  But more seriously, I enjoy reading books, playing chess, skiing (on water or snow – downhill or x-country), or any activities with my kids.  I swim a lot and that gives me much time to think and sort things out.<br />
I think for keeping up with the technical advances in data mining: the KDD conferences, some of the applied analytics conferences, the WWW conferences, and the data mining journals. The ACM SIGKDD publishes a nice newsletter called SIGKDD explorations. It is free with a very low membership fee and it has a lot of announcements and survey papers on new topics and important areas (<a href="http://www.kdd.org">www.kdd.org</a>).  Also, a good list to keep up with is an email list called KDNuggets edited by Gregory Piatetsky-Shapiro.<br />
 </p>
<p><strong><span style="text-decoration:underline;">Biography (<a href="http://www.fayyad.com/usama">www.fayyad.com/usama</a> )-</span></strong></p>
<p>Usama Fayyad founded <span>Open Insights <a href="http://open-insights.com">(www.open-insights.com)</a></span> to deliver on the vision of bridging the gap between data and insights and to help companies develop strategies and solutions not only to turn data into working business assets, but to turn the insights available from the growing amounts of data into critical components of an enterprise’s strategy for approaching markets, dealing with competitors, and acquire and retain customers.</p>
<p>In his prior role as Chief Data Officer of Yahoo! he built the data teams and infrastructure to manage the 25 terabytes of data per day that resulted from the company’s operations. He also built up the targeting systems and the data strategy for how to utilize data to enhance revenue and to create new revenue sources for the company.</p>
<p>In addition, he was the founding executive for Yahoo! Research, a scientific research organization that became the top research place in the world working on inventing the new sciences of the Internet.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Generic Parallel Map Function for Java]]></title>
<link>http://ibhana.wordpress.com/2009/08/11/generic-parallel-map-function-for-java/</link>
<pubDate>Tue, 11 Aug 2009 09:57:04 +0000</pubDate>
<dc:creator>ibhana</dc:creator>
<guid>http://ibhana.wordpress.com/2009/08/11/generic-parallel-map-function-for-java/</guid>
<description><![CDATA[After a few days playing around with the functional language Erlang, I was struck by how easy it is ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>After a few days playing around with the functional language <a href="http://erlang.org/">Erlang</a>, I was struck by how easy it is to create parallel processes in the language. One case where this is useful is, given a list of data items and a function to process them, a process is spawned to handle each item in parallel (when executing on a multiprocessor machine). This is made easy not only by Erlang&#8217;s excellent built-in primitives for parallel programming but also the support for high-order functions &#8211; enabling you to pass in an arbitrary function as an argument to the map function.</p>
<pre class="brush: cpp;">
-module(pfuncs).
-export([map/2]).

map(F, L) -&gt;
	Parent = self(),
	[ receive {Pid, Result} -&gt;
		Result
	  end &amp;#124;&amp;#124;
	  Pid &lt;- [spawn(fun() -&gt; Parent ! {self(), F(X)} end) &amp;#124;&amp;#124; X &lt;- L]].
</pre>
<p>In Java, however, you can&#8217;t just pass a reference to a function. The function needs to be wrapped in an object that implements it (I believe this is called a strategy). For example:</p>
<pre class="brush: java;">
public interface Function&lt;I, R&gt;
{
	public R apply(I item) throws Exception;
}
</pre>
<p>This is a generic interface where the input is some type I and the output is R. This interface can now be used to define many different functions generically and in a type safe manner. For example, you would like to download a list of images in parallel (for example, when implementing a browser):</p>
<pre class="brush: java;">
Function&lt;URI, BufferedImage&gt; getImage = new Function&lt;URI, BufferedImage&gt;()
{
	@Override
	public BufferedImage apply(URI uri) throws Exception
	{
		return ImageIO.read(uri.toURL());
	}
};
</pre>
<p>So what does the code look like that will processes the above function as a set of parallel tasks?</p>
<pre class="brush: java;">
public void testPMap() throws Exception
{
	Function&lt;URI, BufferedImage&gt; getImage = new Function&lt;URI, 	BufferedImage&gt;()
	 {
		@Override
		public BufferedImage apply(URI uri) throws Exception
		{
			return ImageIO.read(uri.toURL());
		}
	};

	Set&lt;URI &gt; input = new HashSet&lt;URI&gt;();

	// Add image urls
	input.add(URI.create(&quot;http://www.example.com/image1.jpg&quot;));
	input.add(URI.create(&quot;http://www.example.com/image2.jpg&quot;));
	input.add(URI.create(&quot;http://www.example.com/image3.jpg&quot;));
	input.add(URI.create(&quot;http://www.example.com/image4.jpg&quot;));

	List&lt;BufferedImage&gt; output = new ArrayList&lt;BufferedImage&gt;(input.size());

	ExecutorService e = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());

	try
	{
		PMap.map(getImage, input, output, e);
	}
	finally
	{
		e.shutdown();
	}

	// do something with output
}
</pre>
<p>And just in case you&#8217;re wondering, the PMap.map function looks like this:</p>
<pre class="brush: java;">
import java.util.Collection;
import java.util.concurrent.Callable;
import java.util.concurrent.CompletionService;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Executor;
import java.util.concurrent.ExecutorCompletionService;

public class PMap
{
	public static &lt;I, R&gt; void map(final Function&lt;I, R&gt; func, Iterable&lt;I&gt; input, Collection&lt;R&gt; output, Executor executor) throws InterruptedException, ExecutionException
	{
		CompletionService&lt;R&gt; ecs = new ExecutorCompletionService&lt;R&gt;(executor);

		int count = 0;
		for (final I i : input)
		{
			Callable&lt;R&gt; callableFunc = new Callable&lt;R&gt;()
			{
				@Override
				public R call() throws Exception
				{
					return func.apply(i);
				}
			};
			ecs.submit(callableFunc);
			count++;
		}

		for (int i = 0; i &lt; count; ++i)
		{
			output.add(ecs.take().get());
		}
	}
	private PMap() {}
}
</pre>
<p>By the way, the above code does not handle failures elegantly. One approach would be to have an type that specifies how a failure in one of the sub-tasks should be handled (i.e. fail all tasks, or continue and return null for that individual task).</p>
<p>Disclaimer: This is just a programming exercise for me so I cannot guarantee that the above code is production-ready or free of bugs etc.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Using Map/Reduce for Sorting]]></title>
<link>http://karticks.wordpress.com/2009/08/05/using-mapreduce-for-sorting/</link>
<pubDate>Wed, 05 Aug 2009 08:12:34 +0000</pubDate>
<dc:creator>karticks</dc:creator>
<guid>http://karticks.wordpress.com/2009/08/05/using-mapreduce-for-sorting/</guid>
<description><![CDATA[In my previous post &#8211; Demystifying Map/Reduce &#8211; I had talked about what is Map/Reduce an]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>In my previous post &#8211; <a href="http://karticks.wordpress.com/2009/07/29/the-mapreduce-design-pattern-demystified/">Demystifying Map/Reduce</a> &#8211; I had talked about what is Map/Reduce and a couple of its applications : word counting and <a href="http://en.wikipedia.org/wiki/PageRank">PageRank</a>. In this post I will try to go over a couple of sorting applications of Map/Reduce.</p>
<p>Let us imagine that we have a huge dataset (i.e. 100s of files, and each file itself is also quite big) of integers that we have to sort. One can use any number of sorting algorithms from literature including external sort (see previous post) to sort these files without assuming anything about the data. Now if the data itself were not widely distributed i.e. the integers lie between a certain range and this range  is quite small compared to size of the data, then we can use Map/Reduce. Let us see why with the help of an example.</p>
<p>Let us assume that our data set (integers) is constrained between 100 to 200 and we have 5 files each containing 1000 random integers between 100 and 200 (so a total of 5000 integers between 100 and 200). We read each file into a Map and then in the Reduce phase, we produce a final Map which contains the count of all the integers. Now if we sort all the integers from the final Map and output it<br />
into a list data structure in the form of &#60;Integer, Count&#62; then we have sorted all the data (see figure below). Aside : In Java, you don&#8217;t even have to come up with the data-structure that I am talking about, if you just use a <a href="http://java.sun.com/javase/6/docs/api/index.html?java/util/TreeMap.html">TreeMap</a> in the final Reduce phase, then all the keys (i.e. data) are already sorted as long as the key type (e.g. String, Integer, etc.) implements the <a href="http://java.sun.com/javase/6/docs/api/index.html?java/lang/Comparable.html">Comparable</a> interface (<a href="http://hadoop.apache.org">Hadoop</a> has something similar called <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/WritableComparable.html">WritableComparable</a> and I am using a TreeMap that takes Strings as keys in <a href="http://code.google.com/p/dalalstreet/source/browse/trunk/MapReduce/src/org/karticks/mapreduce/Reducer.java">Reducer.java</a>).</p>
<div id="attachment_276" class="wp-caption aligncenter" style="width: 310px"><a href="http://karticks.wordpress.com/files/2009/08/map-reduce-sorting.png"><img src="http://karticks.wordpress.com/files/2009/08/map-reduce-sorting.png?w=300" alt="Sorting using Map/Reduce" title="map-reduce-sorting" width="300" height="214" class="size-medium wp-image-276" /></a><p class="wp-caption-text">Sorting using Map/Reduce</p></div>
<p>What is the complexity of the above sorting algorithm ? The Map phase is an order &#8220;<strong>n</strong>&#8221; algorithm (where <strong>n</strong> is the size of the data). The reduce phase is an order &#8220;<strong>m</strong>&#8221; algorithm where &#8220;<strong>m</strong>&#8221; is the number of unique integers in our data set. The sort phase after the Reduce phase will be an order &#8220;<strong>mlogm</strong>&#8221; operation (if use a sort algorithm like <a href="http://en.wikipedia.org/wiki/Heapsort">heap sort</a>). Now if &#8220;<strong>m</strong>&#8221; is small compared to &#8220;<strong>n</strong>&#8221; (e.g. the size of the data set is 100,000 and the actual number of unique integers is only 100), then the complexity of the Reduce phase and final sort phase is actually quite small compared to the Map phase. So the total  complexity of a Map/Reduce phase is of order &#8220;<strong>n</strong>&#8221; if the number of unique integers is quite small compared to the size of the total number of integers to be sorted. However, if the number of unique integers is comparable to the size of the data, then the complexity of the Reduce phase and the final sort phase is no longer small (compared to the complexity of the Map phase) and hence it is better to use a traditional sort algorithm instead of using Map/Reduce (to avoid the overhead of the additional order &#8220;<strong>n</strong>&#8221; Map phase).</p>
<p>The Map/Reduce project has an <a href="http://code.google.com/p/dalalstreet/source/browse/trunk/MapReduce/src/org/karticks/mapreduce/WordCountFromFiles.java">example</a> that reads integers from <a href="http://code.google.com/p/dalalstreet/source/browse/#svn/trunk/MapReduce/res">five files</a> (each containing 5000 integers) and sorts them. The total number of unique integers is 20 and the figure below shows the output of the result in &#8220;Integer (Count)&#8221; format. As one can see the output is sorted and the sum of counts adds up to 25,000 (the size of the data set &#8211; 5000 integers in 5 files). It is a small and trivial example but I hope you find it useful to understand the application of Map/Reduce to sorting.</p>
<div id="attachment_277" class="wp-caption aligncenter" style="width: 309px"><a href="http://karticks.wordpress.com/files/2009/08/map-reduce-sort-result.png"><img src="http://karticks.wordpress.com/files/2009/08/map-reduce-sort-result.png?w=299" alt="Sort result using Map/Reduce" title="map-reduce-sort-result" width="299" height="245" class="size-medium wp-image-277" /></a><p class="wp-caption-text">Sort result using Map/Reduce</p></div>
<p>Until next time. Cheers !!</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[The Map/Reduce design pattern demystified]]></title>
<link>http://karticks.wordpress.com/2009/07/29/the-mapreduce-design-pattern-demystified/</link>
<pubDate>Wed, 29 Jul 2009 20:36:00 +0000</pubDate>
<dc:creator>karticks</dc:creator>
<guid>http://karticks.wordpress.com/2009/07/29/the-mapreduce-design-pattern-demystified/</guid>
<description><![CDATA[Over the last 5-6 years there has been quite a lot of buzz about the Map/Reduce design pattern (yes,]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Over the last 5-6 years there has been quite a lot of buzz about the Map/Reduce design pattern (yes, it is a design pattern), specially after the <a href="http://labs.google.com/papers/mapreduce-osdi04.pdf">famous paper</a> from Google. Now there is an open-source project &#8211; <a href="http://hadoop.apache.org/">Hadoop</a> &#8211; that helps you implement Map/Reduce on a cluster, <a href="http://aws.amazon.com/elasticmapreduce/">Amazon&#8217;s EC2 offers Map/Reduce</a>, <a href="http://www.cloudera.com">Cloudera</a> offers commercial support for Hadoop, and so on.</p>
<p>But what really is Map/Reduce, and that is what I will try to answer in this post.</p>
<h4>Basic Concept</h4>
<p>Let us start with an example of external sorting. Let us say you have a large file (say about 100 GB) of integers and you have been given the job of sorting those integers but you have only access to 1GB of memory. The general solution is that you will read 1GB of data into memory, do an in-place sort (using any sorting algorithm of your choice), write the sorted data out into a file, and then read the next 1GB of data from the master file, sort them, write the data out, and so on and so forth. After this process is finished, you will end up with 100 files (1GB each) and the data in each file is sorted. Now you will do a merge sort i.e. read an integer from each of the 100 files, find out which is the minimum integer, and write that integer to a new file, and we keep doing this until all the data from the 100 files are read. The new file will contain integers that are all sorted. The image below tries to provide an overview of the above algorithm (for more details take a look at Knuth&#8217;s fascinating discussion on external sorts in his classic : <a href="http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming">The Art of Computer Programming Volume 3, Sorting and Searching</a>).</p>
<div id="attachment_251" class="wp-caption aligncenter" style="width: 310px"><a href="http://karticks.wordpress.com/files/2009/07/external-sorting.png"><img src="http://karticks.wordpress.com/files/2009/07/external-sorting.png?w=300" alt="External Sorting" title="external-sorting" width="300" height="226" class="size-medium wp-image-251" /></a><p class="wp-caption-text">External Sorting</p></div>
<p>So what was the purpose of the above example ? The key pattern in the above example is that a huge task is broken down into smaller tasks, and each small task after it has finished produces an intermediate result, and these intermediate results are combined to produce the final result. That is the core of the Map/Reduce design pattern. The processing of small tasks to produce intermediate results is referred to as the Map phase, and the processing of the intermediate results to produce the final result is referred to as the Reduce phase. The key difference is that the Map/Reduce design pattern handles data as key-value pairs, and even the intermediate results are also produced as key-value pairs. Let us go over a couple of examples to understand what I mean by key-value pairs.</p>
<h4>Examples</h4>
<p>Word Counting :<br />
You have been tasked to find out how many times each word occurs in a document. Very simple, you create a hashtable, and then you read each word from the document, and check if the word exists in the hashtable. If the word does not exist in the hashtable, you insert it (using it as the <strong>key</strong>) along with a counter that is initialized to 1 (this is the <strong>value</strong>). If the word exists in the hashtable, you get its counter, increment it by one, and insert the word back into the hashtable with the new counter value. After you have finished reading the document, you iterate over the keys in the hashtable, and for every key (i.e. each word), you lookup its value (the number of times it has occurred) and you have the word count for each word in the document. Now, let us say, you have to find out how many times each word occurs in a set of 100 books. The above process will be extremely slow and time consuming. An efficient solution would be to go through the above process for each book, and producing the word count for each book (the Map phase) and then processing the results (the Reduce phase) from all the 100 hashtables &#8211; one for each book &#8211; to produce the overall word count for all the 100 books (for details look at the video in the Implementation section below).</p>
<p>Page Rank :<br />
Imagine that we have to parse about a million web pages (which is quite a small number considering the size of the World Wide Web) and we have to calculate how many times every URL occurs. For example, an article on Hibernate, might contain a link to Java 6, a link to the Hibernate home page, and a link to the MySQL website. For every such link, our job is to find out how many times does that link appear in these million web pages (I will call it the URL-Count, similar to word count). The higher the URL-Count of a specific URL, the more popular is that URL (this is the foundation of Google&#8217;s <a href="http://en.wikipedia.org/wiki/PageRank">PageRank</a> algorithm). Once again, we will divide this task into 100 Map phases, where every Map phase will look at 10,000 web pages. Every time a Map phase sees a URL in a web page, it will insert it into it&#8217;s hashtable and increment the counter associated with that URL (just like the above word count example). After all the Map phases are finished, each hashtable contains the URL-Count of all the URLs that occurred in each 10,000 webpage set. The Reduce phase iterates over each hashtable and combines the results (counters) of URLs that occur over multiple hashtables to produce the final URL-Count of each URL that occured in our million web pages.</p>
<h4>Implementation</h4>
<p>A simple implementation of the Map-Reduce design pattern consists of a Mapper interface that takes an InputStream as an input and returns a Map as an output. The actual implementation of this interface will know how to read and handle the contents of the stream &#8211; e.g. extracting words, removing punctuation, parsing URLs, etc. The Reducer class just aggregates the results from all the Map phases and produces the final result Map.</p>
<p><strong>The Mapper interface &#8211; </strong></p>
<pre class="brush: java;">
public interface Mapper
{
	/**
	 * Parses the contents of the stream and updates the contents of the &lt;code&gt;Map&lt;/code&gt;
	 * with the relevant information. For example, an implementation to count
	 * words will extract words from the stream (will have to handle punctuation,
	 * line breaks, etc.), or an implementation to mine web-server log files
	 * will have to parse URL patterns, etc. The resulting &lt;code&gt;Map&lt;/code&gt; will contain
	 * the relevant information (words, URLs, etc.) and their counts.
	 *
	 * @param is A &lt;code&gt;InputStream&lt;/code&gt; that contains the content that needs to be parsed
	 * @return A &lt;code&gt;Map&lt;/code&gt; that contains relevant patterns (words, URLs, etc.) and their counts
	 */
	public Map&lt;String, Integer&gt; doMap(InputStream is);
}
</pre>
<p><strong>The Reducer class -</strong></p>
<pre class="brush: java;">
public class Reducer
{
	/**
	 * Executes the Reduce phase of the Map-Reduce pattern by iterating over
	 * all the input &lt;code&gt;Maps&lt;/code&gt; to find common keys and aggregating their results.
	 * Stores and returns the final results in the output &lt;code&gt;Map&lt;/code&gt;.
	 *
	 * @param inputMaps A list of results from Map phase
	 * @return A &lt;code&gt;Map&lt;/code&gt; that contains the final result
	 */
	public Map&lt;String, Integer&gt; doReduce(List&lt;Map&lt;String, Integer&gt;&gt; inputMaps)
	{
		Map&lt;String, Integer&gt; outputMap = new Hashtable&lt;String, Integer&gt;();

		int mapIndex = 0;

		// outer loop - iterate over all maps
		for (Map&lt;String, Integer&gt; map : inputMaps)
		{
			mapIndex++;

			Iterator&lt;String&gt; it = map.keySet().iterator();

			while (it.hasNext())
			{
				String key = it.next();

				// Get the value from the current map
				Integer value = map.get(key);

				// Now iterate over the rest of maps. The mapIndex variable starts
				// at 1 and keeps increasing because once we are done with all the
				// keys in the first map, we don't need to inspect it any more, and
				// the same holds for the second map, third map, and so on.
				for (int j = mapIndex; j &lt; inputMaps.size(); j++)
				{
					Integer v = inputMaps.get(j).get(key);

					// if you find a value for the key, add it to the current value
					// and then remove that key from that map.
					if (v != null)
					{
						value += v;
						inputMaps.get(j).remove(key);
					}
				}

				// finished aggregating all the values for this key, now store it
				// in the output map
				outputMap.put(key, value);
			}
		}

		return outputMap;
	}
}
</pre>
<p>The following video (best viewed in HD mode) shows a simple word counting application using Map-Reduce that walks through the steps of implementing the Mapper interface and passing the results of the Map phase to the Reduce phase and finally validating that the results are correct.</p>
<span id='plh-loop-video-embed-0' class='hidden'>done</span><ins style='text-decoration:none;'>
<div class='video-player' id='x-video-0'>
<object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" width="400" height="242" id="video-0" standby="A Map/Reduce Implementation">
  <param name="movie" value="http://v.wordpress.com/wp-content/plugins/video/flvplayer.swf?ver=1.11" />
  <param name="quality" value="best" />
  <param name="seamlesstabbing" value="true" />
  <param name="allowfullscreen" value="true" />
  <param name="allowscriptaccess" value="always" />
  <param name="overstretch" value="true" />
  <param name="flashvars" value="guid=9w4ElCW9&amp;javascriptid=video-0&amp;width=400&amp;height=242&amp;locksize=no" />
  <!--[if !IE]>-->
  <object type="application/x-shockwave-flash" data="http://v.wordpress.com/wp-content/plugins/video/flvplayer.swf?ver=1.11" width="400" height="242" standby="A Map/Reduce Implementation">
    <param name="quality" value="best" />
    <param name="seamlesstabbing" value="true" />
    <param name="allowfullscreen" value="true" />
    <param name="allowscriptaccess" value="always" />
    <param name="overstretch" value="true" />
    <param name="flashvars" value="guid=9w4ElCW9&amp;javascriptid=video-0&amp;width=400&amp;height=242&amp;locksize=no" />
  <!--<![endif]-->
  <img alt="A Map/Reduce Implementation" src="http://cdn.videos.wordpress.com/9w4ElCW9/mapreduce_std.original.jpg" width="400" height="242" /><p><strong>A Map/Reduce Implementation</strong></p><p>This movie requires <a href="http://www.adobe.com/go/getflashplayer">Adobe Flash</a> for playback.</p>
  <!--[if !IE]>-->
  </object>
  <!--<![endif]-->
</object></div></ins>
<p>The above code and all the code discussed in the video can be found in the <a href="http://code.google.com/p/dalalstreet/source/browse/#svn/trunk/MapReduce/src/org/karticks/mapreduce">MapReduce sub-project of the DalalStreet open source project</a> in Google Code. Here are the links to the files in case you want to take a detailed look at the code.</p>
<ul>
<li><a href="http://code.google.com/p/dalalstreet/source/browse/trunk/MapReduce/src/org/karticks/mapreduce/Mapper.java" target="_blank">Mapper.java</a>
<li><a href="http://code.google.com/p/dalalstreet/source/browse/trunk/MapReduce/src/org/karticks/mapreduce/Reducer.java" target="_blank">Reducer.java</a>
<li><a href="http://code.google.com/p/dalalstreet/source/browse/trunk/MapReduce/src/org/karticks/mapreduce/MapReduceWorker.java" target="_blank">MapReduceWorker.java</a>
<li><a href="http://code.google.com/p/dalalstreet/source/browse/trunk/MapReduce/src/org/karticks/mapreduce/WordCountMapper.java" target="_blank">WordCountMapper.java</a>
<li><a href="http://code.google.com/p/dalalstreet/source/browse/trunk/MapReduce/src/org/karticks/mapreduce/WordCounter.java" target="_blank">WordCounter.java</a>
</ul>
<h4>Complexity</h4>
<p>So if Map/Reduce is really that simple, then why does Google consider it&#8217;s implementation as its intellectual property (IP), or why is there an open-source project (Hadoop) around it, or even a company (<a href="http://www.cloudera.com/">Cloudera</a>) that is trying to commercialize this pattern ?</p>
<p>The answer lies in the application of Map/Reduce to huge data sets (URL-Count of the entire World Wide Web, log analysis, etc.) over a cluster of machines. When one runs Map/Reduce over a cluster of machines, one has worry about getting notified when each Map phase or job finishes (either successfully or with an error), transferring the results &#8211; which can be huge &#8211; of the Map phase over to the Reduce phase (over the network), and other problems that are typically associated with a distributed application. <em><strong>Map/Reduce by itself is not complex, but the associated set of supporting services that enable Map/Reduce to be distributed, is what makes a Map/Reduce &#8220;framework&#8221; (such as Hadoop) complex (and valuable)</strong></em>. Google takes it a step further by running redundant Map phases (to account for common error conditions like disk failures, network failures, etc.) and its IP lies in how it manages these common failures, results from redundant jobs, etc.</p>
<h4>Conclusion</h4>
<p>Map/Reduce has definitely opened up new possibilities for companies that want to analyze their huge data sets, and if you want to give it a test drive, you might want to checkout <a href="http://aws.amazon.com">Amazon&#8217;s EC2 Map/Reduce</a> harness (running Hadoop). You might want to try out the word count example by downloading a few books from <a href="http://www.gutenberg.org/">Project Gutenberg</a>.</p>
<p>Happy crunching !!</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[The Cloud Goes BOOM: First Strike]]></title>
<link>http://databeta.wordpress.com/2009/07/13/the-cloud-goes-boom/</link>
<pubDate>Mon, 13 Jul 2009 10:18:17 +0000</pubDate>
<dc:creator>jmh</dc:creator>
<guid>http://databeta.wordpress.com/2009/07/13/the-cloud-goes-boom/</guid>
<description><![CDATA[For the last year or so, my team at Berkeley &#8212; in collaboration with Yahoo Research &#8212; ha]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a href="http://www.flickr.com/photos/daddy0h/2412424704/"><img class="alignright size-medium wp-image-167" title="lightning" src="http://databeta.wordpress.com/files/2009/07/lightning1.jpg?w=210" alt="lightning" width="210" height="300" /></a></p>
<p>For the last year or so, my team at Berkeley &#8212; in collaboration with Yahoo Research &#8212; has been undertaking an aggressive experiment in programming.  The challenge is to design a radically easier programming model for infrastructure and applications in the next computing platform: <a title="Cloud Computing in Wikipedia" href="http://en.wikipedia.org/wiki/Cloud_computing">The Cloud</a>.  We call this the Berkeley Orders Of Magnitude (BOOM) project: enabling programmers to develop OOM bigger systems in OOM less code.</p>
<p>To kick this off we built something we call<em> <a title="BOOM Analytics TR" href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-113.html">BOOM Analytics</a> <span style="color:#ff0000;">[link updated to new version]</span></em>: a clone of <a title="Hadoop" href="http://hadoop.apache.org/core/">Hadoop </a>and HDFS built largely in <a title="P2 paper" href="http://db.cs.berkeley.edu/papers/sosp05-p2.pdf">Overlog</a>, a declarative language we developed some years back for network protocols.  BOOM Analytics is just as fast and scalable as Hadoop, but radically simpler in its structure.  As a result we were able &#8212; with amazingly little effort &#8212; to turbocharge our incarnation of the elephant with features that would be enormous upgrades to Hadoop&#8217;s Java codebase.  Two of the fanciest are:<!--more--></p>
<ul>
<li><strong><a title="High Availability in Wikipedia" href="http://en.wikipedia.org/wiki/High_availability">High availability</a>:</strong> Hadoop has a single point of failure at its HDFS master (&#8220;name&#8221;) node.  BOOM Analytics provides hot-standby name node failover, courtesy of a concise Overlog implementation of <a title="Paxos in Wikipedia" href="http://en.wikipedia.org/wiki/Paxos_algorithm">MultiPaxos</a>.</li>
<li><strong><a title="Scale-out in wikipedia" href="http://en.wikipedia.org/wiki/Scale_out#Scale_horizontally_.28scale_out.29">Scale-Out</a>:</strong> Hadoop has two chokepoints for scalability at its master nodes, which must run on a single box.   BOOM Analytics provides name node scaleout via data partitioning of the namespace.  When the name node gets full, you just buy more machines and repartition.  And this composes with the previous feature: each partition enjoys high availability via MultiPaxos.</li>
</ul>
<p>The whole effort of building BOOM Analytics took 12 months for 4 PhD students &#8212; including the time taken to write an Overlog interpreter from scratch.  The scaleout feature took one grad-student developer a <em>day</em> to implement.  One day.</p>
<p>BOOM Analytics is serious distributed fu.  But it really turned out to be pretty simple with the right programming model.</p>
<p>Why did we do this?  What&#8217;s next?  Well, you can read <a title="BOOM Analytics TR" href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-98.html">the tech report</a> for details.  Just a couple notes here.  First, we have no designs on displacing Hadoop.  BOOM Analytics is intended as an example of the power of declarative programming.  If it has any utility for the Hadoop community it would probably be as a rapid prototyping environment for new features. The other point is that we&#8217;re not trying to sell Overlog as the &#8220;right&#8221; language for anything. In fact, a key motivation for building BOOM Analytics was to gain some practical experience in order to design a better language.  We are calling that language <em>Lincoln. </em>I look forward to talking about it more soon.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[BashReduce!]]></title>
<link>http://databeta.wordpress.com/2009/07/08/bashreduce/</link>
<pubDate>Wed, 08 Jul 2009 19:14:07 +0000</pubDate>
<dc:creator>jmh</dc:creator>
<guid>http://databeta.wordpress.com/2009/07/08/bashreduce/</guid>
<description><![CDATA[  I just heard through the Berkeley grapevine about the BashReduce effort at Last.fm: MapReduce in 1]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p> </p>
<p><a href="http://www.flickr.com/photos/airgap/1487551740/"><img class="size-medium wp-image-160 alignright" title="1487551740_aa7a0f8e04" src="http://databeta.wordpress.com/files/2009/07/1487551740_aa7a0f8e041.jpg?w=300" alt="stripped down VW" width="240" height="180" /></a></p>
<p>I just heard through the Berkeley grapevine about the <a title="BashReduce blog post" href="http://blog.last.fm/2009/04/06/mapreduce-bash-script">BashReduce</a> effort at Last.fm: MapReduce in 126 lines of bash script! Awesome. I&#8217;m sure it doesn&#8217;t do X, Y and Z. So ask yourself: do you need X? Y? Z? Maybe instead you want V and W. Maybe you should roll your own tool.  </p>
<p> </p>
<p>Makes you think.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Map Reduce and Parallel DBs]]></title>
<link>http://ashalam.wordpress.com/2009/06/28/map-reduce-and-parallel-dbs/</link>
<pubDate>Mon, 29 Jun 2009 00:37:53 +0000</pubDate>
<dc:creator>ashalam</dc:creator>
<guid>http://ashalam.wordpress.com/2009/06/28/map-reduce-and-parallel-dbs/</guid>
<description><![CDATA[In regards to the paper (http://database.cs.brown.edu/sigmod09/) critiquing MapReduce framework, som]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>In regards to the paper (<a href="http://database.cs.brown.edu/sigmod09/">http://database.cs.brown.edu/sigmod09/</a>) critiquing MapReduce framework, some observations come to mind:</p>
<p>Purpose of MapReduce is to analyze really large-scale non-indexed datasets all at once. RDBMs index their data and therefore are efficient in retrieving results to complex queries (e.g. multi-table joins). Not all datasets are indexed. You can write applications (in Java for instance) to have MapReduce framework do things in a certain way. RDBMs require data massaging and loading and then getting indexed before they are useful in this kind of scenario. MapReduce is free, parallel RDBMs are not. In RDBMs, SQL queries work over structured data, but not all datasets are structured. WWW is abound with unstructured data. MapReduce, on the other hand, does not make this assumption and therefore can work against unstructured data.</p>
<blockquote><p>MapReduce presents a simple interface for manipulating data: a map and a reduce function written in the language of choice (Java/C/C++/Perl/Python) of a developer. Its real power lies in the Expressivity it brings: it makes the phrasing of really interesting transformations and analytics breathtakingly easy. (<a href="http://www.asterdata.com/blog/index.php/2008/08/25/announcing-in-database-mapreduce/">http://www.asterdata.com/blog/index.php/2008/08/25/announcing-in-database-mapreduce/</a>)</p></blockquote>
<p> </p>
<blockquote><p> </p></blockquote>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[cloudera training &gt; MapReduce and HDFS &gt; map-reduce overview]]></title>
<link>http://erikeldridge.wordpress.com/2009/06/11/cloudera-training-mapreduce-and-hdfs-map-reduce-overview/</link>
<pubDate>Thu, 11 Jun 2009 17:27:44 +0000</pubDate>
<dc:creator>Erik</dc:creator>
<guid>http://erikeldridge.wordpress.com/2009/06/11/cloudera-training-mapreduce-and-hdfs-map-reduce-overview/</guid>
<description><![CDATA[ref: http://www.cloudera.com/sites/default/files/2-MapReduceAndHDFS.pdf - borrows from functional pr]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>ref: <a href="http://www.cloudera.com/sites/default/files/2-MapReduceAndHDFS.pdf">http://www.cloudera.com/sites/default/files/2-MapReduceAndHDFS.pdf</a></p>
<p>- borrows from functional programming: map, reduce<br />
- provides an interface for map/reduce; we must implement the interface<br />
- map<br />
&#8211; the mapper can emit an arbitrary pair, not necessarily the input key/val<br />
&#8211; the mapper runs simultaneously on multiple machines; the first to complete is used<br />
&#8211; each map runs in its own jvm<br />
&#8211; each run in parallel<br />
&#8211; input is usualy 64MB &#8211; 128MB chunks (results in more streaming)</p>
<p>- reduce<br />
&#8211; the number of reduces that run corresponds to the number of output files<br />
&#8211; ideally, we want 1 reduce<br />
&#8211; run in paralllel</p>
<p>- flow<br />
&#8211; data store of k/v pairs &#62; map &#62; barrier (shuffle phase) &#62; reduce &#62; result<br />
- chained map-reduce jobs are common<br />
- all values are processed independently<br />
- bottleneck: now reduce can run until all maps are finished<br />
- combiner<br />
&#8211; runs immediately after mapper on map node<br />
&#8211; can use reducer function if reducer is commutative and associative</p>
<p>- conclusions<br />
&#8211; mapreduce is a useful abstraction<br />
&#8211; simplifies large scale comp<br />
&#8211; lets the programmer focus on the problem and the library handle the details of distribution</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Diverging views on Big Data density, and some gimmes]]></title>
<link>http://databeta.wordpress.com/2009/05/14/bigdata-node-density/</link>
<pubDate>Thu, 14 May 2009 08:33:51 +0000</pubDate>
<dc:creator>jmh</dc:creator>
<guid>http://databeta.wordpress.com/2009/05/14/bigdata-node-density/</guid>
<description><![CDATA[Was intrigued last week by the confluence of two posts: Owen O&#8217;Malley and Arun Murthy of Yahoo]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a href="http://www.flickr.com/photos/19779889@N00/428397739/"><img class="alignright size-medium wp-image-145" title="428397739_e5ac735923_b" src="http://databeta.wordpress.com/files/2009/05/428397739_e5ac735923_b1.jpg?w=300" alt="428397739_e5ac735923_b" width="300" height="201" /></a>Was intrigued last week by the confluence of two posts:</p>
<ul>
<li>Owen O&#8217;Malley and Arun Murthy of Yahoo&#8217;s Hadoop team <a title="hadoop petabyte sort" href="http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html">posted</a> about sorting a <em>petabyte</em> using Hadoop on <em>3,800 nodes</em>.</li>
<li>Curt Monash <a title="Monash on eBay warehouses" href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/">posted</a> that eBay hosts a <em>6.5 petabyte</em> Greenplum database on <em>96 nodes</em></li>
</ul>
<p>Both impressive.  But wildly different hardware deployments. Why??  It&#8217;s well known that Hadoop is tuned for availability not efficiency.  But does it really need 40x the number of machines as eBay&#8217;s Greenplum cluster?  How did smart folks end up with such wildly divergent numbers?</p>
<p><!--more-->Now there&#8217;s some modest percentage of quibbling to do about the numbers per se &#8212; Greenplum uses compression, so that 6.5 petabytes lives on &#8220;only&#8221; 4.5 petabytes of storage.  Also, the storage is not full of base data; Monash quotes 70% compression from Greenplum.  So fine &#8212; there&#8217;s only 1.95 petabytes of raw eBay bits living on those 4.5 petabytes of Greenplum disks.</p>
<p>Still, 40x?  I mean, Java is slower than C, and Hadoop is slower than Postgres, etc.  But 40x?</p>
<p>Presumably a large part of this is the difference between the Hadoop philosophy of using whitebox PCs (4 disks for 2 quad-core CPUs), and Greenplum&#8217;s use of dense servers like the Sun <a title="Sun Thor box" href="http://www.sun.com/servers/x64/x4540/">Thor</a> (SunFire X4540 &#8212; 48 disks for 2 quad-core CPUs.)</p>
<p>Perhaps another part has to do with different fault tolerance approaches. Hadoop (as per the Google MapReduce paper) is wildly pessimistic, checkpointing the output of <em>every single</em> Map or Reduce stage to disks, before reading it right back in. (I describe this to my undergrads as the &#8220;regurgitation approach&#8221; to fault tolerance.)  By contrast, classic MPP database approaches (like <a title="Goetz Graefe" href="http://www.hpl.hp.com/personal/Goetz_Graefe/">Graefe</a>&#8217;s famous <a title="Graefe's Exchange paper" href="http://www.informatik.uni-trier.de/~ley/db/conf/sigmod/Graefe90.html">Exchange</a> operator) are wildly optimistic and pipeline everything, requiring restarts of deep dataflow pipelines in the case of even a single fault.</p>
<p>Chicken or Egg?  The Google MapReduce pessimistic fault model requires way more machines, but the more machines you have, the more likely you are to see a fault, which will make you pessimistic&#8230;.  Even so, 40x?</p>
<p>I talked briefly to Owen O&#8217;Malley at Yahoo about cluster sizing, and have talked with Greenplum&#8217;s resident expert, Tim Heath, as well (GP wooed Tim away from Yahoo, actually).   Both gave rational explanations for why they ended up with what they use.  And neither seemed doctrinaire or agitated about this issue one way or the other &#8212; both Hadoop and Greenplum are getting their jobs done, and have happy reference stories. </p>
<p>Still I wonder from a general point of view: how much hardware should be thrown at these problems?  What&#8217;s the sweet spot between optimism and pessimism in the software fault tolerance, given the hardware/operational/energy cost to support it?  So far all I hear are casual opinions &#8212; there&#8217;s science to do here (as both Owen and Tim agreed).</p>
<p>As a side note, Mehul Shah has had some important things to say on this note in the past:</p>
<ul>
<li>His work on <a title="flux fault tolerance" href="http://db.cs.berkeley.edu/papers/sigmod04-fluxft.pdf">FLuX</a> presents an alternative fault tolerance and <a title="Flux Load Balancing" href="http://db.cs.berkeley.edu/papers/icde03-fluxlb.pdf">load balancing</a> approach: fully pipelined dataflow, but with process pairs rather than checkpointing.  It&#8217;s much trickier to build, and not necessarily cheaper in hardware overhead, but an intriguing alternative to regurgitation.</li>
<li>His more recent <a title="sort benchmarks" href="http://sortbenchmark.org/">JouleSort</a> benchmark tries to factor energy cost into the sorting machismo story. Unfortunately, nobody else has risen to the challenge, and it&#8217;s a single-node benchmark for now.</li>
</ul>
<p>So a number of good research gimmes here with big potential impact on practice:</p>
<ol>
<li><em>Predictive Snapshots for Dataflows:</em> It sounds wise to only play the Google regurgitation game when the cost of staging to disk is worth the expected benefit of enabling restart.  Can&#8217;t this be predicted reasonably well, so that the choice of pipelining or snapshotting is done judiciously?</li>
<li><em>TCO metrics for Analytics hardware in modern datacenters: </em>What is the right way to measure cost for these deployments, including energy consumption, rackspace, management, etc.  </li>
<li><em>Energy-centric scalable benchmarks: </em>Combine the best of JouleSort, PennySort and the Petabyte-scale work going on in the field, and get people to compete for the right metric on big data.  The scaling will modify JouleSort and PennySort to include purchasing and energy costs of components like network switches on- and across racks.</li>
</ol>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[MapReduce vs. Parallel DBs]]></title>
<link>http://everythingisdata.wordpress.com/2009/05/04/mapreduce-vs-parallel-dbs/</link>
<pubDate>Mon, 04 May 2009 21:09:27 +0000</pubDate>
<dc:creator>Neil Conway</dc:creator>
<guid>http://everythingisdata.wordpress.com/2009/05/04/mapreduce-vs-parallel-dbs/</guid>
<description><![CDATA[&#8220;A Comparison of Approaches to Large-Scale Data Analysis&#8221; in SIGMOD 2009 is a followup t]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>&#8220;<a href="http://database.cs.brown.edu/projects/mapreduce-vs-dbms/">A Comparison of Approaches to Large-Scale Data Analysis</a>&#8221; in SIGMOD 2009 is a followup to Stonebraker and DeWitt&#8217;s controversial blog posts (<a href="http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html">1</a>, <a href="http://www.databasecolumn.com/2008/01/mapreduce-continued.html">2</a>) comparing MapReduce with parallel databases for analyzing large volumes of data. Unsurprisingly, the paper makes the argument that parallel DBs significantly outperform Hadoop for a broad class of data analysis queries, although Hadoop loads the data much faster and was easier to configure.</p>
<p>Overall, I thought the paper&#8217;s analysis was pretty fair, but the authors clearly engage in some trickery to accentuate the performance differences between Hadoop and parallel DBs.</p>
<h3>Loading Time</h3>
<p>When reporting results, the authors separate load time from query time. Looking closely, Hadoop&#8217;s load performance is <i>much</i> faster than the load performance of either of the database systems (about 5x faster than Vertica at scale and 10x faster than DBMS-X). The load times are significant in magnitude, as well: on the UserVisits data set with 100 nodes, Hadoop loads the data in 4250 seconds, while Vertica takes ~21,000 seconds. That leaves plenty of time for Hadoop to execute all the other benchmark queries before Vertica has even finished loading the data! For data sets where the user is only interested in a single analysis run on the data (or perhaps wants to compute an aggregate on the raw data and then apply most analysis to the reduced data set), Hadoop has a massive advantage.</p>
<h3>Indexes</h3>
<p>Why the disparity in load times? A major reason is that the DBs transform the input into their native data format and build indexes, whereas Hadoop does not. Because load times are separated from query times, this essentially means that the database systems are given the chance to precompute answers to the queries, while Hadoop is not. Further, the authors chose several queries that benefit significantly from indexes. For example, the &#8220;Selection Task&#8221; retrieves only 36,000 out of 18 million records at each node (by applying a filter on an indexed column). In the Join Task, a filter is satisfied by only 134,000 of the 155 million records in the UserVisits table (again, both databases happen to have an index on the appropriate column).</p>
<p>Is this unfair? Well, databases can use indexes, while Hadoop currently cannot; it might also be non-trivial to integrate indexes into Hadoop&#8217;s model of user-defined map and reduce functions. For these particular queries, it would be easy to workaround this: given the huge disparity in loading times, a Hadoop user could precompute the set of rows that match the indexable predicates and store those rows as separate HDFS files, yielding a massive performance improvement.</p>
<h3>Partitioning</h3>
<p>For both database systems, the two relations in the Join Task are partitioned on the join key (<tt>Rankings.pageURL = UserVisits.destURL</tt>). This allows the DBs to compute the join results locally on each node. Because Hadoop has no similar concept builtin, it must exchange much more data between nodes to execute the join. As with indexes, it would be possible for the user to implement partitioning in Hadoop by hand (<a href="http://wiki.apache.org/hadoop/Hive/Design#head-507fb621ecdf8a9c2dfcfc5db2dda52a724534fe">Apache Hive</a> provides a simple way to do this). It would be interesting to see DB performance for joins on a non-partitioning key, or Hadoop&#8217;s performance with manual partitioning.</p>
<h3>Aggregation Architecture</h3>
<p>In the Aggregation Task, Hadoop is configured differently from the DBs. Hadoop uses a reducer per node, whereas the DBs compute per-node partial aggregates that are combined by a single coordinator node. For 100 nodes, that means Hadoop needs to shuffle data between 100&#215;100 nodes, whereas the DBs need to only move 100 partial aggs to a single node. However, the Hadoop approach would likely scale better if there were a vast number of distinct groups (too many for the DB&#8217;s coordinator node to hold in memory). The paper should probably have examined Hadoop performance with a much smaller number of reducers for this query.</p>
<h3>Conclusion</h3>
<p>The authors probably should have been more forthright about these differences between the two systems.</p>
<p>If we consider these tricks, the paper&#8217;s conclusion becomes more nuanced. Rather than describing the massive performance advantage that parallel DBs have over Hadoop, the argument is more about ease of use: DBs provide features like indexes, materialized views and partitioning that make it <i>easier</i> to get good performance. Do you really want to implement these features by hand, for each analysis task? Probably not.</p>
<p>That said, I think there&#8217;s value in this paper for the Hadoop community: each of the techniques described above is worth investigating to improve Hadoop performance.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[MapReduce implementation of GFF parsing for Biopython]]></title>
<link>http://bcbio.wordpress.com/2009/03/22/mapreduce-implementation-of-gff-parsing-for-biopython/</link>
<pubDate>Sun, 22 Mar 2009 15:59:46 +0000</pubDate>
<dc:creator>Brad Chapman</dc:creator>
<guid>http://bcbio.wordpress.com/2009/03/22/mapreduce-implementation-of-gff-parsing-for-biopython/</guid>
<description><![CDATA[I previously wrote up details about starting a GFF parser for Biopython. In addition to incorporatin]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>
I previously wrote up details about <a href="http://bcbio.wordpress.com/2009/03/08/initial-gff-parser-for-biopython/">starting a GFF parser for Biopython</a>. In addition to incorporating suggestions received on the <a href="http://biopython.org/wiki/Mailing_lists">Biopython mailing list</a>, it has been redesigned for parallel computation using <a href="http://discoproject.org/">Disco</a>. Disco is an implementation of the distributed <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce</a> framework in Erlang and Python. The code is available from the <a href="http://github.com/chapmanb/bcbb/tree/be2f4f1714b67aa8e428b747c74c81cdd0451072/gff">git repository</a>; this post describes the impetus and design behind the MapReduce revision.
</p>
<p>
The scale of biological analyses is growing quickly thanks to new sequencing technologies. Bioinformatics programmers will need to learn techniques to rapidly analyze extremely large data sets. My coding toolbox has expanded to tackle these problems in two ways. The first is exploring programming languages designed for speed and parallelism, like <a href="http://en.wikipedia.org/wiki/Haskell_(programming_language)">Haskell</a>. Additionally, I have been learning general techniques for parallelizing programs. Both require re-thinking code design to take advantage of increasingly available multi-processor and clustered architectures.
</p>
<p>
The MapReduce framework, originally proposed by Google, exemplifies the idea of redesigning code to analyze large data sets in parallel. In short, the programmer writes two functions: map and reduce. The map function handles the raw parsing work; for instance, it parses a line of GFF text and structures the details of interest. The reduce function combines the results from the map parsing, making them available for additional processing outside of the parallel part of the job. Several implementations of MapReduce have become popularly used. Apache&#8217;s <a href="http://hadoop.apache.org/core/">Hadoop</a> is a mature Java implementation with an underlying distributed file system. Here we utilize Disco, an implementation in Erlang and Python from <a href="http://research.nokia.com/">Nokia Research Center</a>.
</p>
<p>
The MapReduce GFF parser consists of <a href="http://github.com/chapmanb/bcbb/blob/be2f4f1714b67aa8e428b747c74c81cdd0451072/gff/BCBio/GFF/GFFParser.py">two standalone functions</a>. The map function takes a line of GFF text and first determines if we should parse it based on a set of limits. This allows the user to only pull items of interest from the GFF file, saving memory and time:</p>
<pre class="brush: python;">
def _gff_line_map(line, params):
    strand_map = {'+' : 1, '-' : -1, '?' : None, None: None}
    line = line.strip()
    if line[0] != &quot;#&quot;:
        parts = line.split('\t')
        should_do = True
        if params.limit_info:
            for limit_name, limit_values in params.limit_info.items():
                cur_id = tuple([parts[i] for i in
                    params.filter_info[limit_name]])
                if cur_id not in limit_values:
                    should_do = False
                    break
</pre>
</p>
<p>
If the GFF line is to be parsed, we use it to build a dictionary with all the details. Additionally, the line is classified as a top level annotation, a standard flat feature with a location, or part of a parent/child nested feature. The results are returned as a dictionary. For the disco parallel implementation, we use <a href="http://www.json.org/">JSON</a> to convert the dictionary into a flattened string:</p>
<pre class="brush: python;">
        if should_do:
            assert len(parts) == 9, line
            gff_parts = [(None if p == '.' else p) for p in parts]
            gff_info = dict()
            # collect all of the base qualifiers for this item
            quals = collections.defaultdict(list)
            if gff_parts[1]:
                quals[&quot;source&quot;].append(gff_parts[1])
            if gff_parts[5]:
                quals[&quot;score&quot;].append(gff_parts[5])
            if gff_parts[7]:
                quals[&quot;phase&quot;].append(gff_parts[7])
            for key, val in [a.split('=') for a in gff_parts[8].split(';')]:
                quals[key].extend(val.split(','))
            gff_info['quals'] = dict(quals)
            gff_info['rec_id'] = gff_parts[0]
            # if we are describing a location, then we are a feature
            if gff_parts[3] and gff_parts[4]:
                gff_info['location'] = [int(gff_parts[3]) - 1,
                        int(gff_parts[4])]
                gff_info['type'] = gff_parts[2]
                gff_info['id'] = quals.get('ID', [''])[0]
                gff_info['strand'] = strand_map[gff_parts[6]]
                # Handle flat features
                if not gff_info['id']:
                    final_key = 'feature'
                # features that have parents need to link so we can pick up
                # the relationship
                elif gff_info['quals'].has_key('Parent'):
                    final_key = 'child'
                # top level features
                else:
                    final_key = 'parent'
            # otherwise, associate these annotations with the full record
            else:
                final_key = 'annotation'
            return [(final_key, (simplejson.dumps(gff_info) if params.jsonify
                else gff_info))]
</pre>
</p>
<p>
The advantage of this distinct map function is that it can be run in parallel for any line in the file. To condense the results back into a synchronous world, the reduce function takes the results of the map function and combines them into a dictionary of results:</p>
<pre class="brush: python;">
def _gff_line_reduce(map_results, out, params):
    final_items = dict()
    for gff_type, final_val in map_results:
        send_val = (simplejson.loads(final_val) if params.jsonify else
                final_val)
        try:
            final_items[gff_type].append(send_val)
        except KeyError:
            final_items[gff_type] = [send_val]
    for key, vals in final_items.items():
        out.add(key, (simplejson.dumps(vals) if params.jsonify else vals))
</pre>
<p>Finally, the dictionaries of GFF information are converted into Biopython SeqFeatures and attached to SeqRecord objects; the standard object interface is identical to that used for GenBank feature files.
</p>
<p>
Re-writing the code sped it up by a roughly calculated 10% for single processor work. Splitting up parsing and object creation allowed me to apply some simple speed ups which contributed to this improvement. The hidden advantage of learning new programming frameworks is that it encourages you to think about familiar problems in different ways.
</p>
<p>
This implementation is designed to work both in parallel using Disco, and locally on a standard machine. Practically, this means that Disco is not required unless you care about parallelizing the code. Parallel coding may also not be the right approach for a particular problem. For small files, it is more efficient to run the code locally and avoid the overhead involved with making it parallel.
</p>
<p>
When you suddenly need to apply your small GFF parsing script to many gigabytes of result data the code will scale accordingly by incorporating Disco. To look at the practical numbers related to this scaling, I plan to follow up on this post with tests <a href="http://discoproject.org/doc/start/ec2setup.html#ec2setup">using Disco on Amazon&#8217;s Elastic Compute Cloud</a>.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Raft of papers #1: MAD Skills]]></title>
<link>http://databeta.wordpress.com/2009/03/20/mad-skills/</link>
<pubDate>Fri, 20 Mar 2009 21:58:56 +0000</pubDate>
<dc:creator>jmh</dc:creator>
<guid>http://databeta.wordpress.com/2009/03/20/mad-skills/</guid>
<description><![CDATA[Update: VLDB slides posted [pptx] [pdf] It&#8217;s been a busy month pushing out papers. I&#8217;ll ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a href="http://www.flickr.com/photos/mark78/1463574952/"><img class="alignright size-full wp-image-96" title="1463574952_dd400430e5" src="http://databeta.wordpress.com/files/2009/03/1463574952_dd400430e5.jpg" alt="1463574952_dd400430e5" width="210" height="280" /></a></p>
<p><em><span style="color:#ff0000;">Update: VLDB slides posted [<a title="MAD Skills slides, VLDB 2009" href="http://db.cs.berkeley.edu/jmh/talks/MADSkills-vldb09.pptx">pptx</a>] [<a title="MAD Skills slides, VLDB09" href="http://db.cs.berkeley.edu/jmh/talks/MADSkills-vldb09.pdf">pdf</a>]</span></em></p>
<p>It&#8217;s been a busy month pushing out papers. I&#8217;ll cover some of them here over the next days.</p>
<p>The first one I&#8217;ll mention is <a title="MAD Skills, VLDB 2009" href="http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf">MAD Skills: New Analysis Practices for Big Data </a><span style="color:#ff0000;"><em>(link updated to VLDB version)</em></span>.  The paper does a few controversial things (if you&#8217;re the kind of person who finds data management a source of controversy):</p>
<ul>
<li>It takes on &#8220;data warehousing&#8221; and &#8220;business intelligence&#8221; as outmoded, low-tech approaches to getting value out of Big Data. Instead, it advocates a &#8220;Magnetic, Agile, Deep&#8221; (MAD) approach to data, that shifts the locus of power from what Brian Dolan calls the &#8220;DBA priesthood&#8221; to the statisticians and analysts who actually like to crunch the numbers.  This is a <em>good thing</em>, on many fronts.</li>
<li>It describes a state-of-the-art parallel data warehouse that sits on 800TB of disk, using 40 dual-processor dual-core Sun Thumper boxes.</li>
<li>It presents a set of general-purpose, hardcore, massively parallel statistical methods for big data.  They&#8217;re expressed in SQL <em>(OMG!) </em>but could be easily translated to MapReduce if that&#8217;s your bag.</li>
<li>It argues for a catholic (small-c) approach to programming Big Data, including SQL &#38; MapReduce, Java &#38; R, Python &#38; Perl, etc.  If you already have a parallel database, it <em>just shouldn&#8217;t be that hard</em> to support all those things in a single engine.</li>
<li>It advocates a similarly catholic approach to storage.  Use your parallel filesystem, or your traditional database tables, or your compressed columnstore formats, or what have you.  These should not be standalone &#8220;technologies&#8221;, they are great features that should &#8212; no, will &#8212; get added to existing parallel data systems.  (C&#8217;mon, you know it&#8217;s true&#8230; )</li>
</ul>
<p><!--more-->I started to write the paper because it was just too cool what Brian Dolan was doing with Greenplum at Fox Interactive Media (parent company of MySpace.com) &#8212; e.g., writing Support Vector Machines in SQL and running it over dozens of TB of data.  Brian was a great sport about taking his real-world experience and good ideas and putting them down on paper for others to read.  Along the way I learned a lot about the data architecture ideas he&#8217;s been cooking with Mark Dunlap, which are real thumb in the eye of the warehouse orthodoxy, and make eminent good sense in today&#8217;s world.  Finally, it was nice to get to write about the good things that Jeff Cohen and Caleb Welton have been doing at Greenplum to cut through the hype and shrink the distance between SQL and MapReduce.  I&#8217;m hoping those guys will have time to sit down one of these days and patiently write up how they&#8217;ve done it &#8230; it&#8217;s really very elegant.</p>
<p>And it still warms my heart that it&#8217;s Postgres code underneath all that.  Time to resurrect the xfunc code!</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Cloudera building a business around open source Map Reduce]]></title>
<link>http://brightsparc.wordpress.com/2009/03/17/cloudera-building-a-business-around-open-source-map-reduce/</link>
<pubDate>Tue, 17 Mar 2009 06:27:34 +0000</pubDate>
<dc:creator>brightsparc</dc:creator>
<guid>http://brightsparc.wordpress.com/2009/03/17/cloudera-building-a-business-around-open-source-map-reduce/</guid>
<description><![CDATA[The heavy hitting ex-executives behind start up CloudEra are banking on a business based around Hado]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>The heavy hitting <a href="http://bits.blogs.nytimes.com/2009/03/16/bottling-the-magic-behind-google-and-facebook/">ex-executives</a> behind start up <a href="http://www.cloudera.com">CloudEra</a> are banking on a business based around <a href="http://hadoop.apache.org">Hadoop</a>, the open source Map Reduce implementation with a distribution capable of running on Amazon&#8217;s EC2.   Google is credited with popularising (<a href="http://renil.wordpress.com/2007/05/03/mapreduce-functional-programming-lisp/">inventing</a>) Map Reduce and has been tuning its own implementation for many years.  It gave insights into the origin and future research direction in a <a href="http://research.google.com/roundtable/MR.html">round table video</a> last year.</p>
<p>Increasingly companies need to make sense of Terabytes or even Petabytes of data.  This information is stored across many machines on many disks, and needs distributed algorithms for sifting through the data in any reasonable time.  This is where Map-Reduce comes in.</p>
<p>Interestingly Microsoft has taken a <a href="http://oakleafblog.blogspot.com/2009/02/mid-course-correction-for-sql-data.html">step back</a> from this direction when with deciding that its <a href="http://msdn.microsoft.com/en-us/sqlserver/dataservices/default.aspx">SDS</a> offering should support standard &#8216;relational&#8217; features, in effect turning the product into a hosted SQL Server cloud.</p>
<p>It has however been active in this research field.  It released its functional <a href="http://research.microsoft.com/en-us/um/cambridge/projects/fsharp/">programming language F#</a> and it runs its ad serving on <a href="http://research.microsoft.com/en-us/projects/dryad">Dryad</a> - a distributed execution software engine.  <a href="http://research.microsoft.com/en-us/projects/dryadlinq">DryadLINQ</a> combines the power of this engine, with the simplicity of LINQ by creating a SQL-like execution plan for distributed processing, very cool! </p>
<p>Large scale distributed processing software typically runs on many low grade Linux servers running open source software so that licensing costs are kept low.  However with the army of MS developers out there, there are companies springing up to provide software to make the most out of idle cycles on Windows boxes around the network.  Manjrasoft a <a href="http://www.gridcomputingplanet.com/article.php/3785741">recent graduate</a> from Melbourne University&#8217;s GridBus laboratory have released an Alpha of their <a href="http://www.manjrasoft.com/download.html">Aneka</a> software &#8211; a .NET Map Reduce implementation.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Winning in the Cloud]]></title>
<link>http://databeta.wordpress.com/2009/02/04/winning-in-the-cloud/</link>
<pubDate>Wed, 04 Feb 2009 09:52:53 +0000</pubDate>
<dc:creator>jmh</dc:creator>
<guid>http://databeta.wordpress.com/2009/02/04/winning-in-the-cloud/</guid>
<description><![CDATA[  At HPTS 2001 I gave a quick seat-of-the-pants talk called We Lose, which argued that database soft]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p> </p>
<p><a href="http://flickr.com/photos/kky/704056791/"><img class=" alignright" src="http://farm2.static.flickr.com/1332/704056791_63f1e492d8_m_d.jpg" alt="The Cloud" width="240" height="180" /></a></p>
<p>At HPTS 2001 I gave a quick seat-of-the-pants talk called <a title="We Lose, powerpoint" href="http://db.cs.berkeley.edu/jmh/talks/talks/hpts2001-we-lose.ppt">We Lose</a>, which argued that database software and research wasn&#8217;t targeting the hacker community, and therefore was dooming itself to irrelevance.  This thing &#8212; which I cooked up in about 10 minutes &#8212; still gets me a bunch of feedback.  (The talk included a pitch for an easy-to-use dataflow framework that could harness textual data from files, as part of our original Telegraph work.  MapReduce anyone?)</p>
<p> </p>
<p>This issue is decidedly back on the table as different approaches are being explored for Cloud development platforms. So I gave <a title="CIDR gongshow talk, pdf" href="http://www-db.cs.wisc.edu/cidr/cidr2009/gong/09Hellerstein.pdf">a similar pitch at CIDR this year</a>, to try and get the data-centric experts to work on the most important piece of the Cloud: the programming model.  I&#8217;m hoping this time some folks other than us will bite.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Building an Inverted Index with Hadoop and Pig]]></title>
<link>http://squarecog.wordpress.com/2009/01/17/building-an-inverted-index-with-hadoop-and-pig/</link>
<pubDate>Sun, 18 Jan 2009 03:51:03 +0000</pubDate>
<dc:creator>squarecog</dc:creator>
<guid>http://squarecog.wordpress.com/2009/01/17/building-an-inverted-index-with-hadoop-and-pig/</guid>
<description><![CDATA[Pig is a system for processing very large datasets, developed mostly at Yahoo and now an Apache Hado]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><img class="alignright size-full wp-image-39" title="boar" src="http://squarecog.wordpress.com/files/2009/01/boar.gif" alt="boar" width="387" height="242" />Pig is a system for processing very large datasets, developed mostly at Yahoo and now an Apache Hadoop sub-project.  Pig aims to provide massive scalability by translating code written in a new data processing language called Pig Latin into Hadoop (map/reduce) plans.</p>
<p>In this post, I present a (very) brief description of the Pig project and demonstrate how one can construct an inverted index from a collection of text files using just a few lines of PigLatin.<!--more--></p>
<p>Pig offers SQL-like data processing instructions (select, project, filter, group), while being both more flexible by allowing simple integration of user-defined functions, and more straightforward by allowing users to issue command proceduraly, rather than declaratively, as in SQL.  Now, I am a big fan of declarativity, but experience does show that expressing complex rules in SQL is cumbersome.</p>
<p>There are several other projects with similar goals in the Hadoop universe &#8212; Hbase and Hive, both also Hadoop subprojects, being the more famous ones; both support variants of SQL.</p>
<p>The Pig feature that makes it stand out is the easy native support for nested elements &#8212; meaning, a tuple can have other tuples nested inside it; they also support Maps and a few other constructs. <a title="Pig Latin pdf" href="http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf">The Sigmod 2008 paper</a> presents the language and gives examples of how the system is used at Yahoo.</p>
<p>Without further ado &#8212; a quick example of the kind of processing that would be awkward, if not impossible, to write in regular SQL, and long and tedious to express in Java (even using Hadoop).</p>
<p>Let&#8217;s say we have a (very large) collection of (very long) text files, and we want to index it so that we can quickly find the documents that contain certain words. Generally, this kind of problem is solved by making an <a title="Inverted Index article" href="http://en.wikipedia.org/wiki/Inverted_index">inverted index</a> &#8212; a structure that lists all words in a collection, and for each word, all the documents it occurs in.</p>
<p>Here&#8217;s the entirety of Pig Latin code that achieves this:</p>
<p>Load:<br />
<code><br />
t1 = LOAD 'texts/alls_well.txt' USING TextLoader() AS (string:chararray);<br />
t1 = FOREACH t1 GENERATE 'alls_well.txt' as fname, string;<br />
t2 = LOAD 'texts/cymbeline.txt' USING TextLoader() as (string:chararray);<br />
t2 = FOREACH t2 GENERATE 'cymbeline.txt' as fname, string;<br />
text = UNION t1, t2;<br />
</code><br />
Process:<br />
<code><br />
words = FOREACH text GENERATE fname, FLATTEN( TOKENIZE(string) );<br />
word_groups = GROUP lines BY $1;<br />
index = FOREACH word_groups {<br />
files = DISTINCT $1.$0;<br />
cnt = COUNT(files);<br />
GENERATE $0, cnt, files;<br />
};<br />
STORE index INTO '/data/inverted_index';</code></p>
<p>There is kind of an odd thing going on there with loading &#8212; in order to know which file which line comes from, I load the files one by one and inject their names into the read in tuples.  Pig does know how to read whole directories natively, but unfortunately it does not provide any information about which file is being read to the Loader function (a programmer can build his own Loading function &#8212; as well as filtering functions, ordering functions, etc).  There is an interface called &#8220;Slicer&#8221; that a Loader can implement that would give it this kind of access, but that&#8217;s a bit messy too.. Anyway, that&#8217;s not the fun part.  The fun part is what happens later.</p>
<p>Let&#8217;s step through it.<br />
<code>words = FOREACH text GENERATE fname, FLATTEN( TOKENIZE(string) ); </code><br />
For each line of text, we have a &#8220;data bag&#8221; of two fields &#8212; fname and string. The built-in TOKENIZE function splits the string (we can also provide our own tokenizer/stemmer/what-have-you).  Just calling &#8220;GENERATE fname, TOKENIZE(string)&#8221; would give us all the same rows, with the second field now being another &#8220;DataBag&#8221;, this one with a word in each column.  The &#8220;FLATTEN&#8221; command flattens this nested structure &#8212; it generates a new row for every element of this data bag.  So { (&#8216;foo.txt&#8217;, (&#8216;bar&#8217;, &#8216;baz&#8217;, &#8216;bam&#8217;))} becomes { (&#8216;foo.txt&#8217;, &#8216;bar&#8217;), (&#8216;foo.txt&#8217;, &#8216;baz&#8217;), (&#8216;foo.txt&#8217;, &#8216;bam&#8217;) }.</p>
<p><code>word_groups = GROUP words BY $1; </code><br />
This is the inversion part.  We group this set of words by the second column &#8212; the word.<br />
So now, if we had { (&#8216;foo.txt&#8217;, &#8216;apple&#8217;), (&#8216;foo.txt&#8217;, &#8216;pear&#8217;), (&#8216;bar.txt&#8217;, &#8216;apple&#8217;) }, we get { (&#8216;apple&#8217;, ( (&#8216;foo.txt&#8217;, &#8216;apple&#8217;), (&#8216;bar.txt&#8217;, &#8216;apple&#8217;))), (&#8216;pear&#8217;, ((&#8216;foo.txt&#8217;, &#8216;pear&#8217;))) }<br />
In other words, we get a set of tuples where the first column is the value of the column we grouped by, and the second column is a &#8216;DataBag&#8217; of the relevant tuples.</p>
<p><code>index = FOREACH word_groups {<br />
files = DISTINCT $1.$0;<br />
cnt = COUNT(files);<br />
GENERATE $0, cnt, files;<br />
};<br />
</code></p>
<p>Now we generate the index &#8212; for every word group, we find the distinct filenames, and write out the word followed by the number of files in which it is found, and the filenames themselves.</p>
<p><code>STORE index INTO '/data/inverted_index';</code><br />
This is what makes the code actually run.  The index is computed (partitioning the data across Hadoop servers and all that jazz is taken care of automatically), and the index is written out.</p>
<p>Note that the index is written out in lexically sorted order, as are the files inside each posting list.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Great post on databases and map/reduce]]></title>
<link>http://squarecog.wordpress.com/2009/01/14/great-post-on-databases-and-mapreduce/</link>
<pubDate>Wed, 14 Jan 2009 22:21:02 +0000</pubDate>
<dc:creator>squarecog</dc:creator>
<guid>http://squarecog.wordpress.com/2009/01/14/great-post-on-databases-and-mapreduce/</guid>
<description><![CDATA[Anand Rajaraman has a great post on Datawocky with an overview of the various approaches to data ana]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Anand Rajaraman has a <a title="post about databases and map/reduce" href="http://anand.typepad.com/datawocky/2008/09/bridging-the-gap-between-relational-databases-and-mapreduce-three-new-approaches.html">great post on Datawocky</a> with an overview of the various approaches to data analysis using Map/Reduce, and they ways in which this paradigm is bridged with RDBMSes by AsterData and Greenplum, and the Pig project.  Don&#8217;t miss the comments from people directly responsible for these technologies, as well as Facebook&#8217;s Hive.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Hadoop Map-Reduce – Tuning and Debugging ]]></title>
<link>http://infram.wordpress.com/2008/11/28/hadoop-map-reduce-%e2%80%93-tuning-and-debugging/</link>
<pubDate>Fri, 28 Nov 2008 08:35:58 +0000</pubDate>
<dc:creator>mascha</dc:creator>
<guid>http://infram.wordpress.com/2008/11/28/hadoop-map-reduce-%e2%80%93-tuning-and-debugging/</guid>
<description><![CDATA[Some slides about tuning and debugging hadoop map-reduce. It gives good tuning starting points.]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Some slides about <a title="tuning and debugging hadoop map-reduce" href="http://business.rapleaf.com/pdfs/hadoop_part_3.pdf">tuning and debugging hadoop map-reduce</a>. It gives good tuning starting points.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[MapReduce meets AI and Multicore]]></title>
<link>http://erppress.wordpress.com/2008/11/24/mapreduce-meets-ai-and-multicore/</link>
<pubDate>Mon, 24 Nov 2008 21:27:07 +0000</pubDate>
<dc:creator>Alexey Kalmykov</dc:creator>
<guid>http://erppress.wordpress.com/2008/11/24/mapreduce-meets-ai-and-multicore/</guid>
<description><![CDATA[http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf http://cwiki.apache.org/M]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a href="http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf" target="_blank">http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf</a></p>
<p><a href="http://cwiki.apache.org/MAHOUT/" target="_blank">http://cwiki.apache.org/MAHOUT/<br />
</a></p>
</div>]]></content:encoded>
</item>

</channel>
</rss>
