<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress.com" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>web-mining &amp;laquo; WordPress.com Tag Feed</title>
	<link>http://en.wordpress.com/tag/web-mining/</link>
	<description>Feed of posts on WordPress.com tagged "web-mining"</description>
	<pubDate>Thu, 24 Dec 2009 19:21:45 +0000</pubDate>

	<generator>http://en.wordpress.com/tags/</generator>
	<language>en</language>

<item>
<title><![CDATA[Coding Collective Intelligence]]></title>
<link>http://chemoton.wordpress.com/2009/11/24/1366/</link>
<pubDate>Mon, 23 Nov 2009 23:04:59 +0000</pubDate>
<dc:creator>Vitorino Ramos</dc:creator>
<guid>http://chemoton.wordpress.com/2009/11/24/1366/</guid>
<description><![CDATA[Figure &#8211; Book cover of Toby Segaran&#8217;s, &#8220;Programming Collective Intelligence ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p style="text-align:center;"><a href="http://chemoton.wordpress.com/files/2009/11/pci-book.jpg"><img class="aligncenter size-full wp-image-1367" title="PCI Book" src="http://chemoton.wordpress.com/files/2009/11/pci-book.jpg" alt="" width="500" height="655" /></a>Figure &#8211; Book cover of Toby Segaran&#8217;s, &#8220;<a href="http://oreilly.com/catalog/9780596529321" target="_blank">Programming Collective Intelligence &#8211; Building Smart Web 2.0 Applications</a>&#8220;, O&#8217;Reilly Media, 368 pp., August 2007.</p>
<p>{<strong>scopus online description</strong>} Want to tap the power behind search rankings, product recommendations, social bookmarking, and online matchmaking? This fascinating book demonstrates how you can build Web 2.0 applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting data-sets from other web sites, collect data from users of your own applications, and analyze and understand the data once you&#8217;ve found it.  <em>Programming Collective Intelligence</em> takes you into the world of machine learning and statistics, and explains how to draw conclusions about user experience, marketing, personal tastes, and human behavior in general — all from information that you and others collect every day. Each algorithm is described clearly and concisely with code that can immediately be used on your web site, blog, Wiki, or specialized application.</p>
<p style="text-align:justify;">{<strong>even if I don&#8217;t totally agree, here&#8217;s a &#8220;over-rated&#8221; description &#8211; specially on the scientific side, by someone &#8220;dwa&#8221; &#8211; link above</strong>} P<em>rogramming Collective Intelligence</em> is a new book from O&#8217;Reilly, which was written by Toby Segaran. The author graduated from MIT and is currently working at Metaweb Technologies. He develops ways to put large public data-sets into Freebase, a free online semantic database. You can find more information about him on his blog:  http://blog.kiwitobes.com/. Web 2.0 cannot exist without Collective Intelligence. The &#8220;giants&#8221; use it everywhere, YouTube recommends similar movies, Last.fm knows what would you like to listen and Flickr which photos are your favorites etc. This technology empowers <em>intelligent search</em>, <em>clustering</em>, <em>building price models</em> and <em>ranking on the web</em>. I cannot imagine modern service without <em>data analysis</em>. That is the reason why it is worth to start read about it. There are many titles about c<em>ollective intelligence</em> but recently I have read two, this one and &#8220;<em>Collective Intelligence in Action</em>&#8220;. Both are very pragmatic, but the O&#8217;Reilly&#8217;s one is more focused on the merit of the CI. The code listings are much shorter (but examples are written in <em>Python</em>, so that was easy). In general these books comparison is like <em>Java </em>vs. <em>Python</em>. If you would like to build recommendation engine &#8220;in Action&#8221;/Java way, you would have to read whole book, attach extra jar-s and design dozens of classes. The rapid <em>Python </em>way requires reading only 15 pages and voila, you have got the first recommendations. It is awesome!</p>
<p style="text-align:justify;">So how about rest of the book, there are still 319 pages! Further chapters say about: <em>discovering groups</em>, <em>searching</em>, <em>ranking</em>, <em>optimization</em>, <em>document filtering</em>, <em>decision trees</em>, <em>price models</em> or <em>genetic algorithms</em>. The book explains how to implement <em>Simulated Annealing</em>, <em>k-Nearest Neighbors</em>, <em>Bayesian Classifier</em> and many more. Take a look at the table of contents (here: http://oreilly.com/catalog/9780596529321/preview.html), it does not list all the algorithms but you can find more information there. Each chapter has about 20-30 pages. You do not have to read them all, you can choose the most important and still know what is going on. Every chapter contains minimum amount of theoretical introduction, for total beginners it might be not enough. I recommend this book for students who had statistics course (not only IT or computing science), this book will show you how to use your knowledge in practice _ there are many inspiring examples. For those who do not know <em>Python </em>- do not be afraid _ at the beginning you will find short introduction to language syntax. All listings are very short and well described by the author _ sometimes line by line. The book also contains necessary information about basic standard libraries responsible for xml processing or web pages downloading. If you would like to start learn about <em>collective intelligence</em> I would strongly recommend reading &#8220;<em>Programming Collective Intelligence</em>&#8221; first, then &#8220;Collective Intelligence in Action&#8221;. The first one shows how easy it is to implement basic algorithms, the second one would show you how to use existing open source projects related to <em>machine learning</em>.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Web Miners vs Web Masters - An Uneasy Truce]]></title>
<link>http://bixolabs.com/2009/11/11/web-miners-vs-web-masters-an-uneasy-truce/</link>
<pubDate>Wed, 11 Nov 2009 17:10:03 +0000</pubDate>
<dc:creator>kkrugler</dc:creator>
<guid>http://bixolabs.com/2009/11/11/web-miners-vs-web-masters-an-uneasy-truce/</guid>
<description><![CDATA[The life of a webmaster is hard, and web crawlers make it harder http://www.flickr.com/photos/absolu]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><h2>The life of a webmaster is hard, and web crawlers make it harder</h2>
<table>
<tbody>
<tr>
<td><a href="http://www.flickr.com/photos/absolutely_loverly/2953035408/"><img class="alignnone size-full wp-image-167" title="Angry Face" src="http://bixolabs.wordpress.com/files/2009/11/angry-face.png" alt="Angry Face" width="206" height="206" /></a></p>
<div><a rel="cc:attributionURL" href="http://www.flickr.com/photos/absolutely_loverly/">http://www.flickr.com/photos/absolutely_loverly/</a> / <a rel="license" href="http://creativecommons.org/licenses/by/2.0/">CC BY 2.0</a></div>
<p>&#160;</p>
</td>
<td align="top">There&#8217;s the daily drama of keeping both web site users and web site developers happy. Now mix in the unpredictable side effects of having automated agents hitting the site, and you can see why webmasters might think many <a href="http://www.webmasterworld.com/forum39/4119.htm" target="_blank">web crawlers are evil</a>.</td>
</tr>
</tbody>
</table>
<p>But web crawlers serve a very real, important role in the life of a successful site, and it&#8217;s all about <strong>traffic</strong>. Without search engines like Google and Yahoo/Bing, most sites would be invisible to most users.</p>
<h2>Implicit Contracts</h2>
<p>An unwritten agreement exists between webmasters and web crawlers, and it reads something like this: you don&#8217;t overload my site, and you bring traffic my way. In return, I&#8217;ll give you free access lots of valuable content that I host.</p>
<p>And that&#8217;s worked reasonably well, for the past 15 years. Yes, there are crawlers that ignore the <a href="http://en.wikipedia.org/wiki/Robots_exclusion_standard" target="_blank">Robots Exclusion Standard</a>. And there are crawlers that overload the site by hammering it with lots of simultaneous requests for hours on end. And sometimes a crawler goes a little crazy and spends hours trying to fetch non-existent pages using bogus URLs that it incorrectly derived from content on the site&#8217;s pages. For the most part, though, web crawlers try to do the Right Thing, and webmasters can always block rogue crawlers by IP address.</p>
<h2>Web Mining != Search Index</h2>
<p>But now you&#8217;ve got web miners &#8211; automated agents that collect data which often doesn&#8217;t wind up in a search index. And that means no traffic from searches. And thus the implicit contract has been broken.</p>
<p>It hasn&#8217;t happened yet, but I can see a day when many sites set up their robots.txt to allow the major search engines access, and then block everybody else.</p>
<p>What does this mean for the web eco-system? Three things, one for each participant:</p>
<ol>
<li>Web miners need to <strong>crawl extra-super-politely</strong>.</li>
<li>Customers need to work with key sites to <strong>pick good crawl times</strong>.</li>
<li>Web sites need to <strong>offer for-fee APIs</strong> for data mining.</li>
</ol>
<p>The first point is the easiest one to solve &#8211; never hit a site with more than one simultaneous request, never fetch more than a handful of pages a minute, and respect all robots.txt restrictions.</p>
<p>The second is a bit harder, as it currently requires person-to-person contact with the web site in question. It&#8217;s possible to derive these &#8220;good crawl times&#8221; by varying the request rate with the response performance, so there are work-arounds. But eventually I expect to see an extension to robots.txt that lets the site owner provide additional clues to web crawlers about good and bad times for crawling.</p>
<p>The last point, about providing APIs, is the most long-term but also the most powerful. There are many web APIs out there, some of which provide access to valuable web data, but few offer a pay-to-play model. Most are rate limited, where you need to cut special deals if you exceed some relatively low daily threshold. Many have serious terms of use restrictions that limit a caller&#8217;s ability to actually mine the response data &#8211; often the only option is to republish it, with links/attribution back to the originating site.</p>
<p>What would be great is if everybody had a model like Amazon&#8217;s <a href="http://aws.amazon.com/awis/" target="_blank">AWIS</a>, where X requests cost N dollars. You can decide how much or how little to spend. There aren&#8217;t many restrictions on rate/volume or usage. And as a huge added bonus, the data comes back structured, so you don&#8217;t have to waste time hand-crafting some fragile, error-prone HTML scraping code.</p>
<p>And a side-note to companies thinking about the API issue &#8211; if you don&#8217;t provide one, and you block web miners, then you&#8217;ll get crawled anyway, in stealth mode by less scrupulous firms. So then everybody loses, since you&#8217;ll still be giving free access while taking a performance hit, while companies that need the data pay more to these &#8220;stealth crawlers&#8221; and get worse results.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Mining data streams, the web, and the climate]]></title>
<link>http://followthedata.wordpress.com/2009/11/09/mining-data-streams-the-web-and-the-climate/</link>
<pubDate>Mon, 09 Nov 2009 16:32:48 +0000</pubDate>
<dc:creator>Mikael Huss</dc:creator>
<guid>http://followthedata.wordpress.com/2009/11/09/mining-data-streams-the-web-and-the-climate/</guid>
<description><![CDATA[I recently came across MOA (Massive Online Analysis), an environment for what its developers call ma]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>I recently came across <a href="http://www.cs.waikato.ac.nz/~abifet/MOA/">MOA</a> (Massive Online Analysis), an environment for what its developers call massive data mining, or <em>data stream mining</em>. This New Zealand-based project is related to <a href="http://www.cs.waikato.ac.nz/ml/weka/">Weka</a>, a Java-based framework for machine learning which I&#8217;ve used quite a bit over the years. Data stream mining differs from plain old data mining in that the data is assumed to arrive quickly and continuously, as in a stream, and in an unpredictable order. Therefore the full data set will typically be many times larger than your computer&#8217;s memory (which already rules out some commonly used algorithms), and each example can only be briefly examined once, after which it is discarded. Therefore the statistical model has to be updated incrementally, and often must be ready to be applied at any point between training examples.</p>
<p>I also came across a press release describing version 2.0 of <a href="http://www.knowledgeminer.com/">KnowledgeMiner for Excel</a>, a data mining software apparently used by customers like Pfizer, NASA and Boeing, and which is based on <a href="http://www.gmdh.net/">GMDH (Group Method of Data Handling)</a>, a paradigm I hadn&#8217;t heard about before. I failed to install KnowledgeMiner for Excel for my Mac due to some obscure install error, but from what I gather, the GMDH framework involves a kind of automatic model selection, making it easier to use for non-experts in data mining. (Of course I haven&#8217;t tried it, so it&#8217;s hard to evaluate the claim.) The example data set provided with the software package has to do with climate data and modeling, so it should be fun to try as soon as I get it working:</p>
<blockquote><p>The new KnowledgeMiner is now capable of high-dimensional modeling and prediction of climate and has an included example using air and sea surface temperature data. This is a first for a data-mining software package: to offer anyone the ability to see for themselves that global temperatures are rising steadily, using publicly available data. The biggest surprise is seeing that the changes are greatest and accelerating in the northern latitudes. By using data from the past, KnowledgeMiner (yX) can show predictions for future years. Go to <a href="http://www.knowledgeminer.com/cc/" target="_blank">this link</a> to see the climate change data displayed graphically in a slideshow through the year 2020:</p>
</blockquote>
<p>There&#8217;s also an interesting new toolkit for web mining from <a href="http://bixolabs.com/">BixoLabs</a>. They&#8217;ve built what they call an elastic web mining platform in Amazon&#8217;s Elastic Compute Cloud (on top of Hadoop, Cascading and a web mining framework called Bixo, for those of you who care). The whole thing is pre-configured and scalable, and from the tutorials on the site, it seems pretty easy to set it up to crawl the web to your heart&#8217;s content.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Elastic Web Mining Talk]]></title>
<link>http://bixolabs.com/2009/11/02/elastic-web-mining-talk/</link>
<pubDate>Tue, 03 Nov 2009 02:32:20 +0000</pubDate>
<dc:creator>kkrugler</dc:creator>
<guid>http://bixolabs.com/2009/11/02/elastic-web-mining-talk/</guid>
<description><![CDATA[Here&#8217;s the presentation I gave at the ACM data mining unconference on elastic web mining ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Here&#8217;s the presentation I gave at the ACM data mining unconference on elastic web mining &#8211; how to create scalable, reliable and cost effective web mining solutions using an open source stack (Hadoop, Cascading, Bixo) running in Amazon&#8217;s Elastic Compute Cloud (EC2).</p>
<p><!-- SlideShare error: doc is missing or has illegal characters /[^-_a-zA-Z0-9]/ --></p>
<p>But I don&#8217;t see my notes showing up, so here&#8217;s the PDF version with full notes, which make the resulting slides a lot more meaningful.</p>
<p><!-- SlideShare error: doc is missing or has illegal characters /[^-_a-zA-Z0-9]/ --></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Bixolabs goes public]]></title>
<link>http://ken-blog.krugler.org/2009/11/02/bixolabs-goes-public/</link>
<pubDate>Tue, 03 Nov 2009 00:47:52 +0000</pubDate>
<dc:creator>kkrugler</dc:creator>
<guid>http://ken-blog.krugler.org/2009/11/02/bixolabs-goes-public/</guid>
<description><![CDATA[I&#8217;ve been working on an elastic web mining platform for a few months now, and it was finally t]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>I&#8217;ve been working on an elastic <a title="Web mining" href="http://bixolabs.com/web-mining/" target="_blank">web mining</a> platform for a few months now, and it was finally time to go public with at least the current state of the union.</p>
<p>I gave a talk at the <a title="ACM Data Mining Unconference" href="http://www.sfbayacm.org/?p=894" target="_blank">ACM Data Mining Unconference</a> on Sunday, where I also announced the <a title="Public Terabyte Dataset Project" href="http://bixolabs.com/datasets/public-terabyte-dataset-project/" target="_blank">Public Terabyte Dataset project</a>, so the timing was perfect.</p>
<p>If you want to know what&#8217;s been keeping me busy, and looks to be part of my future, check out <a title="Elastic web mining platform" href="http://bixolabs.com/">http://bixolabs.com</a>.</p>
<p>&#160;</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Announcing the Public Terabyte Dataset project]]></title>
<link>http://bixolabs.com/2009/11/01/announcing-the-public-terabyte-dataset-project/</link>
<pubDate>Sun, 01 Nov 2009 14:58:43 +0000</pubDate>
<dc:creator>kkrugler</dc:creator>
<guid>http://bixolabs.com/2009/11/01/announcing-the-public-terabyte-dataset-project/</guid>
<description><![CDATA[We&#8217;re very excited to announce the Public Terabyte Dataset project. This is a high quality cra]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>We&#8217;re very excited to announce the <a href="/datasets/public-terabyte-dataset-project/" target="_self">Public Terabyte Dataset project</a>.</p>
<p>This is a high quality crawl of top web sites, using AWS&#8217;s <a href="http://aws.amazon.com/elasticmapreduce/" target="_blank">Elastic Map Reduce</a>, Concurrent&#8217;s <a href="http://www.cascading.org/" target="_blank">Cascading</a> workflow API, and Bixolab&#8217;s elastic <a href="/">web mining platform</a>.</p>
<p>Hosting for the resulting dataset will be provided by Amazon in <a href="https://s3.amazonaws.com/" target="_blank">S3</a>, and freely available to all <a href="http://aws.amazon.com/ec2/" target="_blank">EC2</a> users.</p>
<p>In addition, the code used to create and process the dataset will be available for download from <a href="http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263" target="_blank">http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263</a></p>
<p>Questions and input on the project can be submitted at <a title="Publc Terabyte Dataset form" href="http://bixolabs.com/PTD/">http://bixolabs.com/PTD/</a></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Presenting at 2009 Silicon Valley Data Mining Camp]]></title>
<link>http://bixolabs.com/2009/10/30/presenting-at-2009-silicon-valley-data-mining-camp/</link>
<pubDate>Fri, 30 Oct 2009 17:42:03 +0000</pubDate>
<dc:creator>kkrugler</dc:creator>
<guid>http://bixolabs.com/2009/10/30/presenting-at-2009-silicon-valley-data-mining-camp/</guid>
<description><![CDATA[This coming Sunday is the big Bay Area data mining &#8220;unconference&#8220;, and with more than 20]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>This coming Sunday is the big Bay Area data mining &#8220;<a href="http://en.wikipedia.org/wiki/Unconference" target="_blank">unconference</a>&#8220;, and with more than 200 people already signed up, it&#8217;s going to be a lot of fun.</p>
<p>I&#8217;ll be presenting at some point during the day &#8211; since it&#8217;s an unconference, you don&#8217;t really know who&#8217;s going to be talking about what/when. My topic is &#8220;<a href="http://www.sfbayacm.org/?p=894&#38;cpage=1#comment-37" target="_blank">Elastic web mining using open source (Hadoop/Cascading/Bixo) in Amazon’s EC2 cloud</a>&#8220;.</p>
<p>If you scan the list of attendees (click the &#8220;RSVPs&#8221; tab near the top of the <a href="http://events.linkedin.com/ACM-Silicon-Valley-Data-Mining-Camp/pub/142420" target="_blank">LinkedIn event page</a>) you&#8217;ll see a lot of high powered executives, consultants and researchers, so I&#8217;m looking forward to really great lobby conversations.</p>
<p>Many thanks to the San Francisco Bay Area Chapter of the ACM for helping out with the <a href="http://www.sfbayacm.org/?p=894" target="_blank">event</a>, which is taking place from noon to 7:30pm at the <a href="http://hackerdojo.pbworks.com/" target="_blank">Hacker&#8217;s Dojo</a> in Mountain View. <a href="http://www.linkedin.com/in/gregmakowski" target="_blank">Greg Makowski</a> is the organizer, so he&#8217;s probably going a little bit crazy right now <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>&#160;</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Monitoring brand using discourse analysis]]></title>
<link>http://discourseweb.wordpress.com/2009/08/28/monitoring-brand/</link>
<pubDate>Fri, 28 Aug 2009 21:39:26 +0000</pubDate>
<dc:creator>Andrzej Góralczyk</dc:creator>
<guid>http://discourseweb.wordpress.com/2009/08/28/monitoring-brand/</guid>
<description><![CDATA[&nbsp; Learning public opinion (or sentiment) about Your brand in traditional way is expensive becau]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>&#160;</p>
<p>Learning public opinion (or sentiment) about Your brand in traditional way is expensive because surveys or focus groups take much time and human work. Probably in near future alternative solution will gain recognition as some technology vendors launched tools for brand monitoring using text analytics. Initial review of these attempts appeared yesterday on <a href="http://smartdatacollective.com/Home/21029">SmartData Collective</a>.</p>
<p>Monitoring brand using discourse analysis differs, to some extent, from the approach based on text analysis. I have very fresh example &#8211; a tool for monitoring opinion about retail nets (supermarkets). And now some words how it is made and how it works.</p>
<p><strong>Building Monitor</strong>. Analysis of the discourse in the corpus of Internet discussions related to supermarkets gave a collection of subjects interesting for interlocutors, and a collection of expressions of their attitudes.  Using these results the complex queries for semantic search were built for the<strong> learning research</strong>. It is the crucial stage &#8211; we should learn very details of the discourse, and get its math at the same time, as the basis for justification and calibration of the Monitor. The final task is relatively easy &#8211; to implement the results and build a &#8220;machine&#8221; using accessible technology.</p>
<p><strong>How Monitor works</strong>. The data for each retail brand is collected using semantic search. Monitor makes all the calculus according to calibrating formulas and provides figures ready for presentation. Please see the <a href="http://discourseweb.wordpress.com/monitor-opinii-spolecznej/opinie-o-hipermarketach/">pictures made for presentation only</a> (not production version).</p>
<p>First are the &#8220;profile&#8221; &#8211; how the brand is perceived, i. e. how it is distinguished vs the average Internet discourse. The result of such kind is often astonishing because the picture dramatically differs from that of Customer&#8217;s (user of monitor) wishes, from official image and marketing buzz. Moreover, the interlocutors&#8217; categories (vertical in the charts) also differ.</p>
<p>Then there is a comparison of the brands monitored. The charts show how people value each brand with regard to the same categories.  2 charts with negative opinions (general index only) are presented as the example.</p>
<p>The third important group of results regards monitoring itself, i. e. presentation of the changes. It depends on the Customer needs. Some customers want to observe the effects of promotional campaigns, and for such purpose day-to-day monitoring is appropriate. Some want to know the general trends&#8230; etc.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Web Mining with SPSS, Python, and Google Analytics]]></title>
<link>http://brocktibert.wordpress.com/2009/08/10/web-mining-with-spss-python-and-google-analytics/</link>
<pubDate>Mon, 10 Aug 2009 20:14:28 +0000</pubDate>
<dc:creator>datamonkey3</dc:creator>
<guid>http://brocktibert.wordpress.com/2009/08/10/web-mining-with-spss-python-and-google-analytics/</guid>
<description><![CDATA[I have posted a few times on how much I love the ability to access Google Analytics data with Python]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>I have posted a few times on how much I love the ability to access Google Analytics data with Python and bring it into SPSS, the tool I use literally everyday.  If I wanted to go the totally free route, I could probably learn how to do this analysis into R, but that is a ways off.</p>
<p>The code below queries a Google Analytics account, filters for a specific set of pages, and uses the results in another block of code that navigates to each page and finds all of the links (I even filtered this set of links too!).  I am new to the web mining arena, but just for fun, I want to look at the dynamics of a web site (or in this case, a sub-site) and overlay onto the network actual data on the usage of each page.  It might be interesting, it might not be, but at least I will learn a lot about Social Networks, Python, and SPSS along the way!</p>
<p>Anyway, the end product generates two SPSS files; one is simply the results from the query sent to the Google Analytics Data Export API, the other is a file that contains the parent/child links from my web spider.  Hopefully you learn something new or it sparks an idea of your own.  If you see something in my code and think I could write it better, please let me know.  I am learning as I go here and know that I probably could do something better, but hey, at least it works!</p>
<p>One note:  I intentionally overwrite the entries in the variable &#8220;pages&#8221; because I don&#8217;t want to navigate to every page returned by GA.  I will comment this out at a time later when the web traffic to the site is lighter.  I actually haven&#8217;t done this yet, but I am assuming it works just fine.</p>
<p>&#8221;&#8217;<br />
Created on Aug 10, 2009</p>
<p>1.  This script connects to Google Analytics and returns a set of pages and some usage data on each<br />
2.  Use the set of pages and navigate to each to parse the links from each<br />
3.  Save the Google Analytics data to an SPSS dataset &#8220;pagedata&#8221;<br />
4.  Save the spidered links to an SPSS dataset &#8220;spider&#8221;</p>
<p>REFERENCES:</p>
<p>http://blog.clintecker.com/post/100021441/python-google-analytics-client-how-to-use-it-and-how</p>
<p>http://github.com/clintecker/python-googleanalytics/tree/master</p>
<p>http://code.google.com/apis/analytics/docs/gdata/gdataReferenceDimensionsMetrics.html</p>
<p>http://blog.wellsoliver.com/2009/06/retrosheet/</p>
<p>http://www.spss.com/fusetalk/forum/categories.cfm?catid=9&#38;entercat=y (Jon Peck)</p>
<p>Begginning Python 2nd Edition &#8211; Hetland</p>
<p>&#8221;&#8217;</p>
<p>import urllib<br />
import re<br />
from googleanalytics import Connection<br />
import datetime<br />
import spss<br />
import time</p>
<p>START = time.time()</p>
<p>##############################################################<br />
#Step 1: query GA to get a list of the UGA pages<br />
##############################################################</p>
<p># setup account and get the external website data<br />
connection = Connection(&#8216;email&#8217;, &#8216;pw&#8217;)<br />
account = connection.get_account(&#8216;acct#&#8217;)</p>
<p>#define the start and end dates (yyyy,mm,dd)<br />
start_date = datetime.date(2009, 7, 6)<br />
end_date = datetime.date(2009, 8, 9)</p>
<p>#define the dimensions, metrics, and filters<br />
dimensions = ['pagePath']<br />
metrics = ['uniquePageviews', 'entrances', 'exits', 'bounces', 'timeOnPage']<br />
filters = [['pagePath' , '=~', '^/undergraduate']]</p>
<p>#get the data to be processed later<br />
data = account.get_data(start_date = start_date, end_date = end_date, dimensions=dimensions, metrics=metrics, filters=filters)</p>
<p>#create a list of pages that will we navigate to and parse the links from<br />
pages = []</p>
<p>sets = [(pr[0][0]) for pr in data.tuple]<br />
for set in sets:<br />
pages.append(set)</p>
<p>#testing the pages<br />
#for page in pages:<br />
#    print page</p>
<p>#this is just a test so I dont run the full spider<br />
pages = ['/undergraduate/']</p>
<p>##############################################################<br />
#Step 2: navigate to each page and parse the links into another tuple<br />
##############################################################</p>
<p>#the tuple that will contain the links<br />
links = []</p>
<p>for page in pages:<br />
url = &#8220;http://www.yoursite.edu&#8221; + page<br />
#print url<br />
try:<br />
content = urllib.urlopen(url).read()<br />
except:<br />
print &#8220;Could not fetch %s&#8221; % url<br />
raise SystemExit</p>
<p>#this pattern pulls the page content but excludes header and footer links since they are on every page<br />
pattern = r&#8217;
<div id="multiNav">(.+?) div id=&#8221;footerfix&#8221;&#8216;</p>
<p>for match in re.finditer(pattern, content, re.S):</p>
<p>strip = r&#8217;a href=&#8221;(/undergraduate/.+?.cfm)&#8221;&#8216;</p>
<p>for m in re.finditer(strip, match.group(0), re.S):<br />
append = [page, m.group(1)]<br />
links.append(append)</p>
<p>#for link in links:<br />
#    print link</p>
<p>##############################################################<br />
#Step 3: Create SPSS datasets from the data returned by Google Analytics API<br />
##############################################################</p>
<p>#Create the dataset to hold the data returned from GA<br />
spss.StartDataStep()<br />
dsObj = spss.Dataset(name=None)<br />
dsname = dsObj.name #save a variable that stores the name of the new dataset for reference later</p>
<p>#append fields to the SPSS file<br />
for field in dimensions:<br />
dsObj.varlist.append(field, 255)<br />
for field in metrics:<br />
dsObj.varlist.append(field, 0)</p>
<p>#put the data from GA into a format that can be fed into SPSS<br />
##(pr[tuple index][element inside tuple index])<br />
#need to make numbers strings so they can be fed to the SPSS api<br />
recs = [(pr[0][0], str(pr[1][0]), str(pr[1][1]), str(pr[1][2]), str(pr[1][3]), str(pr[1][4])) for pr in data.tuple]</p>
<p>#iterate over the new tuple and append cases to the SPSS dataset<br />
for rec in recs:<br />
#print rec<br />
dsObj.cases.append(rec)</p>
<p>#end the datastep so we can save out the SPSS file<br />
spss.EndDataStep()</p>
<p>#activate the new Dataset and save the file<br />
spss.Submit(&#8216;DATASET ACTIVATE %s&#8217; %dsname)<br />
spss.Submit(r&#8221;"&#8221;SAVE OUTFILE = &#8216;C:/pagedata.sav.&#8217;&#8221;"&#8221;)</p>
<p>##############################################################<br />
#Step 4: Create SPSS datasets from the spidered links<br />
##############################################################</p>
<p>#Create the dataset to hold the data returned from GA<br />
spss.StartDataStep()<br />
dsObj = spss.Dataset(name=None)<br />
dsname = dsObj.name #save a variable that stores the name of the new dataset for reference later</p>
<p>#append fields to the SPSS file<br />
dsObj.varlist.append(&#8216;parent&#8217;, 255)<br />
dsObj.varlist.append(&#8216;child&#8217;, 255)</p>
<p>#add data to the file<br />
for link in links:<br />
dsObj.cases.append(link)</p>
<p>#end the datastep so we can save out the SPSS file<br />
spss.EndDataStep()</p>
<p>#activate the new Dataset and save the file<br />
spss.Submit(&#8216;DATASET ACTIVATE %s&#8217; %dsname)<br />
spss.Submit(r&#8221;"&#8221;SAVE OUTFILE = &#8216;C:/spider.sav.&#8217;&#8221;"&#8221;)</p>
<p>#calculate how long the script took to run<br />
total = (time.time() &#8211; START)<br />
print &#8220;Script took %d seconds&#8221; % total</p></div>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Objectivity of subjectivity]]></title>
<link>http://discourseweb.wordpress.com/2009/07/22/objectivity-of-subjectivity/</link>
<pubDate>Wed, 22 Jul 2009 21:52:25 +0000</pubDate>
<dc:creator>Andrzej Góralczyk</dc:creator>
<guid>http://discourseweb.wordpress.com/2009/07/22/objectivity-of-subjectivity/</guid>
<description><![CDATA[&nbsp; For some months I study the discourse of attitude, and sometimes look back to the nice talk o]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>&#160;</p>
<p>For some months I study the discourse of attitude, and sometimes look back to the nice talk of <a href="http://www.lti.cs.cmu.edu/SRS/archives/2008/srs2008.html">Shilpa Arora and Mahesh Joshi</a>. In the beginning of this work they consider a comment subjective if it cannot be objectively verified. Such definition seems to be quite reasonable if you are using some arbitrary judgement what can be objectively verified and what cannot. However, arbitrary judgement is subjective.</p>
<p>There is a long tradition in opinion mining to make some arbitrary classifications of the words or expressions. For example, many researchers consider opinions as subjective, and  the statements about facts as objective. Many use external standard dictionary to classify expressions as positive or negative, WordNet to identify semantic similarity etc. It&#8217;s no wonder that so many studies suffer problems of identifying ironic expressions, sarcasm etc.</p>
<p>In my studies on expressing attitudes the discourse is taken <em>as is</em>, without arbitrary classification. Instead, I examine in which contexts the expressions appear. And result (for Polish language) is different. In the rough picture expressions of attitude seem to fall into two categories: &#8220;at the point&#8221; statements, and private, even intimate statements. There is a sharp distinction between these two kinds of expressions, clearly visible in the matrices of context relatedness of the expressions.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Streamlining web mining]]></title>
<link>http://discourseweb.wordpress.com/2009/07/15/streamlining-web-mining/</link>
<pubDate>Wed, 15 Jul 2009 16:42:40 +0000</pubDate>
<dc:creator>Andrzej Góralczyk</dc:creator>
<guid>http://discourseweb.wordpress.com/2009/07/15/streamlining-web-mining/</guid>
<description><![CDATA[&nbsp; Last Sunday I submitted my comment to the people vs machine debate in Research Magazine. Some]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>&#160;</p>
<p>Last Sunday I submitted my comment to the <a href="http://www.research-live.com/features/tracking-online-word-of-mouth-the-people-vs-machines-debate/4000156.article">people vs machine debate</a> in Research Magazine. Some readers of this comment asked me how I get 97% accuracy of sentiment changes&#8217; measurement in the Web Mining.</p>
<p>Web text analytics is rather new field of research and everybody is using its own approach. So, I would only advice &#8211; don&#8217;t want to be too quick. If you collect millions of records and focus on thousands of specific sentiment-rich expressions, first look at this data. Make some basic descriptive statistics (Yes!), make some charts of the frequency distributions etc. Try to find proper way of stratification, using your best proven approaches and tools. Don&#8217;t avoid this basic examination &#8211; I write this because I see many freshmen in analytic business who want to cut corners.</p>
<p>If you find good way of data stratification you will undoubtedly notice, that some expressions occur most frequently in one or two or three specific contexts or specific subject domains. Follow this clue, and limit further research to these expressions. This is the first step to the discourse mining (not simply text mining).</p>
<p>Next steps are obvious. Look for relations between various characteristics of the contexts, subject domains, and these &#8220;good&#8221; expressions. Make clustering in order to select subjects domains and texts you need. Make the selection from your corpus of texts.</p>
<p>There are a lot of tools to extrude <strong>rich and accurate information</strong> from data selected in this way.</p>
<p>Limiting the scope of study is the first and very basic way to streamline any research process. It is also a basic step used in Industrial Engineering in streamlining any manufacturing or business process.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Mehr Kontakte durch innovatives Web Mining ]]></title>
<link>http://kontaktblog.ch/2009/06/04/mehr-kontakte-durch-innovatives-web-mining/</link>
<pubDate>Thu, 04 Jun 2009 08:34:18 +0000</pubDate>
<dc:creator>Antares</dc:creator>
<guid>http://kontaktblog.ch/2009/06/04/mehr-kontakte-durch-innovatives-web-mining/</guid>
<description><![CDATA[Elektronische Geschäftsbeziehungen gewinnen zunehmend an Bedeutung. Interessant ist dabei nicht nur,]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Elektronische Geschäftsbeziehungen gewinnen zunehmend an Bedeutung. Interessant ist dabei nicht nur, dass immer mehr Unternehmen über elektronische Kanäle Gewinne erwirtschaften, sondern auch, dass es heute möglich ist, durch Aufzeichnen des Nutzerverhaltens die Basis für umfassende Analysen zu legen.</p>
<p>Diese Möglichkeiten widerspiegeln sich im Marketingtrend der kritischen Bewertung und Maximierung des «Return on Marketing», also dem Wechsel von der reinen Effektivitätsbetrachtung hin zur Effizienzbewertung. Potenziale der Logfiles nutzen&#8230;<!--more--></p>
<p>Web Mining zur Leadgenerierung – Potenziale der Logfiles nutzen.</p>
<p>Das dafür notwendige Data-Mining-Werkzeug ist die innovative Analysemethode «Predictive(Web) Analytics», mit der Kundenwünsche erkannt, Verhalten vorausgesagt und das resultierende Wissen für die Gestaltung von noch profitableren Kundenbeziehungen nutzbar gemacht werden. Predictive Web Analytics verbindet online Daten mit effektiven «Events» wie Bestellungen,Käufen usw. und generiert zuverlässige Rückschlüsse auf zukünftige Ereignisse. Damit wird es möglich, die Affinität von Besuchern vorherzusagen, Cluster von Inhaltsbereichen zu berechnen, Besucher zu profilieren und Sequenzen von Aktivitäten auszuwerten. Predictive Web Analytics leistet somit sehr viel mehr als die reinen Webstatistiken, welche die Anzahl Users, Visits, Page Impressions, Top Pages, Top Referres und Errors eruieren.</p>
<p><strong>Web Mining mit einem Predictive Analytics Modeler</strong></p>
<p>Zur neuen Generation von Predictive Analytics Software (PASW) gehört das Web Mining, das alle genannten Analyseebenen abdeckt. Der Webserver liefert die Rohdaten aus Weblogs und anderen Informationsquellen der Website, aus denen sich ableiten lässt, nach welchen Mustern die Webseiten besucht werden, wo Probleme auftauchen, was den User interessiert, was er unternimmt und wahrscheinlich in Zukunft tun wird. Aufgrund dieser Erkenntnisse lassen sich einerseits die User segmentieren, andererseits kann das Angebot auf die Bedürfnisse der Kunden ausgerichtet und die Website in Echtzeit visuell und inhaltlich angepasst werden – zum Beispiel durch das Integrieren von Kaufanreizen oder Hilfen. Sogar anonyme User lassen sich unter optimalen Umständen nach wenigen Klicks einem Verhaltenssegment zuordnen. Loggen sich die User ein, ist die Identifikation noch einfacher. Das Unternehmen kann sie aufgrund ihres Webverhaltens in Zukunft personalisiert ansprechen.</p>
<p><strong>Von Logzeilen zu relevanten Informationen</strong></p>
<p>Der Wechsel vom technikfokussierten Bild anhand der Klickströme zur besucherzentrierten Sicht der Website erzeugt andere Fragestellungen: Was hat der Besucher gesehen? Was hat er getan? Anhand von Sectionstreams werden die besuchten Bereiche in der Besuchssequenz analysiert. Jeder «Event» in der Ereignissequenz (Eventstream) liefert hierzu präzise Informationen. Eine technische wie auch inhaltliche Herausforderung liegt darin, aus der Flut mehr oder weniger redundanter Log-Inhalte – womöglich noch aus verschiedenen Quellen – relevante Ereignisse, Events oder Aktivitäten zu identifizieren. Die handlungsrelevanten Website-Messwerte lassen sich aufgrund des Geschäftsmodells definieren. Die eigentlichen Analysen erfolgen auf Basis der Events (Bestellungen,Empfehlungen, Offertanfragen usw.). In der Regel geht es um Leadqualifizierung über den Kundenlebenszyklus hinweg.</p>
<p><strong>Automatisierte Angebotsempfehlungen</strong></p>
<p>Professionelle Websites unterstützen ihre User richtungsweisend bei der Auswahl oder beim Erwerb von Produkten. Dafür bedarf es der Resultate aus Empfehlungssystemen (Recommender Systems) bzw. aus Warenkorbanalysen, die Fragen beantworten wie: Welche Produkte oder Themenbereiche werden überdurchschnittlich häufig zusammen erworben bzw. miteinander assoziiert? Welche Produkte werden von Personen mit ähnlichen Interessen empfohlen? Aus beiden Ansätzen resultieren personalisierte (Kauf-)Empfehlungen für den Next Best Buy bis hin zu inhaltlich komplett individualisierten Websites. Individualisierte Empfehlungen basieren auf den persönlichen Präferenzen der Benutzer, dem Verhalten ähnlicher Benutzer (Collaborative Filtering) oder auf Objekteigenschaften(Content Filtering).</p>
<p>In einem hybriden Ansatz, dem Content Boosted Collaborative Filtering, werden die Nachteile der einzelnen Ansätze überwunden. Bei diesem kombinierten Vorgehen werden sparsam besetzte User-Objekt-Bewertungsmatrizen mittels einer inhaltsbasierten Filterung der vorhandenen Präferenzprofile aufgefüllt. Dazu wertet das Empfehlungssystem das Präferenzprofil eines Users aus und erstellt dann auf dessen Basis geschätzte Bewertungen für noch ungesehene Objekte. Hat der Nutzer bereits Objekte bewertet, wird die Schätzung damit ersetzt. Dieser Vorgang wird für alle User durchgeführt. Danach folgt das Collaborative Filtering zur Ermittlung benachbarter, ähnlich gelagerter User. Über Predictive Analytics können Kunden in Interessens- und Bedürfnis-Cluster eingeteilt werden, also in Kundengruppen,die in sich homogen und untereinander abgrenzbar sind. Next-Best-Buy-Empfehlungen stammen hier zum Beispiel aus (segmentspezifischen) Warenkorbanalysen. Die aus Streams vom PASW Modeler generierten Empfehlungen stehen dem User ebenfalls in Realtime zur Verfügung.</p>
<p><strong>Fazit</strong></p>
<p>Neben den Standard-Webstatistik-Tools wie Google Analytics,Webtrends usw., bei welchen Klickraten und Pageviews zählen, ist das Web Mining eine unverzichtbare Methode,um inhaltliche Informationen zu gewinnen – und eine verlässliche Entscheidungshilfe zur Optimierung elektronischer Geschäftsbeziehungen.</p>
<p><strong>Quelle:</strong></p>
<p>* Dr. Christiane Okonek ist Head of Data Intelligence bei</p>
<p>rbc Solutions AG, Meilen.</p>
<p>Kontakt: 044 925 36 36, E-Mail: christiane.okonek@rbc.ch</p>
<p>* Daniel Huber ist Head of Business Development bei</p>
<p>rbc Solutions AG, Meilen. E-Mail: daniel.huber@rbc.ch</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Classification and clustering [part one]]]></title>
<link>http://teofilachirei.wordpress.com/2009/05/17/classification-clustering-1/</link>
<pubDate>Sun, 17 May 2009 14:15:36 +0000</pubDate>
<dc:creator>Teofil Achirei</dc:creator>
<guid>http://teofilachirei.wordpress.com/2009/05/17/classification-clustering-1/</guid>
<description><![CDATA[Web content mining is an interesting and wide domain. Almost everyone can modify one or several modu]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Web content mining is an interesting and wide domain. Almost everyone can modify one or several modules from a simple web mining system (like the one we&#8217;re building) and create a hole different approach to a more specific problem. There are two thinks that are very often modified (improved):<br />
- the link extractor module<br />
- the document clustering/classification module</p>
<p>The most used techniques for link extraction (the crawling algorithm) are: Breadth-First, Best-First, PageRank, Shark-Search, and InfoSpiders. In the simple focused web spider that I&#8217;m building I am using the Best-First technique: best links that match the thesaurus are added to the URL Queue.</p>
<p>The second module that is often improved at a web mining system is the <strong>document</strong> <span style="text-decoration:underline;">clustering/classification</span> module. There are differences between <strong>clustering</strong> and <strong>classification</strong>.  Let&#8217;s point the main differences between <a href="http://en.wikipedia.org/wiki/Cluster_analysis"><strong>clustering</strong></a> and <a href="http://en.wikipedia.org/wiki/Document_classification"><strong>classification</strong></a>.<br />
<!--more--><br />
<strong>Cluster analysis</strong> or <strong>clustering</strong> is the assignment of objects into groups (called <span style="text-decoration:underline;">clusters</span>) so that objects from the same cluster are more similar to each other than objects from different clusters. Often similarity is assessed according to a distance measure.</p>
<p>Document <strong>classification/categorization</strong> is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents. Document classification tasks can be divided into two sorts: <strong>supervised</strong> document classification where some external mechanism (such as <span style="text-decoration:underline;">human</span> feedback) provides information on the <span style="text-decoration:underline;">correct classification</span> for documents, and <strong>unsupervised</strong> document classification, where the classification must be done entirely without reference to external information.</p>
<p><big><strong>Clustering</strong></big></p>
<ul>
<li>Unsupervised</li>
<li>Input
<ul>
<li>Clustering algorithm</li>
<li>Similarity measure</li>
<li>Number of clusters</li>
</ul>
</li>
<li>No specific information for each document</li>
</ul>
<p><big><strong>Classification</strong></big></p>
<ul>
<li>Supervised</li>
<li>Each document is labeled with a class</li>
<li>Build a classifier that assigns documents to one of the classes</li>
</ul>
<p><strong>Classification</strong>: the task is to learn to assign instances to predefined classes.<br />
<strong>Clustering</strong>: no predefined classification is required. The task is to learn a classification from the data.<br />
Clustering algorithms divide a data set into natural groups (clusters). Instances in the same cluster are similar to each<br />
other, they share certain properties</p>
<p><span style="text-decoration:underline;">Supervised learning</span>: classification requires supervised learning, i.e., the training data has to specify what we are trying to learn (<strong>the classes</strong>).<br />
<span style="text-decoration:underline;">Unsupervised learning</span>: clustering is an unsupervised task, i.e., the training data doesn’t specify what we are trying to learn (<strong>the clusters</strong>).</p>
<p>One of the most used and wide know algorithm for clustering is <strong><a href="http://en.wikipedia.org/wiki/K-means_algorithm">K-means</a></strong>.</p>
<div id="attachment_311" class="wp-caption aligncenter" style="width: 173px"><a href="http://teofilachirei.wordpress.com/files/2009/05/k-means.png"><img class="size-full wp-image-311" title="k-means clustering example" src="http://teofilachirei.wordpress.com/files/2009/05/k-means.png" alt="k-means clustering example" width="163" height="130" /></a><p class="wp-caption-text">k-means clustering example</p></div>
<p>Well known document classification techniques are <strong><a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">naive Bayes classifier</a></strong> and <strong><a title="K-nearest neighbor" href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">kNN (K-nearest neighbor)</a></strong>.</p>
<div id="attachment_315" class="wp-caption aligncenter" style="width: 261px"><a href="http://teofilachirei.wordpress.com/files/2009/05/knn.png"><img class="size-full wp-image-315" title="kNN: k-nearest neighbor" src="http://teofilachirei.wordpress.com/files/2009/05/knn.png" alt="kNN: k-nearest neighbor" width="251" height="217" /></a><p class="wp-caption-text">kNN: k-nearest neighbor</p></div>
<p>So, to resume, let&#8217;s point the following ideas:</p>
<ul>
<li>classification and clustering both work with documents</li>
<li>with classification we try to classify every document by a well established criteria, and we (humans) often supervise the process</li>
<li>with clustering we let the computer to determine (discover) what are the main classes from some documents and to partition the documents between these classes</li>
</ul>
<p>I hope you get a good idea about the difference between clustering and classification by reading this article. You can also use the following resources to find out more:</p>
<li><a href="http://homepages.inf.ed.ac.uk/keller/teaching/connectionism/lecture13_4up.pdf">[PDF] Clustering and Classification</a></li>
<li><a href="http://www.google.com/search?q=clustering+vs+classification">Google: clustering vs classification</a></li>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[The Thesaurus]]></title>
<link>http://teofilachirei.wordpress.com/2009/05/11/the-thesaurus/</link>
<pubDate>Mon, 11 May 2009 11:00:05 +0000</pubDate>
<dc:creator>Teofil Achirei</dc:creator>
<guid>http://teofilachirei.wordpress.com/2009/05/11/the-thesaurus/</guid>
<description><![CDATA[It&#8217;s time to make our focused web crawler aware about it&#8217;s topic: the thesaurus. The sim]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>It&#8217;s time to make our <strong>focused web crawler</strong> aware about it&#8217;s <strong>topic</strong>: the thesaurus. The simplest way is to have a flat thesaurus: a list of keywords that are related to our topic. And this is how we&#8217;re going to implement it. In the future we could improve it &#8211; an hierarchical thesaurus: a tree or even a graph. Take a look at <a href="http://www.acm.org/about/class/ccs98-html">The 1998 ACM Computing Classification System</a> for an hierarchical thesaurus.</p>
<p>Now let&#8217;s implement our flat thesaurus:</p>
<pre class="brush: java;">
package ro.teo.ssc.thesaurus;
import java.util.List;

public interface IThesaurus {
	void addKeyword(String keyWord);
	List&lt;String&gt; getKeywords();
}
</pre>
<pre class="brush: java;">
package ro.teo.ssc.thesaurus;

import java.util.ArrayList;
import java.util.List;

public class SimpleThesaurus implements IThesaurus{
	private static class SingletonHolder {
	     private static final SimpleThesaurus
	     		INSTANCE=new SimpleThesaurus();
	}

	public static synchronized SimpleThesaurus getInstance(){
		return SingletonHolder.INSTANCE;
	}

	private List&lt;String&gt; _keywords;

	private SimpleThesaurus(){
		_keywords=new ArrayList&lt;String&gt;();
	}

	@Override
	public void addKeyword(String keyWord) {
		if (keyWord!=null){
			if (keyWord.length()&gt;0){
				if (!_keywords.contains(keyWord))
					_keywords.add(keyWord);
			}
		}
	}

	@Override
	public List&lt;String&gt; getKeywords() {
		//we don't want the list to be modified by
		//mistake, so we return a copy of it
		List&lt;String&gt; temp=new ArrayList&lt;String&gt;();
		temp.addAll(_keywords);
		return temp;
	}

}
</pre>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[The Link]]></title>
<link>http://teofilachirei.wordpress.com/2009/05/11/the-link/</link>
<pubDate>Mon, 11 May 2009 09:46:11 +0000</pubDate>
<dc:creator>Teofil Achirei</dc:creator>
<guid>http://teofilachirei.wordpress.com/2009/05/11/the-link/</guid>
<description><![CDATA[Let&#8217;s come back to our simple focused crawler. It&#8217;s time to start filtering the links we]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Let&#8217;s come back to our simple focused crawler. It&#8217;s time to start filtering the links we extract and the pages we download.<br />
We need
<ol>
<li>to know more info about the links  &#8211; the <strong>Link</strong> class</li>
<li>to know our topic &#8211; the <strong>Thesaurus</strong> class</li>
<li>filter extracted links</li>
<li>filter downloaded pages</li>
</ol>
<p>The first step is to define the <strong>Link</strong></p>
<pre class="brush: java;">
package ro.teo.ssc.obj;

public class Link {
	private static Link _empty;
	public static Link Empty(){
		if (_empty==null){
			_empty=new Link();
		}
		return _empty;
	}

	/**
	 * The URL to which the link is pointing.
	 */
	private String href;
	/**
	 * The 'title' attribute of the link.
	 */
	private String title;
	/**
	 * Text marked as link.
	 * This is usually the visible text that the user clicks on,
	 * Or the 'alt' attribute for the image marked as link.
	 */
	private String innerText;

	/**
	 * Used for ordering links from a web page.
	 * This has nothing to do with ranking the page
	 * like PageRank or HITS algorithms.
	 */
	private int rank;

	public Link(){
		href=&quot;&quot;;
		title=&quot;&quot;;
		innerText=&quot;&quot;;
		rank=0;
	}

	public Link(String href, String title, String innerHTML) {
		this();
		if (href!=null)
			this.href = href;
		if (title!=null)
			this.title = title;
		if (innerHTML!=null)
			this.innerText = innerHTML;
		rank=0;
	}

	public String getHref() {
		return href;
	}

	public void setHref(String href) {
		this.href = href;
	}

	public String getTitle() {
		return title;
	}

	public void setTitle(String title) {
		this.title = title;
	}

	public String getInnerHTML() {
		return innerText;
	}

	public void setInnerHTML(String innerHTML) {
		this.innerText = innerHTML;
	}

	public int getRank() {
		return rank;
	}

	public void setRank(int rank) {
		this.rank = rank;
	}

	@Override
	public String toString() {
		String s=&quot; href(&quot;;

		if (href.length()&gt;0) {s+=href;}
		s+=&quot;) &quot;;

		s+=&quot; title(&quot;;
		if (title.length()&gt;0) {s+=title;}
		s+=&quot;)&quot;;

		s+=&quot; innerHTML(&quot;;
		if (innerText.length()&gt;0) {s+=innerText;}
		s+=&quot;) &quot;;

		return s;
	}

}
</pre>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Web Data Mining Company Beats CDC to a Swine Flu Alert]]></title>
<link>http://npharder.wordpress.com/2009/05/01/web-data-mining-company-beats-cdc-to-a-swine-flu-alert/</link>
<pubDate>Fri, 01 May 2009 15:52:50 +0000</pubDate>
<dc:creator>Ken Ellis</dc:creator>
<guid>http://npharder.wordpress.com/2009/05/01/web-data-mining-company-beats-cdc-to-a-swine-flu-alert/</guid>
<description><![CDATA[Some reports (McClatchy, Washington Technology, Wired)  indicate that Veratect, a web data mining co]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Some reports (<a href="http://www.mcclatchydc.com/staff/les_blumenthal/story/67283.html">McClatchy</a>, <a href="http://washingtontechnology.com/Articles/2009/04/27/Private-firm-provides-swine-flu-updates-to-health-agencies.aspx">Washington Technology</a>, <a href="http://www.wired.com/wiredscience/2009/04/swinefluchatter/">Wired</a>)  indicate that <a href="http://www.veratect.com">Veratect</a>, a web data mining company, <a href="http://www.mcclatchydc.com/staff/les_blumenthal/story/67283.html">beat the CDC by 18 days</a> with an alert on Swine Flu.  Nice job!</p>
<p>Of course, its not all that simple.  Veratect has a lower cost for false positives than the CDC, so one would expect them to sent out earlier, more frequent alerts.  In the IR business its what some would call high recall low precision.  We also don&#8217;t know how many false positives Veratect has had.  They also had a team of 35 analysis checking the data.  So don&#8217;t be too hard on the CDC.  Nevertheless, its clear that the Web is a great resource for mining information on potential health threats, and that it has a place in early warning systems.  And for such systems, I would imagine that there is value in biasing towards rejecting a null hypothesis.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[A Simple Serial Focused Web Crawler 10]]></title>
<link>http://teofilachirei.wordpress.com/2009/03/31/a-simple-serial-focused-web-crawler-10/</link>
<pubDate>Tue, 31 Mar 2009 13:06:47 +0000</pubDate>
<dc:creator>Teofil Achirei</dc:creator>
<guid>http://teofilachirei.wordpress.com/2009/03/31/a-simple-serial-focused-web-crawler-10/</guid>
<description><![CDATA[Let&#8217;s see what we&#8217;ve covered so far: as long as there are addresses in URL Queue, repeat]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Let&#8217;s see what we&#8217;ve covered so far:</p>
<pre style="font-size:9pt;">
as long as there are addresses in URL Queue, repeat:
    - get the first address from the queue
    - download the page from that address and store
       it in a temp location<span style="text-decoration:line-through;">
    - check if that page is relevant to the topic (using
      thesaurus)
    - if the page is irrelevant
        remove it from temporary folder
    - if the page is relevant to the topic</span><font color="red">
        extract <span style="text-decoration:line-through;">some</span> links/URLs from that page</font>
    - if the URL queue is not full
        <span style="text-decoration:line-through;">and those  links are not visited</span>
            insert extracted URLs into the queue<span style="text-decoration:line-through;">
        extract text content from that page
        save the content in a permanent storage location
    - move page from the New URL Queue to Visited URL Queue</span>
</pre>
<p>How about extracting only <b>some</b> links from web pages the crawler downloads?</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Errata - New URLs Queue]]></title>
<link>http://teofilachirei.wordpress.com/2009/03/31/errata-new-urls-queue/</link>
<pubDate>Tue, 31 Mar 2009 12:13:13 +0000</pubDate>
<dc:creator>Teofil Achirei</dc:creator>
<guid>http://teofilachirei.wordpress.com/2009/03/31/errata-new-urls-queue/</guid>
<description><![CDATA[If you&#8217;ve heard about Agile Programming, Extreme Programming or you&#8217;ve been working on s]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p> If you&#8217;ve heard about <strong>Agile Programming</strong>, <strong>Extreme Programming</strong> or you&#8217;ve been working on some real projects, you know that specifications change in time.</p>
<p>When I first thought about <strong>IURLQueue</strong>, I didn&#8217;t think I&#8217;ll need to know the actual number of items in the queue.<br />
When I actually needed it, I modified the code:</p>
<pre class="brush: java;">
package ro.teo.ssc.urlfrontier;

public interface INewUrlQueue {
	public void enqueue(String address);
	public String dequeue();
	public boolean isEmpty();
	public boolean isFull();

	public int size();
}
</pre>
<p>&#160;</p>
<pre class="brush: java;">
package ro.teo.ssc.urlfrontier;
import java.util.ArrayDeque;

public class NewUrlQueue implements INewUrlQueue {
	private static class SingletonHolder {
	     private static final NewUrlQueue
	     		INSTANCE = new NewUrlQueue();
   }
	public static synchronized NewUrlQueue getInstance(){
		return SingletonHolder.INSTANCE;
	}

	private static int MAX_SIZE=2048;
	private ArrayDeque q;

	private NewUrlQueue(){
		q=new ArrayDeque();
	}

	@Override
	public String dequeue() {
		String s=&quot;&quot;;
		if (!q.isEmpty())
			s=q.remove();
		return s;
	}

	@Override
	public void enqueue(String address) {
		if (q.size()=MAX_SIZE);
	}

	@Override
	public int size(){
		return q.size();
	}
}
</pre>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[A Simple Serial Focused Web Crawler 4]]></title>
<link>http://teofilachirei.wordpress.com/2009/03/17/a-simple-serial-focused-web-crawler-4/</link>
<pubDate>Tue, 17 Mar 2009 17:42:14 +0000</pubDate>
<dc:creator>Teofil Achirei</dc:creator>
<guid>http://teofilachirei.wordpress.com/2009/03/17/a-simple-serial-focused-web-crawler-4/</guid>
<description><![CDATA[How the crawler works This pseudo algorithm shows how the crawler will work: as long as there are ad]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><b>How the crawler works</b></p>
<p>This pseudo algorithm shows how the crawler will work:</p>
<pre style="font-size:9pt;">

as long as there are addresses in URL Queue, repeat:
    - get the first address from the queue
    - download the page from that address and store
       it in a temp location
    - check if that page is relevant to the topic (using
      thesaurus)
    - if the page is irrelevant
        * remove it from temporary folder
    - if the page is relevant to the topic
        * extract some links/URLs from that page
        * if the URL queue is not full and those
           links are not visited
              # insert extracted URLs into the queue
        * extract text content from that page
        * save the content in a permanent storage location
    - move page from the New URL Queue to Visited URL Queue
</pre>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[A Simple Serial Focused Web Crawler 3]]></title>
<link>http://teofilachirei.wordpress.com/2009/03/17/a-simple-serial-focused-web-crawler-3-initial-setup/</link>
<pubDate>Tue, 17 Mar 2009 17:35:00 +0000</pubDate>
<dc:creator>Teofil Achirei</dc:creator>
<guid>http://teofilachirei.wordpress.com/2009/03/17/a-simple-serial-focused-web-crawler-3-initial-setup/</guid>
<description><![CDATA[Initial Setup 1) First we should define our topic. 2) Then we should define a thesaurus for our topi]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><strong>Initial Setup</strong></p>
<p>1) First we should define our topic.<br />
2) Then we should define a thesaurus for our topic. The simplest way is a list of keywords.<br />
3) Then we need a initial URL seed.<br />
If we don&#8217;t know where to start from, the simplest way is to get some links from Google, Yahoo or Wikipedia by searching each term and extracting the addresses. These addresses will be the first URLs in the URL Queue.</p>
<p><u>Note</u>: Because this project aims to be a very simple, personal web crawler, most of the settings will be hardcoded.<br />
I will not use configuration/properties files or a database for settings and URLs.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[A Simple Serial Focused Web Crawler 2: modules]]></title>
<link>http://teofilachirei.wordpress.com/2009/03/17/a-simple-serial-focused-web-crawler-2-modules/</link>
<pubDate>Tue, 17 Mar 2009 17:31:59 +0000</pubDate>
<dc:creator>Teofil Achirei</dc:creator>
<guid>http://teofilachirei.wordpress.com/2009/03/17/a-simple-serial-focused-web-crawler-2-modules/</guid>
<description><![CDATA[Modules composing the simple focused web crawler: New URLs Queue a queue of the web addresses that w]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Modules composing the simple focused web crawler:</p>
<ul>
<li><b>New URLs Queue</b>
<ul>
<li> a queue of the web addresses that will pe crawled</li>
</ul>
</li>
<li><b>Visited URLs List</b>
<ul>
<li> a queue of visited URLs</li>
</ul>
</li>
<li><b>Downloader</b>
<ul>
<li> downloads pages from the internet</li>
<li> only text pages (text/plain, text/html, etc)</li>
</ul>
</li>
<li><b>Topic checker</b>
<ul>
<li> cheks if a downloaded page is relevant to a pre-defined topic </li>
</ul>
</li>
<li><b>Link extractor</b>
<ul>
<li> extracts links from downloaded html pages</li>
</ul>
</li>
<li><b>Content extractor</b>
<ul>
<li> extracts pure text content from downloaded html pages</li>
<li> no tags, just text</li>
</ul>
</li>
</ul>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[A Simple Serial Focused Web Crawler 1]]></title>
<link>http://teofilachirei.wordpress.com/2009/03/17/a-simple-serial-focused-web-crawler-1/</link>
<pubDate>Tue, 17 Mar 2009 15:02:13 +0000</pubDate>
<dc:creator>Teofil Achirei</dc:creator>
<guid>http://teofilachirei.wordpress.com/2009/03/17/a-simple-serial-focused-web-crawler-1/</guid>
<description><![CDATA[Starting with this post I&#8217;ll publish a simple tutorial and java code for building a simple ser]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Starting with this post I&#8217;ll publish a simple tutorial and java code for building a <strong>simple serial focused web crawler</strong></p>
<p><strong>Defining the project</strong></p>
<li><u>Web Crawler</u><br />
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner.<br />
- <a href="http://en.wikipedia.org/wiki/Web_Crawler">web crawler on wikipedia</a>
</li>
<li><u>Focused Web Crawler</u><br />
A focused crawler or topical crawler is a web crawler that attempts to download only web pages that are relevant to a pre-defined topic or set of topics.<br />
- <a href="http://en.wikipedia.org/wiki/Focused_crawler">focused crawler on wikipedia</a>
</li>
<li><u>Serial Crawler</u><br />
Sequential, not parallel
</li>
<li><u>Simple</u><br />
I&#8217;ll use the simplest approach possible.  The purpose of these posts (&#8220;A Simple Serial Focused Web Crawler&#8221;) are to create a functional focused web crawler in a simple way. The code must be easy to understand, modify and reuse.
</li>
<p>In the next post, <strong>&#8220;A Simple Serial Focused Web Crawler 2&#8243;</strong>, I&#8217;ll present the simple architecture of this focused web crawler.</p>
<p><strong>IMPORTANT:</strong><br />
&#8220;While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability.&#8221;  [Shkapenyuk, V. and Suel, T. (2002)]<br />
The purpose of these posts (regarding <strong>A Simple Serial Focused Web Crawler</strong>) is to present and implement a very simple yet functional approach to a web spider. The result, a minimal focused web spider, could be used as a start point for bigger projects or as a home crawler/documentation downloader.</p>
</div>]]></content:encoded>
</item>

</channel>
</rss>
