<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress.com" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>datamining &amp;laquo; WordPress.com Tag Feed</title>
	<link>http://en.wordpress.com/tag/datamining/</link>
	<description>Feed of posts on WordPress.com tagged "datamining"</description>
	<pubDate>Fri, 27 Nov 2009 13:43:23 +0000</pubDate>

	<generator>http://en.wordpress.com/tags/</generator>
	<language>en</language>

<item>
<title><![CDATA[smarter than the average bear]]></title>
<link>http://armorydatamine.wordpress.com/2009/11/23/smarter-than-the-average-bear/</link>
<pubDate>Mon, 23 Nov 2009 02:49:58 +0000</pubDate>
<dc:creator>zardoz</dc:creator>
<guid>http://armorydatamine.wordpress.com/2009/11/23/smarter-than-the-average-bear/</guid>
<description><![CDATA[Well only just&#8230; I&#8217;ve got some results from my attempt to divide up the feral druid popul]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Well only just&#8230; I&#8217;ve got some results from my attempt to divide up the feral druid population into cats and bears. We started from the fact that there is no &#8220;form&#8221; tag in the armoury XML &#8211; no direct way to count the thing we want to count. The only way to get an insight into this is to find a proxy for each of the forms &#8211; something that is in the data which can be used to separate the sheep from the goats, if you&#8217;ll pardon the mixed metaphor.</p>
<p>Talents seem to be the obvious choice, so long as there is one talent that bears will take and cats not and another talent that is vice versa. Glyphs are the other possibility. Whatever we choose just has to be i) something that players are highly likely to take  and ii) something that is <a href="http://en.wikipedia.org/wiki/Orthogonality" target="_blank">orthogonal</a>; something that definitely points in one direction for bears and another for cats.</p>
<p>But the basic problem is that there are a lot of um&#8230; how to put this politely&#8230; there are a lot of <a href="http://www.youtube.com/watch?v=iZNCnfS5SeY" target="_blank">left-of-centre</a> specs out there. Talents and glyphs are both less orthogonal than I was hoping for &#8211; many specs look a bit bearish and a bit cattish at the same time. And there is a big group that takes none of the talents or glyphs that we want to use.</p>
<p>That&#8217;s why I decided not to make the queries very complex &#8211; adding more talents or glyphs into the selection criteria  just increases the number of toons that fall into the grey area. Also I&#8217;ve counted specs and not toons since the original question was related to the number of druids specced for tanking.</p>
<p>Thanks to the commenters who made suggestions on possible talents and glyphs that might fit these criteria. I&#8217;ve run two queries against the data.</p>
<p>The first query counts feral druids who have Natural Reaction versus those who have Predatory Instincts. A druid with some points in Natural Reaction and none at all in Predatory Instincts might be a bear; t&#8217;other way round for cats. Those with points in neither are marked as &#8220;unknown&#8221;; those with some points in both are the &#8220;could be either&#8221; group.</p>
<p>The second query counts druids who have a Glyph of Maul versus those who have either a Glyph of Shred and/or a Glyph of Rip. Equipping Maul but not Shred or Rip indicates bear; Shred or Rip but no Maul indicates cat. Again we have groups with a mix of these glyphs, and, unfortunately, a huge group with none of them.</p>
<p>Anyway this is what we&#8217;ve got:</p>
<p><strong>(Patch 3.2.2 data; sample size 16327 level 80 feral druids with </strong><strong>28970</strong><strong> specs).</strong></p>
<h4>Talent-based spec count:</h4>
<ul>
<li>Bear: 30%</li>
<li>Cat : 33%</li>
<li>Could be either: 5%</li>
<li>Unknown: 31%</li>
</ul>
<h4>Glyph-based spec count:</h4>
<ul>
<li>Bear: 18%</li>
<li>Cat: 9%</li>
<li>Could be either: 15%</li>
<li>Unknown: 58%</li>
</ul>
<p>Frankly I&#8217;m still not sure how valid these numbers are, but I hope they provide a bit of insight. The talent-based count may at least provide a low-water-mark indication of the number of bearish specs in there.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[An introduction to Casual Data, and how it's changing everything.]]></title>
<link>http://laurenserota.com/2009/11/14/an-introduction-to-casual-data-and-how-its-changing-everything/</link>
<pubDate>Sat, 14 Nov 2009 05:41:51 +0000</pubDate>
<dc:creator>serota</dc:creator>
<guid>http://laurenserota.com/2009/11/14/an-introduction-to-casual-data-and-how-its-changing-everything/</guid>
<description><![CDATA[About a month ago, Dan Rockwell and I finished writing an article for interactions magazine about Ca]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>About a month ago, <a href="http://doodleporn.tumblr.com/">Dan Rockwell</a> and I finished writing an article for<a href="http://interactions.acm.org/"> interactions magazine</a> about Casual Data, the term we&#8217;ve used to describe rich data propagated or mined via some form of social media. The piece defines Casual Data, talks briefly about why it&#8217;s becoming so prevalent, and then proceeds to identify current ways it&#8217;s being used and what that means to the fields of design and research. It will be out in a spring 2010 issue of interactions, however, here&#8217;s a sneak peek at some of the data nugget goodness:</p>
<p><em><strong><em>The problem with too much data</em></strong><br />
While there are a number of firms analyzing the surface value of casual data, there is a need dig deeper to understand context and higher-level implications. The more connected we become, the more connected our data becomes, and the more we need a structured approach for making sense of it.</em></p>
<p><em>Companies having loads of customer data available is not news, however this casual data is not quantitative in nature (demographics, pattern-focused). The emotional meaning behind casual data should not be analyzed statistically, and the methods used to gain this data are as important to understand as the data itself. If customer voice is only harvested through an existing medium (e.g. submitting a query for iPhone-related tweets) the results you get will be brief and will tend to either be of intense glee: &#8220;new iPhone copy/paste function, thank GOD&#8221; or intense distaste: &#8220;Apple sucks!&#8221; &#8211; leaving little room for understanding context of use, while still providing good touch-points for product improvement. There is the potential of casual data being more dangerous than helpful if not properly understood.</em></p>
<div id="n_o9"><a href="http://serota.wordpress.com/files/2009/11/causaldata.jpg"><img class="size-full wp-image-170 alignnone" title="causaldata" src="http://serota.wordpress.com/files/2009/11/causaldata.jpg" alt="what to do with casual data" width="504" height="403" /></a></div>
<div><em><em><strong>Ok, so what&#8217;s our role?</strong></em><br />
The need to find long-term meaning via any quick casual data-farming medium creates a niche opportunity for research firms to use their proven techniques to analyze and understand this abundance of user input. Professional researchers will be able to understand how casual data is useful, where it is applicable and where there are still unanswered (and often unasked) questions. This will allow research companies to reinforce doing more in-depth research as a result of learnings from this data, rather than allowing clients to use this data (which is often incomplete) as conclusive.</em><em>Even tools that have built-in analysis capabilities cannot play down the importance of involving a comprehensive research process. Design researchers look at data to understand not only design opportunities but also to come up with high-level emotional themes. If 10 people say that they want a certain feature from pampers.com, what does that mean in terms of their needs, and how will they benefit from that feature? Extrapolating concepts, ideas and feedback into themes can help the design team understand trends and potential meta-themes, and consequently how to design new products and services that weren&#8217;t necessarily articulated by their customers. Researchers also have the opportunity to help companies understand how to <em>manage</em> all of this data &#8211; does it need to lend itself to searching by future company stakeholders, or will it be regenerated? Having a plan for where the data goes can increase the value attained from it, and help to track trends over time.</em></div>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[EU: forbrugeren skal sige ja til cookies ]]></title>
<link>http://danskprivacynet.wordpress.com/2009/11/11/1930/</link>
<pubDate>Wed, 11 Nov 2009 14:36:25 +0000</pubDate>
<dc:creator>Frederik Kortbæk</dc:creator>
<guid>http://danskprivacynet.wordpress.com/2009/11/11/1930/</guid>
<description><![CDATA[I det direktivudkast, der blev vedtaget på rådsmødet d. 26. oktober og omtalt i blogindlægget d. 3. ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p style="text-align:justify;"><a href="http://danskprivacynet.wordpress.com/files/2009/11/cookies.jpg"><img class="alignnone size-full wp-image-1931" title="cookies" src="http://danskprivacynet.wordpress.com/files/2009/11/cookies.jpg" alt="cookies" width="114" height="170" /></a></p>
<p style="text-align:justify;">I det direktivudkast, der blev vedtaget på rådsmødet d. 26. oktober og omtalt i <a href="http://danskprivacynet.wordpress.com/2009/11/03/eu-radet-godkender-notifikationspligt-ved-databrud/" target="_blank">blogindlægget</a> d. 3. november, præciseres borgerens privacy ved at fastslå, at navne, emailadresser og bankoplysninger, data om alle telefonsamtaler og internetsessioner, skal opbevares sikkert med henblik på at undgå uheld eller at disse data med vilje falder i de forkertes hænder. Det fremgår også, at brugeren skal have klar og fyldestgørende besked om hvorledes der er forholdt med hans data og at han skal give sit samtykke til at hans data gemmes og at andre kan få adgang til disse data.</p>
<p style="text-align:justify;">Dette er på engelsk formuleret således i udkastet: &#8221; Member States shall ensure that the storing of information, or the gaining of access to information already stored, in the terminal equipment of a subscriber or user is only allowed on condition that the subscriber or user concerned has given his or her consent, having been provided with clear and comprehensive information, in accordance with Directive 95/46/EC&#8230;&#8221;</p>
<p style="text-align:justify;">Denne revision af EU´s databeskyttelsesdirektiv kan vise sig at få en endog meget stor negativ indvirkning på online-annoncering med kravet om, at annoncører på forhånd skal indhente brugerens samtykke før de kan placere de såkaldte cookies på deres servere med det formål at effektivisere reklamepraksis ved at gennemføre en målrettet kommunikation og markedsføring ( personligt identificerbare forbrugerdata).</p>
<p style="text-align:justify;">En cookie er betegnelsen for en tekst-fil, der for en bestemt tidsperiode  er gemt på en klient på vegne af en server. Servere gør normalt brug af cookies til at gemme brugeridentifikation, brugeradfærd og indkøbsvaner. Cookie´en bliver sendt tilbage til serveren ved senere forespørgsler fra klienten.</p>
<p style="text-align:justify;">EU-Kommissionen har gentagne gange udtrykt bekymring over indsamling af forbrugerdata til brug i online-annoncering og samtidig slået fast, at den ikke vil tøve med at gribe ind, såfremt branchen ikke selv aftaler et regelsæt, jfr. eksempelvis <a href="http://danskprivacynet.wordpress.com/2009/04/01/eu-lovgivning-om-online-annoncering-pa-vej/" target="_blank">EU-lovgivning om online-annoncering på vej</a>. Med det foreslåede direktivudkast synes EU-Kommissionen ikke mere at afvente en tilfredsstillende branchekodeks. Forholdet er det, at flere branche-organisationer har udarbejdet adfærdskodeks for online-annoncering, men de giver kun forbrugeren en opt-out og ikke en opt-in mulighed (forbrugersamtykke).  Principperne om gennemsigtighed og forbrugervalg synes derfor ikke at være opfyldt. </p>
<p style="text-align:justify;">Meget kunne tyde på, at direktivukastet som en del af &#8220;telepakken&#8221; vil blive endelig vedtaget af EU-Rådet og EU-Parlamentet inden årets udgang med den konsekvens, at ISP´ere, som f.eks. Google og Microsoft samt en en lang række annoncenetværk vil blive tvunget til at indhente brugerens samtykke, før indsamling af data med henblik på brugerens interaktion. Cookies vil kun være tilladt uden direkte brugerens samtykke, såfremt de er &#8220;strengt nødvendige&#8221; for at yde en service brugeren &#8220;udtrykkeligt&#8221; har bedt om, som f.eks. at gemme indkøb via hjemmesider for nethandel.</p>
<p style="text-align:justify;">Det er i skrivende stund imidlertid uklart, hvorledes de enkelte EU-lande, herunder Danmark, vælger at implementere direktivet i konkret lovgivning, men det synes at være udelukket, at komme uden om kravet om forbrugerens samtykke til oprettelse af cookies. Det vil uden tvivl gøre det meget vanskeligt at bruge hjemmesider og man kan stille sig selv det spørgsmål, om udbydere vil risikere at se bort fra loven i det håb at loven ikke vil kunne håndhæves i praksis ? </p>
<p style="text-align:justify;">Branchen er sat under pres, men mon ikke den under alle omstændigheder vil kunne nå at udarbejde en kodeks, som i det væsentligste opfylder direktivudkastet og at man dermed undgår en lovregulering, som kan vise sig at blive mere rigoristisk og mindre adræt og alligevel ikke effektiv nok.    </p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Bab 2 - Konsep Warehousing]]></title>
<link>http://fairuzelsaid.wordpress.com/2009/10/30/bab-2-konsep-warehousing/</link>
<pubDate>Fri, 30 Oct 2009 06:26:30 +0000</pubDate>
<dc:creator>Fairuz El Said</dc:creator>
<guid>http://fairuzelsaid.wordpress.com/2009/10/30/bab-2-konsep-warehousing/</guid>
<description><![CDATA[Pada Bab 2 ini akan dibahas salah satu langkah penting dari data mininng yaitu warehousing. Bahsanny]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Pada Bab 2 ini akan dibahas salah satu langkah penting dari data mininng yaitu warehousing. Bahsannya meliputi, konsep-konsep, berbagai istilah, karakteristik, manfaat, tujuan, tugas-tugas data warehouseing.</p>
<p>Download pdf:</p>
<p><a href="http://fairuzelsaid.wordpress.com/files/2009/10/data-mining-bab-02.pdf">Bab 2 Data Mining  &#8211; Konsep Warehousing</a></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[earth to druids... come in please...]]></title>
<link>http://armorydatamine.wordpress.com/2009/10/30/earth-to-druids-come-in-please/</link>
<pubDate>Fri, 30 Oct 2009 04:43:29 +0000</pubDate>
<dc:creator>zardoz</dc:creator>
<guid>http://armorydatamine.wordpress.com/2009/10/30/earth-to-druids-come-in-please/</guid>
<description><![CDATA[There&#8217;s a post over at wow.com that has set me a bit of a challenge. The post is about bear ta]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>There&#8217;s a post over at <a href="http://www.wow.com/2009/10/21/shifting-perspectives-the-disappearance-of-the-bear/">wow.com</a> that has set me a bit of a challenge. The post is about bear tanks, but makes the valid point that we don&#8217;t have any clear data on the popularity of the various druid forms. There&#8217;s a pretty simple reason for that &#8211; the armoury data doesn&#8217;t provide any direct way of getting such a count.</p>
<p>Still, we don&#8217;t let little obstacles like that get in our way. What we need are some data items that can be used as proxies for what we want to count. Unfortunately I&#8217;m far from being a druid expert, so I&#8217;m looking for suggestions on what items to use.</p>
<p>What we basically want is a talent, or a glyph (or maybe a gem) that bears will want to equip and cats not. And then something that&#8217;s vice versa &#8211; something that cats will have and bears not. One talent or glyph, or several&#8230; whatever makes the most sense. All suggestions on this are most welcome.</p>
<p>(Thanks to the commenters who have already made suggestions on other threads; I&#8217;ll be taking those comments on board.)</p>
<p>If I can get suggestions for both talents and glyphs then I can run more that one query and see how well the numbers match up.</p>
<p>I&#8217;d imagine that, with dual specs, players who liked both forms would have a spec for each. In any case, going from specs to forms and getting a count against the total druid population should tell us something interesting.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[HOT OFF THE PRESS:  Google and Microsoft eye eWorld domination]]></title>
<link>http://avenuel.wordpress.com/2009/10/22/hot-off-the-press-google-and-microsoft-eye-eworld-domination/</link>
<pubDate>Thu, 22 Oct 2009 20:11:16 +0000</pubDate>
<dc:creator>avenuel</dc:creator>
<guid>http://avenuel.wordpress.com/2009/10/22/hot-off-the-press-google-and-microsoft-eye-eworld-domination/</guid>
<description><![CDATA[Google and Microsoft eye eWorld domination 22 October 2009 Well, if you didn&#8217;t think these Goo]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p style="text-align:right;"><span style="text-decoration:underline;">Google and Microsoft eye eWorld domination</span><br />
22 October 2009</p>
<p style="text-align:center;"><img class="aligncenter size-full wp-image-281" title="Picture 5" src="http://avenuel.wordpress.com/files/2009/10/picture-5.png" alt="Picture 5" width="304" height="347" /></p>
<p style="text-align:left;">
<p style="text-align:left;">Well, if you didn&#8217;t think these Google and Microsoft were trying to know everything about you&#8230; they just bought your microblog.  Twitter has just sold the rights to users&#8217; tweets to the search giants for an undisclosed amount.</p>
<p style="text-align:left;">I&#8217;m not sure how I feel about Microsoft, but Google&#8230;  man, you took photos of my home, you read through my e-mails, my searches, now you&#8217;re in my microblog?  <em>Stalker.</em></p>
<p style="text-align:left;">I mean, just when you thought the world couldn&#8217;t know anymore about you&#8230; Huzzuh for datamining!!!  I guess this is just a really harsh warning to everyone out there that <strong>nothing is free</strong>.  Get a free blogging account, sell your information.  Get a free email account, sell your information.  Get a free colorectal exam&#8211;wait, maybe that one doesn&#8217;t apply.  Still, I know everyone&#8217;s talking about the &#8220;information age&#8221; we&#8217;re in, but I&#8217;d like some privacy.  Possessions aside, the only things we<em> really</em> <em>own</em> are personal privacy and personal relations, right?</p>
<p style="text-align:left;">Original Story:</p>
<p style="text-align:left;">http://www.cbc.ca/technology/story/2009/10/22/tehc-internet-twitter-google-microsoft.html</p>
<p style="text-align:left;">
<p style="text-align:right;">
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[CIA buys into Web 2.0 monitoring firm]]></title>
<link>http://ubisurv.wordpress.com/2009/10/20/cia-buys-into-web-2-0-monitoring-firm/</link>
<pubDate>Tue, 20 Oct 2009 14:28:38 +0000</pubDate>
<dc:creator>David</dc:creator>
<guid>http://ubisurv.wordpress.com/2009/10/20/cia-buys-into-web-2-0-monitoring-firm/</guid>
<description><![CDATA[Wired online has a report that the US Central Intelligence Agency has bought a significant stake in ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a title="Wired on CIA purchase" href="http://www.wired.com/dangerroom/2009/10/exclusive-us-spies-buy-stake-in-twitter-blog-monitoring-firm/" target="_blank">Wired online has a report</a> that the US Central Intelligence Agency has bought a significant stake in a market research firm called <a title="Visible Technologies" href="http://www.visibletechnologies.com/" target="_blank">Visible Technologies</a> that specializes in monitoring new social media such as blogs, mirco-blogs, forums, customer feedback sites and social networking sites (although not closed sites like Facebook &#8211; or at least that&#8217;s what they claim).  This is interesting but it isn&#8217;t surprising &#8211; most of what intelligence agencies has always been sifting through the masses of openly available information out there &#8211; what is now called open-source intelligence &#8211; but the fact is that people are putting more of themselves out their than ever before, and material that you would never have expected to be of interest to either commercial or state organisations is now there to be mined for useful data.</p>
<p>(thanks, once again to Aaron Martin for this).</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Micro-targeting and internet tested mail, apresentação de Peter Giangreco]]></title>
<link>http://efeitoobama.wordpress.com/2009/10/19/micro-targeting-and-internet-tested-mail-apresentacao-de-peter-giangreco/</link>
<pubDate>Mon, 19 Oct 2009 22:46:38 +0000</pubDate>
<dc:creator>Editor</dc:creator>
<guid>http://efeitoobama.wordpress.com/2009/10/19/micro-targeting-and-internet-tested-mail-apresentacao-de-peter-giangreco/</guid>
<description><![CDATA[Formado em Ciência Política pela Universidade de Michigan e professor convidado das universidades de]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><img class="alignright" src="http://ua.pravda.com.ua/files/12/_Picture_file_path_12414.jpg" alt="" width="86" height="115" />Formado em Ciência Política pela Universidade de Michigan e professor convidado das universidades de Chicago, Loyola e Harvard, <strong>Peter Giangreco</strong> é um dos maiores especialistas em mala direta dos Estados Unidos e foi o responsável pela estratégia de marketing direto e microtargeting da campanha presidencial de Barack Obama. Giangreco é socio do escritório <a href="http://www.strategygroup.com">The Strategy Group</a> e possui mais de 20 anos de experiência atuando nas campanhas de Bill Clinton e Al Gore, além de ter assessorado o senador e ex-pré-candidato democrata John Edwards.</p>
<p style="text-align:center;"><!-- SlideShare error: doc is missing or has illegal characters /[^-_a-zA-Z0-9]/ --></p>
<p style="text-align:left;">Saiba como foi a participação de Giangreco durante o <a href="http://oefeitoobama.com/">1º Seminário de Estratégia de Comunicação e Marketing</a> da <a href="http://www.gspm.org/">George Washington University</a>:</p>
<ul>
<li><a href="http://efeitoobama.wordpress.com/2009/10/17/o-mapa-do-eleitor-americano-segundo-os-democratas/">O mapa do eleitor americano, segundo os democratas</a></li>
<li><a href="http://efeitoobama.wordpress.com/2009/10/17/%e2%80%9cbrasileiros-devem-investir-no-porta-em-porta%e2%80%9d-diz-estrategista/">“Brasileiros devem investir no porta em porta”, diz estrategista</a></li>
</ul>
<div id="_mcePaste" style="overflow:hidden;position:absolute;left:-10000px;top:0;width:1px;height:1px;"><span style="font-size:32pt;font-family:Arial;color:white;font-weight:bold;">MICRO-TARGETING<br />
AND INTERNET TESTED MAIL</span></div>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Progressive Regression]]></title>
<link>http://cadsmith.wordpress.com/2009/10/18/progressive-regression/</link>
<pubDate>Mon, 19 Oct 2009 02:21:45 +0000</pubDate>
<dc:creator>cadsmith</dc:creator>
<guid>http://cadsmith.wordpress.com/2009/10/18/progressive-regression/</guid>
<description><![CDATA[Assume for a minute that forecasts of a cloudy web are true. Enterprises enter a refactoring phase w]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a href="http://www.livescribe.com/cgi-bin/WebObjects/LDApp.woa/wa/MLSOverviewPage?sid=7CQmnzSDFGR7"><img style="border-width:0;" src="http://cadsmith.files.wordpress.com/2009/10/091018blog.jpg?w=242&#038;h=292" border="0" alt="091018blog" width="242" height="292" /></a></p>
<p>Assume for a minute that forecasts of a cloudy web are true. Enterprises enter a refactoring phase which complements phones replacing PCs amid meetups instead of meetings. More extreme cases result in, not just paperless office, but officeless professions, education and government. The schizophrenic screen that has spent half the time on work and the other on web2.0 then solidifies seamlessly. In the future tense, it is likely that there are applications where the cloud is useful at an acceptable threshold of trust.</p>
<p>Data volume being like a tornado that sweeps over engines of search, classification and computation, a data mining booster may help find direction again, and test data analysis may establish course correction. Tools have found their way onto PCs and networks in the QA, agile and open-source cases. Components can be customized to run off of portable media such as flash drive or CD, USB or P2P, browser plugins, and phone apps. The next exercise is to establish stable, sustainable and scalable <em>eQuality</em>. For example, a mashup of templates from TIS and cloud would vector versions of testability metrics, virtual toolsets, testing-as-a-service, cloud under test, and analysis on demand. Depending upon the cultural or commercial context, they may tend more to either social or automated implementations.</p>
<p>This potentially adds new mosaics to dashboards that are now limited to notes, feeds, reviews, wikis, communications, activities, spreadsheets, and media. Cumulative captains of industry can derive further case studies, cluster BOMs and business plans. Test teams may widen their range and discover unmet needs and niches as tech has a tendency to do. Exceptions to the cloud pivot toward continuing efforts at clearing constraints.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[ebony and ivory]]></title>
<link>http://armorydatamine.wordpress.com/2009/10/16/ebony-and-ivory/</link>
<pubDate>Fri, 16 Oct 2009 04:24:51 +0000</pubDate>
<dc:creator>zardoz</dc:creator>
<guid>http://armorydatamine.wordpress.com/2009/10/16/ebony-and-ivory/</guid>
<description><![CDATA[Thanks to reader Armagon who asked for a consolidated report on race distribution. This subject is c]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Thanks to reader Armagon who asked for a consolidated report on race distribution. This subject is covered by a variety of <a href="http://www.warcraftrealms.com/census.php">other sites</a>, but there is some doubt about whether those sites are maintaining a representative sample.</p>
<p>It&#8217;s easy enough to produce a couple of simple tables that give us the information we need. (The data is from the patch 3.2 scan.)</p>
<table style="text-align:center;height:230px;" border="2" width="323">
<tbody>
<tr>
<th>Race</th>
<th>Popularity</th>
</tr>
<tr>
<td>Human</td>
<td>20 %</td>
</tr>
<tr>
<td>Blood Elf</td>
<td>17 %</td>
</tr>
<tr>
<td>Night Elf</td>
<td>16 %</td>
</tr>
<tr>
<td>Undead</td>
<td>10 %</td>
</tr>
<tr>
<td>Draenei</td>
<td>10 %</td>
</tr>
<tr>
<td>Tauren</td>
<td>9 %</td>
</tr>
<tr>
<td>Orc</td>
<td>6 %</td>
</tr>
<tr>
<td>Gnome</td>
<td>5 %</td>
</tr>
<tr>
<td>Dwarf</td>
<td>4 %</td>
</tr>
<tr>
<td>Troll</td>
<td>4 %</td>
</tr>
</tbody>
</table>
<p>That seems to be a reasonable match to the Warcraft Realms data. There will always be some degree of sampling error in this work so everybody&#8217;s numbers have to be treated with a bit of caution.</p>
<p>Tables like that always make me shake my head a bit but&#8230; the fantasy RPG where everybody wants to roleplay the cute kid next door&#8230; Hey, trolls are people too you know!</p>
<p>Anyway, if we want the distribution of race and class then we get this. (Percentages here are based on the total population so the popularity column adds up to 100%.)</p>
<table style="text-align:center;height:230px;" border="2" width="323">
<tbody>
<tr>
<th>Race</th>
<th>Class</th>
<th>Popularity (%)</th>
</tr>
<tr>
<td>Blood Elf</td>
<td>Paladin</td>
<td>6.1</td>
</tr>
<tr>
<td>Blood Elf</td>
<td>Death Knight</td>
<td>3.2</td>
</tr>
<tr>
<td>Blood Elf</td>
<td>Priest</td>
<td>1.6</td>
</tr>
<tr>
<td>Blood Elf</td>
<td>Mage</td>
<td>1.6</td>
</tr>
<tr>
<td>Blood Elf</td>
<td>Hunter</td>
<td>1.4</td>
</tr>
<tr>
<td>Blood Elf</td>
<td>Warlock</td>
<td>1.4</td>
</tr>
<tr>
<td>Blood Elf</td>
<td>Rogue</td>
<td>1.3</td>
</tr>
<tr>
<td>Draenei</td>
<td>Shaman</td>
<td>3.9</td>
</tr>
<tr>
<td>Draenei</td>
<td>Death Knight</td>
<td>1.6</td>
</tr>
<tr>
<td>Draenei</td>
<td>Paladin</td>
<td>1.2</td>
</tr>
<tr>
<td>Draenei</td>
<td>Priest</td>
<td>0.8</td>
</tr>
<tr>
<td>Draenei</td>
<td>Hunter</td>
<td>0.7</td>
</tr>
<tr>
<td>Draenei</td>
<td>Mage</td>
<td>0.7</td>
</tr>
<tr>
<td>Draenei</td>
<td>Warrior</td>
<td>0.5</td>
</tr>
<tr>
<td>Dwarf</td>
<td>Hunter</td>
<td>1.3</td>
</tr>
<tr>
<td>Dwarf</td>
<td>Paladin</td>
<td>1.2</td>
</tr>
<tr>
<td>Dwarf</td>
<td>Priest</td>
<td>0.7</td>
</tr>
<tr>
<td>Dwarf</td>
<td>Warrior</td>
<td>0.7</td>
</tr>
<tr>
<td>Dwarf</td>
<td>Death Knight</td>
<td>0.4</td>
</tr>
<tr>
<td>Dwarf</td>
<td>Rogue</td>
<td>0.2</td>
</tr>
<tr>
<td>Gnome</td>
<td>Mage</td>
<td>1.7</td>
</tr>
<tr>
<td>Gnome</td>
<td>Warlock</td>
<td>1.4</td>
</tr>
<tr>
<td>Gnome</td>
<td>Rogue</td>
<td>0.9</td>
</tr>
<tr>
<td>Gnome</td>
<td>Death Knight</td>
<td>0.8</td>
</tr>
<tr>
<td>Gnome</td>
<td>Warrior</td>
<td>0.5</td>
</tr>
<tr>
<td>Human</td>
<td>Paladin</td>
<td>5.3</td>
</tr>
<tr>
<td>Human</td>
<td>Death Knight</td>
<td>3.3</td>
</tr>
<tr>
<td>Human</td>
<td>Mage</td>
<td>2.7</td>
</tr>
<tr>
<td>Human</td>
<td>Warlock</td>
<td>2.5</td>
</tr>
<tr>
<td>Human</td>
<td>Warrior</td>
<td>2.4</td>
</tr>
<tr>
<td>Human</td>
<td>Priest</td>
<td>2.1</td>
</tr>
<tr>
<td>Human</td>
<td>Rogue</td>
<td>1.7</td>
</tr>
<tr>
<td>Night Elf</td>
<td>Druid</td>
<td>6.2</td>
</tr>
<tr>
<td>Night Elf</td>
<td>Hunter</td>
<td>3.3</td>
</tr>
<tr>
<td>Night Elf</td>
<td>Death Knight</td>
<td>2</td>
</tr>
<tr>
<td>Night Elf</td>
<td>Rogue</td>
<td>2</td>
</tr>
<tr>
<td>Night Elf</td>
<td>Priest</td>
<td>1.5</td>
</tr>
<tr>
<td>Night Elf</td>
<td>Warrior</td>
<td>1.3</td>
</tr>
<tr>
<td>Orc</td>
<td>Death Knight</td>
<td>1.5</td>
</tr>
<tr>
<td>Orc</td>
<td>Shaman</td>
<td>1.3</td>
</tr>
<tr>
<td>Orc</td>
<td>Warrior</td>
<td>1.3</td>
</tr>
<tr>
<td>Orc</td>
<td>Hunter</td>
<td>1.1</td>
</tr>
<tr>
<td>Orc</td>
<td>Warlock</td>
<td>0.5</td>
</tr>
<tr>
<td>Orc</td>
<td>Rogue</td>
<td>0.3</td>
</tr>
<tr>
<td>Tauren</td>
<td>Druid</td>
<td>4.1</td>
</tr>
<tr>
<td>Tauren</td>
<td>Shaman</td>
<td>1.4</td>
</tr>
<tr>
<td>Tauren</td>
<td>Warrior</td>
<td>1.4</td>
</tr>
<tr>
<td>Tauren</td>
<td>Death Knight</td>
<td>1</td>
</tr>
<tr>
<td>Tauren</td>
<td>Hunter</td>
<td>0.6</td>
</tr>
<tr>
<td>Troll</td>
<td>Shaman</td>
<td>0.9</td>
</tr>
<tr>
<td>Troll</td>
<td>Hunter</td>
<td>0.8</td>
</tr>
<tr>
<td>Troll</td>
<td>Mage</td>
<td>0.5</td>
</tr>
<tr>
<td>Troll</td>
<td>Priest</td>
<td>0.5</td>
</tr>
<tr>
<td>Troll</td>
<td>Rogue</td>
<td>0.4</td>
</tr>
<tr>
<td>Troll</td>
<td>Death Knight</td>
<td>0.3</td>
</tr>
<tr>
<td>Troll</td>
<td>Warrior</td>
<td>0.2</td>
</tr>
<tr>
<td>Undead</td>
<td>Rogue</td>
<td>2</td>
</tr>
<tr>
<td>Undead</td>
<td>Warlock</td>
<td>1.9</td>
</tr>
<tr>
<td>Undead</td>
<td>Priest</td>
<td>1.9</td>
</tr>
<tr>
<td>Undead</td>
<td>Mage</td>
<td>1.7</td>
</tr>
<tr>
<td>Undead</td>
<td>Death Knight</td>
<td>1.3</td>
</tr>
<tr>
<td>Undead</td>
<td>Warrior</td>
<td>0.9</td>
</tr>
</tbody>
</table>
<p>I&#8217;ll add this to my set of reports over at my Google site, so that it stays updated with each new scan. But it does seem that the other sites that cover population are doing a reasonable job of reporting what is really going on.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[US Congress debates online data protection]]></title>
<link>http://ubisurv.wordpress.com/2009/10/09/us-online-data-protection/</link>
<pubDate>Fri, 09 Oct 2009 16:51:37 +0000</pubDate>
<dc:creator>David</dc:creator>
<guid>http://ubisurv.wordpress.com/2009/10/09/us-online-data-protection/</guid>
<description><![CDATA[The US House of Representatives will finally get to debate whether online advertising which tracks t]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>The US House of Representatives will finally get to debate whether online advertising which tracks the browsing habits of users is a violation of privacy and needs to be controlled. <a title="MSNBC story" href="http://www.msnbc.msn.com/id/32722562/ns/technology_and_science-security/" target="_blank">A bill introduced by Rep. Rick Boucher of Virginia</a> will be propsing an opt-out regime that gives users information about the uses to which their data will be put, and allows them to refuse to be enroled. At present many such services work entirely unannounced, placing cookies on users&#8217; hard drives and using other tracking and datamining techniques, and without any way in which a user can say &#8216;no&#8217;.  Of course, we have yet to see the results of the inveitable industry scare-stories and hard-lobbying on the what will be proposed, let alone pased. But the proposal itself is particularly significant because so far the US has so far always bowed to business interests on online privacy and data protection, and if this bill is pased, it is a sign that what EFF-founder, Howard Rhiengold, long ago called the &#8216;electronic frontier&#8217; might start to acquire a little more law and order in favour of ordinary people.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Report on the FBI’s Investigative Data Warehouse 2009]]></title>
<link>http://axiomamuse.wordpress.com/2009/09/27/report-on-the-fbi%e2%80%99s-investigative-data-warehouse-2009/</link>
<pubDate>Sun, 27 Sep 2009 18:25:58 +0000</pubDate>
<dc:creator>AxXiom</dc:creator>
<guid>http://axiomamuse.wordpress.com/2009/09/27/report-on-the-fbi%e2%80%99s-investigative-data-warehouse-2009/</guid>
<description><![CDATA[Report on the Investigative Data Warehouse April 2009 Table Of Contents Overview of the IDW IDW Syst]]></description>
<content:encoded><![CDATA[Report on the Investigative Data Warehouse April 2009 Table Of Contents Overview of the IDW IDW Syst]]></content:encoded>
</item>
<item>
<title><![CDATA[Playing with PLINQ Performance using the StackOverflow Data Dump]]></title>
<link>http://nickjosevski.wordpress.com/2009/09/25/playing-with-plinq-performance-using-the-stackoverflow-data-dump/</link>
<pubDate>Fri, 25 Sep 2009 06:20:51 +0000</pubDate>
<dc:creator>nickjosevski</dc:creator>
<guid>http://nickjosevski.wordpress.com/2009/09/25/playing-with-plinq-performance-using-the-stackoverflow-data-dump/</guid>
<description><![CDATA[Not having made use of PLINQ in an actual product yet, I decided to have a play with how it works, a]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Not having made use of PLINQ in an actual product yet, I decided to have a play with how it works, and to try and obtain my own small metrics on it&#8217;s performance benefits. PLINQ is part of a <a href="http://en.wikipedia.org/wiki/Parallel_Extensions">larger push</a> from the .NET teams at Microsoft to get concurrent/parallel processing out of the box in your C# and VB.NET code. As for performance analysis there are already some great posts out there, not just from the <a href="http://blogs.msdn.com/pfxteam/">Parallel Team</a> at MS but also from great breakdowns with nice charts such as <a href="http://www.scip.be/index.php?Page=ArticlesNET08&#38;Lang=EN">this</a>.</p>
<p>Right off the bat, I&#8217;d like to stress that adding a <em>.AsParallel()</em> to your code won&#8217;t magically speed it up. Knowing this I still had unrealistic expectations when I began creating a demo to specifically show performance improvements. Often enough the level of processing I was performing (even on larger sets of data), did not benefit from being made concurrent across 2 cores. A level of variation in my results, leads me to believe part of the issue is also the ability to obtain enough resources to make effective use of 2+ cores. For example running out of the 4GB ram I have available, interference from other processes on the machine (Firefox, TweetDeck, virus scanner). </p>
<p>In my attempts at re-creating the &#8220;Baby Names&#8221; demo <a href="http://hanselman.com/blog/">Scott Hanselman</a> previewed at the <a href="http://www.ndc2009.no/en/">2009 NDC Conference</a> in his great presentation: &#8220;<em>Whirlwind Tour of .NET 4</em>&#8220;. I first got a hold of the preview code samples back from 2008 for PLINQ that were part of the <a href="http://www.microsoft.com/downloads/details.aspx?FamilyId=348F73FD-593D-4B3C-B055-694C50D2B0F3&#38;displaylang=en">Parallel Extenstions CTP</a>. </p>
<p>I then went on to from scratch create my own simple PLINQ &#8211; Windows Presentation Foundation (WPF) application. </p>
<p>I chose WPF to test a small feature I hadn&#8217;t made use of yet only because I happened to stumble upon it on that day; <a href="http://msdn.microsoft.com/en-us/library/ms742806.aspx">Routed Events</a> see this StackOverflow <a href="http://stackoverflow.com/questions/254992/how-can-i-best-handle-wpf-radio-buttons">question</a>.</p>
<p>Once I completed my take on a LINQ processing demo based on 2 minutes of video showing the operation of &#8216;Baby Names&#8217;, I discovered (by accident*) the Visual Studio 2010 and .NET Framework 4 Training Kit &#8211; <a href="http://www.microsoft.com/downloads/details.aspx?familyid=752CB725-969B-4732-A383-ED5740F02E93&#38;displaylang=en">May Preview</a>, which contains the demo code for what I was trying to re-create.</p>
<blockquote><p>*The accident in which I discovered the Training Kit, was I actually performed a google image search on the term &#8216;PLINQ&#8217; to see what came up for ideas for a graphic to add to this post. The 11th image (centre screen) was the baby name graph displayed in the<em>Whirlwind Tour of .NET 4</em> presentation. The post that had the image was from Bruno Terkaly, the post was about the <a href="http://blogs.msdn.com/brunoterkaly/archive/2009/07/21/visual-studio-2010-and-net-framework-4-training-kit-may-preview.aspx">tool kit</a>, great!</p></blockquote>
<div id="attachment_578" class="wp-caption aligncenter" style="width: 450px"><a href="http://nickjosevski.wordpress.com/files/2009/09/vs2010toolkitscreen.jpg"><img src="http://nickjosevski.wordpress.com/files/2009/09/vs2010toolkitscreen.jpg" alt="VS 2010 Training Kit May Preview" title="VS 2010 Training Kit May Preview" width="436" height="205" class="size-full wp-image-578" /></a><p class="wp-caption-text">VS 2010 Training Kit May Preview</p></div>
<p>None the less, my not-as polished demo application, makes use of the StackOverflow <a href="http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/">creative commons data dump</a> (actually the <a href="http://blog.stackoverflow.com/2009/09/creative-commons-data-dump-sep-09/">Sep 09 drop</a>).</p>
<blockquote><p>
<em>Some background</em>: I grabbed the StackOverflow data dump via the <a href="http://www.legaltorrents.com/torrents/714-sep-09">LegalTorrents link</a>, then I followed this <a href="http://www.brentozar.com/archive/2009/06/how-to-import-the-stackoverflow-xml-into-sql-server/">great post</a> from <a href="http://www.brentozar.com/">Brent Ozar</a>, where he supplies code for 5 stored procedures to create a table schema and import the XML data into SQL Server. It was as simple as running them, and then writing 5 exec  statements and waiting the ~1 hour to load the data (resulting for me in a 2.5 gig DB).
</p></blockquote>
<p>The way I structured a <em>lengthy processing task</em> that can benefit from parallel processing, is by making use of the <em>Posts</em> data (questions and answers), in particular questions with an accepted answer. I make an attempt through a repetitive simple string comparison process to determine how valid the tags on the question are, by scanning the question text for the tags, and counting frequency. Then timing the processing of sequential operation vs the parallel operation as I pipe varying levels of data into the function.</p>
<p>First I extract the data into memory from SQL Server (using LINQ to SQL Entities). Just a note on the specifics of the SO Data Dump structure; &#8216;Score&#8217; is a nullable int so just to keep the data set down in volume I select posts that have some score and greater than a selected input (usually 10+ at least 1 person liked it), same with a reasonable amount of views (on average 200+).</p>
<pre class="brush: csharp;">
private IEnumerable&#38;amp;lt;Post&#38;amp;gt; GetPosts(int score, int views)
{
   var posts = from p in db.Posts
           where (p.Score ?? 0) &#38;amp;gt; score
           &#38;amp;amp;&#38;amp;amp; p.ViewCount &#38;amp;gt; views
           select p;

   return posts.ToList();
}
</pre>
<p>The next step was to create a function that would take some time to process, and of course potentially benefit from being run in parallel. Each post and it&#8217;s tags are operated in isolation, so this is clearly prime for separation over multiple cores. Sadly my development laptop only has 2 cores.</p>
<pre class="brush: csharp;">
private bool IsDescriptive(Post p)
{
   //lengthy boring code
   //pseudocode instead:

   var words = extract_all_unique_words_from_the_post();
     //excluding punctuation
     //and other formatting details (markup).

   var tags = extract_tags_from_post();

   return were_the_tags_used_enough_in_post(words, tags);
}
</pre>
<blockquote><p>
Note: A more sophisticated algorithm here could help actually determine (and recommend) more appropriate tags based on word frequencies, but that&#8217;s beyond what I have time to implement for performance testing purposes. It would need to know to avoid common used words such as &#8216;the&#8217;, &#8216;code&#8217;, &#8216;error&#8217;, &#8216;problem&#8217;, &#8216;unsure&#8217;, etc (you get the point). It would then need to go further and know what words actually make sense to describe the technology (language/environment) the stack overflow question is about.
</p></blockquote>
<p>The Parallel operation is applied to a &#8216;Where&#8217; filtering of data and this is where the timing and the reporting of the performance is based on. Making use of <em>System.Diagnostics.StopWatch</em>.</p>
<pre class="brush: csharp;">

//running sequentially:
posts.Where(p =&#38;amp;gt; IsDescriptive(p));

// vs making use of parallel processing:
posts.AsParallel().Where(p =&#38;amp;gt; IsDescriptive(p));
</pre>
<p>On average this function making use of <em>.AsParallel()</em>, for varying records quantities from 100k to 300k would result in a <strong>1.75 times speed up</strong> over the function operating sequentially on a single core. Which is what I was hoping to see.</p>
<p>All this was performed on a boot from VHD instance of Windows 7 (a great setup guide by Scott Hanselman <a href="http://www.hanselman.com/blog/LessVirtualMoreMachineWindows7AndTheMagicOfBootToVHD.aspx">here</a>) with Visual Studio 2010 Beta 1 and SQL Server 2008, so I do understand there was some performance hit (both running as a VHD and having SQL on the same machine) but on average for effective PLINQ setup functions there was at <strong>least</strong> a 1.6 times factor speed up.</p>
<p>It&#8217;s that simple; it doesn&#8217;t do much <em>yet</em>, but there is potential for improved/more-interesting data analysis and performance measuring of it too. I will make time to clean up the demo application and post the solution files in a <em>future post</em>, so stay tuned for that. When I get a chance I&#8217;ll also try to investigate more of the data manipulations people are performing via data mining techniques and attempt to re-create them just for more performance tests. When I do I&#8217;ll be starting <a href="http://sqlserverpedia.com/wiki/Interesting_StackOverflow_Database_Queries">here</a>.</p>
<p>That&#8217;s it, I&#8217;ll have a follow up post with some more details in particular the types of queries I had that did <strong>not</strong> benefit from PLINQ once I get a chance to determine how they were flawed or if they simply just run better on a single thread/core.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[The Stream of Scientific Revolutions]]></title>
<link>http://cadsmith.wordpress.com/2009/09/20/the-stream-of-scientific-revolutions/</link>
<pubDate>Sun, 20 Sep 2009 22:23:19 +0000</pubDate>
<dc:creator>cadsmith</dc:creator>
<guid>http://cadsmith.wordpress.com/2009/09/20/the-stream-of-scientific-revolutions/</guid>
<description><![CDATA[Due to social and sensor networks, it is estimated that data volume is doubling every 9 to 12 months]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Due to social and sensor networks, it is estimated that data volume is doubling every 9 to 12 months. Analysis is required in realtime to derive knowledge from distributed databases. Awareness is improved by adding sources, e.g. the internet of things. The term <em>data mining,</em> reportedly coined by Robert Hecht-Nielsen a couple of decades ago<em>,</em> denotes automated fact-finding, knowledge discovery, rule inference and prediction activities. The field follows predecessors such as statistics, originally named for state demographics and economics, and machine learning. ACM dedicated a knowledge discovery and data mining group, KDD, in 1989.</p>
<p>In classic science, a hypothesis is often disproved by experiment, whereas in this case, tests yield a data-driven hypothesis. Patterns of interest are useful or novel, though most are not. More recently, the field picked up steam as analysis times for huge databases became excessive, disparate sources needed to be quickly connected and dimensionality, or number of attributes, expanded. These result in ways to assign meaning which leads to knowledge which is communicated as information assuming that errors are avoided or corrected. The result is better visualization and built-in database intelligence.</p>
<p>Government and security have been major proponents, e.g. for profiling. Other applications include biomed, insurance, physics, business intelligence, CRM, information retrieval, OLAP online analytical processing, text mining and analysis, finding experts, sports stats, and digital libraries. Besides software, tools include decision trees and neural networks. Models may be verified by splitting the data and verifying the equivalence of results on both parts.</p>
<p>Major tasks have been outlined as:</p>
<ul>
<li>classification, sequence detection, genetic algorithms, nearest neighbor, naive bayes classifier, logistic regression and discriminant analysis;</li>
<li>affinity analysis, market basket, association analysis, rule learning, rough sets, and sequence detection;</li>
<li>prediction, regression, and time series analysis forecasting;</li>
<li>segmentation, cluster analysis, and kohonen networks;</li>
</ul>
<p>A couple of the popular standards are CRISP-DM, cross industry standard process for data mining, and PMML, predictive model markup language.</p>
<p>There is plenty of software such as R, SAS SEMMA for sample explore modify model assess, SPSS, Netbase, Statistica, opensource Labkey, Rattle GNOME GUI, GNU octave, Weka-3, Apache Hadoop, Datalogic/R, Mozenda scraper. IBM, Oracle and Microsoft have offerings.</p>
<p>Other than usability, system integration and projections from prior knowledge, issues commonly revolve around privacy and performance. Congress has discussed consumer protections, though users are tracked from an increasing number of government social-network sites and cloud security standards are still in development. Data may be missing. Patterns may not be understandable. Noisy data can result in spurious patterns, though source correction is improving. Relationships between fields may be more complex than assumed.</p>
<p><a href="http://www.kdnuggets.com">Kdnuggets</a> is a general resource site.</p>
<p>Also see <a href="http://delicious.com/johnro/datamining">bookmarks</a> for media links.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[JOLT]]></title>
<link>http://runmotherfuckerrun.wordpress.com/2009/09/18/jolt/</link>
<pubDate>Fri, 18 Sep 2009 13:46:41 +0000</pubDate>
<dc:creator>runmotherfuckerrun</dc:creator>
<guid>http://runmotherfuckerrun.wordpress.com/2009/09/18/jolt/</guid>
<description><![CDATA[JOLT é a sigla para Journal of Law &amp; Technology da Universidade de Harvard. Isso é: coisa fina, ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>JOLT é a sigla para <a href="http://jolt.law.harvard.edu/" target="_blank">Journal of Law &#38; Technology da Universidade de Harvard</a>. Isso é: coisa fina, vários PDFs, alguns deveras interessantes.</p>
<p><a href="http://jolt.law.harvard.edu/articles/pdf/v22/22HarvJLTech515.pdf" target="_blank">Nesse link</a> tem um PDF sobre &#8220;DataMining &#38; Antritust&#8221; e versa sobre o cruzamento de dados para criar uma pirâmide de preços. No <em>abstract</em> tem um exemplo com a Amazon:</p>
<blockquote><p>In 2000, customers of Amazon.com discovered that the online retailer was varying the prices charged for DVDs depending on the identity of the purchaser.1 Although Amazon discontinued what it described as a “price test”2 after public outcry, Amazon’s brief foray into first-degree price discrimination stands as a noteworthy example of the possibilities for price discrimination using aggregated data. In its price test, Amazon sought to use information it already had about its customers to predict higher prices that the customers would still belikely to pay.</p></blockquote>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Kinderschutzfilter bespitzeln Kinder ]]></title>
<link>http://11k2.wordpress.com/2009/09/09/kinderschutzfilter-bespitzeln-kinder/</link>
<pubDate>Wed, 09 Sep 2009 15:40:13 +0000</pubDate>
<dc:creator>Fritz</dc:creator>
<guid>http://11k2.wordpress.com/2009/09/09/kinderschutzfilter-bespitzeln-kinder/</guid>
<description><![CDATA[Die US-Softwarefirma EchoMetrix Inc hat sich auf Kinderschutz-Filtersoftware wie Sentry und FamilySa]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p style="margin-bottom:0;" align="LEFT"><a href="http://11k2.wordpress.com/files/2009/09/090909echomatrix-screenshot.jpg"><img class="alignnone size-full wp-image-10921" title="090909echomatrix-screenshot" src="http://11k2.wordpress.com/files/2009/09/090909echomatrix-screenshot.jpg" alt="090909echomatrix-screenshot" width="460" height="327" /></a></p>
<p style="margin-bottom:0;" align="LEFT">Die US-Softwarefirma EchoMetrix Inc hat sich auf Kinderschutz-Filtersoftware wie Sentry und FamilySafe spezialisiert. Damit können Eltern bestimmen, was ihre Kinder im Internet sehen können und was nicht. Im Juni stellte das Unternehmen einen weiteren<!--more--> Service vor:</p>
<p style="margin-bottom:0;" align="LEFT">Pulse. Hier können Unternehmen mit Produkten, die auf Kinder zielen, erfahren, was die Kinder so über diese Produkte denken. Weil EchoMetrix die Chats der Kinder belauscht, sobald ihre Filtersoftware auf deren Computern installiert ist, und deren Reaktionen zu den betreffenden Produkten bewertet.</p>
<p style="margin-bottom:0;" align="LEFT">EchoMetrix CEO Jeff Greene findet, das sei in Einklang mit US-amerikanischen Datenschutzgesetzen. Und ausserdem können sich besorgte Eltern nach der erfolgreichen Installation durch die Website des Anbieters klicken und dort in ihrem Account die &#8220;Weitergabe von Daten an vertrauenswürdige Geschäftspartner&#8221; ausklicken. Das steht aber nirgendwo, man muss es selber herausfinden.</p>
<p style="margin-bottom:0;" align="LEFT">Seht ihr? Sie machen Geschäfte mit unschuldigen Kindern. Und bespitzeln deren Unterhaltungen für lukratives Datamining. Was denkt ihr, was sollte man mit solchen Leuten machen?</p>
<p style="margin-bottom:0;" align="LEFT"><a href="http://11k2.wordpress.com/files/2009/09/090909pulse.jpg"><img class="alignnone size-full wp-image-10922" title="090909pulse" src="http://11k2.wordpress.com/files/2009/09/090909pulse.jpg" alt="090909pulse" width="460" height="132" /></a></p>
<p style="margin-bottom:0;" align="LEFT">( <a href="http://www.google.com/hostednews/ap/article/ALeqM5i5CjgMEdrwRm3JxeglUykMAHAYmAD9AGNVM00" target="_blank">google news</a> via <a href="http://www.boingboing.net/2009/09/07/child-safety-softwar.html" target="_blank">boingboing</a>)</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Data Mining: Obama White House Has Secret Plan To Harvest Personal Data From Social Networking Websites]]></title>
<link>http://freedomandlinux.wordpress.com/2009/09/02/data-mining-obama-white-house-has-secret-plan-to-harvest-personal-data-from-social-networking-websites/</link>
<pubDate>Wed, 02 Sep 2009 15:25:18 +0000</pubDate>
<dc:creator>darthchaosofrspw</dc:creator>
<guid>http://freedomandlinux.wordpress.com/2009/09/02/data-mining-obama-white-house-has-secret-plan-to-harvest-personal-data-from-social-networking-websites/</guid>
<description><![CDATA[Comment: Listen, you fake progressives and fake liberals. It wasn&#8217;t okay when Bush did it, SO ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><em><strong>Comment: Listen, you fake progressives and fake liberals. It wasn&#8217;t okay when Bush did it, SO WHY ARE YOU OKAY WITH OBAMA DOING IT?! You are not liberals. You are not progressives. You are neoliberals. You are the same as the fascist neocons who sucked Bush&#8217;s dick. So go ahead and suck Obama&#8217;s dick and choke on it.</strong></em></p>
<p>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-</p>
<p><span style="font-size:14pt;line-height:1.3em;"><strong>Obama White House Has Secret Plan To Harvest Personal Data From Social Networking Websites</strong></span></p>
<p>Submitted by Ken Boehm on Mon, 08/31/2009 &#8211; 19:07<br />
<a href="http://nlpc.org/stories/2009/08/31/obama-white-house-has-secret-plan-harvest-personal-data-social-networking-website" target="_blank">http://nlpc.org/stories/2009/08/31/obama-white-house-has-secret-plan-harvest-personal-data-social-networking-website</a></p>
<p>NLPC has uncovered a plan by the White House New Media operation to hire a technology vendor to conduct a massive, secret effort to harvest personal information on millions of Americans from social networking websites.</p>
<p>The information to be captured includes comments, tag lines, emails, audio, and video. The targeted sites include Facebook, Twitter, MySpace, YouTube, Flickr and others – any space where the White House “maintains a presence.”</p>
<p>In the course of investigating procurement by the White House New Media office, NLPC discovered a 51-page solicitation of bids that was filed on Friday, August 21, 2009. Filed as Solicitation # WHO-S-09-0003, it is posted at FedBizzOps.com. Click here to download a 51-page pdf of the solicitation.</p>
<p>While the solicitation specifies a 12-month contract, it allows for seven one-year extensions. It specifies no dollar cap. Other troubling issues include:</p>
<ul>
<li>extremely broad secrecy terms preventing the vendor from disclosing to the public or the media what information is being captured and archived (page 7, “Restriction Against Disclosure”)</li>
<li>wholesale capturing of comments by non-White House staff on publicly accessible sites</li>
<li>capturing of content of any type (text, graphics, audio, or video)</li>
<li>capturing of comments by both Obama critics and supporters, with no restriction as to how the White House would use the information.</li>
</ul>
<p>This is the third controversy involving the White House internet operations in less than a month. First, Obama’s New Media operation asked supporters to send information about critics of the White House health care effort to a White House email. This provoked a storm of criticism and the White House retreated. Then large number of people complained of getting email spam from the White House supporting the President’s health care position.  Again the White House was forced to back down.</p>
<p>Now the same people at the White House are at it again with an ambitious plan to harvest huge amounts of information from the web and specifically social networking sites.</p>
<p>Given the White House’s recent abuse of its New Media operations, this huge, new secretive program is yet another sign that this Administration is at best indifferent to privacy rights and at worst prepared to violate civil liberties for political purposes.</p>
<p>Perhaps anticipating negative reaction to the invasiveness of the plan, a justification is provided in a Q&#38;A. section of the solicitation. Question #9 reads:</p>
<p>The Presidential Records Act does not require the storage or archiving of non-EOP content, as such is there a specific reason as to why the content provided on EOP related websites in the form of comments is included in these archiving procedures?</p>
<p>Answer: The PRA includes in its definition of presidential records content ―received by PRA components and personnel. Out of an abundance of caution, we are treating comments made by non-PRA personnel on sites on which a PRA component has a presence as presidential records, requiring them to be captured or sampled.</p>
<p>Of course, this interpretation of the Presidential Records Act is so expansive that virtually any communication mentioning the president or the Administration could become subject to collection and archiving under the Act. This is not out of an “abundance of caution,” but out of an over-abundance of power. President Obama should make sure that this plan goes no further.</p>
<p>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-</p>
<p>You can&#8217;t even get to NLPC&#8217;s website <a href="http://www.nlpc.org/" target="_blank">www.nlpc.org</a> (National Legal Policy Center) to read their write up as I would assume it&#8217;s getting hammered.</p>
<p>You can only find it on various blogs at the moment. Below is a link to the contract description/PDF</p>
<p><a href="http://nicedeb.wordpress.com/2009/09/01/report-obama-white-house-plans-to-harvest-personal-data-from-networking-sites/" target="_blank">http://nicedeb.wordpress.com/2009/09/01/report-obama-white-house-plans-to-harvest-personal-data-from-networking-sites/</a></p>
<p>This is the link to NLPC&#8217;s report/article but evidently their servers are down</p>
<p><a href="http://nlpc.org/stories/2009/08/31/obama-white-house-has-secret-plan-harvest-personal-data-social-networking-website" target="_blank">http://nlpc.org/stories/2009/08/31/obama-white-house-has-secret-plan-harvest-personal-data-social-networking-website</a></p>
<p>contract description/PDF<br />
<a href="https://www.fbo.gov/index?s=opportunity&#38;mode=form&#38;id=eec856940efb75b2b1c11e2b1d5660a4&#38;tab=core&#38;_cview=0&#38;cck=1&#38;au=&#38;ck=" target="_blank">https://www.fbo.gov/index?s=opportunity&#38;mode=form&#38;id=eec856940efb75b2b1c11e2b1d5660a4&#38;tab=core&#38;_cview=0&#38;cck=1&#38;au=&#38;ck=</a></p>
<p>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;</p>
<p><strong>White House Still Trying To Get Information On Citizens</strong><br />
The National Legal and Policy Center (NLPC) has discovered a secret White House project to harvest personal date from social networking websites like facebook and twitter.</p>
<p>The White House office of New Media has sent out a request for proposals from technology vendors to develop and run the project. According to the proposal request, the information to be captured includes comments, tag lines, emails, audio, and video.</p>
<p>The targeted sites include Facebook, Twitter, MySpace, YouTube, Flickr and others &#8212; any space where the White House &#8220;maintains a presence.&#8221; The Proposal requests a bid covers allowing this project to last eight years.</p>
<p>In all fairness if you read the PDF of the solicitation, it speaks of the project as a way to to comply with the Presidential Records Act. But then there are the frightening parts especially for this administration which promises to be the most transparent in history. The disturbing parts of the proposal include:</p>
<ul>
<li>Extremely broad secrecy terms preventing the vendor from disclosing to the public or the media what information is being captured and archived (page 7, &#8220;Restriction Against Disclosure&#8221;)</li>
<li>Wholesale capturing of comments by non-White House staff on publicly accessible sites</li>
<li>Capturing of content of any type (text, graphics, audio, or video)</li>
<li>Capturing of comment by both Obama critics and supporters, with no restriction as to how the White House would use the information.</li>
</ul>
<p><a href="http://www.rantburg.com/poparticle.php?D=2009-09-02&#38;ID=278037&#38;loc=interstitialskip" target="_blank">http://www.rantburg.com/poparticle.php?D=2009-09-02&#38;ID=278037&#38;loc=interstitialskip</a></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Анализ рынка ноутбуков с помощью Python]]></title>
<link>http://butaji.wordpress.com/2009/08/31/%d0%b0%d0%bd%d0%b0%d0%bb%d0%b8%d0%b7-%d1%80%d1%8b%d0%bd%d0%ba%d0%b0-%d0%bd%d0%be%d1%83%d1%82%d0%b1%d1%83%d0%ba%d0%be%d0%b2-%d1%81-%d0%bf%d0%be%d0%bc%d0%be%d1%89%d1%8c%d1%8e-python/</link>
<pubDate>Mon, 31 Aug 2009 05:09:43 +0000</pubDate>
<dc:creator>butaji</dc:creator>
<guid>http://butaji.wordpress.com/2009/08/31/%d0%b0%d0%bd%d0%b0%d0%bb%d0%b8%d0%b7-%d1%80%d1%8b%d0%bd%d0%ba%d0%b0-%d0%bd%d0%be%d1%83%d1%82%d0%b1%d1%83%d0%ba%d0%be%d0%b2-%d1%81-%d0%bf%d0%be%d0%bc%d0%be%d1%89%d1%8c%d1%8e-python/</guid>
<description><![CDATA[Введение В этой статье я расскажу о состоянии на сегодняшнем российском рынке ноутбуков. Всю аналити]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><h2>Введение</h2>
<p>В этой статье я расскажу о состоянии на сегодняшнем российском рынке ноутбуков. Всю аналитику мы будем проводить с помощью кода на python. Думаю она будет полезна как тем, кто ищет ноутбук, так и тем, кто хочет потренироваться написанию на python.</p>
<h2>Начнём</h2>
<p><a href="http://butaji.files.wordpress.com/2009/08/diy034251.jpg"><img style="display:inline;border-width:0;margin:0 10px 0 0;" title="diy-03-425[1]" border="0" alt="diy-03-425[1]" align="left" src="http://butaji.files.wordpress.com/2009/08/diy034251_thumb.jpg?w=244&#038;h=184" width="244" height="184" /></a> Для анализа нам необходим набор данных, к сожалению я не смог обнаружить <a href="http://developer.ebay.com/">веб-сервисы</a> у российских он-лайн магазинов ноутбуков, поэтому мне пришлось скачать прайс-лист одного из них (я не стану называть его) и вытащить из него цены и основные параметры (по-моему мнению таковыми являются: частота процессора, диагональ монитора, объем оперативной памяти, размер жесткого диска и объем памяти на видео-карточке). Далее я провёл некоторый анализ по следующим вопросам:</p>
<ol>
<li>Средняя стоимость ноутбука </li>
<li>Усредненные параметры железа на ноутбуках </li>
<li>Самая дорогая/дешевая конфигурация ноутбука </li>
<li>Какой из параметров конфигурации больше всего влияет на его цену </li>
<li>Прогнозирование цены указанной конфигурации </li>
<li>График распределения конфигураций и цен </li>
</ol>
<h2></h2>
<h2></h2>
<h2>Lets code</h2>
<p>Прайс-лист, который мне удалось заполучить я сохранил в формате CVS, для работы с ним необходимо подключить модуль cvs:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:06cacbed-5c59-4f43-bd47-0054ff72760b" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>import</b></span>&#160;<span style="color:#0000FF;"><b>csv</b></span><br />
<span style="color:#AA22FF;"><b>import</b></span>&#160;<span style="color:#0000FF;"><b>re</b></span><br />
<span style="color:#AA22FF;"><b>import</b></span>&#160;<span style="color:#0000FF;"><b>random</b></span>
</div>
</div>
<p>Так же подключим модуль для работы со случайными числами и регулярными выражениями, которые в последствии нам понадобятся.</p>
<p>Далее создадим метод для чтения и получения ноутбуков:</p>
</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:3782faa3-fd6a-437c-b0eb-12919c112024" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">get_notebooks</span>():<br />
&#160;&#160;&#160;&#160;reader&#160;<span style="color:#666666;">=</span>&#160;csv<span style="color:#666666;">.</span>reader(<span style="color:#AA22FF;">open</span>(<span style="color:#BB4444;">&#8216;data.csv&#8217;</span>),&#160;delimiter<span style="color:#666666;">=</span><span style="color:#BB4444;">&#8216;;&#8217;</span>,&#160;quotechar<span style="color:#666666;">=</span><span style="color:#BB4444;">&#8216;&#124;&#8217;</span>)<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>return</b></span>&#160;<span style="color:#AA22FF;">filter</span>(<span style="color:#AA22FF;"><b>lambda</b></span>&#160;x:&#160;x&#160;<span style="color:#666666;">!=</span>&#160;<span style="color:#AA22FF;">None</span>,&#160;<span style="color:#AA22FF;">map</span>(create_notebook,&#160;reader))
</div>
</div>
<p>здесь всё просто, мы читаем на файл с данными data.csv и фильтруем по результату функции create_notebook, т.к. не все позиции в прайсе являются ноутбуками, а вот кстати и она:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:5d660e7e-55e0-4ee1-b0bf-13dedc90f294" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">create_notebook</span>(raw):<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>try</b></span>:<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;notebook&#160;<span style="color:#666666;">=</span>&#160;Notebook()<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;notebook<span style="color:#666666;">.</span>vendor&#160;<span style="color:#666666;">=</span>&#160;raw[<span style="color:#666666;">0</span>]<span style="color:#666666;">.</span>split(<span style="color:#BB4444;">&#8216;&#160;&#8217;</span>)[<span style="color:#666666;">0</span>]<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;notebook<span style="color:#666666;">.</span>model&#160;<span style="color:#666666;">=</span>&#160;raw[<span style="color:#666666;">0</span>]<span style="color:#666666;">.</span>split(<span style="color:#BB4444;">&#8216;&#160;&#8217;</span>)[<span style="color:#666666;">1</span>]<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;notebook<span style="color:#666666;">.</span>cpu&#160;<span style="color:#666666;">=</span>&#160;getFloat(<span style="color:#BB4444;">r&#8221;(\d+)\,(\d+)\s\Г&#8221;</span>,&#160;raw[<span style="color:#666666;">0</span>]<span style="color:#666666;">.</span>split(<span style="color:#BB4444;">&#8216;/&#8217;</span>)[<span style="color:#666666;">0</span>])<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;notebook<span style="color:#666666;">.</span>monitor&#160;<span style="color:#666666;">=</span>&#160;getFloat(<span style="color:#BB4444;">r&#8221;(\d+)\.(\d+)\&#8221;&#8221;</span>,&#160;raw[<span style="color:#666666;">0</span>]<span style="color:#666666;">.</span>split(<span style="color:#BB4444;">&#8216;/&#8217;</span>)[<span style="color:#666666;">1</span>])<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;notebook<span style="color:#666666;">.</span>ram&#160;<span style="color:#666666;">=</span>&#160;getInt(<span style="color:#BB4444;">r&#8221;(\d+)\Mb&#8221;</span>,&#160;raw[<span style="color:#666666;">0</span>]<span style="color:#666666;">.</span>split(<span style="color:#BB4444;">&#8216;/&#8217;</span>)[<span style="color:#666666;">2</span>])<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;notebook<span style="color:#666666;">.</span>hdd&#160;<span style="color:#666666;">=</span>&#160;getInt(<span style="color:#BB4444;">r&#8221;(\d+)Gb&#8221;</span>,&#160;raw[<span style="color:#666666;">0</span>]<span style="color:#666666;">.</span>split(<span style="color:#BB4444;">&#8216;/&#8217;</span>)[<span style="color:#666666;">3</span>])<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;notebook<span style="color:#666666;">.</span>video&#160;<span style="color:#666666;">=</span>&#160;getInt(<span style="color:#BB4444;">r&#8221;(\d+)Mb&#8221;</span>,&#160;raw[<span style="color:#666666;">0</span>]<span style="color:#666666;">.</span>split(<span style="color:#BB4444;">&#8216;/&#8217;</span>)[<span style="color:#666666;">4</span>])<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;notebook<span style="color:#666666;">.</span>price&#160;<span style="color:#666666;">=</span>&#160;getInt(<span style="color:#BB4444;">r&#8221;(\d+)\s\руб.&#8221;</span>,&#160;raw[<span style="color:#666666;">1</span>])<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>return</b></span>&#160;notebook<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>except</b></span>&#160;<span style="color:#D2413A;"><b>Exception</b></span>,&#160;e:<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>return</b></span>&#160;<span style="color:#AA22FF;">None</span>
</div>
</div>
<p>Как вы можете заметить, я решил не обращать внимания на вендора, модель и тип процессора (здесь конечно не всё так просто, но тем не менее), а и ещё &#8211; в данном методе присутствуют мои кастомные функции-помощники:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:c79507b7-6596-43d1-9aac-a11e7711745a" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">getFloat</span>(regex,&#160;raw):<br />
&#160;&#160;&#160;&#160;m&#160;<span style="color:#666666;">=</span>&#160;re<span style="color:#666666;">.</span>search(regex,&#160;raw)<span style="color:#666666;">.</span>groups()<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>return</b></span>&#160;<span style="color:#AA22FF;">float</span>(m[<span style="color:#666666;">0</span>]&#160;<span style="color:#666666;">+</span>&#160;<span style="color:#BB4444;">&#8216;.&#8217;</span>&#160;<span style="color:#666666;">+</span>&#160;m[<span style="color:#666666;">1</span>])</p>
<p><span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">getInt</span>(regex,&#160;raw):<br />
&#160;&#160;&#160;&#160;m&#160;<span style="color:#666666;">=</span>&#160;re<span style="color:#666666;">.</span>search(regex,&#160;raw)<span style="color:#666666;">.</span>groups()<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>return</b></span>&#160;<span style="color:#AA22FF;">int</span>(m[<span style="color:#666666;">0</span>])
</div>
</div>
<p>Хочу заметить, что писать для питона лучше всего в стиле наборов данных, а не ООП структур, в связи с тем, что язык больше располагает к такому стилю, однако для наведения некоторого порядка в нашей доменной области (ноутбуки), я ввёл класс, как вы могли заметить выше (notebook = Notebook())</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:379c08af-4ff7-40fb-a4f7-cdfe6e0f1814" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>class</b></span>&#160;<span style="color:#0000FF;">Notebook</span>:<br />
&#160;&#160;&#160;<span style="color:#AA22FF;"><b>pass</b></span>
</div>
</div>
<p>Отлично, теперь у нас есть структура в памяти и она готова для анализа (<em>2005 различных конфигураций и их стоимость</em>), что же начнём:</p>
</p>
<p><strong>Средняя стоимость ноутбука:</strong> </p>
</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:18fcd376-b104-4aa9-bf1e-b97e360befa7" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">get_avg_price</span>():<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>print</b></span>&#160;<span style="color:#AA22FF;">sum</span>([n<span style="color:#666666;">.</span>price&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;n&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;get_notebooks()])<span style="color:#666666;">/</span><span style="color:#AA22FF;">len</span>(get_notebooks())
</div>
</div>
<p>Исполняем код и видим, что 1K$, как стандарт для компьютера всё ещё в силе:</p>
</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:4a6517cb-5756-4ae9-9e6d-ba929393e6c9" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
&#62;&#62;&#160;get_avg_price()<br />
<span style="color:#009999;">34574</span>
</div>
</div>
<p><strong>Усредненные параметры железа на ноутбуках </strong></p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:dbac7615-63ce-431e-956e-5ca4af3ea668" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">get_avg_parameters</span>():<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>print</b></span>&#160;<span style="color:#BB4444;">&#8220;cpu&#160;{0}&#8221;</span><span style="color:#666666;">.</span>format(<span style="color:#AA22FF;">sum</span>([n<span style="color:#666666;">.</span>cpu&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;n&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;get_notebooks()])<span style="color:#666666;">/</span><span style="color:#AA22FF;">len</span>(get_notebooks()))<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>print</b></span>&#160;<span style="color:#BB4444;">&#8220;monitor&#160;{0}&#8221;</span><span style="color:#666666;">.</span>format(<span style="color:#AA22FF;">sum</span>([n<span style="color:#666666;">.</span>monitor&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;n&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;get_notebooks()])<span style="color:#666666;">/</span><span style="color:#AA22FF;">len</span>(get_notebooks()))<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>print</b></span>&#160;<span style="color:#BB4444;">&#8220;ram&#160;{0}&#8221;</span><span style="color:#666666;">.</span>format(<span style="color:#AA22FF;">sum</span>([n<span style="color:#666666;">.</span>ram&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;n&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;get_notebooks()])<span style="color:#666666;">/</span><span style="color:#AA22FF;">len</span>(get_notebooks()))<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>print</b></span>&#160;<span style="color:#BB4444;">&#8220;hdd&#160;{0}&#8221;</span><span style="color:#666666;">.</span>format(<span style="color:#AA22FF;">sum</span>([n<span style="color:#666666;">.</span>hdd&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;n&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;get_notebooks()])<span style="color:#666666;">/</span><span style="color:#AA22FF;">len</span>(get_notebooks()))<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>print</b></span>&#160;<span style="color:#BB4444;">&#8220;video&#160;{0}&#8221;</span><span style="color:#666666;">.</span>format(<span style="color:#AA22FF;">sum</span>([n<span style="color:#666666;">.</span>video&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;n&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;get_notebooks()])<span style="color:#666666;">/</span><span style="color:#AA22FF;">len</span>(get_notebooks()))
</div>
</div>
</p>
<p>Та-да, и в наших руках усредненная конфигурация:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:25687bbf-6b61-4ed9-b53e-ffd9e97a44e2" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
&#62;&#62;&#160;get_avg_parameters()<br />
cpu&#160;<span style="color:#009999;">2.0460798005</span><br />
monitor&#160;<span style="color:#009999;">14.6333167082</span><br />
ram&#160;<span style="color:#009999;">2448</span><br />
hdd&#160;<span style="color:#009999;">243</span><br />
video&#160;<span style="color:#009999;">289</span>
</div>
</div>
<p><strong>Самая дорогая/дешевая конфигурация ноутбука:</strong></p>
<p>Функции идентичны, за исключением функций min/max</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:2affb902-918d-460f-8dc0-1e0c4c159367" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">get_max_priced_notebook</span>():<br />
&#160;&#160;&#160;&#160;maxprice&#160;<span style="color:#666666;">=</span>&#160;<span style="color:#AA22FF;">max</span>([n<span style="color:#666666;">.</span>price&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;n&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;get_notebooks()])<br />
&#160;&#160;&#160;&#160;maxconfig&#160;<span style="color:#666666;">=</span>&#160;<span style="color:#AA22FF;">filter</span>(<span style="color:#AA22FF;"><b>lambda</b></span>&#160;x:&#160;x<span style="color:#666666;">.</span>price&#160;<span style="color:#666666;">==</span>&#160;maxprice,&#160;get_notebooks())[<span style="color:#666666;">0</span>]<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>print</b></span>&#160;<span style="color:#BB4444;">&#8220;cpu&#160;{0}&#8221;</span><span style="color:#666666;">.</span>format(maxconfig<span style="color:#666666;">.</span>cpu)<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>print</b></span>&#160;<span style="color:#BB4444;">&#8220;monitor&#160;{0}&#8221;</span><span style="color:#666666;">.</span>format(maxconfig<span style="color:#666666;">.</span>monitor)<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>print</b></span>&#160;<span style="color:#BB4444;">&#8220;ram&#160;{0}&#8221;</span><span style="color:#666666;">.</span>format(maxconfig<span style="color:#666666;">.</span>ram)<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>print</b></span>&#160;<span style="color:#BB4444;">&#8220;hdd&#160;{0}&#8221;</span><span style="color:#666666;">.</span>format(maxconfig<span style="color:#666666;">.</span>hdd)<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>print</b></span>&#160;<span style="color:#BB4444;">&#8220;video&#160;{0}&#8221;</span><span style="color:#666666;">.</span>format(maxconfig<span style="color:#666666;">.</span>video)<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>print</b></span>&#160;<span style="color:#BB4444;">&#8220;price&#160;{0}&#8221;</span><span style="color:#666666;">.</span>format(maxconfig<span style="color:#666666;">.</span>price)
</div>
</div>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:7bbe3eeb-09bf-425e-bfa5-8fa6213ae5ff" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
&#62;&#62;&#160;get_max_priced_notebook()<br />
cpu&#160;<span style="color:#009999;">2.26</span><br />
monitor&#160;<span style="color:#009999;">18.4</span><br />
ram&#160;<span style="color:#009999;">4096</span><br />
hdd&#160;<span style="color:#009999;">500</span><br />
video&#160;<span style="color:#009999;">1024</span><br />
price&#160;<span style="color:#009999;">181660</span>
</div>
</div>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:8216f0f7-224f-4708-875f-d4ec51cb3cdc" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
&#62;&#62;&#160;get_min_priced_notebook()<br />
cpu&#160;<span style="color:#009999;">1.6</span><br />
monitor&#160;<span style="color:#009999;">8.9</span><br />
ram&#160;<span style="color:#009999;">512</span><br />
hdd&#160;<span style="color:#009999;">8</span><br />
video&#160;<span style="color:#009999;">128</span><br />
price&#160;<span style="color:#009999;">8090</span>
</div>
</div>
<p><strong>Какой из параметров конфигурации больше всего влияет на его цену </strong></p>
<p>Очень интересно было бы узнать, за какой из параметров конфигурации мы платим больше всего денег. Прикинув, я предположил, что скорее всего это диагональ монитора и частота процессора, ну что же, думаю, что стоит проверить это.</p>
<p>Для начала наш набор параметров конфигурации стоит немного модифицировать. В связи с тем, что единицы измерения различных параметров различны в своём порядке, нам необходимо привести их к одному знаменателю, т.е. нормализовать их. Итак, приступим:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:5208e42c-3412-4aa7-872e-3305180a4a4d" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">normalized_set_of_notebooks</span>():<br />
&#160;&#160;&#160;&#160;notebooks&#160;<span style="color:#666666;">=</span>&#160;get_notebooks()<br />
&#160;&#160;&#160;&#160;cpu&#160;<span style="color:#666666;">=</span>&#160;<span style="color:#AA22FF;">max</span>([n<span style="color:#666666;">.</span>cpu&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;n&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;notebooks])<br />
&#160;&#160;&#160;&#160;monitor&#160;<span style="color:#666666;">=</span>&#160;<span style="color:#AA22FF;">max</span>([n<span style="color:#666666;">.</span>monitor&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;n&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;notebooks])<br />
&#160;&#160;&#160;&#160;ram&#160;<span style="color:#666666;">=</span>&#160;<span style="color:#AA22FF;">max</span>([n<span style="color:#666666;">.</span>ram&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;n&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;notebooks])<br />
&#160;&#160;&#160;&#160;hdd&#160;<span style="color:#666666;">=</span>&#160;<span style="color:#AA22FF;">max</span>([n<span style="color:#666666;">.</span>hdd&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;n&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;notebooks])<br />
&#160;&#160;&#160;&#160;video&#160;<span style="color:#666666;">=</span>&#160;<span style="color:#AA22FF;">max</span>([n<span style="color:#666666;">.</span>video&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;n&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;notebooks])<br />
&#160;&#160;&#160;&#160;rows&#160;<span style="color:#666666;">=</span>&#160;<span style="color:#AA22FF;">map</span>(<span style="color:#AA22FF;"><b>lambda</b></span>&#160;n&#160;:&#160;[n<span style="color:#666666;">.</span>cpu<span style="color:#666666;">/</span>cpu,&#160;n<span style="color:#666666;">.</span>monitor<span style="color:#666666;">/</span>monitor,&#160;<span style="color:#AA22FF;">float</span>(n<span style="color:#666666;">.</span>ram)<span style="color:#666666;">/</span>ram,&#160;<span style="color:#AA22FF;">float</span>(n<span style="color:#666666;">.</span>hdd)<span style="color:#666666;">/</span>hdd,&#160;<span style="color:#AA22FF;">float</span>(n<span style="color:#666666;">.</span>video)<span style="color:#666666;">/</span>video,&#160;n<span style="color:#666666;">.</span>price],&#160;notebooks)<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>return</b></span>&#160;rows
</div>
</div>
<p>В данной функции я нахожу максимальные значения для каждого из параметров, после этого формирую результирующий список ноутбуков, в котором каждый из параметров представлен в виде коэффициента (его значение будет колебаться от 0 до 1), показывающего отношение его параметра к максимальному значению в наборе, к примеру память в 2048Mb даст конфигурации коэффициент в ram = 0.5 (2048/4056).</p>
<p>Вклад каждого из параметров мы будем считать в рублях, для наглядности, хранить эти веса мы будет в наборе:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:4c93a812-db6d-4bc1-8e28-75bc3fa0e1a9" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#008800;"><i>#cpu,&#160;monitor,&#160;ram,&#160;hdd,&#160;video</i></span><br />
koes&#160;<span style="color:#666666;">=</span>&#160;[<span style="color:#666666;">0</span>,&#160;<span style="color:#666666;">0</span>,&#160;<span style="color:#666666;">0</span>,&#160;<span style="color:#666666;">0</span>,&#160;<span style="color:#666666;">0</span>]
</div>
</div>
<p>Я предлагаю исчислять эти коэффициенты для каждой конфигурации, а после этого определить среднюю величину всех коэффициентов, что даст нам усредненные данные о весе каждого из элементов конфигурации.</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:3c412005-9ba3-404e-bfd2-93e8a1b8675e" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">analyze_params</span>(parameters):<br />
&#160;&#160;&#160;&#160;koeshistory&#160;<span style="color:#666666;">=</span>&#160;[]<br />
&#160;&#160;&#160;&#160;<span style="color:#008800;"><i>#наши&#160;ноутбуки</i></span><br />
&#160;&#160;&#160;&#160;notes&#160;<span style="color:#666666;">=</span>&#160;normalized_set_of_notebooks()<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;i&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;<span style="color:#AA22FF;">range</span>(<span style="color:#AA22FF;">len</span>(notes)):<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;koes&#160;<span style="color:#666666;">=</span>&#160;[<span style="color:#666666;">0</span>,&#160;<span style="color:#666666;">0</span>,&#160;<span style="color:#666666;">0</span>,&#160;<span style="color:#666666;">0</span>,&#160;<span style="color:#666666;">0</span>]<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#008800;"><i>#устанавливаем&#160;коэффициенты</i></span><br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;set_koes(notes[i],&#160;koes)<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#008800;"><i>#сохраняем&#160;историю&#160;коэффициентов</i></span><br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;koeshistory<span style="color:#666666;">.</span>extend(koes)<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#008800;"><i>#показываем&#160;прогресс&#160;выполнения</i></span><br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>if</b></span>&#160;(i&#160;<span style="color:#666666;">%</span>&#160;<span style="color:#666666;">100</span>&#160;<span style="color:#666666;">==</span>&#160;<span style="color:#666666;">0</span>):<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>print</b></span>&#160;i<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>print</b></span>&#160;koes
</div>
</div>
<p>Как же мы будет устанавливать коэффициенты для каждого элемента конфигурации? Мой способ заключается в следующем: </p>
<ul>
<li><em>нам необходимо в случайном порядке наращивать, либо уменьшать значение одного из коэффициентов</em> </li>
<li><em>после чего анализировать, приблизились ли мы к цене за конфигурацию, при умножении вектора параметров на вектор коэффициентов (напомню, что в нашем случае это рубли)</em> </li>
<li><em>если приближение состоялось, ты мы повторяем данное действие, если же нет, то отменяем его</em> </li>
<li><em>повторять данный порядок до той степени, пока не приблизимся к нашей цене с установленной нами точностью</em> </li>
</ul>
<p>Вот реализация данного алгоритма:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:733fce97-3c51-4e18-8150-06dac8a3d02e" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">set_koes</span>(note,&#160;koes,&#160;error<span style="color:#666666;">=</span><span style="color:#666666;">500</span>):<br />
&#160;&#160;&#160;&#160;price&#160;<span style="color:#666666;">=</span>&#160;get_price(note,&#160;koes)<br />
&#160;&#160;&#160;&#160;lasterror&#160;<span style="color:#666666;">=</span>&#160;<span style="color:#AA22FF;">abs</span>(note[<span style="color:#666666;">5</span>]&#160;<span style="color:#666666;">-</span>&#160;price)<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>while</b></span>&#160;(lasterror&#160;<span style="color:#666666;">&#62;</span>&#160;error):<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;k&#160;<span style="color:#666666;">=</span>&#160;random<span style="color:#666666;">.</span>randint(<span style="color:#666666;">0</span>,<span style="color:#666666;">4</span>)<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#008800;"><i>#изменяем&#160;коэффицинт</i></span><br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;inc&#160;<span style="color:#666666;">=</span>&#160;(random<span style="color:#666666;">.</span>random()<span style="color:#666666;">*</span><span style="color:#666666;">2</span>&#160;<span style="color:#666666;">-</span>&#160;<span style="color:#666666;">1</span>)&#160;<span style="color:#666666;">*</span>&#160;(error<span style="color:#666666;">*</span>(<span style="color:#666666;">1</span>&#160;<span style="color:#666666;">-</span>&#160;error<span style="color:#666666;">/</span>lasterror))<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;koes[k]&#160;<span style="color:#666666;">+=</span>&#160;inc<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#008800;"><i>#не&#160;даём&#160;коэффициенту&#160;стать&#160;меньше&#160;нуля</i></span><br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>if</b></span>&#160;(koes[k]&#160;<span style="color:#666666;">&#60;</span>&#160;<span style="color:#666666;">0</span>):&#160;koes[k]&#160;<span style="color:#666666;">=</span>&#160;<span style="color:#666666;">0</span><br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#008800;"><i>#получаем&#160;цену&#160;при&#160;учёте&#160;коэффициентов</i></span><br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;price&#160;<span style="color:#666666;">=</span>&#160;get_price(note,&#160;koes)<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#008800;"><i>#получаем&#160;текущую&#160;ошибку</i></span><br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;curerror&#160;<span style="color:#666666;">=</span>&#160;<span style="color:#AA22FF;">abs</span>(note[<span style="color:#666666;">5</span>]&#160;<span style="color:#666666;">-</span>&#160;price)<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#008800;"><i>#проверяем,&#160;приблизились&#160;ли&#160;мы&#160;к&#160;цене,&#160;казанной&#160;в&#160;прайсе</i></span><br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>if</b></span>&#160;(lasterror&#160;<span style="color:#666666;">&#60;</span>&#160;curerror):<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;koes[k]&#160;<span style="color:#666666;">-=</span>&#160;inc<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>else</b></span>:<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;lasterror&#160;<span style="color:#666666;">=</span>&#160;curerror
</div>
</div>
<p>inc – переменная отвечающая за цвеличение/уменьшение коэффициента, способ её вычисления объесняется тем, что данное значение должно быть тем больше, чем больше разница в ошибке, для быстрого и более точного приближения к желаемому результату.</p>
<p>Умножение векторов для получения цены выглядит следующим образом:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:f44012d4-b13f-4640-b570-08f5af8bc629" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">get_price</span>(note,&#160;koes):<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>return</b></span>&#160;<span style="color:#AA22FF;">sum</span>([note[i]<span style="color:#666666;">*</span>koes[i]&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;i&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;<span style="color:#AA22FF;">range</span>(<span style="color:#666666;">5</span>)])
</div>
</div>
<p>Пришла пора выполнить анализ:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:92322cc2-458a-4f09-91fe-e334a9915036" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
&#62;&#62;&#160;analyze_params()<br />
cpu,&#160;monitor,&#160;ram,&#160;hdd,&#160;video</p>
<p>[<span style="color:#009999;">15455.60675667684</span>,&#160;<span style="color:#009999;">20980.560483811361</span>,&#160;<span style="color:#009999;">12782.535270304281</span>,&#160;<span style="color:#009999;">17819.904629585861</span>,&#160;<span style="color:#009999;">14677.889529808042</span>]
</div>
</div>
<p>Данный набор мы получили, благодаря усреднению коэффициентов, полученных для каждой из конфигураций:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:c4cf2b5d-58dd-432e-83ab-17cfac61f7cf" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">get_avg_koes</span>(koeshistory):<br />
&#160;&#160;&#160;&#160;koes&#160;<span style="color:#666666;">=</span>&#160;[<span style="color:#666666;">0</span>,&#160;<span style="color:#666666;">0</span>,&#160;<span style="color:#666666;">0</span>,&#160;<span style="color:#666666;">0</span>,&#160;<span style="color:#666666;">0</span>]<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;row&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;koeshistory:<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;i&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;<span style="color:#AA22FF;">range</span>(<span style="color:#666666;">5</span>):<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;koes[i]&#160;<span style="color:#666666;">+=</span>&#160;koeshistory[i]<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;i&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;<span style="color:#AA22FF;">range</span>(<span style="color:#666666;">5</span>):<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;koes[i]&#160;<span style="color:#666666;">/=</span>&#160;<span style="color:#AA22FF;">len</span>(koeshistory)<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>return</b></span>&#160;koes
</div>
</div>
<p>Итак, у нас получился желаемый набор, что же мы можем сказать из этих цифр, а можем мы составить рейтинг параметров: </p>
<ol>
<li>Диагональ монитора </li>
<li>Объем жесткого диска </li>
<li>Частота процессора </li>
<li>Объем видео-карточки </li>
<li>Объем оперативной памяти </li>
</ol>
<p>Хотелось бы отметить, что это далеко не идеальный вариант, и у вас могут получится иные результаты, однако, моё предположение, о том, что частота процессора и диагональ дисплея наиболее важные параметры в конфигурации, частично подтвердились.</p>
<p><strong>Прогнозирование цены указанной конфигурации </strong></p>
<p>Классно бы было, имея такой богатый набор данных, уметь прогнозировать цену на заданную конфигурацию. Этим мы и займемся.</p>
<p>Для начала преобразуем нашу коллекцию ноутбуков в список:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:2c7be1e5-7b86-447f-b508-cec9d0462e5e" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">get_notebooks_list</span>():<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>return</b></span>&#160;<span style="color:#AA22FF;">map</span>(<span style="color:#AA22FF;"><b>lambda</b></span>&#160;n:&#160;[n<span style="color:#666666;">.</span>cpu,&#160;n<span style="color:#666666;">.</span>monitor,&#160;n<span style="color:#666666;">.</span>ram,&#160;n<span style="color:#666666;">.</span>hdd,&#160;n<span style="color:#666666;">.</span>video,&#160;n<span style="color:#666666;">.</span>price],&#160;get_notebooks())
</div>
</div>
<p>Далее нам понадобиться функция, способная определить расстояние между двумя векторами, хорошим вариантом я вижу функцию эвклидова расстояния:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:05774a88-aa3d-4e3f-aad5-826b75aef396" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">euclidean</span>(v1,&#160;v2):<br />
&#160;&#160;&#160;&#160;d&#160;<span style="color:#666666;">=</span>&#160;<span style="color:#666666;">0.0</span><br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;i&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;<span style="color:#AA22FF;">range</span>(<span style="color:#AA22FF;">len</span>(v1)):<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;d<span style="color:#666666;">+=</span>(v1[i]&#160;<span style="color:#666666;">-</span>&#160;v2[i])<span style="color:#666666;">**</span><span style="color:#666666;">2</span>;<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>return</b></span>&#160;math<span style="color:#666666;">.</span>sqrt(d)
</div>
</div>
<p>Корень из суммы квадратов разностей довольно таки наглядно и эффективно показывает нам насколько один вектор различен от другого. Чем же полезна для нас данная функция? Всё просто, когда мы получим вектор, с интересующими нас параметрами, мы пробежимся по всей коллекции нашего набора и найдём ближайшего соседа, а его стоимость мы уже знаем, отлично! Вот как мы это сделаем:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:0d87fb01-bf20-4e9c-866a-bd1007fdadbd" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">getdistances</span>(data,&#160;vec1):<br />
&#160;&#160;&#160;&#160;distancelist<span style="color:#666666;">=</span>[]<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;i&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;<span style="color:#AA22FF;">range</span>(<span style="color:#AA22FF;">len</span>(data)):<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;vec2&#160;<span style="color:#666666;">=</span>&#160;data[i]<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;distancelist<span style="color:#666666;">.</span>append((euclidean(vec1,vec2),i))<br />
&#160;&#160;&#160;&#160;distancelist<span style="color:#666666;">.</span>sort()<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>return</b></span>&#160;distancelist
</div>
</div>
<p>Далее, можно немного усложнить задачу, а так же точность предоставляемых данных. Для этого мы введем функцию, использующую классификацию <a href="http://www.machinelearning.ru/wiki/index.php?title=%D0%9C%D0%B5%D1%82%D0%BE%D0%B4_k_%D0%B2%D0%B7%D0%B2%D0%B5%D1%88%D0%B5%D0%BD%D0%BD%D1%8B%D1%85_%D0%B1%D0%BB%D0%B8%D0%B6%D0%B0%D0%B9%D1%88%D0%B8%D1%85_%D1%81%D0%BE%D1%81%D0%B5%D0%B4%D0%B5%D0%B9_(%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80)">методом k взвешенных ближайших соседей</a>:</p>
<blockquote><p><img alt="K" src="http://www.machinelearning.ru/mimetex/?K" /> взвешенных ближайших соседей &#8211; это <a href="http://www.machinelearning.ru/wiki/index.php?title=%D0%9C%D0%B5%D1%82%D1%80%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8%D0%B9_%D0%BA%D0%BB%D0%B0%D1%81%D1%81%D0%B8%D1%84%D0%B8%D0%BA%D0%B0%D1%82%D0%BE%D1%80">метрический алгоритм классификации</a>, основанный на оценивании сходства объектов. Классифицируемый объект относится к тому классу, которому принадлежат ближайшие к нему объекты <a href="http://www.machinelearning.ru/wiki/index.php?title=%D0%92%D1%8B%D0%B1%D0%BE%D1%80%D0%BA%D0%B0">обучающей выборки</a>.</p>
</blockquote>
<p>Ну и взять среднее значение среди некоторого количества ближайших соседей, что сведет на нет влияние цен вендора, либо специфичности конфигурации:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:e2bf4f7a-9c7b-4693-887d-4c168f125887" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">knnestimate</span>(data,vec1,k<span style="color:#666666;">=</span><span style="color:#666666;">3</span>):<br />
&#160;&#160;&#160;&#160;dlist&#160;<span style="color:#666666;">=</span>&#160;getdistances(data,&#160;vec1)<br />
&#160;&#160;&#160;&#160;avg&#160;<span style="color:#666666;">=</span>&#160;<span style="color:#666666;">0.0</span><br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>for</b></span>&#160;i&#160;<span style="color:#AA22FF;"><b>in</b></span>&#160;<span style="color:#AA22FF;">range</span>(k):<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;idx&#160;<span style="color:#666666;">=</span>&#160;dlist[i][<span style="color:#666666;">1</span>]<br />
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;avg&#160;<span style="color:#666666;">+=</span>data[idx][<span style="color:#666666;">5</span>]<br />
&#160;&#160;&#160;&#160;avg&#160;<span style="color:#666666;">/=</span>&#160;k<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>return</b></span>&#160;avg
</div>
</div>
<p>*последние 3 алгоритма взяты из книги <a href="http://www.books.ru/shop/books/586615?partner=butaji">Сегерана Тоби “Программируем коллективный разум”</a></p>
<p>И что же мы получаем:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:6206cf21-acde-4b1f-8adb-419e7c7b106f" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
&#62;&#62;&#160;knnestimate(get_notebooks_list(),&#160;[<span style="color:#009999;">2.4</span>,&#160;<span style="color:#009999;">17</span>,&#160;<span style="color:#009999;">3062</span>,&#160;<span style="color:#009999;">250</span>,&#160;<span style="color:#009999;">512</span>])<br />
<span style="color:#009999;">31521.0</span></p>
<p>&#62;&#62;&#160;knnestimate(get_notebooks_list(),&#160;[<span style="color:#009999;">2.0</span>,&#160;<span style="color:#009999;">15</span>,&#160;<span style="color:#009999;">2048</span>,&#160;<span style="color:#009999;">160</span>,&#160;<span style="color:#009999;">256</span>])<br />
<span style="color:#009999;">27259.0</span><br />
&#62;&#62;&#160;knnestimate(get_notebooks_list(),&#160;[<span style="color:#009999;">2.0</span>,&#160;<span style="color:#009999;">15</span>,&#160;<span style="color:#009999;">2048</span>,&#160;<span style="color:#009999;">160</span>,&#160;<span style="color:#009999;">128</span>])<br />
<span style="color:#009999;">20848.0</span>
</div>
</div>
<p>Цены рыночные и этого вполне достаточно, хотя мы абсолютно не учитываем в этой реализации, к примеру частоту процессора и диагональ монитора (для этого нам необходимо добавить в функцию сравнения векторов их веса , которые мы вычисляли в предыдущем пункте)</p>
<p><strong>График распределения конфигураций и цен </strong></p>
<p>Хочется объять картину распределения целиком, т.е. нарисовать распределение конфигураций и цен на рынке. Ок, сделаем это.</p>
<p>Для начала надо поставить библиотеку <a href="http://matplotlib.sourceforge.net/users/installing.html">matplotlib</a>. Далее подключить её к нашему проекту:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:2e6a5184-4954-479d-8575-7f9403a572c7" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>from</b></span>&#160;<span style="color:#0000FF;"><b>pylab</b></span>&#160;<span style="color:#AA22FF;"><b>import</b></span>&#160;<span style="color:#666666;">*</span>
</div>
</div>
<p>Так же нам понадобится создать два набора данных, для оси абсцисс и ординат:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:c348a4cd-b058-4fea-8d5f-b1e4f5e3651d" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">power_of_notebooks_config</span>():<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>return</b></span>&#160;<span style="color:#AA22FF;">map</span>(<span style="color:#AA22FF;"><b>lambda</b></span>&#160;x:&#160;x[<span style="color:#666666;">0</span>]<span style="color:#666666;">*</span>x[<span style="color:#666666;">1</span>]<span style="color:#666666;">*</span>x[<span style="color:#666666;">2</span>]<span style="color:#666666;">*</span>x[<span style="color:#666666;">3</span>]<span style="color:#666666;">*</span>x[<span style="color:#666666;">4</span>],&#160;normalized_set_of_notebooks())<br />
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">config_prices</span>():<br />
&#160;&#160;&#160;&#160;<span style="color:#AA22FF;"><b>return</b></span>&#160;<span style="color:#AA22FF;">map</span>(<span style="color:#AA22FF;"><b>lambda</b></span>&#160;x:&#160;x[<span style="color:#666666;">5</span>],&#160;normalized_set_of_notebooks())
</div>
</div>
<p>И функцию, в которой мы построим график распределения:</p>
<div style="display:inline;float:none;margin:0;padding:0;" id="scid:2EC9848E-067D-4e79-BAB7-06CA927DB962:a94c35e2-30aa-4d52-a3e6-3afdf59f8423" class="wlWriterEditableSmartContent">
<div style="font-family:consolas,lucida console,courier,monospace;">
<span style="color:#AA22FF;"><b>def</b></span>&#160;<span style="color:#00A000;">draw_market</span>():<br />
&#160;&#160;&#160;&#160;plot(config_prices(),power_of_notebooks_config(),<span style="color:#BB4444;">&#8216;bo&#8217;</span>,&#160;linewidth<span style="color:#666666;">=</span><span style="color:#666666;">1.0</span>)</p>
<p>&#160;&#160;&#160;&#160;xlabel(<span style="color:#BB4444;">&#8216;price&#160;(Rub)&#8217;</span>)<br />
&#160;&#160;&#160;&#160;ylabel(<span style="color:#BB4444;">&#8216;config_power&#8217;</span>)<br />
&#160;&#160;&#160;&#160;title(<span style="color:#BB4444;">&#8216;Russian&#160;Notebooks&#160;Market&#8217;</span>)<br />
&#160;&#160;&#160;&#160;grid(<span style="color:#AA22FF;">True</span>)<br />
&#160;&#160;&#160;&#160;show()
</div>
</div>
<p>И что же мы получаем:</p>
<p><a href="http://butaji.files.wordpress.com/2009/08/notes.png"><img style="display:inline;border-width:0;" title="notes" border="0" alt="notes" src="http://butaji.files.wordpress.com/2009/08/notes_thumb.png?w=495&#038;h=378" width="495" height="378" /></a> </p>
</p>
<h2>В завершение</h2>
<p>Итак, у нас получилось провести небольшой анализ российского рынка ноутбуков, а так же немного проиграться с python. </p>
<p>Исходный код проекта доступен по адресу:</p>
<p><a href="http://code.google.com/p/runm/source/checkout">http://code.google.com/p/runm/source/checkout</a></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Data for Research (DfR) Quick Use Case Video]]></title>
<link>http://michaelgallagher.wordpress.com/2009/08/27/data-for-research-dfr-quick-use-case-video/</link>
<pubDate>Thu, 27 Aug 2009 11:29:15 +0000</pubDate>
<dc:creator>Michael Gallagher</dc:creator>
<guid>http://michaelgallagher.wordpress.com/2009/08/27/data-for-research-dfr-quick-use-case-video/</guid>
<description><![CDATA[This is a quick video outlining some ways in which the new Data for Research dataset service from JS]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>This is a quick video outlining some ways in which the new<a href="http://dfr.jstor.org"> Data for Research</a> dataset service from <a href="http://www.jstor.org">JSTOR</a> could prove beneficial to researchers. </p>
<span id='plh-loop-video-embed-0' class='hidden'>done</span><script type="text/javascript" src="http://v.wordpress.com/wp-content/plugins/video/swfobject2.js"></script><ins style='text-decoration:none;'>
<div class='video-player' id='x-video-0'>
<p id='video-0'></p></div></ins><script type='text/javascript'>swfobject.embedSWF('http://v.wordpress.com/wp-content/plugins/video/flvplayer.swf?ver=1.10', 'video-0', '400', '300', '9.0.115','http://v.wordpress.com/wp-content/plugins/video/expressInstall2.swf', {guid:'1v004ZN3', javascriptid:'video-0', width:'400', height:'300', locksize:'no'}, {allowfullscreen: 'true', allowscriptaccess:'always', seamlesstabbing:'true', overstretch:'true'}, {'id':'video-0'});</script>

<p>I double dog dare you to search for some terms looking for their academic currency. When did pan-Africanism hit the scene? Sushi enter the academic vernacular? Want to know all the key terms from any of the 50 disciplines in the corpus? It is interesting stuff, especially considering the size and span of the database (14 billion words, 350 years, 5 million+ articles). </p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[E-Commerce &amp; Live Chat Customer Service]]></title>
<link>http://edouardbreine.com/2009/08/11/e-commerce-live-chat-customer-service/</link>
<pubDate>Tue, 11 Aug 2009 09:57:32 +0000</pubDate>
<dc:creator>Edouard</dc:creator>
<guid>http://edouardbreine.com/2009/08/11/e-commerce-live-chat-customer-service/</guid>
<description><![CDATA[Facts: For many companies the E-Commerce business activity is still growing by leaps and bounds Trad]]></description>
<content:encoded><![CDATA[Facts: For many companies the E-Commerce business activity is still growing by leaps and bounds Trad]]></content:encoded>
</item>
<item>
<title><![CDATA[Links (2)]]></title>
<link>http://plagueofmemory.wordpress.com/2009/08/09/links-2-2/</link>
<pubDate>Sun, 09 Aug 2009 18:49:39 +0000</pubDate>
<dc:creator>plagueofmemory</dc:creator>
<guid>http://plagueofmemory.wordpress.com/2009/08/09/links-2-2/</guid>
<description><![CDATA[What&#8217;s Wrong With Fusion Centers &#8211; Executive Summary Info Sheet American Civil Liberties]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a target="new" href="http://www.aclu.org/privacy/gen/32966pub20071205.html">What&#8217;s Wrong With Fusion Centers &#8211; Executive Summary</a><br />
<i>Info Sheet<br />
American Civil Liberties Union<br />
December 5, 2007 [Or May 12th - it's unclear.]<br />
Jay Stanley and Barry Steinhardt</i></p>
<p>&#8220;These new fusion centers, over 40 of which have been established around the country, raise very serious privacy issues at a time when new technology, government powers and zeal in the &#8220;war on terrorism&#8221; are combining to threaten Americans&#8217; privacy at an unprecedented level.&#8221;</p>
<p><a target="new" href="http://www.washingtonpost.com/wp-dyn/content/article/2008/04/01/AR2008040103049_pf.html">Centers Tap Into Personal Databases</a><br />
<i>Article<br />
Washington Post<br />
April 2, Wednesday 2008<br />
Robert O&#8217;Harrow Jr.</i></p>
<p>&#8220;Intelligence centers run by states across the country have access to personal information about millions of Americans, including unlisted cellphone numbers, insurance claims, driver&#8217;s license photographs and credit reports, according to a document obtained by The Washington Post.&#8221;</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Data, the currency of academic jargon, and a corpus of 14 billion words]]></title>
<link>http://michaelgallagher.wordpress.com/2009/07/30/data-the-currency-of-academic-jargon-and-a-corpus-of-14-billion-words/</link>
<pubDate>Thu, 30 Jul 2009 19:54:36 +0000</pubDate>
<dc:creator>Michael Gallagher</dc:creator>
<guid>http://michaelgallagher.wordpress.com/2009/07/30/data-the-currency-of-academic-jargon-and-a-corpus-of-14-billion-words/</guid>
<description><![CDATA[So let me start with the following caveat: I work for the Stor of J. There, I said it and I am proud]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>So let me start with the following caveat:</p>
<p>I work for the <a href="http://www.jstor.org/?cookieSet=1">Stor of J</a>. There, I said it and I am proud of it.</p>
<p>That being said, this Data for Research service mentioned below is pretty amazing, especially if you love</p>
<ul>
<li>academic literature</li>
<li>nerdy librarian things</li>
<li>data/statistics/factoids to impress people at parties</li>
<li>key terms/word counts/academic jargon/currency</li>
</ul>
<p>JSTOR is offering a beta service called <a href="http://dfr.jstor.org/">“Data for Research”</a>. The original intention of the<a href="http://dfr.jstor.org/"> Data for Research</a> tool was to make it easier to fulfill requests for data sets and support data mining needs. However, the DfR beta also makes it possible to search and browse across all JSTOR collections, using a type of faceted search interface. The journal content on this beta site is updated 1-2 weeks after each content release on the main site.</p>
<p>With DfR, researchers can</p>
<ul>
<li> conduct full-text and fielded searching of the entire JSTOR archive using a powerful faceted search interface. Using this interface one can quickly and easily define content of interest through an iterative process of searching and results filtering.</li>
<li> view document-level data including word frequencies, citations, and key terms.</li>
<li> request and download datasets associated with the content selected.</li>
</ul>
<p>From DfR, you can also request and download datasets associated with selected content or automate this process with our API. Curious to know when academic vocabulary fell in and out of favor in academic circles? DfR lets you track that information from the over 14 billions words, 4.8 million+ articles and 350 years worth of academic research found in JSTOR.</p>
<p>Personally, I love the fact that the term perestroika peaked in academic literature, perhaps not surprisingly, in the early 1990&#8217;s, while tuberculosis seemed to gain some use as an academic term at the turn of the last century. Librarians should take note as well as DfR will basically automatically pull the key terms for each discipline over the entire corpus. If you are struggling with synonymous search terms for an intricate advanced search statement, this is the place to go. Click on any of the 50 disciplines and see all the key terms associated with it.</p>
<p>A special feedback form has been established for this project (<a href="http://dfr.jstor.org/requests/contact/">http://dfr.jstor.org/requests/contact/</a>) and is linked to all of the pages of the DfR site. If you can think of any ways you want to mine the JSTOR data not supported in this beta, let JSTOR know and they will try to incorporate it into the next instance of DfR.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Blogosphären Visualisierung]]></title>
<link>http://rafazwonull.wordpress.com/2009/07/29/blogospharen-visualisierung/</link>
<pubDate>Wed, 29 Jul 2009 19:31:30 +0000</pubDate>
<dc:creator>Rafa</dc:creator>
<guid>http://rafazwonull.wordpress.com/2009/07/29/blogospharen-visualisierung/</guid>
<description><![CDATA[Beeindruckend. Matthew Hurst hat in seinem Blog einige Blogosphären Visualisierungen veröffentlicht.]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Beeindruckend. Matthew Hurst hat in seinem <a href="http://datamining.typepad.com/" target="_blank">Blog</a> einige Blogosphären Visualisierungen veröffentlicht. Mein persönlicher Favorit ist die folgende:</p>
<p style="text-align:left;">
<div class="wp-caption aligncenter" style="width: 382px"><a href="http://datamining.typepad.com/gallery/blogosphere-sketch.png"><img title="blogosphere_sketch" src="http://datamining.typepad.com/gallery/blogosphere-sketch.png" alt="via http://datamining.typepad.com/" width="372" height="361" /></a><p class="wp-caption-text">via http://datamining.typepad.com/</p></div>
<p style="text-align:left;">Dazu schreibt Matthew:</p>
<blockquote>
<p style="text-align:left;">The dark edges show the reciprocal links (where A has cited B and B has cited A), the lighter edges indicate a-reciprocal links. The larger, denser area of the graph is that part of the blogosphere generally characterised by socio-political discussion (the periphery contains some topical groupings). Above and to the left is that area of the blogosphere concerned with technical discussion and gadgetry.</p>
</blockquote>
<p style="text-align:left;">Mehr Visualisierungen gibt es in seinem Artikel <a href="http://datamining.typepad.com/data_mining/2009/07/science-july-24th-2009.html">Science July 24th 2009.</a></p>
<p style="text-align:left;">-r-</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Data-mining EHRs ]]></title>
<link>http://gershater.wordpress.com/2009/09/23/data-mining-ehrs/</link>
<pubDate>Wed, 23 Sep 2009 23:42:27 +0000</pubDate>
<dc:creator>jgershater</dc:creator>
<guid>http://gershater.wordpress.com/2009/09/23/data-mining-ehrs/</guid>
<description><![CDATA[Clinical trials means a pharmaceutical company tests their new product on &#8216;human guinea pigs]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Clinical trials means a pharmaceutical company tests their new product on &#8216;human guinea pigs&#8217;, in a more or less closed  and controlled environment.</p>
<p>EHRs would allow much more data to be accumulated in a more real-world setting: thousands or tens of thousands of patients of various ages, races, genders etc using a medication in the real-world.</p>
<p>Any companies out there doing data-mining of EHRs or EMRs? At the HIE level perhaps?</p>
<p><a href="http://govhealthit.com/newsitem.aspx?nid=72125">Here is a story</a> from the Department of Health and Human Services proposing to use EHRs (scrubbed of private patient data) to improve the health of minorities.</p>
</div>]]></content:encoded>
</item>

</channel>
</rss>
