<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress.com" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>regexp &amp;laquo; WordPress.com Tag Feed</title>
	<link>http://en.wordpress.com/tag/regexp/</link>
	<description>Feed of posts on WordPress.com tagged "regexp"</description>
	<pubDate>Sat, 28 Nov 2009 10:50:06 +0000</pubDate>

	<generator>http://en.wordpress.com/tags/</generator>
	<language>en</language>

<item>
<title><![CDATA[Качаем картинки пачками. Решение умное.]]></title>
<link>http://radjik.wordpress.com/2009/11/13/%d0%ba%d0%b0%d1%87%d0%b0%d0%b5%d0%bc-%d0%ba%d0%b0%d1%80%d1%82%d0%b8%d0%bd%d0%ba%d0%b8-%d0%bf%d0%b0%d1%87%d0%ba%d0%b0%d0%bc%d0%b8-%d1%80%d0%b5%d1%88%d0%b5%d0%bd%d0%b8%d0%b5-%d1%83%d0%bc%d0%bd%d0%be/</link>
<pubDate>Fri, 13 Nov 2009 16:03:00 +0000</pubDate>
<dc:creator>Раджа</dc:creator>
<guid>http://radjik.wordpress.com/2009/11/13/%d0%ba%d0%b0%d1%87%d0%b0%d0%b5%d0%bc-%d0%ba%d0%b0%d1%80%d1%82%d0%b8%d0%bd%d0%ba%d0%b8-%d0%bf%d0%b0%d1%87%d0%ba%d0%b0%d0%bc%d0%b8-%d1%80%d0%b5%d1%88%d0%b5%d0%bd%d0%b8%d0%b5-%d1%83%d0%bc%d0%bd%d0%be/</guid>
<description><![CDATA[Прошлый пост говорил о решение проблемы с скачиванием картинок &#8220;в лоб&#8221;. Вдумчивое курени]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a href="http://radjik.blogspot.com/2009/09/danbooru.html">Прошлый пост</a> говорил о решение проблемы с скачиванием картинок &#8220;в лоб&#8221;. Вдумчивое курение <a href="http://danbooru.donmai.us/help/api">Danbooru API</a> на пару с <a href="http://juick.com/DarthWantuz">DartWantuz</a> позволило найти более элегантное решение.<br /><a name='more'></a>Мой вариант на коленке со старыми куками:<br />
<blockquote>
<div class="highlight">
<pre><span class="c">#! /bin/bash</span>

<span class="nv">uag</span><span class="o">=</span><span class="s2">&#34;Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1 (.NET CLR 3.5.30729)&#34;</span><span class="nv">postcount</span><span class="o">=</span><span class="sb">`</span>curl -b <span class="nv">$2</span> <span class="s2">&#34;http://danbooru.donmai.us/post/index.xml?tags=$1&#38;limit=1&#34;</span>&#124;pcregrep -o <span class="s1">&#39;posts\ count=\&#34;[^&#34;]+&#39;</span>&#124;sed -e <span class="s1">&#39;s/posts\ count=//&#39;</span> -e <span class="s1">&#39;s/\&#34;//&#39;</span><span class="sb">`</span>

rm -f get2.danbooru.txt

<span class="nb">let</span> <span class="s2">&#34;pcount=postcount/1000+1&#34;</span>

<span class="k">for</span> <span class="o">((</span><span class="nv">i</span><span class="o">=</span>1; i&#60;<span class="o">=</span><span class="nv">$pcount</span>; i++<span class="o">))</span><span class="k">do</span><span class="k">  </span>wget <span class="s2">&#34;http://danbooru.donmai.us/post/index.xml?tags=$1&#38;limit=1000&#38;page=$i&#34;</span> --load-cookies<span class="o">=</span><span class="s2">&#34;$2&#34;</span> -U <span class="s2">&#34;$uag&#34;</span> -O - &#124;pcregrep -o -e <span class="s1">&#39;file_url=[^ ]+&#39;</span>&#124;sed -e <span class="s1">&#39;s/file_url=//g&#39;</span> -e <span class="s1">&#39;s/\&#34;//g&#39;</span>  &#62;&#62;get2.danbooru.txt<span class="k">done</span>;

wget -nc -i get2.danbooru.txt</pre>
</div>
</blockquote>
<p>Запуск: <span style="font-family:&#34;">%scriptname% тег файл_с_куками</span></p>
<p>Его вариант с логином:<br />
<blockquote>
<div class="highlight">
<pre><span class="c">#!/bin/bash</span><span class="nv">TAGS</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span> <span class="nv">$*</span> &#124; sed -e <span class="s1">&#39;s/ /%20/g&#39;</span><span class="k">)</span><span class="c"># опять эти пробелы...</span><span class="nv">LOGIN</span><span class="o">=</span><span class="s1">&#39;login&#39;</span><span class="nv">PASSWORD</span><span class="o">=</span><span class="s1">&#39;password&#39;</span><span class="c"># сцука, настройки</span><span class="nv">AUTH</span><span class="o">=</span><span class="sb">`</span>curl -s -c danbooru.txt -F<span class="s2">&#34;commit=login&#34;</span> -F<span class="s2">&#34;url=&#34;</span> -F<span class="s2">&#34;user[name]=${LOGIN}&#34;</span> -F<span class="s2">&#34;user[password]=${PASSWORD}&#34;</span> <span class="se">\</span>         http://danbooru.donmai.us/user/authenticate<span class="sb">`</span><span class="c"># получаем кукисы с доставкой до danbooru.txt</span><span class="nv">XML</span><span class="o">=</span><span class="sb">`</span>curl -b danbooru.txt -s <span class="s2">&#34;http://danbooru.donmai.us/post/index.xml?limit=1000&#38;tags=${TAGS}&#38;page=1&#34;</span><span class="sb">`</span><span class="nv">TOTAL</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span> <span class="nv">$XML</span> &#124; egrep -o <span class="s2">&#34;[0-9]+&#34;</span> &#124; head -n 4 &#124; tail -n 1<span class="k">)</span><span class="c"># получаем кол-во страниц</span><span class="o">[</span> <span class="nv">$TOTAL</span> -le 1000 <span class="o">]</span> <span class="o">&#38;&#38;</span> <span class="nv">TOTALPAGES</span><span class="o">=</span>1 <span class="o">&#124;&#124;</span> <span class="nv">TOTALPAGES</span><span class="o">=</span><span class="k">$((</span><span class="nv">$TOTAL</span> <span class="o">/</span> <span class="m">1000</span><span class="k">))</span><span class="c"># считаем уже не так хитропопо, в отличии от гельбуры</span>

<span class="nb">echo</span> <span class="s2">&#34;TOTAL ${TOTAL} images&#34;</span><span class="k">while</span> <span class="o">[</span> <span class="nv">$TOTALPAGES</span> !<span class="o">=</span> 0 <span class="o">]</span>; <span class="k">do</span>curl -b danbooru.txt -s <span class="s2">&#34;http://danbooru.donmai.us/post/index.xml?limit=1000&#38;tags=${TAGS}&#38;page=${TOTALPAGES}&#34;</span> <span class="se">\</span>        &#124; egrep -io <span class="s2">&#34;http:\/\/danbooru\.donmai\.us\/data\/[a-z0-9]+\.(jpg&#124;png&#124;gif&#124;jpeg)&#34;</span> &#124; uniq <span class="se">\</span>        &#124; xargs -d <span class="s1">&#39;\n&#39;</span> wget -P files<span class="c"># берём за рога, и качаем, и качаем</span><span class="nb">let</span> <span class="s1">&#39;TOTALPAGES--&#39;</span><span class="k">done</span>rm danbooru.txt<span class="c"># удаляем файл с куками, дабы не оставить следов</span></pre>
</div>
</blockquote>
<p>Запуск: <span style="font-family:&#34;">%scriptname% список тегов</span></p>
<p>Если &#8220;интузязизьм&#8221; не угаснет, то будет универсальный скрипт для нескольких подобных галерей.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Picture Maximums (PicMax)]]></title>
<link>http://neilobremski.wordpress.com/2009/11/04/picture-maximums-picmax/</link>
<pubDate>Wed, 04 Nov 2009 05:11:50 +0000</pubDate>
<dc:creator>Neil Obremski</dc:creator>
<guid>http://neilobremski.wordpress.com/2009/11/04/picture-maximums-picmax/</guid>
<description><![CDATA[Having a bit of inspiration after a productive day at the office, I decided to implement this small ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Having a bit of inspiration after a productive day at the office, I decided to implement this small feature that&#8217;s been on my mind for weeks: restriction maximum picture dimensions and/or size during the preload process.  There are two big reasons: 1.) FTP push and 2.) because I have a lot of super-high-quality image sources.  The <a href="http://www.graceparkfansite.com/">Grace Park</a> site is the first to take advantage of this, getting a maximum width of 400 (about 25% the width of some of the sources!).</p>
<p>PicMax is a variable specified in the site text at the site level, e.g. at the very top.  I wanted it to follow this syntax:</p>
<blockquote><p>picmax=<em>WIDTH</em>x<em>HEIGHT</em>,<em>SIZE</em></p></blockquote>
<p>I also wanted each one of those components to be optional.  You can specify all, one, some, or none and achieve the effect you&#8217;re looking for.  The aforementioned fansite simply specifies &#8220;400&#8243; which means &#8220;400 maximum width&#8221;.  However, what if you just want height?  Easy: &#8220;x400&#8243; (400 maximum height).  You see the &#8216;x&#8217; and &#8216;,&#8217; are delimiters that need to be present in order to specify what&#8217;s after them.  So to cap all pictures at ten thousand bytes, you&#8217;d say &#8220;,10000&#8243;.  Maximum width and size but no height: &#8220;400,10000&#8243;.  How do I achieve such magic?</p>
<blockquote><p>preg_match(&#8216;/^([0-9]+)?x?([0-9]+)?,?([0-9]+)?$/i&#8217;, $picmax, $m))</p></blockquote>
<p><em>Damn</em> I love regular expressions!  Now, to retain your sanity you can specify a zero in the whatever spot you don&#8217;t want filled in and the code will operate the same.  When I first approached the parsing I wondered how I&#8217;d structure the pattern, but it turns out just making everything optional solved it.  Crazy!</p>
<p>When pictures are resized for scale, their maximum file size is capped at their previous file size.  This prevents the unintentional effect of shrinking one and having it grow on disk (the opposite desired effect).  Usually resizing dimensions is enough to squish the byte size, but it&#8217;s difficult to know what to set the JPG quality value to and re-saving any lossy format is an easy way towards bloat.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Java String parse]]></title>
<link>http://sidney3172blog.wordpress.com/2009/10/26/java-string-parse/</link>
<pubDate>Mon, 26 Oct 2009 16:20:29 +0000</pubDate>
<dc:creator>Sergey Gibert</dc:creator>
<guid>http://sidney3172blog.wordpress.com/2009/10/26/java-string-parse/</guid>
<description><![CDATA[Всегда для того чтобы пропарсить большие строки использовал StringTokenizer, а сегодня первый раз ст]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Всегда для того чтобы пропарсить большие строки использовал <a href="http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html">StringTokenizer</a>, а сегодня первый раз столкнулся с классом <a href="http://java.sun.com/javase/6/docs/api/java/util/Scanner.html">Scanner</a> (каюсь это моя отсталось ибо класс появился ещё в J2SE 5.0), но все же лучше поздно чем никогда. Отличие от StringTokenizer`а в том что тут разделители (delimeters)  задаются регулярными выражениями. Это очень удобно к тому же парситься могут не только строки но и потоки, файлы и много чего ещё. Примеры достаточно красочно описывающие возможности есть в javadoc`е указанном выше приведу кратенький пример:<br />
<code><br />
String input = "&#60;html&#62;&#60;table border='1' cellspasing='0' cellpadding='0' width='300'&#62;";<br />
Scanner s = new Scanner(input);<br />
s.findInLine("width='(\\d+)'&#62;");<br />
MatchResult result = s.match();<br />
for (int i=1; i&#60;=result.groupCount(); i++)<br />
System.out.println(result.group(i));<br />
s.close();<br />
</code></p>
<p>Вывод будет следующим:<br />
<code><br />
300<br />
</code></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[ROR - Reg Exp learning and testing]]></title>
<link>http://rorviswa.wordpress.com/2009/09/25/ruby-regular-expression/</link>
<pubDate>Fri, 25 Sep 2009 04:31:30 +0000</pubDate>
<dc:creator>Viswanathan</dc:creator>
<guid>http://rorviswa.wordpress.com/2009/09/25/ruby-regular-expression/</guid>
<description><![CDATA[http://www.rubular.com]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a href="http://www.rubular.com/">http://www.rubular.com</a></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Using Codeigniter Pagination Class]]></title>
<link>http://unravelthecode.wordpress.com/2009/09/14/using-codeigniter-pagination-class/</link>
<pubDate>Mon, 14 Sep 2009 21:38:46 +0000</pubDate>
<dc:creator>drewtown12</dc:creator>
<guid>http://unravelthecode.wordpress.com/2009/09/14/using-codeigniter-pagination-class/</guid>
<description><![CDATA[This is my first post about my code so I thought I start with a fairly simple topic, pagination.  Pa]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>This is my first post about my code so I thought I start with a fairly simple topic, pagination.  Pagination with Codeigniter&#8217;s <a href="http://codeigniter.com/user_guide/libraries/pagination.html">pagination library</a> is fairly straight forward but I have picked up a few tricks that I&#8217;d like to share.  This is however a bit more complicated than regular pagination because we will be paginating based on the first letter of a artist name in a database.  This example is based on the artists/catalog function and the source can be found <a href="http://www.unravelthemusic.com/unravelthemusic.com.zip">here</a>.  See example <a href="http://www.unravelthemusic.com/artists/catalog/C/100">here</a></p>
<pre class="brush: css;">

$this-&gt;load-&gt;library('pagination');

$letter = $this-&gt;uri-&gt;segment(3);
$offset = $this-&gt;uri-&gt;segment(4);

$data['letter'] = substr($letter, 0, 1);

if (!ereg('^[A-Za-z0-9]$', $data['letter'])) {
  redirect('artists/');
}

$this-&gt;load-&gt;model('ArtistModel');
$config['per_page'] = '100';
$data['results'] = $this-&gt;ArtistModel-&gt;loadByLetter($data['letter'], $config['per_page'], $offset);

$config['base_url'] = base_url() . '/artists/catalog/' . $data['letter'] . '/';
$config['uri_segment'] = 4;
$config['first_link'] = 'First';
$config['last_link'] = 'Last';
$config['num_links'] = 6;
$config['total_rows'] = $this-&gt;ArtistModel-&gt;getTotalForLetter($data['letter']);

$this-&gt;pagination-&gt;initialize($config);
$data['links'] = $this-&gt;pagination-&gt;create_links();
</pre>
<p>Pretty standard stuff if you follow the manual page except for a few things.  First off, I take the &#8216;uri segment 3&#8242; and I substr it to just one character.  Therefore, if someone types in &#8216;abcderf&#8217; I just get &#8216;a&#8217;.</p>
<p>On line 8-10 You see an ereg function.  This function tests to make sure that $data['letter'] is a character or a number, and if not redirects to the basic artists page.  This could be done by passing the variable through the form validation class like I have done with searches but for 1 character it seems pretty silly to fire up a whole library to do what a simple ereg could accomplish.</p>
<p>After this point the $config variable is all pretty standard but the important part is the model functions because this is based on finding the first character of a name.</p>
<p>Here is the first model functions required</p>
<pre class="brush: php;">
function loadByLetter($letter, $num, $offset)
 {
 if($letter == '0')
 {
 $this-&gt;db-&gt;where(&quot;artist REGEXP '^[0-9]'&quot;);
 $this-&gt;db-&gt;where('verified', 1);
 $this-&gt;db-&gt;order_by('artist', 'ASC');
 return $this-&gt;db-&gt;get('artists', $num, $offset);

 } else {
 $this-&gt;db-&gt;like('artist', $letter, 'after');
 $this-&gt;db-&gt;where('verified', 1);
 $this-&gt;db-&gt;order_by('artist', 'ASC');
 return $this-&gt;db-&gt;get('artists', $num, $offset);
 }
 }
</pre>
<p>As you can see we have the variables letter, num and offset which are all important in selecting data from the database. If the letter is the number zero (0), which is the default click options on the right side menu for artists beginning with numbers, it will look for all artists starting with a number.  This is the case because we use $this-&#62;db-&#62;where(&#8220;artist REGEXP &#8216;^[0-9]&#8216;&#8221;); to select the data from the database.  This regexp function will select all artists that begin with a number 0-9. We then only select those that are verified and arrange them by name.</p>
<p>If the $letter variable is not zero then we use a like function $this-&#62;db-&#62;like(&#8216;artist&#8217;, $letter, &#8216;after&#8217;);. This will select all things that start with $letter, are verified and order them by artist name.  The &#8216;after&#8217; portion of the like statement means that the wildcard (%) will only be placed after the $letter variable.  The after is crucial in making sure we only get artists that start with that letter.</p>
<p>The offset variable is used to tell the select statement how many to skip.</p>
<p>The other important model function we need is one that will tell us how many artists we have for that particular letter.  The functions before are always using a limit of $num which for us was set to 100.  We then need to use the following function to find the total.</p>
<pre class="brush: php;">
function getTotalForLetter($letter)
 {
 $this-&gt;db-&gt;like('artist', $letter, 'after');
 $this-&gt;db-&gt;where('verified', 1);
 return $this-&gt;db-&gt;count_all_results('artists');
 }
</pre>
<p>This function will use the same like function as before but without the limit.</p>
<p>Then that&#8217;s it.  Pass the $data['links'] variable to your view file and the pagination class takes care of the rest.</p>
<p>&#8211;Drew Town</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Search for any e-mail address with SQL with regular expression]]></title>
<link>http://oscarvalles.wordpress.com/2009/09/10/search-for-any-e-mail-address-with-sql-with-regular-expression/</link>
<pubDate>Thu, 10 Sep 2009 23:16:38 +0000</pubDate>
<dc:creator>oscarvalles</dc:creator>
<guid>http://oscarvalles.wordpress.com/2009/09/10/search-for-any-e-mail-address-with-sql-with-regular-expression/</guid>
<description><![CDATA[Below is a really simple example on how to search for e-mail addresses within records of your databa]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Below is a really simple example on how to search for e-mail addresses within records of your database.</p>
<p><span style="color:#0000ff;"><strong>select </strong></span>* <span style="color:#0000ff;"><strong>from </strong></span>table<br />
<span style="color:#0000ff;"><strong>where </strong></span>field_name  <span style="color:#0000ff;"><strong>regexp </strong></span><span style="color:#00ff00;">&#8216;\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b&#8217;</span>;</p>
<p>The regexp function similar to the &#8216;LIKE&#8217; keyword.  The regular expression that follows regexp is used to search for any e-mail address variation.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Validate Email address]]></title>
<link>http://codecopy.wordpress.com/2009/09/04/code-test/</link>
<pubDate>Fri, 04 Sep 2009 16:17:12 +0000</pubDate>
<dc:creator>sterndorff</dc:creator>
<guid>http://codecopy.wordpress.com/2009/09/04/code-test/</guid>
<description><![CDATA[CodeCopy: using System; using System.Collections.Generic; using System.Linq; using System.Text; usin]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>CodeCopy:</p>
<pre class="brush: csharp;">
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace CodeCopy
{
    public class EmailValidator
    {
        public static bool IsValidEmail(string email)
        {
            if (string.IsNullOrEmpty(email))
                return false;

            string strRegex = @&quot;^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}&quot; +
                  @&quot;\.[0-9]{1,3}\.[0-9]{1,3}\.)&amp;#124;(([a-zA-Z0-9\-]+\&quot; +
                  @&quot;.)+))([a-zA-Z]{2,4}&amp;#124;[0-9]{1,3})(\]?)$&quot;;
            Regex re = new Regex(strRegex);
            if (re.IsMatch(email.Trim()))
                return (true);
            else
                return (false);
        }
    }
}
</pre>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Learning Flex - Lesson 14, Using Formatters and Validators]]></title>
<link>http://mattreyuk.wordpress.com/2009/08/27/learning-flex-lesson-14-using-formatters-and-validators/</link>
<pubDate>Thu, 27 Aug 2009 22:26:00 +0000</pubDate>
<dc:creator>mattreyuk</dc:creator>
<guid>http://mattreyuk.wordpress.com/2009/08/27/learning-flex-lesson-14-using-formatters-and-validators/</guid>
<description><![CDATA[Flex has formatter classes that can be used to format raw data into customized strings. You can comb]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Flex has formatter classes that can be used to format raw data into customized strings. You can combine these with data binding to be able to format multiple fields simultaneously. All formatters are subclasses of the <code>Formatter</code> class and built in versions include:</p>
<p><code>CurrencyFormatter</code>, <code>DateFormatter</code>, <code>NumberFormatter</code>, <code>PhoneFormatter</code>, <code>ZipCodeFormatter</code>.</p>
<p>Each formatter has properties related to it&#8217;s type so a <code>CurrencyFormatter </code>has <code>currencySymbol</code> and <code>precision</code> properties and a <code>PhoneFormatter</code> has an <code>areaCode</code> property (area code to add to a 7 digit number, -1 means do not add and is the default). They all have pretty sensible defaults and more data can be found in the <a href="http://livedocs.adobe.com/flex/3/html/help.html?content=formatters_1.html">Adobe reference docs</a>. To use a defined formatter, just call it&#8217;s <code>format()</code> method providing the value to be formatted.</p>
<p>All validators are a subclass of the <code>Validator</code> class. The following are provided validators but you may define your own:</p>
<p><code>CreditCardValidator</code>, <code>DateValidator</code>, <code>EmailValidator</code>, <code>NumberValidator</code>, <code>PhoneNumberValidator</code>, <code>SocialSecurityValidator</code>, <code>StringValidator</code>, <code>ZipCodeValidator</code>.</p>
<p>All validators have a <code>source</code> attribute which is the <code>id</code> of the control that is being validated and defines where any error message will appear on failure. The <code>property</code> attribute defines where the actual information being validated is stored (ie if we&#8217;re validating a <code>TextBox</code> called<code> myText</code>, the <code>source</code> would be <code>myText</code> and the <code>property</code> would be <code>text</code>). If the validation fails, the control is highlighted in red and hovering over it produces an error message.</p>
<p>Like formatters, validators can have properties related to it&#8217;s type so a <code>ZipCodeValidator</code> has a domain property (which can be &#8220;<code>US Only</code>&#8220;, &#8220;<code>US or Canada</code>&#8221; or &#8220;<code>Canada Only</code>&#8220;) and a <code>CreditCardValidator</code> requires <code>source</code> and <code>property</code> attributes for <code>cardType</code> and <code>cardNumber</code>.</p>
<p>By default, the validator listens for the <code>ValueCommit</code> event on the <code>source</code> component which corresponds to when the user leaves that component. You can change this behavior or call the <code>validate() </code>method to force validation.</p>
<p>Often, just detecting the validation error is not enough, you want to prevent the user from proceeding until they fix the error. In this case, you need to listen for the <code>ValidationResultEvent</code> (calling <code>validate()</code> directly will return an object of this class). The event type property will either be <code>VALID</code> or <code>INVALID</code>.</p>
<p><strong>Regular Expressions</strong></p>
<p>Regular expressions can be used to help design custom validators. These are special strings that define a pattern a client string should match. You can find out more about defining regular expressions <a href="http://www.regular-expressions.info/">here</a>.</p>
<p>The <code>RegExp</code> class allows you to define a regular expression string using it&#8217;s constructor via new which has a second optional parameter for flags (i=case insensitive, x=white space ignored etc). Alternatively you can define a literal expression directly for a <code>RegExp</code> variable. In this case you don&#8217;t need to escape the &#8216;/&#8217; character (because you&#8217;re not providing it as a string) but you do need to put one at the start and end of the pattern &#8211; traditional regex style (any flags would go after the second /).</p>
<p>To execute the regex, use the <code>search()</code> method of the <code>String</code> object you are checking, passing the RegExp object you&#8217;ve created. This will return -1 if there is no match for the pattern (or the position in the string where the pattern begins).</p>
<p>You can build a custom validator by extending <code>mx.Validators.Validator</code>. Call the <code>super() </code>method from the constructor and <code>override</code> the protected method <code>doValidation(value:Object):Array</code>. The<code> doValidation() </code>method returns an <code>Array</code> of <code>ValidationResult</code> objects. Start by clearing your <code>results Array</code> then call the super class version of the method (<code>super.doValidation()</code>). If the input <code>value</code> object is not null, call it&#8217;s <code>search()</code> method providing your <code>RegExp</code>. If it&#8217;s not found, call the <code>push() </code>method on the <code>results Array </code>object to add a new <code>ValidationResult </code>object. This should specify the properties <code>isError:Boolean</code> &#8211; <code>true</code> if there&#8217;s an error, <code>subField:String</code> &#8211; the name of the subfield (if any) the error is associated with, <code>errorCode:String</code> &#8211; the error code for the problem (eg &#8220;notNumber&#8221;) and the <code>errorMessage:String</code> &#8211; a longer, more descriptive error message (eg &#8220;account ids cannot contain letters, they are 7 digits long&#8221;).</p>
<p>You can set the <code>required </code>property on a validator to true to ensure that if the field is left blank, it fails validation. You&#8217;ll need to call the <code>validate()</code> method directly for this at a point you know the user &#8220;should&#8221; have entered the data (eg on a submit button click).</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[The Regular Expression for Parsing Error Call Stack.]]></title>
<link>http://log5f.wordpress.com/2009/08/22/the-regular-expression-for-parsing-error-call-stack/</link>
<pubDate>Sat, 22 Aug 2009 20:07:47 +0000</pubDate>
<dc:creator>max.rozdobudko</dc:creator>
<guid>http://log5f.wordpress.com/2009/08/22/the-regular-expression-for-parsing-error-call-stack/</guid>
<description><![CDATA[For getting information about the caller issuing the logging request uses converters which based on ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>For getting information about the caller issuing the logging request uses converters which based on usefull regular expression:</p>
<pre class="brush: jscript;">
^\tat (?:(.+)::)*(\w+)\/*(.*)\(\)\[(?:(.+)\:(\d+))?\]$
</pre>
<p>This pattern based on a <a href="http://github.com/jonathanbranam/360flex08_presocode/">Jonathan Branams</a>&#8217;s regular expression, but it&#8217;s modified for correct parsing calls from constructor and calls from classes in default package.</p>
<p>You can try it pattern and find many other usefull patterns on <a href="http://gskinner.com/RegExr/">RegExr</a>.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Massive Regular Expressions Toolbox]]></title>
<link>http://maheshvnit.wordpress.com/2009/08/18/massive-regular-expressions-toolbox/</link>
<pubDate>Tue, 18 Aug 2009 11:55:17 +0000</pubDate>
<dc:creator>maheshvnit</dc:creator>
<guid>http://maheshvnit.wordpress.com/2009/08/18/massive-regular-expressions-toolbox/</guid>
<description><![CDATA[Regular expressions (”regex’s” for short) are sets of symbols and syntactic elements used to match p]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p style="line-height:20px!important;text-align:left;font-size:12px;border:0 initial initial;margin:0;padding:0 0 5px;"><img style="background-image:initial;background-repeat:initial;background-attachment:initial;background-color:#f8f8f4;display:inline;float:left;background-position:initial initial;border:1px solid #e6e6e6;margin:0 10px 10px 0;padding:2px;" src="http://www.tripwiremagazine.com/wp-content/uploads/images/stories/Articles/title-images/6885.jpg" alt="" width="200" height="200" /><a style="color:#e8a02c;text-decoration:none;border:0 initial initial;margin:0;padding:0;" href="http://en.wikipedia.org/wiki/Regular_expression" target="_blank">Regular expressions</a> (”regex’s” for short) are sets of symbols and syntactic elements used to match patterns of text and they are pretty powerful. Regular expressions have been around for a very long time (in computer industry scale) and was first introduced as part of the powerful UNIX search tool <a style="color:#e8a02c;text-decoration:none;border:0 initial initial;margin:0;padding:0;" href="http://en.wikipedia.org/wiki/Grep" target="_blank">grep</a>.</p>
<p style="line-height:20px!important;text-align:left;font-size:12px;border:0 initial initial;margin:0;padding:0 0 5px;">The regex syntax used commonly today is compliant with <strong>Extended Regular Expressions (EREs)</strong> defined in IEEE POSIX 1003.2 (Section 2.8). EREs are supported by Apache, PHP4+, Javascript 1.3+, MS Visual Studio, MS Frontpage, most visual editors, vi, emac, the GNU family of tools (including grep, awk and sed) as well as many others. <strong>Extended Regular Expressions (EREs)</strong> will support <strong>Basic Regular Expressions</strong> (BREs are essentially a subset of EREs). The BRE syntax is considered obsolete and is only still around to preserve backward compatibility.</p>
<p style="line-height:20px!important;text-align:left;font-size:12px;border:0 initial initial;margin:0;padding:0 0 5px;">I believe mastering at least the most basic elements of regex is essential for any programmer. Further I know that having direct access to references, examples, ready to use patterns etc. is essential to speed up your work.</p>
<p style="line-height:20px!important;text-align:left;font-size:12px;border:0 initial initial;margin:0;padding:0 0 5px;">This is a toolbox for getting started and/or becoming more serious about regex. It provides details on commonly needed regexs that you can just pick up and use right away. Lets get started!</p>
<p style="line-height:20px!important;text-align:left;font-size:12px;border:0 initial initial;margin:0;padding:0 0 5px;">
<p style="line-height:20px!important;text-align:left;font-size:12px;border:0 initial initial;margin:0;padding:0 0 5px;">Read the rest of this entry »</p>
<p style="line-height:20px!important;text-align:left;font-size:12px;border:0 initial initial;margin:0;padding:0 0 5px;">
<p style="line-height:20px!important;text-align:left;font-size:12px;border:0 initial initial;margin:0;padding:0 0 5px;"><a href="http://www.tripwiremagazine.com/tutorials/tutorials/massive-regular-expressions-toolbox.html" target="_blank">http://www.tripwiremagazine.com/tutorials/tutorials/massive-regular-expressions-toolbox.html</a></p>
<p style="line-height:20px!important;text-align:left;font-size:12px;border:0 initial initial;margin:0;padding:0 0 5px;">
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Regex to filter ISBN 10 and ISBN 13]]></title>
<link>http://daills.wordpress.com/2009/07/30/regex-to-filter-isbn-10-and-isbn-13/</link>
<pubDate>Thu, 30 Jul 2009 20:51:26 +0000</pubDate>
<dc:creator>daill</dc:creator>
<guid>http://daills.wordpress.com/2009/07/30/regex-to-filter-isbn-10-and-isbn-13/</guid>
<description><![CDATA[In case of a project of two friends of mine, i have been asked to write a regular expression pattern]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>In case of a project of two friends of mine, i have been asked to write a regular expression pattern that matches ISBN 10 and ISBN 13 in different notation ways. So the result looks crappy but it works nearly perfect &#8230; here it is: </p>
<p><code>((?=(?:978&#124;979)[0-9 ]{14})(?:\d{3}[ ]{1}\d{1,5}[ ]{1}\d{2,7}[ ]{1}\d{2,7}[ ]{1}\d{1})&#124;(?=(?:978&#124;979)[0-9-]{14})(?:\d{3}[-]{1}\d{1,5}[-]{1}\d{2,7}[-]{1}\d{2,7}[-]{1}\d{1})&#124;(?=(?:978&#124;979)\d{10})(?:\d{13})&#124;(?=[0-9xX]{10})(?:(?:\d{9}[xX]{1}&#124;\d{10}))&#124;(?=[0-9-xX]{13})(?:\d{1}[-]{1}\d{3,5}[-]{1}\d{3,5}[-]{1}[0-9xX]{1})&#124;(?=[0-9xX ]{13})(?:\d{1}[ ]{1}\d{3,5}[ ]{1}\d{3,5}[ ]{1}[0-9xX]{1}))</code></p>
<p>It matches:</p>
<ul>
<li>1. 1234567891</li>
<li>2. 1234567891011</li>
<li>3. 3-446-19313-1</li>
<li>4. 3 446 19313 1</li>
<li>5. 978-3-86680-192-9</li>
<li>6. 978 3 86680 192 9</li>
</ul>
<p>Notice: I&#8217;m far-off from being a regex guru so please don&#8217;t argue about the syntax <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Expressão Regular Máscara ]]></title>
<link>http://tecnosi.wordpress.com/2009/07/24/expressaoregularmascara/</link>
<pubDate>Fri, 24 Jul 2009 12:14:09 +0000</pubDate>
<dc:creator>Alisson Paiva</dc:creator>
<guid>http://tecnosi.wordpress.com/2009/07/24/expressaoregularmascara/</guid>
<description><![CDATA[Vamos agora com um exemplo prático do uso de expressões regulares, uma pagina que vai aplicando uma ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Vamos agora com um exemplo prático do uso de expressões regulares, uma pagina que vai aplicando uma máscara enquanto o usuário vai digitando:<br />
<!--more--><br />
Para o exemplo usaremos a formatação de telefone com a mascara (99) 9999-9999, e as expressões regulares:</p>
<p><strong>(/\D/g)</strong> //Permite apenas numeros.</p>
<p><strong>(/^(\d\d)(\d)/g,&#8221;($1) $2&#8243;)</strong> //Coloca &#8220;(   )&#8221;</p>
<p><strong> (/(\d{4})(\d)/,&#8221;$1-$2&#8243;)</strong> //Coloca &#8220;-&#8221; após a seqüência de 4 caracteres.<br />
O código HTML que chama a função da mascará para cada tecla pressionada ficará assim:</p>
<pre class="brush: jscript;">
&lt;html&gt;
&lt;head&gt;
&lt;title&gt;MÁSCARA&lt;/title&gt;
&lt;script type=&quot;text/javascript&quot;&gt;
/*Função Pai de Mascaras*/
function Mascara(o,f){
      v_obj=o
      v_fun=f
      setTimeout(&quot;execmascara()&quot;,1)
    }

    /*Função que Executa os objetos*/
    function execmascara(){
        v_obj.value=v_fun(v_obj.value)
    }

    /*Função que padroniza telefone (11) 4184-1241*/
    function Telefone(v){
        v=v.replace(/\D/g,&quot;&quot;) //Retira todo valor que não for número.
        v=v.replace(/^(\d\d)(\d)/g,&quot;($1) $2&quot;) //Coloca &quot;(   )&quot;
        v=v.replace(/(\d{4})(\d)/,&quot;$1-$2&quot;) //Coloca &quot;-&quot;
        return v
    }
	&lt;/script&gt;
	&lt;/head&gt;
	&lt;body&gt;
	Telefone:
		&lt;input type=&quot;text&quot; maxlength=&quot;14&quot; size=&quot;12&quot; onkeypress=&quot;Mascara(this,Telefone)&quot; /&gt;
	&lt;/body&gt;
&lt;/html&gt;
</pre>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Напарсить проксей в виде ip:port со страницы]]></title>
<link>http://infery.wordpress.com/2009/07/18/%d0%bd%d0%b0%d0%bf%d0%b0%d1%80%d1%81%d0%b8%d1%82%d1%8c-%d0%bf%d1%80%d0%be%d0%ba%d1%81%d0%b5%d0%b9-%d0%b2-%d0%b2%d0%b8%d0%b4%d0%b5-ipport-%d1%81%d0%be-%d1%81%d1%82%d1%80%d0%b0%d0%bd%d0%b8%d1%86%d1%8b/</link>
<pubDate>Sat, 18 Jul 2009 13:36:24 +0000</pubDate>
<dc:creator>infery</dc:creator>
<guid>http://infery.wordpress.com/2009/07/18/%d0%bd%d0%b0%d0%bf%d0%b0%d1%80%d1%81%d0%b8%d1%82%d1%8c-%d0%bf%d1%80%d0%be%d0%ba%d1%81%d0%b5%d0%b9-%d0%b2-%d0%b2%d0%b8%d0%b4%d0%b5-ipport-%d1%81%d0%be-%d1%81%d1%82%d1%80%d0%b0%d0%bd%d0%b8%d1%86%d1%8b/</guid>
<description><![CDATA[Понадобился мне тут список прокси в формате ip:port. Вбил в гугле запрос и пошел на первый попавшийс]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Понадобился мне тут список прокси в формате ip:port. Вбил в гугле запрос и пошел на первый попавшийся <a href="http://www.proxy4free.com/page1.html">сайт</a>. Вобщем он мне понравился, но вручную доставать оттуда всю нужную инфу не очень хотелось&#8230; Чтож.. Автоматизация рулит. Выбор пал на C#. Просто и удобно. Создаем проект оконного приложения, добавляем кнопочку и два textBox&#8217;а c MultiLine = true. Жмем на кнопку чтоб сгенерился код (при желании пишем вручную=)).<br />
Ах, да! Чуть не забыл. В самом начале нужно дописать<br />
<code><br />
using System.Text.RegularExpressions; //это для использовния регулярных выражений<br />
</code></p>
<p>дальше идет обработчик события кнопки</p>
<p><code><br />
private void button1_Click(object sender, EventArgs e)<br />
{<br />
Regex rx = new Regex("[0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}"); //вот так извлекаем ип<br />
Regex rPort = new Regex("[&#62;][0-9]{1,4}[&#60;]"); //а вот так порт<br />
</code><code> </code><code>MatchCollection me; //коллекция для найденного<br />
MatchCollection mPort; //коллекция для найденного<br />
me = rx.Matches(textBox1.Text); //извлекаем<br />
mPort = rPort.Matches(textBox1.Text); //извлекаем<br />
int ind = 0;<br />
string zz = null;<br />
foreach (Match m in me) //перечисляем все ип-адреса<br />
{<br />
if (m.Value != zz) //это сравнение с предыдущим, чтоб повторений не было<br />
{<br />
textBox2.Text += (m.Value + ":"+</code></p>
<p><code> + mPort[ind].Value.Substring(1, mPort[ind].Value.Length-2)  + </code></p>
<p><code> + Environment.NewLine); //и заносим в список все это дело<br />
ind++;<br />
}<br />
zz = m.Value; //сохраняем значение ип, оно у нас будет предыдущим))<br />
}<br />
}<br />
</code></p>
<p>Вот что получилось:<br />
<img class="aligncenter size-full wp-image-32" title="brute" src="http://infery.wordpress.com/files/2009/07/brute.jpg" alt="brute" width="301" height="370" /><br />
Сразу говорю, что программа только для этого сайта!<br />
И блин без отступов!! Как с отступами делать в этом вордпрессе???</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Exemplos de expressão regular]]></title>
<link>http://tecnosi.wordpress.com/2009/07/17/exemplos-de-expressao-regular/</link>
<pubDate>Fri, 17 Jul 2009 13:54:36 +0000</pubDate>
<dc:creator>Alisson Paiva</dc:creator>
<guid>http://tecnosi.wordpress.com/2009/07/17/exemplos-de-expressao-regular/</guid>
<description><![CDATA[CPF: Permite uma seqüência de três digitos, um ponto novamente três digitos um ponto seguido de três]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><strong>CPF: </strong>Permite uma seqüência de três digitos, um ponto novamente três digitos um ponto seguido de três digitos, hífen e mais dois digitos<br />
padrão 999.999.999-99</p>
<p>&#8220;d{3}.?d{3}.?d{3}-?d{2}&#8221;</p>
<p><strong>Outros exemplos:</strong></p>
<p><!--more--></p>
<p><strong>Data:</strong><br />
&#8220;^(([0-2]d&#124;[3][0-1])/([0]d&#124;[1][0-2])/[1-2][0-9]d{2})$&#8221;</p>
<p><strong>IP:</strong><br />
&#8220;^(([1]?[0-9]{1,2}&#124;2([0-4][0-9]&#124;5[0-5]))\.){3}([1]?[0-9]{1,2}&#124;2([0-4][0-9]&#124;5[0-5]))$&#8221;</p>
<p><strong>HORA:</strong><br />
&#8220;/^([0-1][0-9]&#124;[2][0-3]):[0-5][0-9]$/&#8221;</p>
<p><strong><em>SQL:</em></strong><br />
<strong>SELECT:</strong>  &#8220;SELECT\s[\w\*\)\(\,\s]+\sFROM\s[\w]+&#8221;</p>
<p><strong>INSERT:</strong>  &#8220;INSERT\sINTO\s[\d\w]+[\s\w\d\)\(\,]*\sVALUES\s\([\d\w\'\,\)]+&#8221;</p>
<p><strong>UPDATE:</strong>  &#8220;UPDATE\s[\w]+\sSET\s[\w\,\'\=]+&#8221;</p>
<p><strong>DELETE:</strong>  &#8220;DELETE\sFROM\s[\d\w\'\=]+&#8221;</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Expressão regular - O que é?]]></title>
<link>http://tecnosi.wordpress.com/2009/07/16/expressaoregular/</link>
<pubDate>Thu, 16 Jul 2009 17:47:02 +0000</pubDate>
<dc:creator>Alisson Paiva</dc:creator>
<guid>http://tecnosi.wordpress.com/2009/07/16/expressaoregular/</guid>
<description><![CDATA[A consistência dos dados é muito importante para qualquer sistema, um dos meios mais utilizados para]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>A consistência dos dados é muito importante para qualquer sistema, um dos meios mais utilizados para padronizar os dados aceitos é a expressão regular.</p>
<p>Mas o que são expressões regulares?  Para resumir, podemos dizer que:</p>
<p><!--more--></p>
<p style="text-align:center;"><em>&#8220;expressões  regulares são caracteres que juntos formam uma regra de entrada de dados. Os dados que obedecerem a todos os padrões estabelecidos serão validados.&#8221;</em></p>
<p>Podemos usar expressões regulares para tudo o que segue um padrão como: rg, cpf, cnpj, telefone, e-mail, data, cep,senhas e etc.</p>
<p>Varias linguagens já aceitam expressões regulares como: C, VB, JavaScript, Genexus, perl, Phyton e etc.</p>
<p>Abaixo temos uma pequena lista de padrões de expressão regular:</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="288" valign="top">j*</td>
<td width="288" valign="top">Letra j zero ou mais vezes</td>
</tr>
<tr>
<td width="288" valign="top">J+</td>
<td width="288" valign="top">Letra j uma ou mais vezes</td>
</tr>
<tr>
<td width="288" valign="top">[c-t]</td>
<td width="288" valign="top">Qualquer letra da sequencia</td>
</tr>
<tr>
<td width="288" valign="top">\w</td>
<td width="288" valign="top">Qualquer caracter alfanumerico</td>
</tr>
<tr>
<td width="288" valign="top">\d</td>
<td width="288" valign="top">Qualquer digito.</td>
</tr>
</tbody>
</table>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Regular Expressions In Siebel with Non English Character Set support]]></title>
<link>http://ersin.wordpress.com/2009/07/09/regular-expressions-in-siebel-with-non-english-character-set-support/</link>
<pubDate>Thu, 09 Jul 2009 09:51:46 +0000</pubDate>
<dc:creator>ersin</dc:creator>
<guid>http://ersin.wordpress.com/2009/07/09/regular-expressions-in-siebel-with-non-english-character-set-support/</guid>
<description><![CDATA[You can find basic information about How Regular expressions can be used in Siebel in below links, h]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>You can find basic information about How Regular expressions can be used in Siebel in below links,</p>
<p><a href="http://siebeldev.blogspot.com/2009/04/regular-expression-validation.html" target="_blank">http://siebeldev.blogspot.com/2009/04/regular-expression-validation.html</a></p>
<p><a href="http://download.oracle.com/docs/cd/B40099_02/books/eScript/eScript_JSReference252.html" target="_blank">http://download.oracle.com/docs/cd/B40099_02/books/eScript/eScript_JSReference252.html</a></p>
<p>For the Non English character set support, RexExp pattern should be writed with Unicode characters.</p>
<p>You can get information about Unicode below link,</p>
<p><a href="http://www.unicode.org/" target="_blank">http://www.unicode.org/</a></p>
<p><strong>You can find a simple pattern about Turkish First Name  validation below,</strong></p>
<p>&#8220;^[\u0041-\u005A]{1}\u002E?[\u0041-\u005A&#124;\u0061-\u007A&#124;\u0130&#124;\u0131&#124;\u00D6&#124;\u00F6&#124;\u00DC&#124;\u00FC&#124;\u00C7&#124;\u00E7&#124;\u011E&#124;\u011F&#124;\u015E&#124;\u015F]*$&#8221;</p>
<p><strong>Pattern explanation: </strong></p>
<p>First character should be Upper Case(\u0041-\u005A]{1})</p>
<p>Second character could be &#8220;.&#8221; (\u002E?)</p>
<p>Other characters can be anything in Turkish Alphabet([\u0041-\u005A&#124;\u0061-\u007A&#124;\u0130&#124;\u0131&#124;\u00D6&#124;\u00F6&#124;\u00DC&#124;\u00FC&#124;\u00C7&#124;\u00E7&#124;\u011E&#124;\u011F&#124;\u015E&#124;\u015F])</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Analisi quantitativa del testo - un prototipo]]></title>
<link>http://sicapisce.wordpress.com/2009/07/09/analisi-quantitativa-del-testo-un-prototipo/</link>
<pubDate>Thu, 09 Jul 2009 06:00:29 +0000</pubDate>
<dc:creator>Samuel Zarbock</dc:creator>
<guid>http://sicapisce.wordpress.com/2009/07/09/analisi-quantitativa-del-testo-un-prototipo/</guid>
<description><![CDATA[Nell&#8217;articolo &#8220;Analisi quantitativa del testo &#8211; un progetto&#8221; illustravo l]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Nell&#8217;articolo <a title="Come profilare un testo rispetto ad una comunità di scritti di riferimento" href="http://sicapisce.wordpress.com/2009/04/08/analisi-quantitativa-del-testo/" target="_blank">&#8220;Analisi quantitativa del testo &#8211; un progetto&#8221;</a> illustravo l&#8217;idea di un software con cui categorizzare in automatico un documento rispetto ad una serie di documenti di riferimento. Il risultato sarebbe una serie di parole e frasi che riassumano gli argomenti trattati nel documento analizzato.<br />
Qui comincio a descrivere un prototipo in PHP di questo progetto.</p>
<h2>1. Il client e il server.</h2>
<p>Innanzitutto l&#8217;analisi del testo si basa sul confronto di un documento con un gigantesco corpus di documenti più o meno correlati ad esso. L&#8217;utente del sistema esegue (fa eseguire) alcuni calcoli sul testo del proprio documento e ne confronta i risultati con quanto emerso dagli stessi calcoli eseguiti sul corpus. A questo punto le differenze assumono un significato (dalla quantità si passa alla qualità) e permettono l&#8217;identificazione di attributi con cui caratterizzare il documento analizzato. Fine.</p>
<p>Non si tratterebbe quindi di un motore di ricerca, dove lo <em>spider</em> va a cercare informazioni sulla risorsa, ma al contrario di un sistema in cui è la risorsa a descriversi ad un ente che restituisce alcune deduzioni.<br />
&#8220;<em>Come sono fatta, io, in confronto all&#8217;italiano scritto?</em>&#8220;; &#8220;<em>come sono fatta, io, in confronto alle altre novelle del mio stesso autore?</em>&#8220;; &#8220;<em>come sono fatta, io, in confronto a tutti i romanzi pubblicati in Sardegna in questi ultimi dieci anni?</em>&#8220;.</p>
<p>Per prima cosa i documenti in analisi devono andare ad alimentare il corpus dei documenti di riferimento. Questo significa che se sto facendo analizzare un nuovo libro, i risultati dei calcoli vengono aggiunti a quelli precedentemente già memorizzati. In questo modo il corpus crescerebbe al diffondersi del servizio di categorizzazione: più testi vengono categorizzati, più testi finiscono nel corpus.<br />
Sapendo in cosa consiste l&#8217;analisi statistica, poi, i <em>client</em> si possono permettere di inviare al servizio di categorizzazione solamente dati statistici semi-lavorati, ovvero non hanno bisogno di inviare tutto il testo né tutto il risultato dei conteggi. Il server conserva solamente dati statistici semi-lavorati.</p>
<p>Il processo si sta delineando quindi in questo modo:</p>
<ul>
<li>il server contiene conteggi relativi ad una serie di documenti già analizzati (e un&#8217;aggregazione di tutti questi conteggi: il famoso corpus)</li>
<li>il client chiede al server se un certo documento è stato conteggiato di recente; in caso di conteggio aggiornato, il server lo confronta con i dati aggregati e restituisce al client le etichette</li>
<li>in caso di conteggi non aggiornati (o di documento non ancora analizzato), il client (ri)esegue i suoi conteggi e li manda al server; questo fa il suo confronto e restituisce le etichette</li>
<li>in entrambi i casi i conteggi del client vengono memorizzati anche sul server, così da alimentare il totalone.</li>
</ul>
<p>La cosa difficile da ottenere è l&#8217;identificazione univoca di un documento: lo si può fare con una pagina online (un URL è sufficientemente univoco) o con un libro (sto pensando al codice IBAN), ma già diventa un problema con documenti prodotti dagli utenti (ad esempio).</p>
<h2>2. La struttura dei dati.</h2>
<p>Sia il client che il server devono contenere, per ogni documento analizzato sino a quel momento, i dati statistici calcolati e una serie di informazioni che ne permettano l&#8217;identificazione univoca. Il client si limiterà ai documenti che ha analizzato lui stesso, il server raccoglierà le informazioni di tutti i documenti.<br />
Il server, in più, aggregherà questi dati riassumendo i conteggi in un&#8217;unica tabella (tutto l&#8217;italiano scritto) o in più tabelle temporanee (tutto l&#8217;italiano scritto di Ungaretti, per fare un esempio&#8230;).</p>
<p>Per conservare, calcolare e analizzare le parole ci verrà comodo utilizzare un DataBase, strumento perfetto quando si tratta di contare, gestire e manipolare grandi quantità di dati. Il DB che utilizzerò è MySQL, che tra le altre cose mi permette di eseguire query in sintassi <em>regexp</em> (cosa che potrebbe sempre tornare utile, trattandosi di parole) e di affidargli la responsabilità di stabilire se un dato debba essere aggiunto oppure aggiornato (mediante la sintassi ON DUPLICATE KEY UPDATE, di cui qui di seguito).</p>
<blockquote>
<pre>CREATE DATABASE `adt` /*!40100 DEFAULT CHARACTER SET utf8 */;

DROP TABLE IF EXISTS `adt`.`content`;
CREATE TABLE  `adt`.`content` (
  `id` varchar(50) NOT NULL,
  `author_id` int(10) unsigned NOT NULL,
  `mod_date` datetime NOT NULL,
  `check_date` datetime NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `Dates` (`mod_date`,`check_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

DROP TABLE IF EXISTS `adt`.`x1`;
CREATE TABLE  `adt`.`x1` (
`content_id` varchar(50) NOT NULL,
`word1` varchar(50) NOT NULL,
`count` int(12) unsigned NOT NULL,
PRIMARY KEY USING BTREE (`content_id`,`word1`),
KEY `Sum` (`count`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

DROP TABLE IF EXISTS `adt`.`x2`;
CREATE TABLE  `adt`.`x2` (
`content_id` varchar(50) NOT NULL,
`word1` varchar(50) NOT NULL,
`word2` varchar(50) NOT NULL,
`count` int(12) unsigned NOT NULL,
PRIMARY KEY USING BTREE (`content_id`,`word1`,`word2`),
KEY `Sum` (`count`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

DROP TABLE IF EXISTS `adt`.`o3`;
CREATE TABLE  `adt`.`o3` (
`content_id` varchar(50) NOT NULL,
`word1` varchar(50) NOT NULL,
`word2` varchar(50) NOT NULL,
`word3` varchar(50) NOT NULL,
`count` int(12) unsigned NOT NULL
PRIMARY KEY USING BTREE (`content_id`,`word1`,`word2`,`word3`),
KEY `Sum` (`count`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;</pre>
</blockquote>
<p>Le tabelle con il prefisso &#8220;x&#8221; contengono i risultati dei primi conteggi: ogni <em>lemma</em> (o forma), <em>bigramma</em> o <em>trigramma</em> appare una volta sola, già accompagnato dal numero di occorrenze; il numero successivo al prefisso &#8220;x&#8221; indica quante parole sono state considerate (<em>n-gram</em>).<br />
Per far sì che un&#8217;istruzione INSERT non dia errore in occasione di un lemma già presente è opportuno utilizzare la sintassi ON DUPLICATE KEY UPDATE, come in questo esempio:</p>
<pre>INSERT INTO x1 (content_id,word1,count) VALUES ($id,$word,1) ON DUPLICATE KEY UPDATE count=count+1;</pre>
<p>In questo modo con MySQL è possibile evitare di controllare mediante una SELECT se un lemma è già stato inserito e poi conseguentemente lanciare una INSERT oppure una UPDATE.</p>
<p>La tabella &#8220;<em>content</em>&#8221; contiene tutti i dati relativi al testo originale; per i raggruppamenti di testi mi devo ancora attrezzare (potrei ipotizzare ad esempio una riga relativa al contenuto &#8220;ALL&#8221;, raggruppamento obbligatorio e presente di default: ma ci devo ancora pensare).</p>
<h2>3. Leggiamo un testo.</h2>
<p>Innanzitutto è necessario caricare in memoria il testo da analizzare. Non si tratta, per ora, di realizzare un vero e proprio parser, bensì di preparare la struttura che ce lo renderà possibile:</p>
<blockquote>
<pre>&#60;script language="php"&#62;

$dblink = mysql_connect( 'localhost', 'USERNAME', 'PASSWORD' );
mysql_select_db( 'DBNAME', $dblink );
if( ini_get( 'max_execution_time' ) )
{
	$time_out = ini_get( 'max_execution_time' );
} else {
	$time_out = 30;
}
ini_set( 'max_execution_time', 300 );

$fp = fopen( "TEXT_TO_PARSE.txt", "r" );
$content_id = 1;

$lines = array();
$current = 0;
while( $line = fgets( $fp ) )
{
	if( !ctype_space( $line[0] ) )
	{
		$current++;
		$lines[$current] = utf8_decode( trim( $line ) );
	}
}
fclose( $fp );

ini_set('max_execution_time', $time_out);
&#60;/script&#62;</pre>
</blockquote>
<p>Apriamo in lettura un file (che deve essere UTF-8), creaiamo l&#8217;array <em>$lines</em>, e per ogni linea non vuota del file riempiamo una cella dell&#8217;array. In questo modo l&#8217;array <em>$lines</em> conterrà tutte le righe che compongono il testo da analizzare. Niente di difficile, sino a qui.</p>
<h2>4. Salviamo le parole.</h2>
<p>Per ora non stiamo ancora analizzando nulla: prima dobbiamo salvare le parole sulle apposite tabelle (la tabella &#8220;x&#8221; per le parole singole, la tabella &#8220;x&#8221; per le coppie di parole, la tabella &#8220;x&#8221; per le sequenze di tre parole, e così di seguito: quindi &#8220;x1&#8243;, &#8220;x2&#8243;, &#8220;x3&#8243; eccetera).<br />
Dopo, quando cominceremo a contare le occorrenze, avremo bisogno di sapere, per ogni parola, a quali <strong>comunità</strong> questa appartiene:</p>
<ul>
<li>la <em>lingua</em> del testo in cui è apparsa</li>
<li>il <em>genere letterario</em> in cui è scritto il testo da cui proviene</li>
<li>gli <em>argomenti</em> che il testo da cui proviene affronta</li>
<li>la <em>rete di citazioni</em> in cui il testo da cui proviene si immerge</li>
<li>il <em>testo</em> da cui proviene</li>
</ul>
<p>In questo modo ci sarà possibile fare analisi che scendano al livello di profondità che più ci interessa: sarà ad esempio possibile estrarre il <strong>lessico di frequenza</strong> di tutti i testi di <em>argomento biblico</em> oppure tutte le <strong>locuzioni più frequenti</strong> del <em>testo X</em>, eccetera.<br />
Per creare un tabella che contenga tutte quelle informazioni di dettaglio (&#8220;content&#8221;) sarà sufficiente fare riferimento al testo dal quale le parole provvengono: è il testo, difatti, che può essere <em>caratterizzato da un genere, da alcuni argomenti, da uno o più autori, da una data di edizione</em>&#8230; Sarà il testo a fare da crocevia per tutti i tipi di relazioni che potremmo voler stabilire tra le parole.</p>
<p>E quindi ora, nel momento di salvare le parole su di un DB, sarà sufficiente associarle al testo dal quale provvengono. Per il momento stabilisco che l&#8217;identificativo del testo sia arbitrariamente il numero 1.</p>
<h2>5. Prima le parole singole&#8230;</h2>
<p>Il codice PHP che scrive su &#8220;x1&#8243; tutte le parole di un testo è il seguente:</p>
<blockquote>
<pre>&#60;script language="php"&#62;
// @TODO: L'attuale check di eventuale presenza di questo content_id sul DB
// si basa su di un id che deve essere univoco!!
if( !mysql_fetch_array( mysql_query( "SELECT id FROM x1 WHERE id = '$content_id' limit 1" ) ) )
// @TODO: identificare i fine frase per permettere le analisi a livello
// più dettagliato che il testo. L'idea è di estrarre un array direttamente
// da tutto il testo originale scindendolo ogni volta che appare un punto:
// $fulltext = preg_replace( '/[\.\n]+/', '.', $fulltext );
// $lines = explode( ".", $fulltext );
{
	foreach( $lines as $key =&#62; $value)
	{
		$sql = "INSERT INTO x1 VALUES ";
		$value = preg_replace( '/[^\w]+/', ' ', $value );
		$value = preg_replace( '/[\s]+/', ' ', $value );
		$value = preg_replace( '/^[ ]+$/', '', $value );
		if ($value)
		{
			$line_arr = explode( " ", $value );
			foreach( $line_arr as $key2 =&#62; $item )
			{
				$sql .= "($content_id,'$item',1),";
			}
		}
		$sql = substr( $sql, 0, -1 );
		$sql .= "  ON DUPLICATE KEY UPDATE count=count+1;";
		mysql_query( $sql );
	}
	echo "EOF&#60;br&#62;";
} else {
	echo "Content already present.&#60;br&#62;";
}
&#60;/script&#62;</pre>
</blockquote>
<p>Mi connetto al DB, prendo ogni cella dell&#8217;array <em>$lines</em> (per ora ipotizzo semplicemente di averlo: poi vedrò come passarmelo per davvero), accorpo tutte le diverse spaziature in uno spazio solo, creo un array che contenga tutte le parole e aggiungo le informazioni delle singole parole (ovvero id del testo e parola) alla query SQL che infine eseguo (per velocizzare l&#8217;esecuzione della query ne compongo una sola p<span style="color:#000000;">er tutta una riga).<br />
Stampo a monitor un &#8220;End of File&#8221; che rassicuri l&#8217;utente.</span></p>
<h2><span style="color:#000000;">6. &#8230;poi tutte le altre.</span></h2>
<p>Il codice PHP che scrive su &#8220;x2&#8243;, &#8220;x3&#8243;, &#8220;x4&#8243;, &#8220;x5&#8243; e &#8220;x6&#8243; tutte le parole di un testo è il seguente:</p>
<blockquote>
<pre>if ($key-5 &#62;= 0) {
 mysql_query("INSERT INTO adt.x6 VALUES ('0', '".$line_arr[$key-5]."',
    '".$line_arr[$key-4]."', '".$line_arr[$key-3]."', '".$line_arr[$key-2]."',
    '".$line_arr[$key-1]."', '".$line_arr[$key]."')");
 mysql_query("INSERT INTO adt.x5 VALUES ('0', '".$line_arr[$key-4]."',
    '".$line_arr[$key-3]."', '".$line_arr[$key-2]."', '".$line_arr[$key-1]."',
    '".$line_arr[$key]."')");
 mysql_query("INSERT INTO adt.x4 VALUES ('0', '".$line_arr[$key-3]."',
    '".$line_arr[$key-2]."', '".$line_arr[$key-1]."', '".$line_arr[$key]."')");
 mysql_query("INSERT INTO adt.x3 VALUES ('0', '".$line_arr[$key-2]."',
    '".$line_arr[$key-1]."', '".$line_arr[$key]."')");
 mysql_query("INSERT INTO adt.x2 VALUES ('0', '".$line_arr[$key-1]."',
    '".$line_arr[$key]."')");
 } 

 elseif ($key-4 == 0) {
 mysql_query("INSERT INTO adt.x5 VALUES ('0', '".$line_arr[$key-4]."',
    '".$line_arr[$key-3]."', '".$line_arr[$key-2]."', '".$line_arr[$key-1]."',
    '".$line_arr[$key]."')");
 mysql_query("INSERT INTO adt.x4 VALUES ('0', '".$line_arr[$key-3]."',
    '".$line_arr[$key-2]."', '".$line_arr[$key-1]."', '".$line_arr[$key]."')");
 mysql_query("INSERT INTO adt.x3 VALUES ('0', '".$line_arr[$key-2]."',
    '".$line_arr[$key-1]."', '".$line_arr[$key]."')");
 mysql_query("INSERT INTO adt.x2 VALUES ('0', '".$line_arr[$key-1]."',
    '".$line_arr[$key]."')");
 }

 elseif ($key-3 == 0) {
 mysql_query("INSERT INTO adt.x4 VALUES ('0', '".$line_arr[$key-3]."',
    '".$line_arr[$key-2]."', '".$line_arr[$key-1]."', '".$line_arr[$key]."')");
 mysql_query("INSERT INTO adt.x3 VALUES ('0', '".$line_arr[$key-2]."',
    '".$line_arr[$key-1]."', '".$line_arr[$key]."')");
 mysql_query("INSERT INTO adt.x2 VALUES ('0', '".$line_arr[$key-1]."',
    '".$line_arr[$key]."')");
 }

 elseif ($key-2 == 0) {
 mysql_query("INSERT INTO adt.x3 VALUES ('0', '".$line_arr[$key-2]."',
    '".$line_arr[$key-1]."', '".$line_arr[$key]."')");
 mysql_query("INSERT INTO adt.x2 VALUES ('0', '".$line_arr[$key-1]."',
    '".$line_arr[$key]."')");
 }

 elseif ($key-1 == 0) {
 mysql_query("INSERT INTO adt.x2 VALUES ('0', '".$line_arr[$key-1]."',
    '".$line_arr[$key]."')");
 }</pre>
</blockquote>
<p>Ok, mi rendo conto che questa parte del giocattolo è perfettibile (tanto per usare un <a title="Un elenco di figure retoriche, tra le quali appare anche l'eufemismo" href="http://sicapisce.wordpress.com/2008/10/15/figure-retoriche/" target="_blank">eufemismo</a>); ad ogni modo il risultato è tanti begli <em>n-gram</em>.</p>
<p>Al prossimo appuntamento con questo progetto comincerò a verificare se davvero il confronto tra i dati statistici di un testo e del suo corpus di riferimento fa emergere informazioni qualitative interessanti.</p>
<p>Grazie a chi è riuscito a seguirmi fino a qui&#8230; <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<div id="_mcePaste" style="overflow:hidden;position:absolute;left:-10000px;top:787px;width:1px;height:1px;">
<pre>  `mod_date` datetime NOT NULL,</pre>
</div>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Usefull regExp in JavaScript.]]></title>
<link>http://javierloriente.wordpress.com/2009/06/17/usefull-regexp-in-javascript/</link>
<pubDate>Wed, 17 Jun 2009 15:26:33 +0000</pubDate>
<dc:creator>alikates</dc:creator>
<guid>http://javierloriente.wordpress.com/2009/06/17/usefull-regexp-in-javascript/</guid>
<description><![CDATA[I have found this regular expression to be usefull // Regex for parsing URIs. var HTTP_URI_RE = /^(h]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>I have found this regular expression to be usefull<br />
<code><br />
// Regex for parsing URIs.<br />
var HTTP_URI_RE = /^(https?:)\/\/([\w-_.]+(:[0-9]+)?)(.*)$/;<br />
var FILE_URI_RE = /^file:\/\/(.*)$/;<br />
var PATH_RE     = /^(\/?[\w\-_\.\/\%\*]*)(\#([\w\-_\.\/\%\*]*))?$/;<br />
var FILE_RE     = /^(.*\/)*([^.]+\.[\w]+)$/;<br />
var SEARCH_RE   = /^([^&#38;=]+)(=([^&#38;]*))?(&#38;(.*))?/;<br />
</code></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[regexp to extract parameter like option=[value]]]></title>
<link>http://r4ccoon.wordpress.com/2009/06/08/regexp-to-extract-parameter-like-optionvalue/</link>
<pubDate>Mon, 08 Jun 2009 05:22:52 +0000</pubDate>
<dc:creator>r4ccoon</dc:creator>
<guid>http://r4ccoon.wordpress.com/2009/06/08/regexp-to-extract-parameter-like-optionvalue/</guid>
<description><![CDATA[preg_match_all("#(\w+)=\[(.*?)\]#s", $param_line, $matches); sample string = {gallery width=[129] ur]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><pre>preg_match_all("#(\w+)=\[(.*?)\]#s", $param_line, $matches);</pre>
<p>sample string =  {gallery width=[129] url=[http://google.com] height=[122]}</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Need to learn: regulare expressions]]></title>
<link>http://firmit.wordpress.com/2009/06/07/need-to-learn-regulare-expressions/</link>
<pubDate>Sun, 07 Jun 2009 19:38:34 +0000</pubDate>
<dc:creator>firmit</dc:creator>
<guid>http://firmit.wordpress.com/2009/06/07/need-to-learn-regulare-expressions/</guid>
<description><![CDATA[Oh &#8211; the possibilities! sed, grep and a lot of editors support the use of regexp. It&#8217;s a]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Oh &#8211; the possibilities!<br />
sed, grep and a lot of editors support the use of regexp. It&#8217;s about time I learned it!</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[RegExp ile Mail adresi kontrolü]]></title>
<link>http://mynetbook.wordpress.com/2009/05/27/regexp-ile-mail-adresi-kontrolu/</link>
<pubDate>Wed, 27 May 2009 12:23:08 +0000</pubDate>
<dc:creator>FreePunch</dc:creator>
<guid>http://mynetbook.wordpress.com/2009/05/27/regexp-ile-mail-adresi-kontrolu/</guid>
<description><![CDATA[Function rgxEmail(email) Dim regEx, Match, Matches If email &lt;&gt; &#8220;&#8221; Then Set regEx =]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Function rgxEmail(email)<br />
Dim regEx, Match, Matches<br />
If email &#60;&#62; &#8220;&#8221; Then<br />
Set regEx = New RegExp<br />
regEx.Pattern = &#8220;^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)&#124;(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}&#124;[0-9]{1,3})(\]?)$&#8221;<br />
regEx.IgnoreCase = True</p>
<p>If regEx.Test(email) Then cckEmail = True Else cckEmail = False<br />
End If<br />
End Function</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Having performance issues with regex?]]></title>
<link>http://eyalsch.wordpress.com/2009/05/21/regex/</link>
<pubDate>Thu, 21 May 2009 10:59:57 +0000</pubDate>
<dc:creator>Eyal Schneider</dc:creator>
<guid>http://eyalsch.wordpress.com/2009/05/21/regex/</guid>
<description><![CDATA[Java introduced the java.util.regex package in version 1.4. It is a powerful addition, and yet, one ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Java introduced the java.util.regex package in version 1.4. It is a powerful addition, and yet, one should really be an artist in order to use it right. Assuming that a regular expression is proved for correctness, it may still run extremely slow (even take hours) if it is not wisely written.</p>
<p>Continue reading for understanding the origin of the problem, or jump directly to the last section, containing 10 useful performance tips for regular expression writers in Java.</p>
<h2 style="text-align:left;"><span style="color:#0000ff;">Can it really be so slow?</span></h2>
<p> Suppose that we want to match only strings composed of sequences of ‘a’s or ‘b’s. A legal regex would be:</p>
<p><span style="color:#993366;"><strong>(a*b*)*</strong></span></p>
<p>However, if you run this regex against the string “aaaaaaaaaaaaaaaaaaaaaaaaaaaaax” for example, it takes several minutes to terminate and report that there is no match!<br />
Of course, a better regex in this case would be:</p>
<p><span style="color:#993366;"><strong>(a&#124;b)*</strong></span></p>
<p>This one takes less than a millisecond on my machine, on the same input. Obviously, there is a performance issue here.</p>
<h2><span style="color:#0000ff;">Why does this happen?</span></h2>
<p> Like most regex engines, Java uses NFA (Non deterministic Finite Automaton) approach. The engine scans the regex components one by one, and progresses on the input string accordingly. However, it will go back in order to explore matching alternatives when a “dead end” is found. Alternatives result from regex structures such as quantifiers (*, +, ?) and alternation (e.g. <strong><span style="color:#993366;">a&#124;b&#124;c&#124;d</span></strong>). This exploration technique is called <em>backtracking</em>.<br />
In the catastrophic example above, the engine will actually explore ALL the decompositions of the series of ‘a’s into smaller series, before it realizes that there is no match. This example shows how the backtracking algorithm can result in <strong>exponential time </strong>evaluation (in the length of the input sting). It also reveals an important property of NFA engines: the worst cases will always be inputs that <strong>almost</strong> match the pattern. If a match is found, the exploration stops.</p>
<p>The other main approach for regex engines is DFA (Deterministic Final Automata). In this approach, the regex compilation actually builds an automaton, to be used when traversing input strings. Inputs are traversed character by character, with no going back. This guarantees <strong>linear time</strong>in the length of the input string, regardless of the regex complexity. Instead of trying different match possibilities serially (as in NFA), DFAsimulates a parallel scanning on all possibilities.</p>
<p>So why does Java (and .NET, Perl, Python, Ruby, PHP etc) use NFA and not DFA, which has much better asymptotic behavior? The reason is that NFA has some significant benefits:</p>
<ul>
<li>Its compilation is faster, and requires much less memory</li>
<li>It allows some powerful features (See <a href="http://java.sun.com/docs/books/tutorial/essential/regex/" target="_blank">Sun’s tutorial </a>for details on them):
<ul>
<li><em>Capturing groups</em> and <em>back references</em></li>
<li><em>Lookarounds</em></li>
<li>Advanced quantifiers (<em>Possessive</em> and <em>Reluctant</em>)</li>
</ul>
</li>
</ul>
<p>It is important to note that the popular terms NFA and DFA for regex engines are inaccurate. Theoretically speaking, these 2 models have the same computation power, meaning that there can&#8217;t be a matching rule that can be expressed in one of them but not in the other. In practice, there was a need for more features, so the two implementation types diverged in their semantics. NFA engines were given more power, making them superior to DFA engines in computation power.</p>
<p>Due to the speed of DFA and the unique features of NFA, there are 2 more “integrative” approaches for regex implementations. Some implementations use both engines (e.g. GNU egrep, which chooses the specific engine at runtime), and some of them really implement a hybrid version (e.g. Tcl regex), enjoying all the benefits.</p>
<h2><span style="color:#0000ff;">Tips</span></h2>
<p><strong> </strong>Following are some tips on how to avoid regex efficiency issues in Java. Many of them are aimed at reducing backtracking.</p>
<p><strong> 1)</strong>  <strong>Pre-compile</strong><br />
Trivial, but worth mentioning. If you use a regular expression more than once, remember to compile it first as a pre-process step:<br />
Pattern p = Pattern.<em>compile</em>(regex,flags);<br />
&#8230;</p>
<p>//Usage stage<br />
Matcher a = p.matcher(input);</p>
<p><strong> 2)  Reluctant quantifiers vs greedy quantifiers<br />
</strong>The default quantifiers (* + and ?) are greedy. That means that they start by matching the longest possible sequence, and then they step back gradually if backtracking is needed. In case that you know that the match is usually short, you should use <em>reluctant quantifiers</em> instead. They start by the smallest match, and progress if needed.<br />
For example, suppose that you want to match only strings containing the sequence “hello”. The regex <strong><span style="color:#993366;">.*hello.*</span></strong> will do the job, but if you know that ‘hello’ usually appears near the beginning of the text, then <strong><span style="color:#993366;">.*?hello.*</span></strong> will run faster on average.</p>
<p><strong> 3)  Use possessive quantifiers where possible</strong><br />
Unlike reluctant quantifiers which affect performance but do not affect regex behavior, possessive quantifiers may really change the meaning of the regex.<br />
When using *+ instead of *, the first attempted match will be greedy (i.e. longest match, just as with *), but there will be no backtracking in case of failure, even if this causes the complete match to fail. When does this become useful?<br />
Suppose that you need to match text in quotes. The regex <strong><span style="color:#993366;">\”[^\”]*\”</span></strong>will work fine. However, it will do unnecessary backtracking on negative cases (e.g. “bla bla bla). Using the regex <strong><span style="color:#993366;">\”[^\”]*+\”</span></strong>instead will eliminate the backtracking, without changing the regex meaning.</p>
<p><em>Independent grouping</em> can have the same effect, and it allows even more control (See <a href="http://java.sun.com/docs/books/tutorial/essential/regex/" target="_blank">Sun’s tutorial </a>).</p>
<p><strong>4)  Avoid capturing groups</strong><br />
Any expression in parentheses is considered a capturing group by default. This has a small performance impact. Make your groups “non-capturing groups” when possible, by beginning them with <strong><span style="color:#993366;">(?:</span></strong> instead of with <strong><span style="color:#993366;">(</span></strong>.</p>
<p><strong>5)  Use alternation wisely<br />
</strong>When using alternation (e.g. <strong><span style="color:#993366;">Paul&#124;Jane&#124;Chris</span></strong>), the order in which the engine tries to match the options is the same order in which they appear. Therefore, you can take advantage of this property, and order your options from the most common one to the less common one. This will improve the average time of positive matches.</p>
<p><strong> 6)  Avoid multiple interpretations<br />
</strong>Write your regex in a way that minimizes the number of different ways to match a particular input string. For example, the regex <strong><span style="color:#993366;">(a*b*)*</span></strong>given in the beginning of this article, allows interpreting the string “aabb” in too many ways:<br />
(a2b2)<br />
(a1)(a1)(b1)(b1)<br />
(a2)(b2)<br />
(a1)(a1b2)<br />
etc…</p>
<p>The regex <strong><span style="color:#993366;">(a&#124;b)*</span></strong> on the other hand, forces a unique interpretation on a positive input.<br />
This is very important in order to reduce backtracking, in cases of <em>almost</em> a match.</p>
<p><strong> 7)</strong>  <strong>Lookarounds<br />
</strong><em>Lookarounds  </em>allow adding restrictions on sequences to the left/right of current position. Specifically, with <em>negative lookahead</em> you can search strings that <strong>do not</strong> contain some sequence (cumbersome thing to do without this feature!). How can this help improve performance?</p>
<p>Suppose you want to capture the URL from a link tag. Consider the following regex:<br />
<strong><span style="color:#993366;">&#60;a .* href=(\S*).*/&#62;</span></strong><br />
For legal tags it will find a match and capture only the href attribute as required (<strong><span style="color:#993366;">\S </span></strong>stands for a non-whitespace character) .  On some illegal tags, however, exessive backtracking will occur. For example: “&#60;a href= href=href=…. href=something”.<br />
The following regex will prevent this, by replacing the “.*” expression with something more specific, that does not match “href’&#8221;:<br />
<strong><span style="color:#993366;">&#60;a ((?!href).)* href=(\S*)((?!href).)*/&#62;</span></strong></p>
<p><strong> 8)  Specify lengths<br />
</strong>Java has a regex optimizer that checks the input string’s length against the min/max length derived from the regex contents. This allows failing the match in some cases immediately. In order to help this mechanism, specify repetition number on groups whenever possible (e.g. <strong><span style="color:#993366;">[01]{6}</span></strong>, which matches all binary strings of length 6).</p>
<p><strong> 9)  Extract mandatory strings<br />
</strong>Sometimes, strings which are mandatory are hidden inside groups or alternations:<br />
<span style="color:#993366;"><strong>(hello&#124;hell&#124;heel)</strong></span><br />
This expression could be simplified to:<br />
<span style="color:#993366;"><strong>he(llo&#124;ll&#124;el)</strong></span><br />
By doing so, the regex optimizer has more information to work with.</p>
<p><strong> 10)  Benchmark your regex<br />
</strong>When the regex is used in a performance critical area of your application, it would be wise to test it first. Write a micro-benchmark application that tests it against different inputs. Remember to test different lengths of inputs, and also inputs that <em>almost match</em> your pattern. <strong></strong></p>
<p> </p>
<h3><span style="color:#0000ff;">Links:</span></h3>
<address><strong> </strong><a href="http://java.sun.com/docs/books/tutorial/essential/regex/index.html">http://java.sun.com/docs/books/tutorial/essential/regex/index.html</a></address>
<address><a href="http://www.javaworld.com/javaworld/jw-09-2007/jw-09-optimizingregex.html?page=1">http://www.javaworld.com/javaworld/jw-09-2007/jw-09-optimizingregex.html?page=1</a></address>
<address><a href="http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html" target="_blank">http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html</a></address>
<address><a href="http://www.devarticles.com/c/a/Java/NFA-DFA-POSIX-and-the-Mechanics-of-Expression-Processing/" target="_blank">http://www.devarticles.com/c/a/Java/NFA-DFA-POSIX-and-the-Mechanics-of-Expression-Processing/</a></address>
<address></address>
<address></address>
<address></address>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Flex regular expression cheatsheet]]></title>
<link>http://guavus.wordpress.com/2009/05/21/flex-regular-expression-cheatsheet/</link>
<pubDate>Thu, 21 May 2009 04:00:40 +0000</pubDate>
<dc:creator>vx</dc:creator>
<guid>http://guavus.wordpress.com/2009/05/21/flex-regular-expression-cheatsheet/</guid>
<description><![CDATA[to see large version, right click save as image]]></description>
<content:encoded><![CDATA[to see large version, right click save as image]]></content:encoded>
</item>

</channel>
</rss>
