SZC’s Agglomeration

{an agglomeration of laconic notes in relation to anything that sparks my interest}

regular expressions (a.k.a. regEx) tutorial #1

Posted by Stefan Camilleri on {2008.August.10}

/** Over the past few years, many developers have commented about their lack of understanding on how regular expressions (regEx) are written; yet all admit how helpful it would be to them if they knew how to create them */

I am therefore going to jot down a quick ‘teach yourself in 5 minutes’ for the basic regEx expressions.  I will not go into too much detail, since most advanced features are not used anyway on a day to day basis, yet the most simple of regEx can be really helpful in coding, even for simple search and replace operations in your IDE.  I assure you that once you get the hang of them, you will fall in love (as I have) {

The Basics

regEx is basically a pattern search, most would already know this, but I’m just stating it just in case.  As in the find menu, you would enter the word you are looking for, in actuality, that word is a pattern.

For this tutorial, I’ll use a mnemonic which I like -> ‘Californians Like Girls in Small Bikinis’ (well not only Californians maybe ;) )

I will be using Javascript notation for the regEx here, yet you can use this in any other engine by omitting the / from the start and end of the expression.

I’m going to try not to overload my tutorial with tech explanations like lexicons and tokens, or how we could break this down using BNF, since I’m pretty sure that anyone familiar with them can figure them out for themselves.

The structure I’m using is:    /Regular Expression/   ->   Matches highlighted in my mnemonic
Followed by a brief explanation.

Simple literal regEx match

  1. /al/ -> Californians Like Girls in Small Bikinis
  2. /ni/ -> Californians Like Girls in Small Bikinis

In this case, this is a simple match, returning the instances of the searched token.  The expression will match the first instance, the blue instance is returned on subsequent matches.

Simple wildcard regEx match

  1. /.al/ -> Californians Like Girls in Small Bikinis
  2. /ni./ -> Californians Like Girls in Small Bikinis

Here we introduce the . wildcard.  This means ‘any character’.   So the first term matches ‘any character followed by a and l.

Unbound repetition of patterns

  1. /.*al/ -> Californians Like Girls in Small Bikinis
  2. /ni.*/ -> Californians Like Girls in Small Bikinis
  3. /.*Cal/ -> Californians Like Girls in Small Bikinis

The * symbol is what we call the ‘Kleene Closure’, this means ‘0 or more instances of’, so what we are searching for here is ‘0 or more instances of any character followed by al’.

Note that in the first instance, the second al is matched, not the first, since the regEx acts on the whole line.  Also note the third example, since we state ‘0 or more’, ‘Cal’ is still matched.

One-or-More repetition

  1. /.+al/ -> Californians Like Girls in Small Bikinis
  2. /ni.+/ -> Californians Like Girls in Small Bikinis
  3. /.+Cal/ -> Californians Like Girls in Small Bikinis

This is exactly like the previous example, with one difference, the + means ‘1 or more’, so in the third example, nothing is matched, since there is no instance of any character followed by ‘Cal’

Character classes

  1. /[Cm]al/ -> Californians Like Girls in Small Bikinis
  2. /ni[ai]/ -> Californians Like Girls in Small Bikinis

The [ ] denote a character class.  This means ‘any one of these characters’.  So in the first case, we are looking for a ‘C’ or an ‘m’ followed by ‘al’, which gives us two matches.  You can see why the second one only returns one match.

Negated character classes

  1. /[^Cx]al/ -> Californians Like Girls in Small Bikinis
  2. /ni[^ai]/ -> Californians Like Girls in Small Bikinis

This is the negation of the previous example.  The ^ at the start of character class means ‘anything that isn’t on of these characters’.

It is important that the ^ is within the [  ]

Start of sentence

  1. /^Californians/ -> Californians Like Girls in Small Bikinis
  2. /^Bikinis/ -> Californians Like Girls in Small Bikinis

The ^ at the start of a regEx means ’start of string’, so we are looking for the ’start of a string followed by a C, and a, etc…

End of sentence

  1. /Californians$/ -> Californians Like Girls in Small Bikinis
  2. /Bikinis$/ -> Californians Like Girls in Small Bikinis

The $ in this case matches the ‘end of a string’

}

That will be all for this tutorial.  There will be other tutorials to follow this, which I will add to this post as links once I create them.

Any feedback would be appreciated.

Tutorials ToC:

  1. The Basics (this tutorial)
  2. Shortcuts
  3. Search and Replace (matching groups)
  4. Saving memory, and shortening your regEx
  5. Advanced: Positive and Negative lookahead and lookbehind.
  6. A quick’n dirty way to test regEx, as well as examples in C#, Java, VB and JavaScript

p.s. for anyone curious as to what that mnemonic is for, it is basically used to remember the 5 layers of the epidermis, starting from the lower layer. i.e. corneum, licidum, granulosum, spinosum and basale.

8 Responses to “regular expressions (a.k.a. regEx) tutorial #1”

  1. daSilva said

    Smashing! Looking forward to see the rest…

  2. Caxaria said

    Really nice!

    Now go on, I’ve refreshed the page 10 times and still the other parts of the tutorial are missing! :D

  3. @Caxaria: anticipation my friend ;) I’ll work on part two tonight.

  4. MyNetFaves : Public Faves Tagged Regex…

    Marked your site as regex at MyNetFaves!…

  5. @MyNetFaves

    Thank you… I hope to deliver the remaining tutorials shortly, I’ve just been busy moving from the Midlands to London :)

  6. Joe said

    how wud you parse the following into pairs with regex?

    Color Blue Flavor Vanilla Topping “Chocolate Sprinkles” Size Small

    where the result would be
    Color=Blue
    Flavor=Vanilla
    Topping=Chocolate Sprinles
    Size=Small

    ([\"])[^\"]*\\1|[^ ]+([^ ]*) does not seems to work perfectly

  7. @Joe

    This will work specifically for the string you propose:

    Color ([^ ]+) Flavor ([^ ]+) Topping “(.+)” Size ([^ ]+)

    A more dynamic solution is this:

    (?:[^ ]+) +((?:(?:”.+”)|(?:[^ ]))+) ?

    Repeated for as many tuples as you need. It will pick up always the second value, catering also for inverts.

    i.e. Color Blue would pick up Blue and Color “Light Blue” would pick up “Light Blue”

    Hope it helps :)

  8. Off Ramz said

    Thanx man. One of the few tutorials about regex that I have really understood.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>