SZC’s Agglomeration

{an agglomeration of laconic notes in relation to anything that sparks my interest}

Posts Tagged ‘regular expressions’

regular expressions (a.k.a. regEx) tutorial #1

Posted by Stefan Camilleri on {2008.August.10}

/** Over the past few years, many developers have commented about their lack of understanding on how regular expressions (regEx) are written; yet all admit how helpful it would be to them if they knew how to create them */

I am therefore going to jot down a quick ‘teach yourself in 5 minutes’ for the basic regEx expressions.  I will not go into too much detail, since most advanced features are not used anyway on a day to day basis, yet the most simple of regEx can be really helpful in coding, even for simple search and replace operations in your IDE.  I assure you that once you get the hang of them, you will fall in love (as I have) {

The Basics

regEx is basically a pattern search, most would already know this, but I’m just stating it just in case.  As in the find menu, you would enter the word you are looking for, in actuality, that word is a pattern.

For this tutorial, I’ll use a mnemonic which I like -> ‘Californians Like Girls in Small Bikinis’ (well not only Californians maybe ;) )

I will be using Javascript notation for the regEx here, yet you can use this in any other engine by omitting the / from the start and end of the expression.

I’m going to try not to overload my tutorial with tech explanations like lexicons and tokens, or how we could break this down using BNF, since I’m pretty sure that anyone familiar with them can figure them out for themselves.

The structure I’m using is:    /Regular Expression/   ->   Matches highlighted in my mnemonic
Followed by a brief explanation.

Simple literal regEx match

  1. /al/ -> Californians Like Girls in Small Bikinis
  2. /ni/ -> Californians Like Girls in Small Bikinis

In this case, this is a simple match, returning the instances of the searched token.  The expression will match the first instance, the blue instance is returned on subsequent matches.

Simple wildcard regEx match

  1. /.al/ -> Californians Like Girls in Small Bikinis
  2. /ni./ -> Californians Like Girls in Small Bikinis

Here we introduce the . wildcard.  This means ‘any character’.   So the first term matches ‘any character followed by a and l.

Unbound repetition of patterns

  1. /.*al/ -> Californians Like Girls in Small Bikinis
  2. /ni.*/ -> Californians Like Girls in Small Bikinis
  3. /.*Cal/ -> Californians Like Girls in Small Bikinis

The * symbol is what we call the ‘Kleene Closure’, this means ‘0 or more instances of’, so what we are searching for here is ‘0 or more instances of any character followed by al’.

Note that in the first instance, the second al is matched, not the first, since the regEx acts on the whole line.  Also note the third example, since we state ‘0 or more’, ‘Cal’ is still matched.

One-or-More repetition

  1. /.+al/ -> Californians Like Girls in Small Bikinis
  2. /ni.+/ -> Californians Like Girls in Small Bikinis
  3. /.+Cal/ -> Californians Like Girls in Small Bikinis

This is exactly like the previous example, with one difference, the + means ‘1 or more’, so in the third example, nothing is matched, since there is no instance of any character followed by ‘Cal’

Character classes

  1. /[Cm]al/ -> Californians Like Girls in Small Bikinis
  2. /ni[ai]/ -> Californians Like Girls in Small Bikinis

The [ ] denote a character class.  This means ‘any one of these characters’.  So in the first case, we are looking for a ‘C’ or an ‘m’ followed by ‘al’, which gives us two matches.  You can see why the second one only returns one match.

Negated character classes

  1. /[^Cx]al/ -> Californians Like Girls in Small Bikinis
  2. /ni[^ai]/ -> Californians Like Girls in Small Bikinis

This is the negation of the previous example.  The ^ at the start of character class means ‘anything that isn’t on of these characters’.

It is important that the ^ is within the [  ]

Start of sentence

  1. /^Californians/ -> Californians Like Girls in Small Bikinis
  2. /^Bikinis/ -> Californians Like Girls in Small Bikinis

The ^ at the start of a regEx means ’start of string’, so we are looking for the ’start of a string followed by a C, and a, etc…

End of sentence

  1. /Californians$/ -> Californians Like Girls in Small Bikinis
  2. /Bikinis$/ -> Californians Like Girls in Small Bikinis

The $ in this case matches the ‘end of a string’

}

That will be all for this tutorial.  There will be other tutorials to follow this, which I will add to this post as links once I create them.

Any feedback would be appreciated.

Tutorials ToC:

  1. The Basics (this tutorial)
  2. Shortcuts
  3. Search and Replace (matching groups)
  4. Saving memory, and shortening your regEx
  5. Advanced: Positive and Negative lookahead and lookbehind.
  6. A quick’n dirty way to test regEx, as well as examples in C#, Java, VB and JavaScript

p.s. for anyone curious as to what that mnemonic is for, it is basically used to remember the 5 layers of the epidermis, starting from the lower layer. i.e. corneum, licidum, granulosum, spinosum and basale.

Posted in dev, regex, tutorial | Tagged: , , | 8 Comments »