Thursday, December 26, 2013

Regular Expression

Regular Expressions and Patterns

Regular expressions are very powerful tools for performing pattern matches.
So how are regular expressions implemented in JavaScript? There are two ways:
1.      Using literal syntax
var RegularExpression = /pattern/

2.      When you need to dynamically construct the regular expression, via the RegExp() constructor as a string, and is useful when the pattern is not known ahead of time.
var RegularExpression  =  new RegExp("pattern");

A pattern defined inside RegExp() should be enclosed in quotes, with any special characters escaped to retain its meaning (ie: "\d" must be defined as "\\d").
Example (check input for 5 digit number)
Let’s deconstruct the regular expression used, which checks that a string contains a valid 5-digit number, and ONLY a 5-digit number:

var re5digit=/^\d{5}$/;

  • ^ indicates the beginning of the string. Using a ^ metacharacter requires that the match start at the beginning.
  • \d indicates a digit character and the {5} following it means that there must be 5 consecutive digit characters.
  • $ indicates the end of the string. Using a $ metacharacter requires that the match end at the end of the string.
Translated to English, this pattern states: "Starting at the beginning of the string there must be nothing other than 5 digits. There must also be nothing following those 5 digits." 

Pattern flags (switches)

Ignore the case of characters.
/The/i matches "the" and "The" and "tHe"
Global search for all occurrences of a pattern
/ain/g matches both "ain"s in "No pain no gain", instead of just the first.
Global search, ignore case.
/it/gi matches all "it"s in "It is our IT department" 
Multiline mode. Causes ^ to match beginning of line or beginning of string. Causes $ to match end of line or end of string. JavaScript1.5+ only.
/hip$/m matches "hip" as well as "hip\nhop"

Position Matching

Only matches the beginning of a string.
/^The/ matches "The" in "The night" by not "In The Night"
Only matches the end of a string.
/and$/ matches "and" in "Land" but not "landing"
Matches any word boundary (test characters must exist at the beginning or end of a word within the string)
/ly\b/ matches "ly" in "This is really cool."
Matches any non-word boundary.
/\Bor/ matches “or” in "normal" but not "origami."
A positive look ahead. Requires that the following pattern in within the input. Pattern is not included as part of the actual match.
/(?=Chapter)\d+/ matches any digits when it's proceeded by the words "Chapter", such as 2 in "Chapter 2", though not "I have 2 kids."
A negative look ahead. Requires that the following pattern is not within the input. Pattern is not included as part of the actual match.
/JavaScript(?! Kit)/ matches any occurrence of the word "JavaScript" except when it's inside the phrase "JavaScript Kit"


All alphabetical and numerical characters match themselves literally. So /2 days/ will match "2 days" inside a string.
Matches NUL character.
Matches a new line character
Matches a form feed character
Matches carriage return character
Matches a tab character
Matches a vertical tab character
Matches a backspace.
Matches the ASCII character expressed by the octal number xxx.

"\50" matches left parentheses character "("
Matches the ASCII character expressed by the hex number dd.

"\x28" matches left parentheses character "("
Matches the ASCII character expressed by the UNICODE xxxx.

"\u00A3" matches "£".
The backslash (\) is also used when you wish to match a special character literally. For example, if you wish to match the symbol "$" literally instead of have it signal the end of the string, backslash it: /\$/

Character Classes

Match any one character enclosed in the character set. You may use a hyphen to denote range. For example. /[a-z]/ matches any letter in the alphabet, /[0-9]/ any single digit.
/[AN]BC/ matches "ABC" and "NBC" but not "BBC" since the leading “B” is not in the set.
Match any one character not enclosed in the character set. The caret indicates that none of the characters
NOTE: the caret used within a character class is not to be confused with the caret that denotes the beginning of a string. Negation is only performed within the square brackets.
/[^AN]BC/ matches "BBC" but not "ABC" or "NBC".
(Dot). Match any character except newline or another Unicode line terminator.
/b.t/ matches "bat", "bit", "bet" and so on.
Match any alphanumeric character including the underscore. Equivalent to [a-zA-Z0-9_].
/\w/ matches "200" in "200%"
Match any single non-word character. Equivalent to [^a-zA-Z0-9_].
/\W/ matches "%" in "200%"
Match any single digit. Equivalent to [0-9].

Match any non-digit. Equivalent to [^0-9].
/\D/ matches "No" in "No 342222"
Match any single space character. Equivalent to [ \t\r\n\v\f].

Match any single non-space character. Equivalent to [^ \t\r\n\v\f].


Match exactly x occurrences of a regular expression.
/\d{5}/ matches 5 digits.
Match x or more occurrences of a regular expression.
/\s{2,}/ matches at least 2 whitespace characters.
Matches x to y number of occurrences of a regular expression.
/\d{2,4}/ matches at least 2 but no more than 4 digits.
Match zero or one occurrences. Equivalent to {0,1}.
"?" can also be used following one of the quantifiers *, +, ?, or {} to make the later match non greedy, or the minimum number of times versus the default maximum. For example, using the string "He counted 12345", the expression /\d+/ matches "12345", while /\de?/ would match just "1", or the minimum match.
/a\s?b/ matches "ab" or "a b".
/\d{2,4}?/ matches "12" in the string "12345" instead of "1234" due to "?" at the end of the quantifier.
Match zero or more occurrences. Equivalent to {0,}.
/we*/ matches "w" in "why" and "wee" in "between", but nothing in "bad"
Match one or more occurrences. Equivalent to {1,}.
/fe+d/ matches both "fed" and "feed"

Alternation & Grouping

( )
Grouping characters together to create a clause. May be nested.
/(abc)+(def)/ matches one or more occurrences of "abc" followed by one occurrence of "def".
( )
Apart from grouping characters (see above), parenthesis also serve to capture the desired subpattern within a pattern. The values of the subpatterns can then be retrieved using RegExp.$1, RegExp.$2 etc after the pattern itself is matched or compared. For example, the following matches "2 chapters" in "We read 2 chapters in 3 days", and furthermore isolates the value "2":
var mystring="We read 2 chapters in 3 days"
var needle=/(\d+) chapters/

mystring.match(needle) //matches "2 chapters"
alert(RegExp.$1) //alerts captured subpattern, or "2"
The subpattern can also be back referenced later within the main pattern. See "Back References" below.
The following finds the text "John Doe" and swaps their positions, so it becomes "Doe John":
"John Doe".replace(/(John) (Doe)/, "$2 $1")
Matches x but does not capture it. In other words, no numbered references are created for the items within the parenthesis.
/(?:.d){2}/ matches but doesn't capture "cdad".
Positive lookahead: Matches x only if it's followed by y. Note that y is not included as part of the match, acting only as a required conditon.
/George(?= Bush)/ matches "George" in "George Bush" but not "George Michael" or "George Orwell".
/Java(?=Script|Hut)/ matches "Java" in "JavaScript" or "JavaHut" but not "JavaLand".
Negative lookahead: Matches x only if it's NOT followed by y. Note that y is not included as part of the match, acting only as a required condiiton.
/^\d+(?! years)/ matches "5" in "5 days" or "5 oranges", but not "5 years".

Alternation combines clauses into one regular expression and then matches any of the individual clauses. Similar to "OR" statement.
/forever|young/ matches "forever" or "young"
/(ab)|(cd)|(ef)/ matches and remembers "ab" or "cd" or "ef".

Back references

( )\n
"\n" (where n is a number from 1 to 9) when added to the end of a regular expression pattern allows you to back reference a subpattern within the pattern, so the value of the subpattern is remembered and used as part of the matching . A subpattern is created by surrounding it with parenthesis within the pattern. Think of "\n" as a dynamic variable that is replaced with the value of the subpattern it references. For example:
is equivalent to the pattern /hubbahubba/, as "\1" is replaced with the value of the first subpattern within the pattern, or (hubba), to form the final pattern.
Lets say you want to match any word that occurs twice in a row, such as "hubba hubba." The expression to use would be:
"\1" is replaced with the value of the first subpattern's match to essentially mean "match any word, followed by a space, followed by the same word again".
If there were more than one set of parentheses in the pattern string you would use \2 or \3 to match the desired subpattern based on the order of the left parenthesis for that subpattern. In the example:
/(a (b (c)))/
"\1" references (a (b (c))), "\2" references (b (c)), and "\3" references (c).

Regular Expression methods

String.match(regular expression)
Executes a search for a match within a string based on a regular expression. It returns an array of information or null if no match is found.
Note: Also updates the $1…$9 properties in the RegExp object.

var oldstring="Peter has 8 dollars and Jane has 15"
//returns the array ["8","15"]

Similar to String.match() above in that it returns an array of information or null if no match is found. Unlike String.match() however, the parameter entered should be a string, not a regular expression pattern.
var match = /s(amp)le/i.exec("Sample text")
//returns ["Sample","amp"]

String.replace(regular expression, replacement text)
Searches and replaces the regular expression portion (match) with the replaced text instead. For the "replacement text" parameter, you can use the keywords $1 to $99 to replace the original text with values from subpatterns defined within the main pattern.
The following finds the text "John Doe" and swaps their positions, so it becomes "Doe John":
var newname="John Doe".replace(/(John) (Doe)/, "$2 $1")
The following characters carry special meaning inside "replacement text":
  • $1 to $99: References the submatched substrings inside parenthesized expressions within the regular expression. With it you can capture the result of a match and use it within the replacement text.
  • $&: References the entire substring that matched the regular expression
  • $`: References the text that proceeds the matched substring
  • $': References the text that follows the matched substring
  • $$: A literal dollar sign
The "replacement text" parameter can also be substituted with a callback function instead. 

var oldstring="(304)434-5454"
newstring=oldstring.replace(/[\(\)-]/g, "")
//returns "3044345454" (removes "(", ")", and "-")

String.split (string literal or regular expression)
Breaks up a string into an array of substrings based on a regular expression or fixed string.

var oldstring="1,2, 3,  4,   5"
//returns the array ["1","2","3","4","5"] expression)
Tests for a match in a string. It returns the index of the match, or -1 if not found. Does NOT support global searches (ie: "g" flag not supported).
"Amy and George".search(/george/i)
//returns 8

Tests if the given string matches the Regexp, and returns true if matching, false if not.
var pattern=/george/i
pattern.test("Amy and George")
//retuns true

var string1="Peter has 8 dollars and Jane has 15"
//returns the array [8,15]
var string2="(304)434-5454"
parsestring2=string2.replace(/[\(\)-]/g, "");
//Returns "3044345454" (removes "(", ")", and "-")
var string3="1,2, 3,  4,   5"
//Returns the array ["1","2","3","4","5"]
Delving deeper, you can actually use the replace() method to modify- and not simply replace- a substring. This is accomplished by using the $1…$9 properties of the RegExp object. These properties are populated with the contents of the portions of the searched string that matched the portions of the search pattern contained within parentheses. The following example illustrates how to use the replace method to swap the order of first and last names and insert a comma and a space in between them:
<SCRIPT language="JavaScript1.2">
  var objRegExp = /(\w+)\s(\w+)/;
  var strFullName = "Jane Doe";
  var strReverseName = strFullName.replace(objRegExp, "$2, $1");
  alert(strReverseName) //alerts "Doe, John"
The output of this code will be “Doe, Jane”. How this works is that the pattern in the first parentheses matches “Jane” and this string is placed in the RegExp.$1 property. The \s (space) character match is not saved to the RegExp object because it is not in parentheses. The pattern in the second set of parentheses matches “Doe” and is saved to the RegExp.$2 property. The String replace() method takes the Regular Expression object as its first argument and the replacement text as the second argument. The $2 and $1 in the replacement text are substitution variables that will substitute the contents of RegExp.$2 and RegExp.$1 in the result string.
You can also use replace() method to strip unwanted characters from a string before testing the string for validity or before saving the string to a database. It can be used to add formatting characters for the display of a string as well.

Here is a simple example that uses test() to see if a regular expression matches against a certain string:
var pattern=/php/i
pattern.test("PHP is your friend"); //returns true

Sample Usage

Now that you’ve been introduced to regular expressions and patterns, let’s look at a few examples of common validation and formatting functions.

Valid Number

A valid number value should contain only an optional minus sign, followed by digits, followed by an optional dot (.) to signal decimals, and if it's present, additional digits. A regular expression to do that would look like this:
var anum=/(^-*\d+$)|(^-*\d+\.\d+$)/

Valid Date Format

A valid short date should consist of a 2-digit month, date separator, 2-digit day, date separator, and a 4-digit year (e.g. 02/02/2000). It would be nice to allow the user to use any valid date separator character that your backend database supported such as slashes, dashes and periods. You want to be sure the user enters the same date separator character for all occurrences. The following function returns true or false depending on whether the user input matches this date format:
function checkdateformat(userinput){
var dateformat = /^\d{1,2}(\-|\/|\.)\d{1,2}\1\d{4}$/
return dateformat.test(userinput) //returns true or false depending on userinput

This example uses back referencing to ensure that the second date separator matches the first one.

Replace HTML tags (brackets) with entities instead

User input often times must be parsed for security or to ensure it doesn't mess up the formatting of the page. The most common task is to remove any HTML tags (brackets) entered by the user, and replace them with their entities equivalent instead. The following function does just that- replace "<" and ">" with "&lt;" and "&gt;", respectively:
function htmltoentity(userinput){
var formatted=userinput.replace(/(<)|(>)/g,
function(thematch){if (thematch=="<") return "&lt;"; else return "&gt;"})
The first parameter of  replace() searches for a match for either "<" or ">". The second parameter demonstrates something new and interesting- you can actually use a function instead of a plain replacement text as the parameter. When a function is used, the parameter of it (in this case, "thematch") contains the matched substring and returns what you wish it to be replaced with. Since we're looking to replace both "<" and ">", this function will help us return two different replacement strings accordingly.