Ruby LanguageRegular Expressions and Regex Based Operations

Groups, named and otherwise.

Ruby extends the standard group syntax (...) with a named group, (?<name>...). This allows for extraction by name instead of having to count how many groups you have.

name_reg = /h(i|ello), my name is (?<name>.*)/i #i means case insensitive

name_input = "Hi, my name is Zaphod Beeblebrox"

match_data = name_reg.match(name_input) #returns either a MatchData object or nil
match_data = name_input.match(name_reg) #works either way

if match_data.nil? #Always check for nil! Common error.
  puts "No match"
else
  match[0] #=> "Hi, my name is Zaphod Beeblebrox"
  match[1] #=> "i" #the first group, (i|ello)
  match[2] #=> "Zaphod Beeblebrox"
  #Because it was a named group, we can get it by name
  match[:name]  #=> "Zaphod Beeblebrox"
  match["name"] #=> "Zaphod Beeblebrox"
  puts "Hello #{match[:name]}!"
end

The index of the match is counted based on the order of the left parentheses (with the entire regex being the first group at index 0)

reg = /(((a)b)c)(d)/
match = reg.match 'abcd'
match[0] #=> "abcd"
match[1] #=> "abc"
match[2] #=> "ab"
match[3] #=> "a"
match[4] #=> "d"

=~ operator

if /hay/ =~ 'haystack'
  puts "There is hay in the word haystack"
end

Note: The order is significant. Though 'haystack' =~ /hay/ is in most cases an equivalent, side effects might differ:

  • Strings captured from named capture groups are assigned to local variables only when Regexp#=~ is called (regexp =~ str);
  • Since the right operand might be is an arbitrary object, for regexp =~ str there will be called either Regexp#=~ or String#=~.

Note that this does not return a true/false value, it instead returns either the index of the match if found, or nil if not found. Because all integers in ruby are truthy (including 0) and nil is falsy, this works. If you want a boolean value, use #=== as shown in another example.

Quantifiers

Quantifiers allows to specify count of repeated strings.

  • Zero or one:

    /a?/
    
  • Zero or many:

    /a*/
    
  • One or many:

    /a+/
    
  • Exact number:

    /a{2,4}/ # Two, three or four
    /a{2,}/  # Two or more
    /a{,4}/  # Less than four (including zero)
    

By default, quantifiers are greedy, which means they take as many characters as they can while still making a match. Normally this is not noticeable:

/(?<site>.*) Stack Exchange/ =~ 'Motor Vehicle Maintenance & Repair Stack Exchange'

The named capture group site will be set to ''Motor Vehicle Maintenance & Repair' as expected. But if 'Stack Exchange' is an optional part of the string (because it could be 'Stack Overflow' instead), the naive solution will not work as expected:

/(?<site>.*)( Stack Exchange)?/

This version will still match, but the named capture will include 'Stack Exchange' since * greedily eats those characters. The solution is to add another question mark to make the * lazy:

/(?<site>.*?)( Stack Exchange)?/

Appending ? to any quantifier will make it lazy.

Character classes

Describes ranges of symbols

You can enumerate symbols explicitly

/[abc]/ # 'a' or 'b' or 'c'

Or use ranges

/[a-z]/ # from 'a' to 'z'

It is possible to combine ranges and single symbols

/[a-cz]/ # 'a' or 'b' or 'c' or 'z'

Leading dash (-) is treated as charachter

/[-a-c]/ # '-' or 'a' or 'b' or 'c'

Classes can be negative when preceding symbols with ^

/[^a-c]/ # Not 'a', 'b' or 'c'

There are some shortcuts for widespread classes and special charachters, plus line endings

^  # Start of line
$  # End of line
\A # Start of string
\Z # End of string, excluding any new line at the end of string
\z # End of string
.  # Any single character
\s # Any whitespace character
\S # Any non-whitespace character
\d # Any digit
\D # Any non-digit
\w # Any word character (letter, number, underscore)
\W # Any non-word character
\b # Any word boundary

\n will be understood simply as new line

To escape any reserved charachter, such as / or [] and others use backslash (left slash)

\\ # => \
\[\] # => []

Regular Expressions in Case Statements

You can test if a string matches several regular expressions using a switch statement.

Example

case "Ruby is #1!"
when /\APython/
    puts "Boooo."
when /\ARuby/
    puts "You are right."
else 
    puts "Sorry, I didn't understand that."
end

This works because case statements are checked for equality using the === operator, not the == operator. When a regex is on the left hand side of a comparison using ===, it will test a string to see if it matches.

Defining a Regexp

A Regexp can be created in three different ways in Ruby.

  • using slashes: / /

  • using %r{}

  • using Regex.new

    #The following forms are equivalent
    regexp_slash = /hello/
    regexp_bracket = %r{hello}
    regexp_new = Regexp.new('hello')
    
    string_to_match = "hello world!"
    
    #All of these will return a truthy value
    string_to_match =~ regexp_slash    # => 0
    string_to_match =~ regexp_bracket  # => 0
    string_to_match =~ regexp_new      # => 0
    

match? - Boolean Result

Returns true or false, which indicates whether the regexp is matched or not without updating $~ and other related variables. If the second parameter is present, it specifies the position in the string to begin the search.

/R.../.match?("Ruby")    #=> true
/R.../.match?("Ruby", 1) #=> false
/P.../.match?("Ruby")    #=> false

Ruby 2.4+

Common quick usage

Regular expressions are often used in methods as parameters to check if other strings are present or to search and/or replace strings.

You'll often see the following:

string = "My not so long string"
string[/so/] # gives so
string[/present/] # gives nil
string[/present/].nil? # gives true

So you can simply use this as a check if a string contains a substring

puts "found" if string[/so/]

More advanced but still short and quick: search for a specific group by using the second parameter, 2 is the second in this example because numbering starts at 1 and not 0, a group is what is enclosed in parentheses.

string[/(n.t).+(l.ng)/, 2] # gives long

Also often used: search and replace with sub or gsub, \1 gives the first found group, \2 the second:

string.gsub(/(n.t).+(l.ng)/, '\1 very \2') # My not very long string

The last result is remembered and can be used on the following lines

$2 # gives long