Feature Proposal: add match operator to query language

Motivation

While there's a like operator ~ there's no regex match. For instance, this is needed to search for proper words inside a string (see examples below).

Description and Documentation

Until now, we did not implement this obvious operator as we assumed that it would make it harder to translate the query language to SQL ... once we decide to do. However, today's database engines do have support for posix regex. So I'd consider this a thin argument not to have this very useful operator.

Examples

%IF{
  "'foo bar baz' =~ '\bbar\b'"
  then="yes"
  else="no"
}% -> yes

%IF{
  "'foo barbaz' =~ '\bbar\b'"
  then="yes"
  else="no"
}% -> no

%IF{" 
   Tag =~ '\bbased-on-novel\b'" 
   then="This movie is based on a novel"
}%

Impact

%WHATDOESITAFFECT%
edit

Implementation

package Foswiki::Query::OP_match;

use strict;

use Foswiki::Query::BinaryOP ();
our @ISA = ('Foswiki::Query::BinaryOP');

sub new {
    my $class = shift;
    return $class->SUPER::new( name => '=~', prec => 500 );
}

sub evaluate {
    my $this = shift;
    my $node = shift;

    return $this->evalTest(
        $node,
        \@_,
        sub {
            my $regex;
            eval { $regex = qr/$_[1]/ };
            if ($@) {
              throw Foswiki::Infix::Error( 'illegal regex', $_[1] );
            }
            defined( $_[0] ) && defined( $_[1] )
            && $_[0] =~ m/$regex/s ? 1 : 0;
        }
    );
}


1;

All unit tests in Fn_IF then pass.

-- Contributors: MichaelDaum - 26 Mar 2010

Discussion

I've been thinking about this problem since out IRC discussion, and while the above solution goes some way towards making sense, it doesn't quite get there. I originally coded the escapes the way I did to "defend" certain escapes, such as \n, which I thought might be required later.

After some experimentation, I'm now convinced they are. if you are to write a string, for example, that matches a field value that contains a newline, you have to expand \n to chr(10) in the parser. It's not enough to pass it on to the operator to deal with, because the operator doesn;t know the string value came from a constant. A consistent scheme for handling all these escapes in constant strings is required. So here's my counter-patch:
Index: UnitTestContrib/test/unit/QueryTests.pm
===================================================================
--- UnitTestContrib/test/unit/QueryTests.pm   (revision 6928)
+++ UnitTestContrib/test/unit/QueryTests.pm   (working copy)
@@ -74,6 +74,9 @@
     $meta->putKeyed( 'FIELD',
         { name => "string", title => "String", value => "String" } );
     $meta->putKeyed( 'FIELD',
+        { name => "StringWithChars", title => "StringWithChars",
+          value => "n\nn t\tt s\\s q'q o#o h#h X~X \\b \\a \\e \\f \\r \\cX" } );
+    $meta->putKeyed( 'FIELD',
         { name => "boolean", title => "Boolean", value => "1" } );
     $meta->putKeyed( 'FIELD',
         { name => "macro", value => "%RED%" } );
@@ -235,6 +238,13 @@
     $this->check( "number=98 OR string='Spring'",  0 );
 }
 
+sub test_constant_strings {
+    my $this = shift;
+    my $in = 'n\nn t\tt s\\\\s q\\\'q o\\043o h\\x23h X\\x{7e}X \\b \\a \\e \\f \\r \\cX';
+
+    $this->check( "'$in'=StringWithChars", 1 );
+}
+
 sub conjoin {
     my ( $this, $last, $A, $B, $a, $b, $c, $r ) = @_;
 
Index: core/lib/Foswiki/Infix/Parser.pm
===================================================================
--- core/lib/Foswiki/Infix/Parser.pm   (revision 6928)
+++ core/lib/Foswiki/Infix/Parser.pm   (working copy)
@@ -227,11 +227,13 @@
                 push( @opers, $op );
             }
             elsif ( $$input =~ s/^\s*(['"])(|.*?[^\\])\1// ) {
-                print STDERR "Tok: qs '$1'\n" if MONITOR_PARSER;
+                my $q = $1;
                 my $val = $2;
+                print STDERR "Tok: qs '$q'\n" if MONITOR_PARSER;
 
-                # Handle escaped characters in the string
-                $val =~ s/(?<!\\)\\(.)/$1/g;
+                # Handle escaped characters in the string. This is where
+                # expansions such as \n are handled
+                $val =~ s/(?<!\\)\\(0[0-7]{2}|x[a-fA-F0-9]{2}|x{[a-fA-F0-9]+}|n|t|\\|$q)/eval('"\\'.$1.'"')/ge;
                 push( @opands,
                     $this->{client_class}
                       ->newLeaf( $val, $Foswiki::Infix::Node::STRING ) );

-- CrawfordCurrie - 26 Mar 2010

Hm, not sure we should allow these special chars everywhere. Before an \n in all other (non-regex) operators was a \ and a n (actually an n only, but that's part of the found bug). With your above patch, all operators now understand a certain subset of special chars.

-- MichaelDaum - 26 Mar 2010

Btw, here's an interesting link comparing different flavors of regular expressions and their database support.

-- MichaelDaum - 26 Mar 2010

Absolutely we should allow them everywhere. Otherwise we're back in the twiki hell of things working in some places but not in others. For consistency I'd actually prefer they worked in attr strings as well, but that's maybe a step too far.

I seem to recall raising this issue of differing regex support; IIRC you assured me that "everyone supports POSIX".

-- CrawfordCurrie - 26 Mar 2010

Just to clarify.

This is a proposal to add regex operator to query. We already have plain regex search type, but it looks everywhere in topic text.

This will enable targeted regex where we only match regex to a specific form field or parent or whatever AND can combine this with AND and OR and additional conditions.

Great enhancement of query. Very visible enhancement to application builders.

The rest of the discussions about the escaping is an existing bug/problem in query which just becomes more important with regexes.

So people should ignore the escape discussion and only judge the proposal on the addition of =~ operator.

I recommend not raising concern. There are two top developers on it, the feature is very important, and further enhance our feature set compared to the old project. So bear over with the late arrival. I let this pass as an exception because the code is already available and both Michael and Crawford implement it so it will be stable for feature freeze, and because it is a killer feature.

-- KennethLavrsen - 26 Mar 2010

excellent.

I presume there will be docco showcasing this new operator in a query SEARCH too. and a unit test - to support the non-core QueryAlgorithm implementors.

-- SvenDowideit - 27 Mar 2010

By that you mean a unit test for the operator in the context of a SEARCH? I'd regard that as essential, in the same way as a unit test for IF is essential. At the moment we have unit tests for the query syntax, but none that apply to "real" topics.

Note that one thing this test must check is the compatibility of the regular expressions. I believe it's important that the RE match in query searches observes the same RE syntax as type="regex". There are many flavours of regular expression (see http://www.regular-expressions.info/refflavors.html) and there is no lowest common denominator. Notably Javascript has elected to support Perl syntax regexes, and the PCRE lib means they are supported by many C programs as well (this is what NativeSearchContrib uses).
Topic revision: r21 - 05 Jul 2015, GeorgeClark
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy