Feature Proposal: add match operator to query language
Motivation
While there's a like operator
~ there's no regex match. For instance, this is needed to search for proper words inside a string (see examples below).
Description and Documentation
Until now, we did not implement this obvious operator as we assumed that it would make it harder to translate the query language to SQL ... once we decide to do.
However, today's database engines do have support for posix regex. So I'd consider this a thin argument not to have this very useful operator.
Examples
%IF{
"'foo bar baz' =~ '\bbar\b'"
then="yes"
else="no"
}% -> yes
%IF{
"'foo barbaz' =~ '\bbar\b'"
then="yes"
else="no"
}% -> no
%IF{"
Tag =~ '\bbased-on-novel\b'"
then="This movie is based on a novel"
}%
Impact
Implementation
package Foswiki::Query::OP_match;
use strict;
use Foswiki::Query::BinaryOP ();
our @ISA = ('Foswiki::Query::BinaryOP');
sub new {
my $class = shift;
return $class->SUPER::new( name => '=~', prec => 500 );
}
sub evaluate {
my $this = shift;
my $node = shift;
return $this->evalTest(
$node,
\@_,
sub {
my $regex;
eval { $regex = qr/$_[1]/ };
if ($@) {
throw Foswiki::Infix::Error( 'illegal regex', $_[1] );
}
defined( $_[0] ) && defined( $_[1] )
&& $_[0] =~ m/$regex/s ? 1 : 0;
}
);
}
1;
All unit tests in Fn_IF then pass.
--
Contributors: MichaelDaum - 26 Mar 2010
Discussion
I've been thinking about this problem since out IRC discussion, and while the above solution goes some way towards making sense, it doesn't quite get there. I originally coded the escapes the way I did to "defend" certain escapes, such as \n, which I thought might be required later.
After some experimentation, I'm now convinced they are. if you are to write a string, for example, that matches a field value that contains a newline, you
have to expand \n to chr(10) in the parser. It's not enough to pass it on to the operator to deal with, because the operator doesn;t know the string value came from a constant. A consistent scheme for handling all these escapes in constant strings is required. So here's my counter-patch:
Index: UnitTestContrib/test/unit/QueryTests.pm
===================================================================
--- UnitTestContrib/test/unit/QueryTests.pm (revision 6928)
+++ UnitTestContrib/test/unit/QueryTests.pm (working copy)
@@ -74,6 +74,9 @@
$meta->putKeyed( 'FIELD',
{ name => "string", title => "String", value => "String" } );
$meta->putKeyed( 'FIELD',
+ { name => "StringWithChars", title => "StringWithChars",
+ value => "n\nn t\tt s\\s q'q o#o h#h X~X \\b \\a \\e \\f \\r \\cX" } );
+ $meta->putKeyed( 'FIELD',
{ name => "boolean", title => "Boolean", value => "1" } );
$meta->putKeyed( 'FIELD',
{ name => "macro", value => "%RED%" } );
@@ -235,6 +238,13 @@
$this->check( "number=98 OR string='Spring'", 0 );
}
+sub test_constant_strings {
+ my $this = shift;
+ my $in = 'n\nn t\tt s\\\\s q\\\'q o\\043o h\\x23h X\\x{7e}X \\b \\a \\e \\f \\r \\cX';
+
+ $this->check( "'$in'=StringWithChars", 1 );
+}
+
sub conjoin {
my ( $this, $last, $A, $B, $a, $b, $c, $r ) = @_;
Index: core/lib/Foswiki/Infix/Parser.pm
===================================================================
--- core/lib/Foswiki/Infix/Parser.pm (revision 6928)
+++ core/lib/Foswiki/Infix/Parser.pm (working copy)
@@ -227,11 +227,13 @@
push( @opers, $op );
}
elsif ( $$input =~ s/^\s*(['"])(|.*?[^\\])\1// ) {
- print STDERR "Tok: qs '$1'\n" if MONITOR_PARSER;
+ my $q = $1;
my $val = $2;
+ print STDERR "Tok: qs '$q'\n" if MONITOR_PARSER;
- # Handle escaped characters in the string
- $val =~ s/(?<!\\)\\(.)/$1/g;
+ # Handle escaped characters in the string. This is where
+ # expansions such as \n are handled
+ $val =~ s/(?<!\\)\\(0[0-7]{2}|x[a-fA-F0-9]{2}|x{[a-fA-F0-9]+}|n|t|\\|$q)/eval('"\\'.$1.'"')/ge;
push( @opands,
$this->{client_class}
->newLeaf( $val, $Foswiki::Infix::Node::STRING ) );
--
CrawfordCurrie - 26 Mar 2010
Hm, not sure we should allow these special chars everywhere. Before an
\n in all other (non-regex) operators was a
\ and a
n (actually an
n only, but that's part of the found bug). With your above patch,
all operators now understand a certain subset of special chars.
--
MichaelDaum - 26 Mar 2010
Btw, here's an interesting link
comparing different flavors of regular expressions and their database support.
--
MichaelDaum - 26 Mar 2010
Absolutely we should allow them everywhere. Otherwise we're back in the twiki hell of things working in some places but not in others. For consistency I'd actually prefer they worked in attr strings as well, but that's maybe a step too far.
I seem to recall raising this issue of differing regex support; IIRC you assured me that "everyone supports POSIX".
--
CrawfordCurrie - 26 Mar 2010
Just to clarify.
This is a proposal to add regex operator to query. We already have plain regex search type, but it looks everywhere in topic text.
This will enable targeted regex where we only match regex to a specific form field or parent or whatever AND can combine this with AND and OR and additional conditions.
Great enhancement of query. Very visible enhancement to application builders.
The rest of the discussions about the escaping is an existing bug/problem in query which just becomes more important with regexes.
So people should ignore the escape discussion and only judge the proposal on the addition of =~ operator.
I recommend not raising concern. There are two top developers on it, the feature is very important, and further enhance our feature set compared to the old project. So bear over with the late arrival. I let this pass
as an exception because the code is already available and both Michael and Crawford implement it so it will be stable for feature freeze, and because it is a killer feature.
--
KennethLavrsen - 26 Mar 2010
excellent.
I presume there will be docco showcasing this new operator in a query SEARCH too. and a unit test - to support the non-core
QueryAlgorithm? implementors.
--
SvenDowideit - 27 Mar 2010
By that you mean a unit test for the operator in the context of a SEARCH? I'd regard that as essential, in the same way as a unit test for IF is essential. At the moment we have unit tests for the query syntax, but none that apply to "real" topics.
Note that one thing this test
must check is the compatibility of the regular expressions. I believe it's important that the RE match in query searches observes the same RE syntax as
type="regex". There are many flavours of regular expression (see
http://www.regular-expressions.info/refflavors.html) and there is no lowest common denominator. Notably Javascript has elected to support Perl syntax regexes, and the PCRE lib means they are supported by many C programs as well (this is what
NativeSearchContrib uses).
Note that the GNU grep RE syntax used in %SEARCH{type="regex" uses GNU ERE, which is a superset of POSIX ERE. This syntax is also highly compatible with Perl syntax (
NativeSearchContrib automatically maps \<\> to \b), but the character classes are different.
The attached spreadsheet has been derived from the site linked above. I focused on the columns I perceive as most important to us:
- Java, PCRE, POSIX ERE and XPath - because they are likely to be used by back-end DBs
- Perl, ECMA because we use them extensively
- GNU ERE because that's the mode we use grep in for plain-text searches
I went through the different support examining what can and can't be done, and colour-coded the sheet. My conclusion is we have major problems in 5 important areas:
- Handling of the start/end of words
- Backreferences
- Handling newlines
- Unicode
- Non-greedy evaluation
We can always:
- limit ourselves to POSIX ERE as the lowest common denominator in these areas, which means no support for any of them.
- Run grep in -P mode, which eliminates the GNU ERE column (though not POSIX ERE which may be supported in back-end DBs)
--
CrawfordCurrie - 27 Mar 2010
Switching to "grep -P" will impact some existing users (e.g. those running ubuntu 8 and older). I am
not raising a concern - but I am highlighting the need to address this issue in the release/upgrade notes if we switch to "grep -P".
--
MichaelTempest - 29 Mar 2010
Well, I
am raising a concern - I don't believe this feature should go live until we have a strategy for handling differing regexp syntaxes.
--
CrawfordCurrie - 29 Mar 2010
As far as I see there are two options:
- use perl's regexes as is and ignore any potential incompatibilities (there are many)
- use CPAN:POSIX::Rregex and hope to be safe with it
--
MichaelDaum - 29 Mar 2010
In the current implementation of SEARCH with type="regex" we have limitations in the RE. I have noticed that I cannot use \s for example inside the $pattern.
This is a problem and feature proposal in itself.
I think you two are trying to do too many things on this feature proposal.
Please go back to the original proposal.
It is to add the operator =~ to query search. That is all we should be deciding on this proposal.
Now, what prevents us from adding (keeping) =~ using the same regex limitations/implementation as we already know from SEARCH type="regex"?
If we try to do more than that we will most likely be rushing in changes that will break things for many people, as already mentioned by Michael.
If we add the =~ in 1.1.0 with the same limitations we know today, nothing prevents extending the regex later in 2.0
--
KennethLavrsen - 30 Mar 2010
Cleaned up proposal removing / hiding discussion of found bug.
--
MichaelDaum - 06 Apr 2010
Discussion on regex format moved to
NormaliseRegexSyntax.
--
CrawfordCurrie - 06 Apr 2010
Assuming that we now address the generic RE issue in
NormaliseRegexSyntax, do you still have concern about this exact proposal, Crawford?
--
KennethLavrsen - 08 Apr 2010
I think I can raise the concern about the proposal. I'm still bothered by the lack of unit tests, but that's an implementation issue.
--
CrawfordCurrie - 08 Apr 2010
Good.
Accepted by concensus then.
And since the code was already checked in before the feature freeze deadline - it made it for 1.1.
--
KennethLavrsen - 13 Apr 2010