Feature Proposal: Give DataForms, field types, registerMETA ability to specify type of data they hold
1. Field entry validation
If a data form app is designed to use a number in a field, it should be able to rely on it - so that users don't need to put a check in every use of a field.
This means we need to work out a way to tell someone that changes a form definition that what they're 'requesting' is impossible, or requires 'some list' of topics to be modified... - see Tasks.Item11026
for a little idea that might lead in a useful direction.
2. Default sort style for SEARCH
Allow a form and field specific default for AddSortingBehaviourToSearch
3. Index generation for fast query caches
have some great benefits over the BruteForce algorithm:
- Unlike grep, which must parse entire topics, they can efficiently scan an atomised form of the topics, which reduces the volume of data that must be processed.
- If the data type is known, indexes can be used to simply lookup values - scanning often isn't required (or possible).
- If the data type is known, external databases can pre-sort results so that Foswiki doesn't have to. When there are many thousands of results in a web of tens of thousands of topics, this means Foswiki only has to fetch & process only the topics that are going to be displayed.
But there is a problem. Foswiki currently doesn't care about the type of data it holds anywhere; it's "magical" about sorting results. It "knows" to sort numerically when the resultset's sort key only contains numbers. But if we want to avoid slow double-handling of data, preventing Foswiki from touching any topics it doesn't end up displaying, then an external database must do the sorting for us. To do that it needs to know the type of the data so that it can set indexes efficiently.
Giving hints about the data type also means that QuerySearch
might be able match a value of '007' as being the same as '7'; and an external DB could properly ascending sort -5.06 after -6.
This proposal aims to build a specification for specifying types on data held in the TopicObjectModel
Description and Documentation
Foswiki has always been agnostic about data types. Our DataForms
users haven't had to think about data types before. Here are some "high-level" types, that might be further specialised by using + separated modifiers (much like
is still a 'select' type):
- number+real (base2 float)
- number+decimal (base10 float)
- string+varchar (for a long string)
- string+char (implied when the size ==255)
- case-less (there are times when we should really store 'thursday' irrespective of what someone's js sends us
- fwaddress, Eg.
- list (to benefit QuerySearch 's IN operator)
- list+<type>, Eg.
But in reality, PaulHarvey
suspects that only
, for some usages of QuerySearch
's new IN operator) will be immediately required.
- Sorting on number fields requires a
- Sorting on date fields requires a
datetime datatype, but this can be automatically defaulted in the
- Using QuerySearch 's new IN operator where the RHS is a formfield requires the
Sadly, we need a way to specify
data type on a numeric fields, if users want to be able to sort on those fields correctly.
But we can provide a way for the
field type to automatically set a default datatype, so for example the
field type would automatically default to a
So, in practice, it should be possible for users to only care about datatypes when they're trying to sort on numbers.
Field types and their default data types
| !DataForm field type
|| Default data type
It's inappropriate for MongoDBPlugin
because there's a small limit to the amount of data that can be indexed in a field. So in the case of
simply wouldn't set an index at all. But SolrPlugin
would. And DBIStoreContrib
, might try to use a varchar index type appropriate for the SQL db it's connected to.
makes use of this API, for example it registers a
many => 1
) and we would love to do a query like this, which would take advantage of indexes:
or perhaps the equivalent:
"'Some/Web.Topic' IN links.value"
-- Contributors: PaulHarvey
- 10 Aug 2011
For compatibility, it still has to behave correctly when data is not
typed. There are three scenarios I can think of:
- data is typed and the searching algorithm can sort on that data (will return a sorted result set)
- data is typed but the search algortihm is unable to sort on that data for some reason (e.g. the data type is not supported by the search engine)
- data is not typed
So somehow the Foswiki default post-search sort has to be told that it doesn't need to sort (case 1), may need to sort (case 2) or must always sort on a field (case 3). For each field.
Secondly, a search such as
presents a problem because
might exist in several different forms, attached to different topics. If the data type is specified the same in all those forms, then no worries, but if not, how do you handle it?
Note that the data typing applies equally to search terms
. For example,
could be a string match, or a date match if the
- 10 Aug 2011
is missing a post-sort-search feature; that would be cool. I'm hoping if you're only fetching the datum on which you need to sort on, from say a resultset of 35,000 topics, that's going to be much quicker than the minutes of saturated network i/o it was taking for Foswiki to load/parse/process ACLs on each member of said res suultset (the problem we had prior to ACL filtering).
Regarding the second problem, I guess search engines could group/collect/table/namespace the fields by the form they're coming from (in reality, I'm surprised we've got as far as we have without making MongoDB collection-per-dataform, rather than collection/database-per-web). I haven't thoroughly thought that through: I've only been thinking of the minimum changes required from a plugin author/dataform user perspective.
As for the problem of query values not resembling stored/normalised/indexed values, such as with date... I guess we need new
we could make the second form illegal (only constant values allowed in the type-casting functions).
- 10 Aug 2011
lets see what i can put together quickly.
- mentioning adding type ~~ string+html - perhaps this is a way to implement AddTextareaPlusRichToTWikiForms - '=Add a textarea+rich type to the TWikiForms definition that would cover the textarea with a wysiwyg html editor and store the contents of the field as html.=
caseless - so taht we store no-case, and search indexes can optimise for that too. (related to SearchOrderOnFormfieldNoCase - do we have case-less sorting in the *SortProposal*
- adding some more important Motivations
- 14 Aug 2011
use-case for the
- 15 Aug 2011
See also: ConvertDatesToISO
- 30 Nov 2011
Idea: add an optional
attribute to META:FORMFIELD. If the META:FORM cannot be loaded, then use the
; otherwise use the type from the META:FORM. Update this field whenever the topic is saved and a form definition is loaded. At the moment if the form cannot be found all fields are given "text" type; this is a simple way to improve on that.
- 30 Nov 2011
Great minds think alike - in fact I've been experimenting with this on our wiki. The idea being to nominate some fields as type 'fwaddress' which can be safely followed with OP_ref, but the idea is the same.
- 01 Dec 2011
please, continue these thoughts - but don't forget to consider the flow-on of reducing the significance of the DataForm
for example, if the typing info is in each topic, but the type changes in the definition topic - and thus is expected to change when rendering (without having updated the data topic)
given that iirc, editing a topic will update (as in remove) fields that are no longer in the dataform, and update them to the latest schema, knowing what the type was
sounds handy - but even though you know the type, you've not addressed the change in how its rendered, or how to manage the required change to queries.
basically - its complicated, and the 2 opposing solutions would be:
- manage changes to schema/view/query all at once - so if the dataform definition changes, all topics that use it are changed to suit
- consider the dataform definition as only an initial seed definition, but this requires a linkage to versioned query and view definitions, so that if a schema changes incompatibly, old topics can be viewed and queried the old way.
even so - storing the current data's type (and values limitations) will be useful.
- 01 Dec 2011
OK, but I think there's a simpler (3) treat the type in the topic as only
affecting the view, all other accesses to the topic go back to the form definition (such as determining value constraints).
(3) is quite like the problem I fixed recently of the cache topic being out of step with the RCS history. I found it useful there to consider the possible scenarios.
- Type in field is inconsistent with form definition.
- No type in field in topic
- Typed field exists in topic, but there is no corresponding field in form definition
- Field does not exist in topic but exists in form definition
Then we have some access scenarios:
- Topic view
- Topic edit
- Topic load (as in: I want to manipulate this topic in code)
- Query (search) over form data
- 01 Dec 2011
Any redundancy is bad, besides for caching purposes. Dealing with cached type info and trying to keep it up to date seems to be a major hassle. We suffered from similar in the past as Crawford reminded us (rcs vs topicinfo).
Sven's having an important point and I've got the feeling that dealing with this type information by conditionally ignoring it, is going to be very tricky up to the point it becomes unpredictable for the user
react differently on DataForm changes.
Is it worth duplicating type info on each record? I am not convinced yet.
- 01 Dec 2011
Well, the case I've got is that I want to draw a directed graph of topic links. This means only following formfields whose values really are valid topic names, and not, eg. some GATC sequence or a date.
Presently I've got a solution that works quite well, and without any knowledge of the dataforms it expects to encounter
But, perhaps the store can transparently add
from the DataForm definition, to avoid polluting the actual
Just as I desperately wish we could have
, instead of
form.name='MyForm' OR form.name='Web.MyForm' OR form.name='Web/MyForm'
- 15 Dec 2011
Regarding date handling, I have created the proposal CreateISODateFormat
. Does the "strictly typed" date field
- 22 Feb 2012
Change to "accepted" 24/2/12 - you might want to consider what release.
- 24 Feb 2012
Note that multiple indexes are possible, for example: sort=SomeTextField will potentially be able to use an index (and sort quickly) on a back-end store that supports it. Whereas something simple like sort=uc(SomeTextField) would prevent the use of the index. To solve this either the index is defined as uc(SomeTextField) or if a raw and uc() searches are required then define both as indexes.
As noted sorting and indexing issues are close bedfellows. i.e. a query for 'LEVENS' = uc(SomeTextField
? ) is likely to lose the benefit of the index.
On my work with VDBI
, it has become clear that I may need to store fields two or more times: exactly as-is and indexable version(s).
Some extra indexes could be specified by the core or extensions
is installed, it will be necessary to load the database from the text equivalent and I would like to store any field into a searchable form, but:
- What do I do if the DataForm says "date" but the embedded META:FIELD says its text?
- What do I do if the DataForm says "date" but the embedded META:FIELD isn't given?
- What do I do if the DataForm says "date" but the embedded META:FIELD is given but has a blank value?
- What do I do if the DataForm says "date" but the embedded META:FIELD is given but has an invalid value?
Some of this may be a case of the admin running an initial load with error reports, but ideally some sensible defaults would be (respectively to above)
- Take the type as date. With an SQL back-end, every field must be of the same type to be of any use.
- Store as a NULL? (undefined in QUERY speak)
- Store and hence query/sort with a date of '0000-00-00' or '9999-99-99'? Or also treat as NULL?
- Store and hence query/sort with a date of '0000-00-00' or '9999-99-99'?
Note that the original field can also be stored as-is, such that '12-March-2008' could still be returned from the store, this would minimise the impact to how an application would appear, but search and sort would be much faster.
Some of these need to be options on the DataForm
Alternate indexes can be constructed, there was a request for a field like A-001-X567, but the numbers may not have leading zeroes, i.e A-1-X8 might also occur. All that requires is a mask to create the index with appropriate number of leading zeroes, but retrieve the original from the store. However, from where we are, I think that this is a step too far and this can be added without needing to prepare the way now.
It seems to me that we have 4 stages of data structure:
- Unstructured: i.e Natural language; paragraphs, sentences & words
- Loosely structured: Headings, Tables etc
- Formal structure: DataForms
- Now under control of Foswiki via the UI
- But can be damaged by direct manipulation of the text files
- But over time datatypes can changes
- Strictly structured:
- Now under DB control, types controlled across the board not just one field at a time
- Direct DB manipulation can possibly damage structure but less likely, it's not a valid API in any sense
- Versioning side must handle change types over time, but the change is necessarily across all fields not just the next update of one topic
Do some degree or other we are moving towards a more strictly structured foswiki. To gain performace thru indexes it's inevitable. My VDBI
goal is to minimise this, but I do not believe it can be eliminated.
Note that the store will work without indexes, and as a pluggable replacement that's important. It's just that some performance benefits will not materialize without some indexing nous — i.e it's becoming rather geeky. Mitigating poor search performance can achieved by capturing query statistics for the FW admin/dba to analyse. Making it less geeky is somewhat harder.
The strictly structured concept suggests the possibility of a +strict modifier (or equivalent). OTOH, it will necessarily be a given in many cases.
Please also note that I have a SuperDataForm idea to deal with other issues so there is likely to be overlap. I'll write up a proposal real soon now — honest!
Note that when the DataForm is modified this could trigger some mass updates, but that will just need to be handled via the UI (are you sure? And then the updates occur etc)
A related issue is that the changed DataForm
could be utter rubbish initially. Within VDBI
I am planning to note which version of the DataForm was the last valid one. Only when a valid version is specified would any changes be accepted. Actually, because DataForms are currently updated in a suck-it and see style as encouraged by the wiki (loosely structured) and we now need a stricter nature for the DataForm, this is a conflict. We arguably need a UI to 'Approve' the new DataForm in addition to merely saving it. There, are alternatives: allow a temporary new DataForm and create a few test entries and approve the change when A-OK. There maybe better ideas, but I suspect that they are all non-trivial.
- 27 Feb 2012
All valid points, I think both Sven & Crawford have encountered the same design issues.
For what it's worth, a MongoDBPlugin
-powered query in Foswiki is still much
faster than a vanilla one even with all indexes turned off. Especially when it is pre-filtering the results according to ACLs for the user under which the query takes place; by delegating as much work as we can to the DB, we prevent too much double-handling of data by Foswiki in perl, and that already speeds things up enormously.
In my own experience, getting indexes right is great for overall query performance, but matters mostly (or at least: impact is more noticeable) for nested SEARCHes where we are firing dozens or hundreds of queries in a single page view.
In my mind, I think it's best to treat missing/invalid values as nulls in the index, rather than using zeroes/1970.
a query for 'LEVENS' = uc(SomeTextField? ) is likely to lose the benefit of the index.
- but could be hoisted to a query that can make use of a case-insensitive index. With MongoDB I've got many queries that look like
on webs with on the order of 50-60,000 topics, and lack of indexes isn't really an issue here (unless used in a nested SEARCH).
This multiple-index strategy is what Solr does best; in fact, I wish I had the time to write a query algo using Solr - although it's not expressive enough to run much of the QuerySearch
language, many simpler expressions should be possible.
- 27 Feb 2012
I have found that hybrid search algos - running simple hoisted searches, such as full-text searches using Sphinx - to narrow down the set of potential topics, then using brute-force to perform a final search, successfully accelerates 90% of searches (which are usually looking for a narrow set of topics). Don't get too
hung up on complete coverage in the index - it's a good goal to have, but expect a leprechaun to be waiting for you with a pot of gold when you achieve it.
- 28 Feb 2012
Crawford: I've been in SQL land quite a bit lately especially wrt indexes, as a result my thoughts above do give the impression that I'm trying to create perfect indexing. I also had the CreateISODateFormat
proposal on my mind and it's specific needs.
Well I'm not trying to create perfect indexing.
will by its nature index every field, but only as a string. Actually that's simplified. Things like date and number could be indexed by storing a canonical text version that sorts (and hence indexes) appropriately. However, I'll probably add a table for all date fields for efficiency, specifically allowing the SQL engine to do a lot of the work. I will need something similar for numbers. Everything else can be a string.
Starting from where we are, this will be a significant speed-up (as you confirm) which for the general Wiki and App usage will be (I hope) more than enough. If someone needs more performance then there's nothing stopping some further tweaking, but let's wait for these needs to appear and then worry about them.
I'm very much in favour of the work here, I dare so we all have some misgivings, but we'll only sort those out as we work on things. My VDBI
work is certainly generating a lot of questions and ideas in my mind.
Finally, I do not believe it is possible to have a completely store-agnostic solution to this. Actually it is, but only if you sacrifice making the best of the stores abilities — which is a mostly about performance.
- 28 Feb 2012
I really need to share what we've learnt with our Mongo adventure: if you want efficient queries over 10,000s of topics, it is crucial
to avoid chattiness, to delegate as much of the query/filtering to the DB as possible. I can't stress that enough: this is drastically more important than fiddling with indexes - in fact I can have all indexes disabled
and Foswiki still
struggles to formulate a query convoluted enough that Mongo can't answer it (under production load!) inside of a second or three (we have ~230,000 topics, largest web ~65,000). (Actually, that's a lie - some
Whenever Foswiki must load
thousands (or even hundreds) of topics into memory, that totally dwarfs any DB query times, indexed or not. Example:
- We had a web of ~35,000 topics
- query looked something like
"form.name='Web/Topic.DataForm' AND Status='Foo'"
explain() indicated that this was a reasonably fast query (using indexes), even when loaded with real traffic on prod: circa 5-10ms
- Nonetheless, with most of the topics ACLs set so that only SpecialGroup could view them, non-admin users experienced page view times of up to two minutes to get the first page of 25 hits!
- Profiling indicated Foswiki was sucking in almost all 35,000 topics (~200MiB!) on each view - to check ACLs, and hide hidden stuff from the result
- Sven then added the ACL filtering code, so that ACL filtering was delegated to mongo. WikiGuest, SpecialGroup, non-special users alike were all able experience ~1.6s page views
This is also why in MongoDBPlugin
That's not to say that good indexes aren't important - if it weren't for indexes, I wouldn't be able to have massively inefficient nested SEARCHes performing hundreds of queries in a given page view - I just want to stress that getting Foswiki to scale absolutely demands minimization of database chattiness.
- 29 Feb 2012
I have been thinking about DataForms quite a bit and it suggests to me deeper problems that need to be addressed. I have sets of DataForms
as part of work flows. In one work flow state I capture a date. In another state that date is now a label (so it cannot be changed during form edit). As a label it will not be seen as a date, so how can any database know to treat this field as a date. This could be fixed with date+label I guess.
As I've already stated, I believe that AllowTypedData
needs to introduce strict types (as distinct from existing loose types) which should also properly consider rendering if not the whole MVC business.
There's another complication relating to my date+label thoughts, what about number+select? That is to say the the type is at heart a number and should be rendered as such, but it's also a select type for editing where a discrete range of numbers are offered.
- 10 May 2012
I suggested something along that idea with the column at the top, mapping traditional foswiki types to the 'new' data-type spec. In your example, the new-style way would be
, but I see I didn't tackle the problem of giving an appropriate view
Hrm. I really will try to get my datum-handler thingy into a workable state, which tries to be an MV(P)C architecture to replace Foswiki::Form - initial inspiration was the disconnect between Foswiki's (X)HTML-centric renderForDisplay/Edit vs other views that we wanted (RDF, CSV, jqGrid, etc).
- 10 May 2012 - 23:58