Feature Proposal: Add optional controls on valid topic names

Motivation

With the support of Unicode comes the possibilities of mischief in the names of topics. We still have the name filter which removes (filters-out) characters which can cause evil things. But with unicode there are a lot more possibilities.

Description and Documentation

Add a couple of optional controls which a site administrator could enable to better restrict the naming of topics:
DisallowCombiningCharacters
Configuration option to either reject or filter out unicode character combining marks. See Wikipedia:Combining_character
RestrictTopicName
Possibly a regex for additional restrictions, such as Perl character classes, AlphaNumeric, etc.

I'm not really certain what we need here, but it does feel that some further controls are reasonable. Some examples of "interesting" topic names that are now allowed include:

There are also comments here & there that "Filter-out" is really not the right approach. If true, then what should a Unicode aware control look like?

Examples

Impact

%WHATDOESITAFFECT%
edit

Implementation

-- Contributors: GeorgeClark - 11 Jan 2016

Discussion

Related: some recommendations already exists in the RFC 7564. Cite from the 4.2 section:

Most application technologies need strings that can be used to refer to, include, or communicate protocol strings like usernames, filenames , data feed identifiers, and chatroom names. We group such strings into a class called "IdentifierClass" ...

-- JozefMojzis - 19 Jan 2016

CPAN:Unicode::Precis has been developed to support RFC:7564, of course it needs checking out carefully.

-- JulianLevens - 19 Jan 2016

Before jumping to any conclusions, let's just review what the filters are actually used for. There are a coulpe of filters that are used in a very nasty way:
  • {NameFilter} is used to enforce the "validity" of a topic name (and by extension, the wikiname of a user). A "valid" topic name is defined as one that maps to a valid filename (legal filename characters, no path components) and (perhaps) contains no characters that might be used to mount an XSS attack (this is extremely dubious; the defence against XSS should not (and I think does not) depend on {NameFilter}) {NameFilter} is used to actively modify strings in a non-recoverable way i.e. once it is applied, the original name is lost
  • {AttachmentNameFilter} if a derivative of {NameFilter} and is used to test characters in attachment names for filesystem validity - it is only used in validateAttachmentName, but is also used to actively modify strings in a non-recoverable way
Both of these filters are "filter out" i.e. they filter out characters that match the regex. The {NameFilter} was originally created to protect the store from illegal filenames. As such it is specific to the store implementation (different stores may support different character sets for names).

My view is that we need to look at this in two different ways. Firstly, we have the definition of a wikiword - which English and many other languages uses CamelCase, we know that it doesn't work with some other scripts. As such we need to give the flexibility to redefine the construction of a wikiword. This can be done by moving $Foswiki::regex{wikiWordRegex} and {webBaseNmeRegex} into Foswiki.spec.

I favour the view that it's the store's responsibility to be "character set agnostic" i.e. the store is asked to handle webs/topic names that meet the constraints imposed by Foswiki.spec, but otherwise are not constrained. Thus if Foswiki.spec allows it, the store must support a topic called "▁▂▃▅▆▇". The store must translate these strings into something that is safe to use with the filesystem (or database, or whatever is providing the actual storage mechanism). This has a knock-on effect on the Sandbox.

One fly in the ointment here is that attachment and topic URLs can still be built ad-hoc in topics by assembling pub/ URLs (and it's this that has stopped me fixing this before now). There are a couple of approaches here:
  1. Require construction of URLs by the PUBURL parameter mechanism (i.e. disable URL construction using simple strings in topics)
  2. post-process (in completePageHandler) to map "bad" pub URLs to their "good" alternative. This is a hack, but at least maintains a semblance of compatibility
  3. something else I didn't think of

-- Main.CrawfordCurrie - 19 Jan 2016 - 14:10

I support that store should be completely character set agnostic. Store should be intrinsically safe without any filtering at the front end. So if a site choose to eliminate the name filter, we should remain functional. There are current bugs (Search crashes when it encounters a backslash in a filename for ex.) So this needs careful testing.

Regarding removing the NameFilter controls, I agree that they are not ideal, but they do prevent introduction of certain characters that can be abused.
  • We have a lot of code that "evals" strings here and there, and characters that could potentially terminate a string or statement ', " and ;, might introduce some risk. The only thing I can state for certain is that 1) I don't know, and 2) we cannot know. And I assert cannot, because there are private extensions inside companies that might be vulnerable and that we cannot access.
  • I can't state for sure that we always encode topic names before display. So < and > might allow introduction of scripts. And again not just in our code, but in private extensions as well.
  • And finally the ASCII control characters, LF, CR, etc. and Null (a risk to "c" language programs). I just don't see a use for them. We also use them in multiple places as "placeholders" that are supposedly guaranteed to not occur in topic text.
So my conclusion here is that a blanket removal of the *NameFilter controls is just too uncertain to do by default.

I could see us taking several steps.
  1. Ensure that Store is indeed character set safe. That needs to be done regardless of any decisions here.
  2. Deal with the issues around attachment naming, possibly separating the stored name from the upload/download & display names. Again needed, regardless of this proposal.
  3. Make the NameFilters "optional" ... We can test carefully, but as I said above, I don't believe it will be safe to just remove them.
  4. Add a {TopicNamePolicy} and {AttachmentNamePolicy} of some sort that permits added controls. I don't know what they are yet, but maybe something using Unicode::precis could be worked out. Policies might possibly deal with:
    • impact of UCS Transformation format characters, which might allow normally "filtered" ASCII characters to slip through protections.
    • Consider other possible unicode attacks using punycode, character spoofing, etc.
    • General practices in the customer's environment. These, we should NOT assume or assert, but we should at least accommodate some level of control.
  5. And I think it may be time (but not on this proposal) to discuss the direction of CamelCase linking, and full functionality in unicameral languages.

-- GeorgeClark - 19 Jan 2016

I understand that this is unrelated to this proposal, but if we will separate the Store's "filename" handling into some separate methods, (based on Craftword's: My view is that we need to look at this in two different ways ) maybe we could somewhat address issues outlined in the http://foswiki.org/Tasks/Item13930 . Two different CamelCase names not always mean different files... SomeTopicName vs SomeTopicname on OS X.

-- JozefMojzis - 19 Jan 2016

OK, a few points. here.
  1. I wasn't proposing getting rid of {NameFilter}
  2. the store already uses methods to test for topic existence. There may be cases where two topic names are compared outside the control of the store that Jozef's CamelCase example triggers. I don't know, and I don't know hos to find them.
  3. George is right, step 1 is to ensure the stores are character-set agnostic. And that's the hardest step, because it requires us to address the "fly in the ointment" I mentioned above. Ideas on this one, anyone?

-- Main.CrawfordCurrie - 20 Jan 2016 - 07:56
 
Topic revision: r9 - 20 Jan 2016, CrawfordCurrie
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy