WIP: Extend meaning of "Hidden Files" (PR version 3) #179
Closed
VinceR
wants to merge 3 commits from issue/270/tdelibs-V3
into master
Loading…
Reference in new issue
There is no content yet.
Delete Branch 'issue/270/tdelibs-V3'
Deleting a branch is permanent. It CANNOT be undone. Continue?
TDEStringMatcher
, a general purpose iterative string matching class. Additional information about the class can be found intdecore/README.tdestringmatcher
. Files involved:Introduce concept of a hidden file matcher instance of the
TDEStringMatcher
class that can be used to extend the definition "hidden" files beyond traditional "dotfiles".Create a global default instance of a hidden file matcher for use in applications that don't need their own. Files involved:
KFileItem
class to utilize a hidden file matcher when evaluating whether or not a filesystem object is hidden. Files involved:This commit partially addresses issue TDE/tdebase#270. Additional changes to
tdelibs
,tdebase/libkonq
, andtdebase/konqueror
will be required to resolve that issue.Considerations for reviewers:
konqueror
. To the best of my knowledge, the code successfully addresses issue # 270 and manifests no bugs or noticeable performance penalty.TSMTRACE
and should be ignored when reviewing - they will be removed when final testing is complete.TDEStringMatcher
signals and associatedKFileItem
slots. The changes are implemented by#define TSMSIGNALS
. After further testing, the associated preprocessor conditional directives will be removed.@MicheleC,
This is in regard to this conversation in which the merits of using associated private classes was dicussed.
As you will see from the
TDEStringMatcher
code, I did follow your advice and putTQString patternString
into a private class instead of declaring it asprivate:
property in the main class. But on second look, this looks contrived and perhaps pointless given how small the private class is. What do you think about this?As you will see from the
KFileItem
code, I have added 2 protected class variables (one boolean and one an object pointer. I would definitely agree that both of these should be in an associated private class. The problem is that theKFileItemPrivate
class seems to be special purpose and instantiated only conditionally in various methods. Should create another private class, useKFileItemPrivate
but allocate it early in theKFileItem
constructor, or just leave things as they are?@VinceR
thanks for the new PR. FYI, I expect to have a close look at this next weekend. The week ahead is busy (again) and the time available for TDE will be devoted to finalize R14.0.13, whose code base will be frozen this Friday.
This PR replaces PR #178
First part of the review. Still do do 3 files.
TSMTRACE << "TDEGlobal::hiddenFileMatcher(): Global HFM initialization STARTED" << endl;
_hiddenFileMatcher = new TDEStringMatcher();
TDEGlobal::config()->setGroup( "General" );
TQString settings = TDEGlobal::config()->readEntry( "globalHiddenFileSpec", "/oW/.*" );
Should this be
/ow/.*
(lower casew
)? Or arew
andW
different?This was a typographical error, it was supposed to be
/ow/.*
. There may be a case to be made for a relaxed case-insensitive evaluation of option characters, where 'w' and 'W' are interchangeable.yes, we could use 'w' or 'W' interchangably, as long as we document that accordingly. So for the other options.
Actually this should be
/ow/p.*/
(note thep
), right?}
// Initialize hidden file matching apparatus
setHiddenFileMatcher( TDEGlobal::hiddenFileMatcher() );
This is fine. Nevertheless there are some constructors that do not use
init()
butassign()
. We need to add the same call there too.}
bool KFileItem::isHidden() const
void KFileItem::resetHiddenFileMatcher()
See comments in
setHiddenFileMatcher()
. Possibly this method is no longer needed.if ( hiddenFileMatcher == m_pHiddenFileMatcher )
return;
if ( hiddenFileMatcher == 0 || hiddenFileMatcher == nullptr ) {
no need to double check. Just
if (hiddenFileMatcher == nullptr)
is enough.I thought that '0' and
nullptr
were different, with the latter pointing to a specific address interpreted to be "unset" versus literal address location 0.nullptr
is technically different than0
but checking fornullptr
is sufficient to cover both cases.if ( hiddenFileMatcher == m_pHiddenFileMatcher )
return;
if ( hiddenFileMatcher == 0 || hiddenFileMatcher == nullptr ) {
kdWarning() << "KFileItem::setHiddenFileMatcher: refusing to process null pointer passed by caller" << endl;
We need to be able to set nullptr as a potential matcher. The reason for this is to correctly handle the case in which the global file matcher is destroyed. The current code would try to call
resetHiddenFileMatcher()
which would set the matcher back to the global file matcher being destroyed. This is a design mistake.If we call
setHiddenFileMatcher()
with a nullptr, we disconnect the existing matcher and setm_pHiddenFileMatcher
to nullptr. Of course later we need to check that the matcher is not null before dereferencing it.By the way, we probably don't even need
resetHiddenFileMatcher()
any more if we definesetHiddenFileMatcher
as follow:void setHiddenFileMatcher( TDEStringMatcher *hiddenFileMatcher = nullptr);
. In this case,destroyed()
could simply be connected to setHiddenFileMatcher()`. I leave a bit of a doubt on this and it should be tested first, to make sure the signal-slot logic does not complain due to the difference in signature of the signal and slot.Under what circumstances would the global matcher be destroyed? I thought by its very nature it lasts the life of the application and could therefore be counted on by all applications. If it cannot, then we should re-think how the global matcher is implemented.
The only code outside of the
KFileItem
itself that should be callingsetHiddenFileMatcher()
is that in theKDirLister
that allocated theKFileItem
in the first place. It will never callsetHiddenFileMatcher()
with anullptr
, but with a pointer to its own hidden file matcher.That said, if we do as you suggest, then I see the followinng options as reasonable responses to
setHiddenFileMatcher(nullptr)
:m_bHiddenByMatcher
tofalse
m_bHiddenByMatcher
based on hard coded dotfile matching.m_pHiddenFileMatcher = TDEGlobal::hiddenFileMatcher()
. Even if the global matcher had somehow been previously destroyed, make it come to life again. Any reason why that wouldn't work?I think this could work, but we do need to settle how
nullptr
should be interpreted.The global file matcher should be destroyed only when an application using the tdecore lib is closed, so yes in practice this should not be a real problem. Nevertheless it is a design mistake and the fact that the application is closing should not be an excuse to justify the mistake. Since there is a simple solution (handling nullptr with a few extra
if()
), it should be fairly simple to implement the fix.I wasn't pointing out a bug, just suggesting how to use
nullptr
andsetHiddenFileMatcher
to disconnect a matcher correctly.Yes, agree with you.
No because the object may not exists and if you try to use it you may access invalid memory and crash your application. If a new global matcher is created again, it won't have the same address anyway. So we should simply set
m_pHiddenFileMatcher
tonullptr
after disconnecting the existing matcher, if any.Sorry, I had misunderstood your main point. Yes, it is always good to be paranoid with when using pointers:)
No worries, always good to ask in case of doubt 👍
I have implemented all suggestions and everthing seems to still work correctly!
Sounds good :-)
}
#ifdef TSMSIGNALS
if ( m_pHiddenFileMatcher != 0 && m_pHiddenFileMatcher != nullptr ) {
No double check required.
#ifdef TSMSIGNALS
if ( m_pHiddenFileMatcher != 0 && m_pHiddenFileMatcher != nullptr ) {
TSMTRACE << " Attempting to disconnect slots from hidden file matcher signals ... " << endl;
if ( disconnect( m_pHiddenFileMatcher, 0, 0, 0 ) )
This is wrong, it basically disconnects all the signals emitted by the
m_pHiddenFileMatcher
, which is likely shared by manyKFileItem
s.We need to disconnect only the connection between
m_pHiddenFileMatcher
andthis
file item. Therefore a correct call would be:I will give this a try. I know that I got into major trouble trying to be too specific with disconnecting the
patternsChanged
signal, so I got pretty broad.Being specific is good. In that case you need to disconnect both
patternsChanged
anddestroyed
.What I suggested above is a mid-way form which disconnects all signal-slot connections between the two objects.
I changed the statement to
disconnect(m_pHiddenFileMatcher, 0, this, 0)
and everything works fine. Historic note:I had initially specified
disconnect(m_pHiddenFileMatcher, 0, 0, 0)
in a desperate effort to fix segfaults I was encountering withdisconnect
. The main problem turned out to be something else.While changing and testing other parts of the code in response other comments, I started encountering
disconnect
segfaults again. Changing the code to what you suggested cleared these up.The problem with
disconnect(m_pHiddenFileMatcher, 0, 0, 0)
was that you were disconnecting all the connections of the matcher. The same matcher is connected to all the KFileItem in the same KDirLister, so usingdisconnect
as you initially did, was disconnecting the matcher from everything, not just the current KFileItem.#ifdef TSMSIGNALS
TSMTRACE << " Attempting to reconnect slots to hidden file matcher signals ... " << endl;
if ( connect( m_pHiddenFileMatcher, TQT_SIGNAL( destroyed() ), this, TQT_SLOT( resetHiddenFileMatcher() ) ) )
Just a reminder to leave the
connect
/disconnect
calls in the code once you remove the TSM traces.Just a reminder to leave the connect/disconnect calls in the code once you remove the TSM traces.
Did you mean removing the
#ifdef TSMSIGNALS
? I hope to remove both#define
s.I mean that when the TSM* stuff if removed, the body if the
if()
is empty. Therefore theif
needs to be removed, but theconnect
in theif
condition is required.TSMTRACE << "KFileItem::reEvaluateHidden() called for " << m_url.fileName() <<endl ;
if ( !m_url.isEmpty() )
return m_url.fileName()[0] == '.';
m_bHiddenByMatcher = m_pHiddenFileMatcher->matchAny( m_url.fileName() );
As per comments in
setHiddenFileMatcher()
, we need to check thatm_pHiddenFileMatcher
is not null before using it. If it is null, by default the file is not hidden./**
* Sets object that encapsulates criteria for determining whether or not
* a filesystem entity is hidden based on characteristics of its name.
* Object is stored in @property m_pHiddenFileMatcher.
Object is stored in @property m_pHiddenFileMatcher.
should be removed. APIs are supposed to hide away the details of the inner implementation :-)void resetHiddenFileMatcher();
/**
* Checks whether or not the current filesystem object is "hidden" by
This comment can be moved at the beginning of the implementation of
void reEvaluateHidden();
in the cpp file. Here in the .h file it would more appropriate something likeRe-evaluate whether the item is hidden or not
. Again, the details of the implementation should be hidden to the outside world.bool m_bMimeTypeKnown:1;
// Auto: check leading dot.
// Auto: always check if hidden.
Suggest
Auto: check if the item is hidden.
/**
* Object that encapsulates criteria for determining whether or not
* this filesystem entity is hidden based on characteristics of its
* name. This property is set by method setHiddenFileMatcher().
This property is set by method setHiddenFileMatcher().
can be removed.* this filesystem entity is hidden based on characteristics of its
* name. This property is set by method setHiddenFileMatcher().
*/
TDEStringMatcher *m_pHiddenFileMatcher = nullptr;
Using
= nullptr;
is not technically wrong since c++11 is allowed. But since all other members of KFileItem are initialized in the constructor, it would be cleaner to movenullptr
together with them in the same place.Acknowledged. An historic note:
Before I implemented signals/slots, it was sufficient to initialize
m_pHiddenFileMatcher
simply by callingsetHiddenFileMatcher( TDEGlobal::hiddenFileMatcher() )
. No other initialization was required.After implementing signals/slots, I started getting segfaults during
KFileItem
initialization because I wasdisconnect
-ing from the "previous" hidden file matcher ... which did not exist at this point, som_pHiddenFileMatcher
was pointing to some random location in memory. Pre-initializingm_pHiddenFileMatcher
tonullptr
cleared this up. I guess that I cannot assume thatnew
zeroes out memory.m_pHiddenFileMatcher
(llike any variable) needs to first be initialized. If not and trying to access the variable before the first assignment, then you would accessing invalid memory.new
is not the problem here.new
simply allocate the memory and invoke the related constructor.TDEStringMatcher::TDEStringMatcher()
was not initializingm_pHiddenFileMatcher
, hence its content was random (I also missed to catch that in the review btw).Second and final part of the review.
'w' - Match patterns are to be interpreted as "wildcards"
'r' - Match patterns are to be interpreted as "regexes" (TQRegExp default)
'c' - Matching will be case-sensitive (TQRegExp default)
'c' - Matching will be case-INsensitive
'i'
for case insensitive.Files with a version number suffix will be matched via regex
and dotfiles will be matched via wildcard
Current and potential use of the TDEStringMatcher class include:
These could probably be removed.
TDEStringMatcher::~TDEStringMatcher()
{
patternList.setAutoDelete( true );
Need to move
patternList.setAutoDelete( true );
to the constructor, otherwise when the patternlist is cleared ingeneratePatternList()
, there will be memory leaks.TQStringList specList = TQStringList::split( patternStringDivider, newPatternString.mid(1), true );
TQRegExp rxWork;
TQPtrList<TQRegExp> rxPatternList;
There is no need for
rxPatternList
, we can usepatternList
after having clear it.The original reason for
rxPatternList
was to fully validatepatternString
before actually changingpatternList
.Implementing these suggestions from other conversation items will make
generatePatternList()
completely tolerant of input. Given that, you are correct and we can operate directly onpatternList
. The UI should be able make sure thatpatternString
doesn't have empty or invalid patterns.This is actually a good idea and I thought about that too. And in the end I came to the same conclusion that the UI should make sure patterns are correct.
But on second thought, I think you original idea is a better solution. Reason being:
So all in all, I think your original implementation was a better choice.
TQChar specificationType = specification[0].lower();
switch ( specificationType ) {
case 'o' : {
TQString optionString = specification.mid(1).lower();
Instead of using an
optionString
which requires creating a new string and copy the specification into it, we can usespecification
and start from ` in the for loop below.TSMTRACE << " Processing match pattern: '" << pattern << "'" << endl;
if ( pattern.isEmpty() ) {
TSMTRACE << " Empty patterns are not allowed" << endl;
rxPatternList.clear();
If a specific pattern is empty, we should ignore it and continue with the potential next pattern in the list.
rxWork.setPattern( pattern );
if (! rxWork.isValid() ) {
TSMTRACE << " Invalid pattern" << endl;
rxPatternList.clear();
If rxWork is not valid, we should ignore it and continue with the potential next pattern in the list.
}
// patternList.clear(); // no need to do this?
patternList.setAutoDelete( true );
patternList.setAutoDelete( true );
not required here.}
}
if ( patternList.isEmpty() ) {
if there is no pattern in the list, IMO we should return
true
. The idea is that if there is no filtering string, the file is not filtered. We shall check with @SlavekB to see what he thinks.Also we can move the
if (patternList.isEmpty())
before thefor
loop (slightly faster).I completely agree -- whatever was I thinking? :)
Ok, marking this resolved as I guess you have made an update on your side.
{
//-Debug: TSMTRACE << "Attempting to match string '" << stringToMatch << "' against ALL stored patterns" << endl;
for ( const TQRegExp *rxPattern : patternList ) {
if ( !
The if condition is wrong. It needs an extra couple of braces.
}
}
if ( patternList.isEmpty() ) {
As above, if there is no pattern in the list, IMO we should return
true
.We can move
if ( patternList.isEmpty() )
before thefor
loop, slightly more efficient if the list is empty.TDEStringMatcher();
~TDEStringMatcher();
/**
If we refer to
README.tdestringmatcher
, we should include such file in the installed files, since this comment is part of the public API.Alternatively we can modify the comment like
I don't think forcing this file to be installed is a good idea (although my distro Gentoo proactively installs all of those readme-type files anyway).
I do think some type of reference is a good idea, how about something like this:
Refer to file README.tdestringmatcher in tdelibs/tdecore source code directory for more information on how the input string should be formatted.
I have checked what we are doing with other README (for example README.kiosk in the same folder) and we are not installing it (at least in Debian). So I think a reference as you suggested is good enough, since it is similar to other existing cases.
*/
bool generatePatternList( TQString newPatternString );
/**
The whole comment can simply be `Return the pattern string'. APIs hides away the internal implementation.
Since I am abandoning storing
patternString
(nowm_patternString
) in a private class, I have put it back into theprotected:
area in the header.In my opinion, the properties and methods in the
protected:
area are every much a part of the API as those in thepublic:
area since applications may be subclassingTDEStringMatcher
. As such, I would think that any explicit relationships between methods and properties should be documented. For instance:If am thinking about this incorrectly, let me know. Disclaimer: I have never seen the TDE API documentation generated by doxygen. I am not ever sure where it is :\
The whole concept of API is that a user of a library uses the public part of a class (that's the API) to do what he needs without having to worry about how the internal implementation of the library is done. The internal implementation can even change without the user noticing it, as long as the functionality stated in the API remains the same.
Developers who wants to extend the library with extra functionality will need to know the code and in that case they will see the whole picture, internal implementation included. That is the development documentation.
It is normally installed with the respective dev pacakges. But you can also look it up here (the important part at least):
https://www.trinitydesktop.org/apidocs.php
Probably still good to mention to refer to README.stringmatcher in the comment, since we have agreed to keep it anyway.
*/
TQString getPatternString();
/**
Suggest to use individual comment for
matchAny
andmatchAll
, since Doxygen will extract the comments for individual methods.protected:
TQPtrList<TQRegExp> patternList;
Suggest we rename this to
m_patternList
. It's convention in TDE to prefix members of a class withm_
.I agree with you and as we discussed in #170, we don't need to use a private class for this case.
Just leave things as they are. KFileItem has a lot of other private members, so the two new variables fits well where you placed them.
OVerall the PR is good. Most of the comments are related to either minor improvements or some mistakes that can be easily fixed.
There is one point we should discuss, together with @SlavekB, which is how we specify the patterns. Your solution has some pros but the pattern specification is quite convoluted (although ingenious). I had a different more traditional idea in mind, so I will list pros and cons of both ideas from what I can see and then we can discuss all three together.
Solution in this PR
Pros:
Cons:
addPattern
,modifyPattern
orremovePattern
method available. Of course they could be added.Traditional solution
Uses only a list of TQRegEx objects, without the pattern string. Patterns are appended/removed from the list as needed, by calling appropriate methods.
Pros:
Cons:
@VinceR feel free to add pros/cons to both solution as you see them from your point of view. We can then ask @SlavekB what he thinks about both to have a third opinion.
Other than that, as I said at the top, this PR is mostly ready. Well done.
Before I deal with the individual conversation items, I think it is worthwhile to discuss interface design and why I decided to approach things the way I did. I see three separate but related components that are required for a complete solution.
TDEStringMatcher
, includes:TQPtrList<TQRegExp> patternList
generatePatternList()
matchAny()
andmatchAll()
getPatternString()
andgeneratePatternList()
.TDEStringMatcherUI
(to be discussed / reviewed later).I thought that it was important to integrate standardized conversion functions (component # 2) into the core string matching engine. This makes it very easy for applications that need to store and retrieve match settings. They do not need to know anything about
patternString
syntax to initialize aTDEStringMatcher
.The UI does need to be concerned with details about match patterns and their corresponding options. With the current implementation, it needs to generate a dialog either with direct processing of match pattern list items, or by decoding
patternString
. On the other hand, a user that uses the UI will never need to understandpatternString
string syntax.The current
patternString
syntax is designed to permit the future addition of other match options. It also permits (but does not require) a more compact (less redundant) specification of pattern options. I would also argue that it is quite easy to understand once explained. It is certainly simpler than understanding regex syntax.Now in response to your pros & cons:
I consider this to be critical for the original purpose: hidden file matching. It's all going to start and end with config file settings.
Of course we could move this type of code back to applications & views, letting each one do things its own way. But I personally prefer some uniformity so when we look at different configurations, we only need to know one set of rules.
Users don't need to read anything if they use the UI. They will if they want to edit configuration files directly, but we don't encourage that anyway.
Not so much work.
TDEStringMatcherUI
can present a simplified view of "patterns" & "options" in a way that does not require knowledge ofTQRegExp
properties.For the relatively simple case of hidden file matching, I think setting all patterns at once is sufficient. But you are correct, those could be added for more complex cases. We could also add a
setPatterns()
for replacing the entirepatternList
with the content of anotherTQPtrList<TQRegExp>
.If these additional pattern list modification functions are added, it will become necessary to make
getPatternString()
actually decode thepatternList
into a correspondingpatternString
With current implementation, that's not an issue. The string is simply of the copy of the one used in the last successful call to
generatePatternList()
. Or the suggested additional pattern list modification functions are added, the string becomes a cached copy of the most recent call togetPatternString()
.Let me think a bit longer on this and the pattern string thing. Will write back in a day or two.
After further consideration, I think your approach is fine, although we should make a couple of refinements.
make the options non-persistent across patterns. This means every pattern will have its own options, like "/ow/p.*/ow/pa?c.txt" for example.
The reasons to do so is simply to simplify pattern changes. A user may want to add/remove additional patterns to the existing ones, so by making each pattern independent from the others, those operations become much simpler. Otherwise each time would be necessary to parse the whole string, keep memory of the options and adapt the string as required.
add two methods
addPattern
andremovePattern
to append/remove an indivual option+pattern to the existing pattern lists.Btw, the
setPatterns()
function that you proposed would be the same asgeneratePatternList()
, so we could as well renamegeneratePatternList()
tosetPatternString()
which would also match with the correspondinggetPatternString()
getter method.What do you think?
There is nothing that prohibits a pattern string from being like that, so let's consider the ramifications of of making this mandatory:
patternString
always have an even number of parts (option specs followed by pattern specs)?I think part of your concern may come from thinking that the UI will have to screw around directly with an arcane
patternString
. I don't think that needs to be the case. The UI can deal directly with a copy ofpatternList
(or an array derived from that) and communicate the results back to the hidden file matcher.I would be reluctant for a UI to have direct access to protected member
patternList
. If it is a copy that is being processed, then certainly those methods could be implemented within the UI.While I have been implementing suggestions in this PR, I also made some other changes to the code to support a UI:
I have been using the
generate*
naming because it connotes a process of transforming a one type of entity (e.g, string) to another (e.g. list ofTQRegExp
). To me,set*
andget*
connote copying of variables of the same type. I guess I could use names likepatternString2List
andpatternList2String
instead ofgenerate*
.yes, each "piece" of the pattern would have options + pattern string.
Nope, it can be handle as of now, specifying only the options to be active , while the inactive ones are not specified. The only difference is that we will need to repeat the options for each "piece" of the string if needed.
If we specify the option in the "piece", it will be set, otherwise there is no need to specify it. So all existing patterns would keep working.
The UI shouldn't have to do anything too complicated. Just add or remove "pieces" to the pattern string, or replace the string with a new one at once in the worst case.
UI won't have access to the internals of the implementation. You may have a string like (simplified, not literal)
pattern1:pattern2:pattern3
and the UI may sayremovePattern(pattern2)
andaddPattern(pattern4)
, resulting in the stringpattern1:pattern3:pattern4
. By having each piece to be independent from previous/following pieces of the pattern, such operations are much easier to implement and less error-prone.Highly suggest to use get/set names when two functions work in reverse for the same functionality. Getters and setters are pretty much standard practice when it comes to get or set values. They don't simply copy the values, they can do validation of the fields, reject setting wrong values and add encapsulation to the inner variables.
EDIT: having said that, it is just a suggestion, so feel free to keep current names if you wish.
Continuing the conversation on pattern string format:
I am striving to put all of the logic for encoding / decoding pattern strings exclusively into
TDEStringMatcher
implementation.TQRegExp
objects or of structs containing just the information necessary to create its dialog.That said, I am not against your idea of formating the pattern string as
option spec, pattern spec, [option spec, pattern spec, …]
. Doing so would eliminate the need to explicitly prefix each specification witho
orp
. We can debate the merits of whether or not each option spec should be fully populated with known option characters, or if not, whether "missing" option characters are inferred from a previous option spec versus a standard default.In my original implementation, applications could explicitly set the character to be used for splitting up the pattern string. In the current implementation, the 1st character of the pattern string determines the splitter.
I am now considering making the splitter a fixed character that satisfies the following characteristics:
less
.The control character Vertical Tab (
0x0B
) looks like a good candidate:^K
inless
,nano
,vi
, andemacs
.kwrite
andkate
.VT
ingeany
.What do you think of the idea generally (standard separator character) and of my proposed choice of Vertical Tab as that character?
Hi @VinceR,
apologies for the late reply, somehow I missed your message and only spotted it today.
Sounds good.
I am not against this, but keep in mind that if the UI has access to the underlying internal structure (for example TQRegExp objects) then we probably need extra code to make sure the pattern string and the internal structure are always in sync (i.e. if the UI changes the internal objects, then we need to re-derive the pattern string). In light of that, I think it would be simpler for the UI to interact with the pattern string only, so that we don't have to add extra syncing code.
Nice, I had not thought about it. That makes things even simpler. For example a string could be something like
|wp.*|wpa?c.txt
, where theo
is gone. Thep
is still used to mark the beginning of the pattern string. Regarding the various options, we can assume that those not specified will use a default value. For wildcard-regex we can default to wild card, so a simple use case would be for example|p.*
, which would filter all dot files.Uhm, many users may not even know what a vertical tab is, nor how to type it 😰 I think we should look at a character which has all the characteristics that you listed above but is common and visible. How about
|
? It's the pipe character in unix, so it is extremely unlikely that a user will use that as part of a file name.Btw I liked your idea of being able to specify the splitting character. Why don't we make something like this:
s
, the second character will specify the splitting char to uses
, then the splitting char will default to|
and the first character will be interpreted as a normal option.This way we have a simple default but still offer the opportunity to customize the splitting character if needed. For example
wp.*|wpa?c.txt
--> default split char is|
s/wp.*/wpa?c.txt
--> split char is/
p.*
--> minimal pattern string, default to wildcard (filter dot files in this example)What do you think about this?
The more I think about this, the more I favor passing implementation-agnostic structures to the UI. Someday we may want to change the
TQRegExp
implementation to something better (e.g. a port of QRegularExpression or our own wrapper around PCRE2).Already done, in revised code to be uploaded.
Users will never need to know either -- and that is by design. In fact, we want to make it very difficult to accidently type this internal-use-only character into a regex. If somebody wants to specify a vertical tab in a search pattern, they can use
\v
.Regarding rest of your suggestion:
|
is not a good default splitter since that's the regex alternation character. I had originally thought of using the horizontal tab character but TDE converts that to\t
when storing a multi-pattern string.I think I have implemented almost all of the other suggestions made in conversation items although I need to double-check. I need to fine-tune some of the new stuff I added in support of a UI, then I will be ready to upload a revised version of the code.
Although I know that you have wanted to focus on core
TDEStringMatcher
functionality, I think think everthing I have done may make more sense if evaluated in a more complete context:KonqListView
working withKDirLister
and a decent UI for changing patterns. I am overdue for re-doing the UI to accommodate the additional flexibility that comes with setting per-pattern options. My old UI works for my testing but only because I am familiar withpatternString
rules.Exactly the point :-)
Very good point, didn't think about it.
|
is not a good choice at all.The only problem I see with vertical tab is that if a user edit a config file manually in an editor, it will show up as a broken string (going to next line), so the user may collapse the string in a single row, effectively losing the vertical tab. Anyway I see you point on users not needing to edit manually in most cases, so we may as well use vertical tab, unless we find a better candidate (@SlavekB any alternative suggestion?)
Feel free to upload all the code if you wish, but the order of review will be the same :-) If possible use separate commits for eahc part (TDEStringMatcher, KDirLister, KonqListView, UI, ...) so it will be easier to review for me.
Hi @MicheleC,
Circumstances this past month forced my attention away from this PR but I am now back at it. Sorry for the long gap.
Vince
No worries @VinceR, we are all here in our spare time. No need to apologize for that :-)
MicheleC,
After such a long absence, I finally have some updates. One of the problems with this amount of time passing is that it has given me ample opportunity to brainstorm and implement "improvements". Because of this, the code has changed enough that you will probably want to go through another review cycle - sorry about that.
As before, please ignore the copious TSMTRACE statements that I will remove before this gets pushed to the main branch.
All of the changes in this commit have been tested along with other changes to
tdelibs/tdeio
,tdebase/libkonq
, andtdebase/konqueror
that are needed to create a fully functional testing environment. This includes a rewritten UI for setting match patterns and associated options.Vince
Hi @VinceR ,
thanks for the updated code. I will do a new review cycle next week (this weekend is Chinese New Year and I will be traveling) and feedback as usual.
Hi @VinceR
apologies for the delay, got busy with other stuff. I will start the review tomorrow or Wednesday at most.
I wanted to let you know that since the changes in this PR are quite at the core of tdelibs, the PR won't make it in the final R14.1.0 which is due out at the end of April. We are too close to that date to make such kind of change. But there will be plenty of time to add it to R14.2.0 (we are also putting back some other stuff for the same reason).
First part of the review done. I will continue tomorrow with the rest.
* Desired outcome of matching
TRUE: match succeeds if a string matches the match pattern.
FALSE: match fails if a string matches the match pattern.
Suggest rewording:
FALSE: match succeeds if a string does not match the match pattern.
'i' - Letter case variants are equivalent (e.g. case-insensitive)
'e' - All letter & number character variants are equivalent
'=' - Match succeeds if pattern matches [default]
'!' - Match fails if pattern matches (inverted match)
Suggested rewording:
'!' - Match succeeds if pattern does not match (inverted match)
'=' - Match succeeds if pattern matches [default]
'!' - Match fails if pattern matches (inverted match)
Options set in option string remain in effect until subsequently overridden.
As commented in the past, it is better to not carry over options across different patterns. First it makes the code easier, but most important it makes patterns independent from each other.
If options are carried over, swapping pieces around could alter the meaning of patterns.
Also what happens if we start with "rc" as defaults and the first option is just "i"? It would result in an 'optionString' of 'rci' which is invalid.
With the understanding that further discussion and resolution of this particular concer will be taking place in a future discussion:
Not invalid, 'i' wins!
Indeed 'i' wins, my bad for not getting that :-(
Let's park this point of the discussion for the time being. Later we can involved Slavek and see what he thinks about it too, but first let's settle the remianing point.
In an effort to settle this point,
README.tdestringmatcher
has been amended in the newly uploaded code. There are no more claims to support pattern option carryover although that is what happens in the code.Current README description is clearer, thanks for updating that.
The following is an example of a string representing a match specification list
intended to apply to file names
w.*ee*cr~$\\.[0-9]+
Once again, vertical tab is hard to understand and type. Yes, users may not need to know/type it, but we should try to make things easy for those who want to.
IMO, the example string is not easily undestandable, even with gitea showing the unicode values of VT rather than the character itself. Requires some thinking :-)
With the understanding that further discussion and resolution of this particular concern will be taking place in a future discussion:
I would like to direct your attention to
~/.qt/qtrc
, a perfect example of a configuration file that can be edited with a text editor but is not designed for that purpose. It takes a bit of research to understand what all those comma-separated values in the font specification actually mean and I never could find the decoder ring for the palette stuff:)I must admit that gitea's representation of VT is an eyesore. Text editors do a better job for of representing the character for purposes of copy/paste.
We probably need to involve Slavek on this as well. There are definitely config files that have complex stuff in them, I have seen all sort of things over the years, but usually things are "linear". What I mean by that is that things are complex but arranged in lines or categories. When things are broken up on multiple lines with no visible indication, users who are not aware of a VT will be confused, because they would probably expect a "normal" config file.
Let me try to explain things with an example. If the file is something like this (the line break being VT in reality):
a general user may simply think that things are on different lines and align the config for clarity as follow:
which would break the whole functionality.
HT
looks better thatVT
and using\t
in the string also avoid the rish of a user wrongly deleting or missing the tab character.TSMTRACE << "TDEGlobal::hiddenFileMatcher(): Global HFM initialization STARTED" << endl;
_hiddenFileMatcher = new TDEStringMatcher();
TDEGlobal::config()->setGroup( "General" );
char settingsDefault[5] = { 'w', SEP, '.', '*', 0 }; // wildcard match of dotfiles
Just noting how the older version
"/oW/.*"
was way more readable than{ 'w', SEP, '.', '*', 0 }
#include <tqregexp.h>
#include <kdebug.h>
typedef TQValueVector<TQRegExp> RegexList;
A TQValueList may be a better choice. TQValueVector has a capacity, so it would allocate extra memory even if there is a single element. TQValueList (being a list) does not do that.
TQValueVector
was chosen to directly correspond totypedef TQValueVector<MatchSpec> MatchSpecList
defined intdecore/tdestringmatcher.cpp
.The reason
TQValueVector
was chosen for additional public interface toTDEStringmatcher
was because it fits very nicely with theTQListBox
in the UI, where each specification will be referenced with an integer.Ok, the reasoning being the choice is good.
newMatchSpecs.append( optionString );
newMatchSpecs.append( matchSpec.pattern );
newRegexList.append( rxWork );
optionString = "";
I am confused (not withstanding that I think carry over options across matching spec is not a good idea - see previous comments).
optionString = "";
prevents options to be carried over to the next matchSpec, so it contradicts description in README file.In regard to function
setMatchSpecs( MatchSpecList newMatchSpecList )
optionString
is a work variable that is reinitialized for each match specification. The function takes as input a list (array) of match patterns along with all of their associated options. For each of these, the function derives a character string that encodes those options. The function is not making any attempt to generate the "most economical" matchSpecString (e.g. by leveraging propagated option characters) - in that respect, I think it is generating what you would prefer.Given that, on further review of the code, I would agree that the following initialization is pointless:
TQString optionString = "rc" ; // start with defaults
I think our discussion of propagated option characters will be focused on the tolerance of
setMatchSpecs( TQString newMatchSpecString )
for those.On second thought, I think I did that to seed a default state for
setMatchSpecs( TQString newMatchSpecString )
Anyway, we'll figure it out in the follow-on conversation discussing this.
Yes, that is my understand of the code too, ans that is why I was confused since the readme and previous discussion were pointing to the options to be propagated.
Let's discuss further on the next iteration.
""
};
if ( newMatchSpecString == p->m_matchSpecString )
This should be at the very beginning of the method, before anything else
for ( TQString &specification : newMatchSpecs ) {
if ( specification.find( TQChar(SEP) ) >= 0 ) {
This can't happen, because of this line above:
TQStringList newMatchSpecs = TQStringList::split( SEP, newMatchSpecString, true );
You are correct, I guess I was being overly paranoid. I will flag for removal.
break;
default:
continue; // should not arise
Although it should never happen, it would be better to flag an error and return false like in all other places.
No it should NEVER happen because this function itself had set a legal value during the previous analysis of the associated option string. Won't handling this impossible condition be similar to what I had done when looking for SEP in a substring produced by
split( SEP... )
? I will go ahead and follow the recommendation but this raises another question I've been meaning to ask you.In this and other functions, I look for various error conditions and when one is encountered, I issue a message, free some allocated memory, and return. I know that
goto
has been frowned upon by programming purists over the years but it seems that there might be a good case for using it for these early error exits. Consider the following example and let me know if you would flag it as bad practice:I agree, it should never happen for that exact reason. So we don't need a default case at all. But if we put it in, then it may make more sense to flag an error because it means we have a bug in the code. I am ok either way.
Forgot to comment on the use of
goto
: I could direct you to this famous article but simply put, there are better ways to write the same code without usinggoto
, avoiding subtle issues that could arise when strange execution path are followed.// alphanumeric character in rxWork.pattern to its "least" equivalent.
break;
default:
continue; // should not arise
Flag error, see before comment
newMatchSpecList.clear();
newRegexList.clear();
return false;
continue;
continue
will never be reached, it can be removedprocessingPattern = false; // next spec should be an option string
continue;
}
rather than using
continue
at line 270, it would be easier to read and understand if there was anelse
at this point and the part below be inside its body.if-else
makes immediately obvious the flow.Agreed. I will also move the shorter option processing code to the
if
and the pattern processing code to theelse
{
PatternType patternType;
ANCHandling ancHandling;
bool wantMatch; // "matching" vs. "not matching"
Suggest renaming
wantMatch
to something like 'expectMatch'* Container used in a TDEStringMatcher object
* representing multiple match specifications.
*/
typedef TQValueVector<MatchSpec> MatchSpecList;
A
TQValueList
may be a better choice.TQValueVector
has a capacity, so it would allocate extra memory even if there is a single element.TQValueList
(being a list) does not do that.Repeating my response to related earlier conversation:
The reason
TQValueVector
was chosen for additional public interface toTDEStringmatcher
was because it fits very nicely with theTQListBox
in the UI, where each specification will be referenced with an integer.Understood and I am fine with the explanation.
/**
@return list of currently defined match specifications.
*/
MatchSpecList getMatchSpecs();
we can make this function
const
or provide two versions (const/non const) if editing the return value is required./**
@return string encoding list of currently defined match specifications.
*/
TQString getMatchSpecString();
Same comment about making this const
Utility function for converting a wildcard pattern string
to a regular expression pattern string.
*/
TQString wildcardToRegex( const TQString& wildcardPattern );
Not sure if this is used in other part of the code not yet uploaded, but in case it is not we could make this function protected
I will be discussing this function further in a future conversation.
Although it potentially could be used in
TDEStringmatcher
, it is not tied to that class. I thought it would be nice to expose the function publicly in case others might want to use it for whatever reason.See other comment about fixing the wildcard to regex directly in TQt3 rather than duplicating the code. Thehn this comment thread will no longer apply since this function would not be in the TSM class anymore.
/**
Utility function for escaping all regex-specific characters.
*/
TQString escapeRegexChars( const TQString& basicString );
As above one, could it be a protected method? Or isit used in tdebase or somewhere else (don't see any usage in this PR)
True, it is not currently being used anywhere in the code. If we decide to use the
TQRegExp
for matchingPatternType::SUBSTRING
(instead ofTQString
), it will become useful.Although it potentially could be used in
TDEStringmatcher
, it is not tied to that class. Again, I thought it would be nice to expose it publicly in case others might want to use it for whatever reason.Ok, let's discuss again in a further iteration.
private:
class TDEStringMatcherPrivate;
TDEStringMatcherPrivate *p;
It is a TQt3/TDE convention to use
d
for the pointer to the provate class, notp
. Don't know whyd
but it's everywhere, so good to follow that.};
// Use vertical tab as m_patternString separator
inline constexpr char SEP { 0x0B };
Suggest a more meaningful name, like for example "TSM_PATTERN_SEP", so when used in other files it is easier to guess the context and function of it.
Or maybe make it a static member of TDEStringMatcher, so that it can later be used as TDEStringMatcher::SEPARATOR?
Second part of review. Still need to review the last two files, which I will hopefully do tomorrow.
TSMTRACE << "Attempting to match string '" << stringToMatch << "' against stored patterns" << endl;
if ( p->m_matchSpecList.isEmpty() ) {
//-Debug: TSMTRACE << "Match failed on empty pattern list!" << endl;
return false; //FIXME: or should that be true per MicheleC's comment?
I think
false
is better. If the string to match is empty, it can't match patterns, so returning could be misleading. If I commented like that before, I was wrong or thinking something else.TQString matchThis = stringToMatch;
if ( p->m_matchSpecList[index].ancHandling == ANCHandling::EQUIVALENCE )
{
if ( equivalentString.isNull() ) {
isEmpty()
is a better check thanisNull()
// FIXME TBD: This is where we will be converting each alphanumeric
// character in stringToMatch to its "least" equivalent and storing
// the result in equivalentString. Until then, we'll just do:
equivalentString = stringToMatch;
We can use
matchThis = equivalentString;
instead ofequivalentString = stringToMatch;
(with the equivalence conversion when FIXED) and so there is no need for theequivalentString
variable at allI'm drawing a blank on what you are recommending.
equivalentString = stringToMatch;
is a merely placeholder for a future code whereinequivalentString
is constructed for the first and only time for use during this and future matches requiringANCHandling::EQUIVALENCE
. Doing what you recommend now will cause matching against an empty string instead of an unconverted string.BTW, I have implemented and successfully tested
ANCHandling::EQUIVALENCE
. Once we complete this review cycle, I recommend we tackle the other parts of this PR (e.g. theKDirLister
changes) before reviewing my implementation ofANCHandling::EQUIVALENCE
which I suspect will engender a lot of discussion :)Ah, I now understand what you intended to do. You want to calculate the equivalent string but only if that is required. And you are doing that in the same
for
loop that does the matching too.I admit it was a bit hard to grasp the logic at first.
I was about to proposed splitting the for loop into two for better clarity but after thinking about it further, I think your code is better.
//-Debug: TSMTRACE << "Attempting to match string '" << stringToMatch << "' against stored patterns" << endl;
if ( p->m_matchSpecList.isEmpty() ) {
//-Debug: TSMTRACE << "Match failed on empty pattern list!" << endl;
return false; //FIXME: or should that be true per MicheleC's comment?
I think "false" is better. If the string to match is empty, it can't match patterns, so returning could be misleading. If I commented like that before, I was wrong or thinking something else.
TQString matchThis = stringToMatch;
if ( p->m_matchSpecList[index].ancHandling == ANCHandling::EQUIVALENCE )
{
if ( equivalentString.isNull() ) {
isEmpty()
is a better check thanisNull()
// FIXME TBD: This is where we will be converting each alphanumeric
// character in stringToMatch to its "least" equivalent and storing
// the result in equivalentString. Until then, we'll just do:
equivalentString = stringToMatch;
We can use
matchThis = equivalentString;
instead ofequivalentString = stringToMatch;
(with the equivalence conversion when FIXED) and so there is no need for theequivalentString
variable at all}
if (
( p->m_regexList[index].search( matchThis ) < 0 ) // was there no match?
Suggest to rewrite in a clearer way.
Also I think there was a bug, because it was doing inverted comparsion.
If no match was found and we wanted a positive match, it was not going to return false. So the first
less than
comparison should be agreater or equal than
comparison (correct me if I misunderstood the logic here).Save the result of the first line in a bool variable, then use it in the if() expression. Simpler to read and understand.
I sheepishly admit that there is a LOT more wrong with the
matchAll
function than this. It looks like I didn't propagate the changes I had made to thematchAny
function. I will make corrections, then we can revisit this and the next conversation item concerning this function.Ok
return false;
}
if ( p->m_regexList[index].search( matchThis ) < 0 ) {
This
if
is not required IMO. There is thewantMatch
field to choose whether a string should or should not match, and this is tested in the previousif
. Thisif
logic is the equivalent of requiring any string to match, which contractict the poinf of havingwantMatch
.I guess this was added because of the comparison error explained in the
if
aboce.//================================================================================================
/*
The following code is a modified copy of that found in tqt3/src/tools/qregexp.cpp.
I have not checked the code of this function, but just wondering what is the difference that requires a custom function for this.
Also why can't we just match using the "wildcard" functionality of TQRegExp?
Wildcards are currently handled internally by
TQRegExp
by converting them to regular expressions via the internal TQt functionwc2rx
that is not exposed in a public API (seetqt3/src/tools/qregexp.cpp
~ line 810). I do not think this tight binding of wildcard processing and a particular regular expression engine is a very good idea for the following reasons:exactMatch()
instead ofsearch()
function that would normally be used for regex's.wc2rx
function has what I believe to be a bug in that it does not treat '!' as the wildcard character class negation character.QRegularExpression
leaves wildcard-to-regex conversions up to the caller: https://doc.qt.io/qt-5/qregularexpression.html#wildcardToRegularExpressionfnmatch(3)
fromglibc
since that would allow handling of a richer wildcard expression language.Uhm... I reserve further judgement after looking into the code, but usually wildcard in linux/bash are
*
,?
and[]
. The!
is not considered a wildcard negation. Perhaps different type of applications do, but I believe most users are not expecting!
to negate a wildcard character.Personally, I've never used more than
.
and*
wildcards but according toman 7 glob
:An expression
[!...]
matches a single character, namely any character that is not matched by the expression obtained by removing the first!
from it. (Thus,[!]a-]
matches any single character except]
,a
, and-
.)Now that regular expressions have bracket expressions where the negation is indicated by a
^
, POSIX has declared the effect of a wildcard pattern[^...]
to be undefined.Read further and then take a look at the extensions in
man fnmatch
. It looks like "wildcard" (irregular) expressions can get just as complicated as normal regular expressions. Maybe these would best be supported independently by calls to that function.Thanks for the info, I wasn't aware of that (likewise I use pretty much just
.,
?and
*`.In such case I think it would make more sense to fix the related function in TQt3 to add support for
[!...]
rather than duplicating and modifying the code, which would result in non consistent behavior between the two versions. What do you think?Consider the following differences between the proposed public
TDEStringMatcher::wildcardToRegex()
method and the internalTQRegExp
wc2rx()
helper function:!
, not^
as character class negation character.^
and ends with$
, resulting in correct wildcard matching with theTQRegExp::search()
function. Currently awc2rx
-created regex requiresTQRegExp::exactMatch()
for correct matching.TQRegExp
orTDEStringMatcher
.Certainly all of this can be done with an update to
TQRegExp
but there is a potential that implementing the first 2 items might break existing applications that depend on the current incorrect idiosyncratic behavior. If you don't think this is a deal breaker, I will open an issue to request these changes.The new code is using
fnmatch(3)
on GLIBC systems which leverages its own matching code that does not rely on prior conversion of a wildcard expression to regex. Even so, it might be nice to still provide a conversion utility for applications that want to go down that path.@return whether or not @param stringToMatch matches any of
the current match specifications.
*/
bool matchAny( const TQString& stringToMatch );
This could be made a
const
method I guess.@return whether or not @param stringToMatch matches all of
the current match specifications.
*/
bool matchAll( const TQString& stringToMatch );
This could be made a
const
method I guess.Hi, MichelleC,
Thanks for your review so far. I think I will go back and resolve the earlier conversations on the assumption that if I missed something from those, you will be reminding me of it during the current review cycle.
I wish I knew more about git. I am currently working on 2 independent branches, the current one you are reviewing and a more comprehensive one that I use for RnD and testing. As it is now, I need to make changes, small and large, to both branches. There must be a better way using git magic. Or maybe not :)
I am perfectly fine with not attaining the 14.1 target. I do hope the interval between future feature releases will be shorter than the one between 14.0 and 14.1.
Do you have a sense of how many people there are that actually use the current 14.1 development branch code for their desktop? I hope it's a lot because there is no better test of committed changes than actual daily use outside of the laboratory.
Vince
Hi Vince,
I shall complete the review of the last two files later today (weekend was busy...) and provide an overall take of where we are so far.
For git, I usually work in the following way. It may not be the best, but it fits my workflow.
R14.1.0 suffered from lack of planning for the first years. I would say till 2018 there was no real plan. Then we came together with a plan but we had to clean up some mess from the previous 4 years and we ended up where we are. R14.1.0 will be out at the end of April if all goes as per schedule.
For future release I would like to see a R14.x.0 release every 3 or 4 years at most and we will do planning from the very beginning. So it should be a shorter interval.
To be honest I don't have any idea about that. There are for sure a number of people using it, myself included. But I think the majority are on R14.0.x.
Last part of the review.
return;
TSMTRACE << " Attempting to disconnect slots from hidden file matcher (" << m_pHiddenFileMatcher << ") signals ... " << endl;
if ( (m_pHiddenFileMatcher != nullptr) && disconnect( m_pHiddenFileMatcher, 0, this, 0 ) )
Suggeest minor adjustment: move the
disconnect
in theif
body rather than doing it in theif
condition. It does not change the functionality but readability greatly benefits from it.if ( hiddenFileMatcher == nullptr ) {
kdWarning() << "KFileItem::setHiddenFileMatcher: called with null pointer, nothing will be hidden any more" << endl;
return;
There is a logic bug here, because
reEvaluateHidden();
is not called and if we setnullptr
as the new file mather object, the old hidden status would remain till the pattern changes.We need to call
reEvaluateHidden();
before returning, but I would suggest a different coding style:This way
reEvaluateHidden();
is always called regardles of whether we are setting a new file matcher or removing an existing one.Hi @VinceR,
I finished this round of review.
Good progress, many things are now in place and the design is more robust compared to the very first version from so long ago.
Most of my comments are minor adjustments and should not take long to rectify.
There are only 2 points which needs some discussion:
the choice of separator character. As already expressed, the vertical tab is not much of a user friendly choice IMO, although quite unique indeed. I think the very first suggestion from you (using
/
) is a better choice. I know/
can be part of a pattern, but I think we could simply escape that as '/' is needed. Users should be quite familiar with that since it is a common thing to do. And it avoids "broken lines"whether to make options carry forward or not. IMO, having independent pieces is easier from a programming and logic view and less prone to mistakes when moving things around.
Having said that, see one of my comments because unlessI have misunderstood, it seems the code is not carrying options over but the readme file say so.
We are definitely going in the right direction. Although we will miss the R14.1.0 window, we can merge this into R14.2.0-dev when we are ready.
Hi MicheleC,
Addressing the separator character:
I too am not thrilled about using
<VT>
. My standard go-to separator for dividing text strings is the horizontal tab (<HT>
) and I had some initial concerns about using that for the divider that may not be founded. But even using<HT>
won't make editing config files that easy.Consider an example where a user wishes to define 3 regular expressions in the matcher:
1\st
,2\nd
and3\rd
. Ideally, TDE would store this specification literally in the configuration file as:1\st 2\nd 3\rd
(those are horizontal tabs separating the the 3 expressions). But instead, it stores it as1\\st\t2\\nd\t3\\rd
. I don't agree with what TDE does here: why does\
need to be escaped? and why does a perfectly good and visually pleasing horizontal tab need to be converted to\t
? But that's the way it works.I don't think it is reasonable to ask users, working in a UI, to remember to escape
/
characters in their regular expressions. They don't have to do that anywhere else. We could post-process user input and convert unescaped instances of/
to\/
but ... just to support direct editing of configuration files?Let me propose that we use
<HT>
instead of<VT>
. I was originally concerned that this would present a conflict when users specify\t
in their regular expressions but given TDE's overzealous escaping of\
, that won't be an issue.is definitely preferrable compared to (see also other comment I made earlier today). Alternatively, why not using
,
and make it a CSV list of options and patterns?I have a few more comments to answer, I will do that probably on Thursday.
Done with the remaining comments :-)
Hi MicheleC,
It's been a while, so I thought I should provide a status. Despite my silence, I have been pretty busy with this PR. Once I've made some final design decisions, I will respond to the remaining conversation items and push updated code. I will also include the code that implements what I have dubbed "Alphanumeric Equivalence", which takes case-insensitivity to another level.
One of the things I have done is to develop a class that provides a simplified interface to PCRE2 that can be used as a regex engine in
TDEStringMatcher
instead of (or in addition to)TQRegExp
. I still need to do some more development and testing, but if this turns out to be a successful endeavor, I will push updated code later for review.Vince
Ok, thanks for the update @VinceR. Looking forward for the new code and the next review cycle :-)
In an effort to settle this point, the new code is using
<HT>
as the separator. We don't want to use,
as the separator because that would require UI users who wanted to write a literal,
in a pattern to write\,
instead. This is not a problem for<HT>
since users already know to specify\t
in patterns that require it.Hi @MicheleC,
First, congratulations to you, @SlavekB, and the rest on getting 14.1 released. I am sure that was a big job and hopefully users will find the update both pleasing and bug-free.
As you can see, I have uploaded new code for this PR and also made responses to conversation items that still merited a specific response from me.
@VinceR
Hi @VinceR
It's a team effort, so everybody who contributed deserves credit for it, including you :-)
I will review the code sometimes this week or next and feedback as usual.
See comments
// #define TSM_PCRE2
// #define TEQUIVCHARS_HASH_LOOKUP
#ifndef TEQUIVCHARS_HASH_LOOKUP // Using binary search on sorted array
We would need to add cmake code to detect the presence of the hopscotch-map library and then define TEQUIVCHARS_HASH_LOOKUP from cmake. The package is available at least in debian, so it is indeed a good idea to use it if it is faster.
First, a comment about the current code: I included everything that contains an idea that that I pursued as a possible implementation. I did that mainly to show you what ideas I tried. All of those possibilities have been tested and work. I intend to choose one of them to use for implementing implementing TSM's
ANCHandling:Equivalence
Specifically:
Hash table lookup vs. binary search on pre-sorted table
The 2 methods were close in lookup speed but there was a noticeable delay while the hash table was being constructed. For a table this small, I doubt using a hash makes any sense. Using a 3rd party hash presents additional maintenance difficulty, even though it might out-perform
std:unordered_map
. For that reason, I want to use binary search.The need for the replaceCharsMB() function.
This was the original code, developed before I gained a better understanding of how to properly access the 16-bit characters in a
TQString
. I have learned a lot since then, not only in developing this class but also in developing a class wrapper for PCRE2. Although the function works and performs reasonably well, there are a lot of problems with using the types and functions defined inwchar.h
to process strings of 16-bit characters on linux systems. The final code will not have this function.Not completely true; see code comment just before the definition of
replaceChars
function intdecore/tequivchars.cpp
for the complete picture.I think it is a very sensible explanation and choice. A 3rd party library for such a small table would be extra maintenance work for not much benefit.
class TEquivChars_Private
{
public:
#ifndef TEQUIVCHARS_HASH_LOOKUP // Using binary search on sorted array
This comment covers the whole code between lines 10 and 30 plus the tequivchars-mapping.h file.
We should move the line 20-30 into the tequivchars-mapping.h file, so that the equivalence table is a well formed array. By doing so, we do not need to hard code the precise size of the map as in line 11-12, but we can derive them using sizeof statements, making the code robust to addition/deletion from the table.
tequivchars-mapping.h needs to be guarded against multiple inclusions with an #ifndef-#define-#endif block.
would the code still work if we use
TQChar
instead ofw_char_t
? Nothing against the latter, but given that TDE is heavily coded using TQt3 classes, it would be more natural to use TQChar if the functionality is not affected.I understand what you are asking but hard-coding table dimensions apparently is required somewhere. Following your suggestion, I defined the table in
tequivchars-mapping.h
asEquivalentsTable[2993][2]
. Doing that allows me to compute table rows usingsizeof()
. But I was completely unsuccessful when I triedEquivalentsTable[][2]
, getting error messages such as "flexible array member ... is not a end of class ..." and "initializer for flexible array member".I will do this but this file is not a true header file but a code snippet, that is included exactly once in the only code that needs it.
I think that would make the code more cumbersome and possibly less efficient . I do use TQChar when stepping through the input string and when producing the output string.
That said, I now think that using
w_char_t
is not a great idea. That's a 32-bit value on linux systems and we are dealing with 16 bit characters. I am going to change it tounsigned short
Something like this should work and give the correct number of rows and columns
Yes, correct. My original comment was meant as a follow up to make the table a well formed array.
You are using TQChar in part of the string code. Why use a different type (unsigned short, w_char_t, ...) that will result in more (possibly implicit) conversions between types each time? For the record, TQChar uses an nsigned short internally, but using TQChar explicitly will make the code more robust. TQString/TQChar internal implementation could be changed without affecting this PR.
};
#ifndef TEQUIVCHARS_HASH_LOOKUP // Using binary search on sorted array
inline wchar_t TEquivChars_Private::lookup( wchar_t key )
The
lookup()
code is fine, but perhaps we could consider using std::binary_search()?Unless I am overlooking something, that returns only a boolean value. I need a row number.
You are correct, I didn't check the std::binary_search signature before. Very surprising indeed, as often you want to find the place in the array where an element is.
bsearch()
from stdlib does that, so you may want to take a look. Thelookup()
code is perfectly fine, just longer./*
There are 2 implementations of table lookup within this fuction: one that
uses hash table lookup and another that uses binary search on a presorted
table. We default using the binary search implementation for these reasons:
I miss the point 😕
If we already tested and figured out that using the hopscotch-map library is slower and uses more CPU in most cases and therefore we default to the binary search, why making the code more complicated by including the code for the hopscotch-map library at all? Surely it's a nice exercise to test it and compare the performances, but once there is a clear winner, it is preferrable to keep the code as simple and readable as possible 😃
TQString TEquivChars::replaceChars( TQString inputQstring, bool isRegex )
{
int inStrLen = inputQstring.length();
TQString outString = TQString::fromLatin1( "" );
Given we know the length of the input string and we want to replace characters with equivalent ones, we should reserve space in the
outString
from the beginning to make the code more efficient. See documentation forTQString::reserve()
bool startedCharClass = false; // Previous character was starting '[' of character class
bool inCharacterClass = false; // [___]
bool inPosixBracketExpr = false; // [:___:]
#ifdef TSM_PCRE2
We would need to add cmake code to detect whether PCRE2 regex are supported and define TSM_PCRE2 accordingly
I will do that once I introduce the class wrapping PCRE2. Since I went to some effort to develop and test the additional code to support PCRE2's enhanced regex, I would like to keep the #ifdef's in place.
Sounds good, as long as we do it before merging the PR 🙂
bool inDirective = false; // (*___)
bool inGroupName = false; // (?<___>
#endif // TSM_PCRE2
wchar_t currChar = 0;
As per other comment, is it possible to use TQChar instead of w_char_t?
wchar_t prevChar = 0;
wchar_t nextChar = 0;
for ( int i = 0 ; i < inStrLen ; i++ , outString += TQChar(currChar) )
If we pre-reserve the space in the outString, this should be efficient already. But you can also consider direct indexing-n-modifying into the string, as per
TQString::reserver()
example. Or you can also consider using a TQTextOStream on the string.} // isRegex
CONVERT:
This label is not needed. At best, it should be made into a comment.
// No need to "handle" exception, *currchar was not changed
}
#else // TEQUIVCHARS_HASH_LOOKUP => Use binary search
currChar = p->lookup(currChar);
Consider using std::binary_search() as mentioned in a previous comment.
the one that replaced it.
*/
TQString TEquivChars::replaceCharsMB( TQString inputQstring, bool isRegex )
Coming from a point of view of keeping the code as simple as possible, the question is whether
replaceCharsMB
is needed or not. Is there a specific use case thatreplaceChars
cannot handled and thereforereplaceCharsMB
is needed? If the answer is yes, we should consider using onlyreplaceCharsMB
. If the answer is no, we should consider using onlyreplaceChars
.expression and the alphabetical characters inside Posix bracket [::]
expressions are left as-is
*/
TQString replaceChars( TQString inputString, bool isRegex = false );
Suggest we make
inputString
a const referenceconst TQString &inputString)
Alternative implementation of replaceChars function that uses some
"multibyte string" / "wide character" functions defined in wchar.h.
*/
TQString replaceCharsMB( TQString inputString, bool isRegex = false );
Suggest we make
inputString
a const referenceconst TQString &inputString)
@VinceR
I finally reviewed your latest commit and provided feedback about it.
Step by step, we are moving closer to the point where the PR will be ready for merging.
To summarize where we are and to make sure I did not forget anything, the next steps would be:
After that, we should move at reviewing the rest of the code that will build on this PR. In that respect, does TDE/tdebase#270 need any modification or is it good to go as is? And do we need more code on top of this PR and the one in tdebase? It has been a long time, so I lost a bit of clarity at where we stand in regards to the overall initial idea.
Ok, I think it is a good compromise.
<HT>
is definitely more readable than<VT>
and your observation on\t
makes sense. I didn't see any new code for it though, so maybe something you still haven't published.I will take a look at the feedback and respond in the next week.
Take a look at
tdestringmatcher.{h,cpp}
There were a number of changes in these, some of which implementedANCHandling::EQUIVALENCE
Yes, I need to push the remaining changes to:
These will be a lot more straightforward to review, I promise :)
I still need to push the latest changes for that PR
Well ... I've been unit testing a new class that wraps PCRE2. I would eventually like to use it instead of
TQRegExp
forTDEStringMatcher
because it supports a richer regex language, it's well supported, it's faster, and does not have theTQRegExp
bugs. But in the interest of avoiding more scope creep in this PR, I should probably introduce that in a new PR.tdestringmatcher.h
line 57:inline constexpr char PatterStringDivider { '\t' };
andreferenced in a few other modules (namespace TSM).
But I do see an error: it should be spelled
PatternStringDivider
The screenshot below is taken from the current code on this PR on gitea. I don't see the PatterStringDivider.
And as well I still see this comment where the equivalence conversion should be:
Please double checked you pushed all the changes :-)
See previous comment about not seeing the changes :-)
Thaanks for refreshing my memory
Yes, a follow up PR is probably a good idea
Well I'm not sure how this happened, but apparently I committed and pushed only the new files and not the updated existing files.
In order to not confuse things too much, I will review your feedback on the new files, make changes accordingly to them, and then push everything again as merged commit.
OK, I have done this and am ready to push a replacement commit. Since I don't want to destroy context, I will await your responses to my responses to your conversation items before I do the push.
I am done with replying to the various points.
Maybe it could be wiser to push the code in a new PR and close off this one? That way we preserve the comments here and we start a cleaner discussion for the review of the new code? What do you think?
Discussion continues on PR #209.
@VinceR should we mark this PR as closed since we continue on the new version?
WIP: Address issue 270 - Extend meaning of "Hidden Files" (PR version 3)to WIP: Extend meaning of "Hidden Files" (PR version 3) 11 months agoYes, I will mark it closed now.
Reviewers