Deduplicate code in TQObject::activate_signal() #123

According to a benchmark the while() loop with direct use of iterators
were ~10% slower than the current one which handled case when there is
only one connection to a signal separately.

But use of a range-based for loop sped up even this case for ~1.5%.

Closes: #121

Seems it causes crashes... I'll mark it as WIP

LGTM

Seems it causes crashes... I'll mark it as WIP

Ah missed this comment.

I found the cause of the crashes: it's probabilisticly happens when a TQObject gets deleted by one of its signal handlers.

Before the patch it surfaced only if there were 2+ connections to a slot and was less likely if the object was deleted in the last slot. With the patch it's much more likely to happen even if there is only one connection.

I don't suppose we can reliably track all the instances where actual deletion happen, so the best we can hope for is to mitigate the issue somehow. (One of variants to mitigate it is "Keep it as-is and don't touch").

I found the cause of the crashes: it's probabilisticly happens when a TQObject gets deleted by one of its signal handlers.

Isn't is wrong in first place though? A slot should not delete the sender of the signal, IMO.

Perhpas for the time being we can "keep it as it is" as you suggested because it is not good to crash tqt3, but this PR could be useful to test/debug wrong code somewhere else.

Isn't is wrong in first place though? A slot should not delete the sender of the signal, IMO.

Generally it is, but there are at least two cases like that in tqt itself and if it will be done in a safely manner it could be ok. One produces a semi-persistent crash and another I caught only when I added some debug prints. It will be almost impossible to track all those through out the whole code base and unless we do it it will be just random crashes and disappointment for users. Plus fixing each one of them might have a performance impact (though relatively small)...

I actually have a couple ideas how to workaround this...

Ok, I'm happy enough how it looks now... and no more crashes...

This PR also include a couple of additional patches:

added checks for sol from #124
added a lock when deleting ourself from another object's sender list; I believe there are more missing locks, but this one is the only one I'm pretty confident about.

I really like the solution proposed for the crash, very ingenious. Yet it raises a point for thoughts.

Say we have a signal connected to three receivers. The first receiver when invoked, deletes the sender object. Then when invoking the other two receivers, their sender would be pointing to an object already deleted, which is indeed wrong and could cause SEGV or UB.

I think we are still missing something and until we clear ourselves on that, we should be cautious with this change. What do you think?

Say we have a signal connected to three receivers. The first receiver when invoked, deletes the sender object. Then when invoking the other two receivers, their sender would be pointing to an object already deleted, which is indeed wrong and could cause SEGV or UB.

Then they are screwed, and it's definitely their fault... If you call sender() you are responsible for making sure it wasn't deleted before... luckily, this situation would have caused a crash even with the current code.

To be frankly I'm more concerned what if a slot decides to change sender's connections e.g. disconnect from the call it just received... this might actually be a problem...

If you call sender() you are responsible for making sure it wasn't deleted before...

Indeed. And there is even one more complication: sometimes the sender and the receiver are the same object. So if the first receiver delete the sender or itself, then the second receiver will call a slot on a non existing receiver object. Even worse!

To be frankly I'm more concerned what if a slot decides to change sender's connections e.g. disconnect from the call it just received... this might actually be a problem...

That is another good point/problem.

The question is now what to do with this PR. Options:

do nothing, leave code as is, i.e. keep the duplicated 60 lines
merge PR as is. I don't expect the thin wrapper to cause any performance degradation. Anyhow I would suggest we test the modified code for a few days on our own system before officially merge, to make sure there are no other side effects
try to understand the problem more (i.e. postpone merging) and see if we can find other ways.

I am open to all three options. @Fat-Zer @SlavekB what is your preference?

I did not look at it in detail, but the change seems to make this block of code clearer. At the same time, Alex measured some improvement in speed. This seems like a win-win result.

I assume that we can undergo a change in testing for a few days and further deeper examination of the problem than it will be approved and merged.

Just built tqt3 with this PR and PR #109, installed it (no reboot yet) and run

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtqt-mt.so.3.5.0 konqueror

Result is SEGV with the following bt, which is exactly on the part of code modified by this PR. So I reckon we need more work on this :-)

backtrace

#6  TQObject::activate_signal (this=0x558a2bd1af00, clist=<optimized out>, o=o@entry=0x7fff6599fdb0) at kernel/qobject.cpp:2814
#7  0x00007fe8dcecec60 in TQObject::activate_signal (this=this@entry=0x558a2bd1af00, signal=<optimized out>) at kernel/qobject.cpp:2777
#8  0x00007fe8dbbac8fc in TDEIO::SlaveInterface::finished (this=this@entry=0x558a2bd1af00) at ./obj-x86_64-linux-gnu/tdeio/tdeio/slaveinterface.moc:244
#9  0x00007fe8dbbae768 in TDEIO::SlaveInterface::dispatch (this=0x558a2bd1af00, _cmd=104, rawdata=...) at ./tdeio/tdeio/slaveinterface.cpp:243
#10 0x00007fe8dbbabe61 in TDEIO::SlaveInterface::dispatch (this=0x558a2bd1af00) at ./tdeio/tdeio/slaveinterface.cpp:173
#11 0x00007fe8dbba9fc1 in TDEIO::Slave::gotInput (this=0x558a2bd1af00) at ./tdeio/tdeio/slave.cpp:300
#12 0x00007fe8dbbabd40 in TDEIO::Slave::tqt_invoke (this=0x558a2bd1af00, _id=4, _o=0x7fff659a0090) at ./obj-x86_64-linux-gnu/tdeio/tdeio/slave.moc:124
#13 0x00007fe8dcece964 in TQObject::activate_signal (this=this@entry=0x558a2bcd0830, clist=clist@entry=0x558a2b96d640, o=o@entry=0x7fff659a0090) at kernel/qobject.cpp:2854
#14 0x00007fe8dceced76 in TQObject::activate_signal (this=this@entry=0x558a2bcd0830, signal=<optimized out>, param=<optimized out>) at kernel/qobject.cpp:2952

Result is SEGV with the following bt, which is exactly on the part of code modified by this PR. So I reckon we need more work on this :-)

damnit... looks like it's exactly what I was dreaded for: it changes connections in the slot...

I have another idea to workaround this too: move the TQShared part upon the TQConnectionList itself and monitor it changes.

It will break ABI (though in a relatively minor way). Only probably affected classes would be those who use TQConnectionList in a non-opaque manner. As I far as I can grep it's several places in tdebindings, kspy from tdevelop and somewhy kview from tdegraphics. Though as far as they don't construct destroy the objects themself it shouldn't really introduce any failures to those (but I can't guarantee that)...

It will also change TQSignalVec, but it seems nobody uses it anyway (I was thinking of moving it into implementation)...

Ok, this time it seems to be fine... at least the system boots fine and there are no obvious crashes or bugs with library compiled from the PR...

I have not yet done any test nor reviewed the code in detail, but any effect on execution time? Deep copying connection lists on connections does give me a bit of goose bumps, if you get what I mean. And if we are doing all this effort to spare 60 lines of codes, I start to wonder whether we'd be better leaving things as they are. Don't mean to sound negative, just trying to be objective on benefits vs. risks vs performances.

I have not yet done any test nor reviewed the code in detail, but any effect on execution time? Deep copying connection lists on connections does give me a bit of goose bumps, if you get what I mean.

I haven't measured it, but it shouldn't affect general performance. The deep copying happens only with quite specific circumstances: connection modifications from within a slot. So it shouldn't be that often; as the signal lists are generally small (1-2 connections), the copying shouldn't take that long. Any longer lists should have caused crashes with the current implementation.

It's hard to put all that into a normal performance test due to the process being stochastic. I can imaging only collecting some tallies to see how often it actually happens on a real system, if you wish (but frankly I don't really want to do it)...

And if we are doing all this effort to spare 60 lines of codes

Not only that, the fact that we couldn't simply replace this code is an indicator that the underlying code is broken.

I start to wonder whether we'd be better leaving things as they are. Don't mean to sound negative, just trying to be objective on benefits vs. risks vs performances.

Yeah, I see what you mean... To be honest, I myself already wondering if I'm just a hostage of some sunk cost fallacy here, due to how much time and mental efforts I've already spent on this one... But I still believe it's still yet justified...

Will review the code this week and let's discuss from there. It is true the original code was buggy anyway, so perhaps all the effort made is for a good reason :-)

@Fat-Zer @SlavekB I propose we leave this PR for after R14.1.2. We are too close to freezeing the release and if we get something wrong, we could introduce bugs without enough time to test properly (see PR #127, issue #142). What do you think?

Yes, that's right, now we have already come to the stage where we need to calm down and focus on stabilization before release. There would be not enough time to test other major changes. In any case, it is great how many fixes and optimization managed to do in this development window. Thank you all participated for this effort!

I'm ok with that...