Deduplicate code in TQObject::activate_signal() #123

Open
Fat-Zer wants to merge 5 commits from feat/dedup-tqobject into master
Collaborator

According to a benchmark the while() loop with direct use of iterators
were ~10% slower than the current one which handled case when there is
only one connection to a signal separately.

But use of a range-based for loop sped up even this case for ~1.5%.

Closes: #121

According to a benchmark the while() loop with direct use of iterators were ~10% slower than the current one which handled case when there is only one connection to a signal separately. But use of a range-based for loop sped up even this case for ~1.5%. Closes: https://mirror.git.trinitydesktop.org/gitea/TDE/tqt3/issues/121
Fat-Zer added 1 commit 2 months ago
5a662b671d
Deduplicate code in TQObject::activate_signal()
Poster
Collaborator

Seems it causes crashes... I'll mark it as WIP

Seems it causes crashes... I'll mark it as WIP
Fat-Zer changed title from Deduplicate code in TQObject::activate_signal() to WIP: Deduplicate code in TQObject::activate_signal() 2 months ago
MicheleC approved these changes 2 months ago
Dismissed
MicheleC left a comment
Owner

LGTM

LGTM
Owner

Seems it causes crashes... I'll mark it as WIP

Ah missed this comment.

> Seems it causes crashes... I'll mark it as WIP Ah missed this comment.
MicheleC dismissed MicheleC’s review 2 months ago
Reason:

Due to crash reported by Alex

Poster
Collaborator

I found the cause of the crashes: it's probabilisticly happens when a TQObject gets deleted by one of its signal handlers.

Before the patch it surfaced only if there were 2+ connections to a slot and was less likely if the object was deleted in the last slot. With the patch it's much more likely to happen even if there is only one connection.

I don't suppose we can reliably track all the instances where actual deletion happen, so the best we can hope for is to mitigate the issue somehow. (One of variants to mitigate it is "Keep it as-is and don't touch").

I found the cause of the crashes: it's probabilisticly happens when a TQObject gets deleted by one of its signal handlers. Before the patch it surfaced only if there were 2+ connections to a slot and was less likely if the object was deleted in the last slot. With the patch it's much more likely to happen even if there is only one connection. I don't suppose we can reliably track all the instances where actual deletion happen, so the best we can hope for is to mitigate the issue somehow. (One of variants to mitigate it is "Keep it as-is and don't touch").
Owner

I found the cause of the crashes: it's probabilisticly happens when a TQObject gets deleted by one of its signal handlers.

Isn't is wrong in first place though? A slot should not delete the sender of the signal, IMO.

Perhpas for the time being we can "keep it as it is" as you suggested because it is not good to crash tqt3, but this PR could be useful to test/debug wrong code somewhere else.

> I found the cause of the crashes: it's probabilisticly happens when a TQObject gets deleted by one of its signal handlers. Isn't is wrong in first place though? A slot should not delete the sender of the signal, IMO. Perhpas for the time being we can "keep it as it is" as you suggested because it is not good to crash tqt3, but this PR could be useful to test/debug wrong code somewhere else.
Poster
Collaborator

Isn't is wrong in first place though? A slot should not delete the sender of the signal, IMO.

Generally it is, but there are at least two cases like that in tqt itself and if it will be done in a safely manner it could be ok. One produces a semi-persistent crash and another I caught only when I added some debug prints. It will be almost impossible to track all those through out the whole code base and unless we do it it will be just random crashes and disappointment for users. Plus fixing each one of them might have a performance impact (though relatively small)...

I actually have a couple ideas how to workaround this...

> Isn't is wrong in first place though? A slot should not delete the sender of the signal, IMO. Generally it is, but there are at least two cases like that in tqt itself and if it will be done in a safely manner it could be ok. One produces a semi-persistent crash and another I caught only when I added some debug prints. It will be almost impossible to track all those through out the whole code base and unless we do it it will be just random crashes and disappointment for users. Plus fixing each one of them might have a performance impact (though relatively small)... I actually have a couple ideas how to workaround this...
Fat-Zer added 3 commits 2 months ago
ebc3e67bf4
Workaround crashes caused by some slots deleting the sender
14e882386b
TQObject::activate_signal: uniformly apply if(sol) checks
74f7b9309e
Add a lock to removing ourself from another's object's sender list
Fat-Zer changed title from WIP: Deduplicate code in TQObject::activate_signal() to Deduplicate code in TQObject::activate_signal() 2 months ago
Poster
Collaborator

Ok, I'm happy enough how it looks now... and no more crashes...

This PR also include a couple of additional patches:

  • added checks for sol from #124
  • added a lock when deleting ourself from another object's sender list; I believe there are more missing locks, but this one is the only one I'm pretty confident about.
Ok, I'm happy enough how it looks now... and no more crashes... This PR also include a couple of additional patches: - added checks for `sol` from #124 - added a lock when deleting ourself from another object's sender list; I believe there are more missing locks, but this one is the only one I'm pretty confident about.
Owner

I really like the solution proposed for the crash, very ingenious. Yet it raises a point for thoughts.

Say we have a signal connected to three receivers. The first receiver when invoked, deletes the sender object. Then when invoking the other two receivers, their sender would be pointing to an object already deleted, which is indeed wrong and could cause SEGV or UB.

I think we are still missing something and until we clear ourselves on that, we should be cautious with this change. What do you think?

I really like the solution proposed for the crash, very ingenious. Yet it raises a point for thoughts. Say we have a signal connected to three receivers. The first receiver when invoked, deletes the sender object. Then when invoking the other two receivers, their `sender` would be pointing to an object already deleted, which is indeed wrong and could cause SEGV or UB. I think we are still missing something and until we clear ourselves on that, we should be cautious with this change. What do you think?
Poster
Collaborator

Say we have a signal connected to three receivers. The first receiver when invoked, deletes the sender object. Then when invoking the other two receivers, their sender would be pointing to an object already deleted, which is indeed wrong and could cause SEGV or UB.

Then they are screwed, and it's definitely their fault... If you call sender() you are responsible for making sure it wasn't deleted before... luckily, this situation would have caused a crash even with the current code.

To be frankly I'm more concerned what if a slot decides to change sender's connections e.g. disconnect from the call it just received... this might actually be a problem...

> Say we have a signal connected to three receivers. The first receiver when invoked, deletes the sender object. Then when invoking the other two receivers, their sender would be pointing to an object already deleted, which is indeed wrong and could cause SEGV or UB. Then they are screwed, and it's definitely their fault... If you call `sender()` you are responsible for making sure it wasn't deleted before... luckily, this situation would have caused a crash even with the current code. To be frankly I'm more concerned what if a slot decides to change sender's connections e.g. disconnect from the call it just received... this might actually be a problem...
Owner

If you call sender() you are responsible for making sure it wasn't deleted before...

Indeed. And there is even one more complication: sometimes the sender and the receiver are the same object. So if the first receiver delete the sender or itself, then the second receiver will call a slot on a non existing receiver object. Even worse!

To be frankly I'm more concerned what if a slot decides to change sender's connections e.g. disconnect from the call it just received... this might actually be a problem...

That is another good point/problem.

The question is now what to do with this PR. Options:

  1. do nothing, leave code as is, i.e. keep the duplicated 60 lines
  2. merge PR as is. I don't expect the thin wrapper to cause any performance degradation. Anyhow I would suggest we test the modified code for a few days on our own system before officially merge, to make sure there are no other side effects
  3. try to understand the problem more (i.e. postpone merging) and see if we can find other ways.

I am open to all three options. @Fat-Zer @SlavekB what is your preference?

> If you call sender() you are responsible for making sure it wasn't deleted before... Indeed. And there is even one more complication: sometimes the sender and the receiver are the same object. So if the first receiver delete the sender or itself, then the second receiver will call a slot on a non existing receiver object. Even worse! > To be frankly I'm more concerned what if a slot decides to change sender's connections e.g. disconnect from the call it just received... this might actually be a problem... That is another good point/problem. The question is now what to do with this PR. Options: 1. do nothing, leave code as is, i.e. keep the duplicated 60 lines 2. merge PR as is. I don't expect the thin wrapper to cause any performance degradation. Anyhow I would suggest we test the modified code for a few days on our own system before officially merge, to make sure there are no other side effects 3. try to understand the problem more (i.e. postpone merging) and see if we can find other ways. I am open to all three options. @Fat-Zer @SlavekB what is your preference?
Owner

I did not look at it in detail, but the change seems to make this block of code clearer. At the same time, Alex measured some improvement in speed. This seems like a win-win result.

I assume that we can undergo a change in testing for a few days and further deeper examination of the problem than it will be approved and merged.

I did not look at it in detail, but the change seems to make this block of code clearer. At the same time, Alex measured some improvement in speed. This seems like a win-win result. I assume that we can undergo a change in testing for a few days and further deeper examination of the problem than it will be approved and merged.
Owner

Just built tqt3 with this PR and PR #109, installed it (no reboot yet) and run

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtqt-mt.so.3.5.0 konqueror

Result is SEGV with the following bt, which is exactly on the part of code modified by this PR. So I reckon we need more work on this :-)

backtrace
#6  TQObject::activate_signal (this=0x558a2bd1af00, clist=<optimized out>, o=o@entry=0x7fff6599fdb0) at kernel/qobject.cpp:2814
#7  0x00007fe8dcecec60 in TQObject::activate_signal (this=this@entry=0x558a2bd1af00, signal=<optimized out>) at kernel/qobject.cpp:2777
#8  0x00007fe8dbbac8fc in TDEIO::SlaveInterface::finished (this=this@entry=0x558a2bd1af00) at ./obj-x86_64-linux-gnu/tdeio/tdeio/slaveinterface.moc:244
#9  0x00007fe8dbbae768 in TDEIO::SlaveInterface::dispatch (this=0x558a2bd1af00, _cmd=104, rawdata=...) at ./tdeio/tdeio/slaveinterface.cpp:243
#10 0x00007fe8dbbabe61 in TDEIO::SlaveInterface::dispatch (this=0x558a2bd1af00) at ./tdeio/tdeio/slaveinterface.cpp:173
#11 0x00007fe8dbba9fc1 in TDEIO::Slave::gotInput (this=0x558a2bd1af00) at ./tdeio/tdeio/slave.cpp:300
#12 0x00007fe8dbbabd40 in TDEIO::Slave::tqt_invoke (this=0x558a2bd1af00, _id=4, _o=0x7fff659a0090) at ./obj-x86_64-linux-gnu/tdeio/tdeio/slave.moc:124
#13 0x00007fe8dcece964 in TQObject::activate_signal (this=this@entry=0x558a2bcd0830, clist=clist@entry=0x558a2b96d640, o=o@entry=0x7fff659a0090) at kernel/qobject.cpp:2854
#14 0x00007fe8dceced76 in TQObject::activate_signal (this=this@entry=0x558a2bcd0830, signal=<optimized out>, param=<optimized out>) at kernel/qobject.cpp:2952
Just built tqt3 with this PR and PR #109, installed it (no reboot yet) and run ``` LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtqt-mt.so.3.5.0 konqueror ``` Result is SEGV with the following bt, which is exactly on the part of code modified by this PR. So I reckon we need more work on this :-) <details> <summary>backtrace</summary> ``` #6 TQObject::activate_signal (this=0x558a2bd1af00, clist=<optimized out>, o=o@entry=0x7fff6599fdb0) at kernel/qobject.cpp:2814 #7 0x00007fe8dcecec60 in TQObject::activate_signal (this=this@entry=0x558a2bd1af00, signal=<optimized out>) at kernel/qobject.cpp:2777 #8 0x00007fe8dbbac8fc in TDEIO::SlaveInterface::finished (this=this@entry=0x558a2bd1af00) at ./obj-x86_64-linux-gnu/tdeio/tdeio/slaveinterface.moc:244 #9 0x00007fe8dbbae768 in TDEIO::SlaveInterface::dispatch (this=0x558a2bd1af00, _cmd=104, rawdata=...) at ./tdeio/tdeio/slaveinterface.cpp:243 #10 0x00007fe8dbbabe61 in TDEIO::SlaveInterface::dispatch (this=0x558a2bd1af00) at ./tdeio/tdeio/slaveinterface.cpp:173 #11 0x00007fe8dbba9fc1 in TDEIO::Slave::gotInput (this=0x558a2bd1af00) at ./tdeio/tdeio/slave.cpp:300 #12 0x00007fe8dbbabd40 in TDEIO::Slave::tqt_invoke (this=0x558a2bd1af00, _id=4, _o=0x7fff659a0090) at ./obj-x86_64-linux-gnu/tdeio/tdeio/slave.moc:124 #13 0x00007fe8dcece964 in TQObject::activate_signal (this=this@entry=0x558a2bcd0830, clist=clist@entry=0x558a2b96d640, o=o@entry=0x7fff659a0090) at kernel/qobject.cpp:2854 #14 0x00007fe8dceced76 in TQObject::activate_signal (this=this@entry=0x558a2bcd0830, signal=<optimized out>, param=<optimized out>) at kernel/qobject.cpp:2952 ``` </details>
Fat-Zer changed title from Deduplicate code in TQObject::activate_signal() to WIP: Deduplicate code in TQObject::activate_signal() 2 months ago
Poster
Collaborator

Result is SEGV with the following bt, which is exactly on the part of code modified by this PR. So I reckon we need more work on this :-)

damnit... looks like it's exactly what I was dreaded for: it changes connections in the slot...

I have another idea to workaround this too: move the TQShared part upon the TQConnectionList itself and monitor it changes.

It will break ABI (though in a relatively minor way). Only probably affected classes would be those who use TQConnectionList in a non-opaque manner. As I far as I can grep it's several places in tdebindings, kspy from tdevelop and somewhy kview from tdegraphics. Though as far as they don't construct destroy the objects themself it shouldn't really introduce any failures to those (but I can't guarantee that)...

It will also change TQSignalVec, but it seems nobody uses it anyway (I was thinking of moving it into implementation)...

> Result is SEGV with the following bt, which is exactly on the part of code modified by this PR. So I reckon we need more work on this :-) damnit... looks like it's exactly what I was dreaded for: it changes connections in the slot... I have another idea to workaround this too: move the TQShared part upon the TQConnectionList itself and monitor it changes. It will break ABI (though in a relatively minor way). Only probably affected classes would be those who use `TQConnectionList` in a non-opaque manner. As I far as I can grep it's several places in `tdebindings`, `kspy` from `tdevelop` and somewhy `kview` from `tdegraphics`. Though as far as they don't construct destroy the objects themself it shouldn't really introduce any failures to those (but I can't guarantee that)... It will also change `TQSignalVec`, but it seems nobody uses it anyway (I was thinking of moving it into implementation)...
Fat-Zer added 5 commits 2 months ago
3eb9061b63
Deduplicate code in TQObject::activate_signal()
ba2deea50a
TQObject::activate_signal: uniformly apply if(sol) checks
13ecc38dcd
Add a lock to removing ourself from another's object's sender list
24eb408f6a
qobject.cpp: fix compilation with -DTQT_NO_USERDATA
6250349b0b
Make TQConnectionList shared
Fat-Zer changed title from WIP: Deduplicate code in TQObject::activate_signal() to Deduplicate code in TQObject::activate_signal() 2 months ago
Poster
Collaborator

Ok, this time it seems to be fine... at least the system boots fine and there are no obvious crashes or bugs with library compiled from the PR...

Ok, this time it seems to be fine... at least the system boots fine and there are no obvious crashes or bugs with library compiled from the PR...
Owner

I have not yet done any test nor reviewed the code in detail, but any effect on execution time? Deep copying connection lists on connections does give me a bit of goose bumps, if you get what I mean. And if we are doing all this effort to spare 60 lines of codes, I start to wonder whether we'd be better leaving things as they are. Don't mean to sound negative, just trying to be objective on benefits vs. risks vs performances.

I have not yet done any test nor reviewed the code in detail, but any effect on execution time? Deep copying connection lists on connections does give me a bit of goose bumps, if you get what I mean. And if we are doing all this effort to spare 60 lines of codes, I start to wonder whether we'd be better leaving things as they are. Don't mean to sound negative, just trying to be objective on benefits vs. risks vs performances.
Poster
Collaborator

I have not yet done any test nor reviewed the code in detail, but any effect on execution time? Deep copying connection lists on connections does give me a bit of goose bumps, if you get what I mean.

I haven't measured it, but it shouldn't affect general performance. The deep copying happens only with quite specific circumstances: connection modifications from within a slot. So it shouldn't be that often; as the signal lists are generally small (1-2 connections), the copying shouldn't take that long. Any longer lists should have caused crashes with the current implementation.

It's hard to put all that into a normal performance test due to the process being stochastic. I can imaging only collecting some tallies to see how often it actually happens on a real system, if you wish (but frankly I don't really want to do it)...

And if we are doing all this effort to spare 60 lines of codes

Not only that, the fact that we couldn't simply replace this code is an indicator that the underlying code is broken.

I start to wonder whether we'd be better leaving things as they are. Don't mean to sound negative, just trying to be objective on benefits vs. risks vs performances.

Yeah, I see what you mean... To be honest, I myself already wondering if I'm just a hostage of some sunk cost fallacy here, due to how much time and mental efforts I've already spent on this one... But I still believe it's still yet justified...

> I have not yet done any test nor reviewed the code in detail, but any effect on execution time? Deep copying connection lists on connections does give me a bit of goose bumps, if you get what I mean. I haven't measured it, but it shouldn't affect general performance. The deep copying happens only with quite specific circumstances: connection modifications from within a slot. So it shouldn't be that often; as the signal lists are generally small (1-2 connections), the copying shouldn't take that long. Any longer lists should have caused crashes with the current implementation. It's hard to put all that into a normal performance test due to the process being stochastic. I can imaging only collecting some tallies to see how often it actually happens on a real system, if you wish (but frankly I don't really want to do it)... > And if we are doing all this effort to spare 60 lines of codes Not only that, the fact that we couldn't simply replace this code is an indicator that the underlying code is broken. > I start to wonder whether we'd be better leaving things as they are. Don't mean to sound negative, just trying to be objective on benefits vs. risks vs performances. Yeah, I see what you mean... To be honest, I myself already wondering if I'm just a hostage of some sunk cost fallacy here, due to how much time and mental efforts I've already spent on this one... But I still believe it's still yet justified...
Owner

Will review the code this week and let's discuss from there. It is true the original code was buggy anyway, so perhaps all the effort made is for a good reason :-)

Will review the code this week and let's discuss from there. It is true the original code was buggy anyway, so perhaps all the effort made is for a good reason :-)
Owner

@Fat-Zer @SlavekB I propose we leave this PR for after R14.1.2. We are too close to freezeing the release and if we get something wrong, we could introduce bugs without enough time to test properly (see PR #127, issue #142). What do you think?

@Fat-Zer @SlavekB I propose we leave this PR for after R14.1.2. We are too close to freezeing the release and if we get something wrong, we could introduce bugs without enough time to test properly (see PR #127, issue #142). What do you think?
Owner

Yes, that's right, now we have already come to the stage where we need to calm down and focus on stabilization before release. There would be not enough time to test other major changes. In any case, it is great how many fixes and optimization managed to do in this development window. Thank you all participated for this effort!

Yes, that's right, now we have already come to the stage where we need to calm down and focus on stabilization before release. There would be not enough time to test other major changes. In any case, it is great how many fixes and optimization managed to do in this development window. Thank you all participated for this effort!
Poster
Collaborator

I'm ok with that...

I'm ok with that...
MicheleC added this to the R14.1.x milestone 1 month ago
MicheleC modified the milestone from R14.1.x to R14.1.3 release 3 weeks ago
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
Sign in to join this conversation.
No reviewers
No Milestone
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

No dependencies set.

Reference: TDE/tqt3#123
Loading…
There is no content yet.