[Just a small thing that bugging me] A code duplication in TQObject::activate_signal() #121

This if causes exact duplication of ~60 lines of code, because it claims it saves one iterator.

Can we just assume that's solely because in early 2000-s trolltech was fining people's salaries for using iterators and just de-duplicate it? Or should we benchmark this?

Or am I missing something? Were iterators really expensive back then? are they now?

I reckon they decided to differentiate because most of the times one signal is connected to exactly one slot, so they were trying to save as much time as possible. Perhaps it was a remnant from Qt1/2 when computers were not as powerfull.
I think modern computers are fast enough and it would be good to simplify the code IMO

I've actually already thrown together a quick benchmark yesterday just in case:

// Build: tqmoc bench.cpp -o bench.moc && g++ -std=c++20 -O2 bench.cpp $(pkg-config --cflags --libs tqt) -o ./bench
// Run: LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench

#include <iostream>
#include <vector>
#include <format>
#include <chrono>

#include <time.h>

#include <tqobject.h>

template<clockid_t clk_id>
class gettime_clock{
public:
    typedef std::chrono::nanoseconds   duration;
    typedef duration::rep              rep;
    typedef duration::period           period;
    typedef std::chrono::time_point<gettime_clock<clk_id>>    time_point;

    static time_point now( ) {
        struct timespec ts;
        clock_gettime(clk_id, &ts);
        return time_point(duration((rep)(ts.tv_sec)*1000'000'000l+ts.tv_nsec));
    }
};

class Sender : public TQObject {
    TQ_OBJECT
public:
    void doSend() {
        emit send();
    }
signals:
    void send();   
};

class Reciever : public TQObject {
    TQ_OBJECT
public slots:
    void recieve() {}
};


int main( int argc, char **argv ) {
    Sender s;
    Reciever r;
    TQObject::connect(&s, TQ_SIGNAL(send()), &r, TQ_SLOT(recieve()));

    using cpu_clock=gettime_clock<CLOCK_PROCESS_CPUTIME_ID>;

    cpu_clock::duration min(cpu_clock::duration::max()), max(0), overall(0);
    int iters = 10;

    for(int i=0;i<iters; i++){
        auto start_time = cpu_clock::now();

        for(int j=0; j<100'000'000; j++) {
            s.doSend();
        }

        auto end_time = cpu_clock::now();
        auto elapsed = end_time - start_time;
        min = std::min(min, elapsed);
        max = std::max(max, elapsed);
        overall += elapsed;
    }

    using fmsec=std::chrono::duration<float>;

    std::cout << std::format("{:.3f} {:.3f} {:.3f}", fmsec(min).count(), (fmsec(overall)/iters).count(), fmsec(max).count())
              << std::endl;
}

#include"bench.moc"

For five round of run, Results without the patch:

$ for i in $(seq 1 5); do LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench; done
6.185 6.189 6.199
6.203 6.208 6.213
6.187 6.189 6.192
6.199 6.203 6.208
6.203 6.207 6.214

And with the patch:

$ for i in $(seq 1 5); do LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench; done
6.853 6.863 6.872
6.853 6.856 6.858
6.867 6.870 6.875
6.852 6.856 6.878
6.867 6.870 6.876

If I didn't mess up my measurements, it seems I owe an apology to somebody in trolltech: this actually saves ~10% on a signal handling costs.

I still think it'd be nice to de-duplicate, but now I'd me more cautious about it.

I've actually already thrown together a quick benchmark yesterday just in case: <details> ```cpp // Build: tqmoc bench.cpp -o bench.moc && g++ -std=c++20 -O2 bench.cpp $(pkg-config --cflags --libs tqt) -o ./bench // Run: LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench #include <iostream> #include <vector> #include <format> #include <chrono> #include <time.h> #include <tqobject.h> template<clockid_t clk_id> class gettime_clock{ public: typedef std::chrono::nanoseconds duration; typedef duration::rep rep; typedef duration::period period; typedef std::chrono::time_point<gettime_clock<clk_id>> time_point; static time_point now( ) { struct timespec ts; clock_gettime(clk_id, &ts); return time_point(duration((rep)(ts.tv_sec)*1000'000'000l+ts.tv_nsec)); } }; class Sender : public TQObject { TQ_OBJECT public: void doSend() { emit send(); } signals: void send(); }; class Reciever : public TQObject { TQ_OBJECT public slots: void recieve() {} }; int main( int argc, char **argv ) { Sender s; Reciever r; TQObject::connect(&s, TQ_SIGNAL(send()), &r, TQ_SLOT(recieve())); using cpu_clock=gettime_clock<CLOCK_PROCESS_CPUTIME_ID>; cpu_clock::duration min(cpu_clock::duration::max()), max(0), overall(0); int iters = 10; for(int i=0;i<iters; i++){ auto start_time = cpu_clock::now(); for(int j=0; j<100'000'000; j++) { s.doSend(); } auto end_time = cpu_clock::now(); auto elapsed = end_time - start_time; min = std::min(min, elapsed); max = std::max(max, elapsed); overall += elapsed; } using fmsec=std::chrono::duration<float>; std::cout << std::format("{:.3f} {:.3f} {:.3f}", fmsec(min).count(), (fmsec(overall)/iters).count(), fmsec(max).count()) << std::endl; } #include"bench.moc" ``` </details> For five round of run, Results without the patch: ``` $ for i in $(seq 1 5); do LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench; done 6.185 6.189 6.199 6.203 6.208 6.213 6.187 6.189 6.192 6.199 6.203 6.208 6.203 6.207 6.214 ``` And with [the patch](https://mirror.git.trinitydesktop.org/gitea/TDE/tqt3/commit/75efd1af6b8d1e3acaa338db848336090c35893b): ``` $ for i in $(seq 1 5); do LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench; done 6.853 6.863 6.872 6.853 6.856 6.858 6.867 6.870 6.875 6.852 6.856 6.878 6.867 6.870 6.876 ``` If I didn't mess up my measurements, it seems I owe an apology to somebody in trolltech: this actually saves ~10% on a signal handling costs. I still think it'd be nice to de-duplicate, but now I'd me more cautious about it.

I haven't looked at the original code, but maybe we could use a macro to achieve the de-duplication? That way we can reuse code and avoid runtime costs?

Actually, range-based for does some magic:

@@ -2788,11 +2788,8 @@ void TQObject::activate_signal( TQConnectionList *clist, TQUObject *o )
     TQObject* oldSender = 0;
-    TQConnection *c;
     TQConnection *cd = 0;
-    TQConnectionListIt it(*clist);
-    while ( (c=it.current()) ) {
-       ++it;
+    for(TQConnection *c: *clist) {
        if ( c == cd )

And code runs even (marginally) faster than before:

$ for i in $(seq 1 5); do LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench; done
6.089 6.092 6.094
6.083 6.086 6.101
6.092 6.094 6.096
6.082 6.083 6.084
6.082 6.084 6.085

Thank c++ gods for c++11.

Thank c++ gods for c++11.

I think it is more likely that the original tqt3 iterator was subpar compared to modern iterator implementation. Or the extra time coud com

Anyway, even with the original tqt3 iterator, 700ms over 1 billion signal executions (without any real code in the receiver) would not have been a huge delay.

Looks like we can simplify the code with a bit of modern c++ :-)

I think it is more likely that the original tqt3 iterator was subpar compared to modern iterator implementation. Or the extra time coud com

The range-based for uses (almost) the same iterators under the hood. The actual difference is:

it technically uses TQPtrList<TQConnection>::Iterator instead of TQConnectionListIt, but the last one is solely a thin wrapper that should be optimized out.
it uses a different condition: ci != clist.end() instead of ci.current().

I haven't tested it, but I suspect that because of this the compiler should be able to produce a more branch-prediction-friendly code.