[Just a small thing that bugging me] A code duplication in TQObject::activate_signal() #121

Open
opened 2 months ago by Fat-Zer · 6 comments
Collaborator

This if causes exact duplication of ~60 lines of code, because it claims it saves one iterator.

Can we just assume that's solely because in early 2000-s trolltech was fining people's salaries for using iterators and just de-duplicate it? Or should we benchmark this?

Or am I missing something? Were iterators really expensive back then? are they now?

[This `if`][1] causes exact duplication of ~60 lines of code, because it claims it saves one iterator. Can we just assume that's solely because in early 2000-s trolltech was fining people's salaries for using iterators and just de-duplicate it? Or should we benchmark this? Or am I missing something? Were iterators really expensive back then? are they now? [1]: https://mirror.git.trinitydesktop.org/gitea/TDE/tqt3/src/commit/bd8bd100a41d205592a3c4d41c52617bfdeb602d/src/kernel/qobject.cpp#L2771
Owner

I reckon they decided to differentiate because most of the times one signal is connected to exactly one slot, so they were trying to save as much time as possible. Perhaps it was a remnant from Qt1/2 when computers were not as powerfull.
I think modern computers are fast enough and it would be good to simplify the code IMO

I reckon they decided to differentiate because most of the times one signal is connected to exactly one slot, so they were trying to save as much time as possible. Perhaps it was a remnant from Qt1/2 when computers were not as powerfull. I think modern computers are fast enough and it would be good to simplify the code IMO
Poster
Collaborator

I've actually already thrown together a quick benchmark yesterday just in case:

// Build: tqmoc bench.cpp -o bench.moc && g++ -std=c++20 -O2 bench.cpp $(pkg-config --cflags --libs tqt) -o ./bench
// Run: LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench

#include <iostream>
#include <vector>
#include <format>
#include <chrono>

#include <time.h>

#include <tqobject.h>

template<clockid_t clk_id>
class gettime_clock{
public:
    typedef std::chrono::nanoseconds   duration;
    typedef duration::rep              rep;
    typedef duration::period           period;
    typedef std::chrono::time_point<gettime_clock<clk_id>>    time_point;

    static time_point now( ) {
        struct timespec ts;
        clock_gettime(clk_id, &ts);
        return time_point(duration((rep)(ts.tv_sec)*1000'000'000l+ts.tv_nsec));
    }
};

class Sender : public TQObject {
    TQ_OBJECT
public:
    void doSend() {
        emit send();
    }
signals:
    void send();   
};

class Reciever : public TQObject {
    TQ_OBJECT
public slots:
    void recieve() {}
};


int main( int argc, char **argv ) {
    Sender s;
    Reciever r;
    TQObject::connect(&s, TQ_SIGNAL(send()), &r, TQ_SLOT(recieve()));

    using cpu_clock=gettime_clock<CLOCK_PROCESS_CPUTIME_ID>;

    cpu_clock::duration min(cpu_clock::duration::max()), max(0), overall(0);
    int iters = 10;

    for(int i=0;i<iters; i++){
        auto start_time = cpu_clock::now();

        for(int j=0; j<100'000'000; j++) {
            s.doSend();
        }

        auto end_time = cpu_clock::now();
        auto elapsed = end_time - start_time;
        min = std::min(min, elapsed);
        max = std::max(max, elapsed);
        overall += elapsed;
    }

    using fmsec=std::chrono::duration<float>;

    std::cout << std::format("{:.3f} {:.3f} {:.3f}", fmsec(min).count(), (fmsec(overall)/iters).count(), fmsec(max).count())
              << std::endl;
}

#include"bench.moc"

For five round of run, Results without the patch:

$ for i in $(seq 1 5); do LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench; done
6.185 6.189 6.199
6.203 6.208 6.213
6.187 6.189 6.192
6.199 6.203 6.208
6.203 6.207 6.214 
    

And with the patch:

$ for i in $(seq 1 5); do LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench; done
6.853 6.863 6.872
6.853 6.856 6.858
6.867 6.870 6.875
6.852 6.856 6.878
6.867 6.870 6.876

If I didn't mess up my measurements, it seems I owe an apology to somebody in trolltech: this actually saves ~10% on a signal handling costs.

I still think it'd be nice to de-duplicate, but now I'd me more cautious about it.

I've actually already thrown together a quick benchmark yesterday just in case: <details> ```cpp // Build: tqmoc bench.cpp -o bench.moc && g++ -std=c++20 -O2 bench.cpp $(pkg-config --cflags --libs tqt) -o ./bench // Run: LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench #include <iostream> #include <vector> #include <format> #include <chrono> #include <time.h> #include <tqobject.h> template<clockid_t clk_id> class gettime_clock{ public: typedef std::chrono::nanoseconds duration; typedef duration::rep rep; typedef duration::period period; typedef std::chrono::time_point<gettime_clock<clk_id>> time_point; static time_point now( ) { struct timespec ts; clock_gettime(clk_id, &ts); return time_point(duration((rep)(ts.tv_sec)*1000'000'000l+ts.tv_nsec)); } }; class Sender : public TQObject { TQ_OBJECT public: void doSend() { emit send(); } signals: void send(); }; class Reciever : public TQObject { TQ_OBJECT public slots: void recieve() {} }; int main( int argc, char **argv ) { Sender s; Reciever r; TQObject::connect(&s, TQ_SIGNAL(send()), &r, TQ_SLOT(recieve())); using cpu_clock=gettime_clock<CLOCK_PROCESS_CPUTIME_ID>; cpu_clock::duration min(cpu_clock::duration::max()), max(0), overall(0); int iters = 10; for(int i=0;i<iters; i++){ auto start_time = cpu_clock::now(); for(int j=0; j<100'000'000; j++) { s.doSend(); } auto end_time = cpu_clock::now(); auto elapsed = end_time - start_time; min = std::min(min, elapsed); max = std::max(max, elapsed); overall += elapsed; } using fmsec=std::chrono::duration<float>; std::cout << std::format("{:.3f} {:.3f} {:.3f}", fmsec(min).count(), (fmsec(overall)/iters).count(), fmsec(max).count()) << std::endl; } #include"bench.moc" ``` </details> For five round of run, Results without the patch: ``` $ for i in $(seq 1 5); do LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench; done 6.185 6.189 6.199 6.203 6.208 6.213 6.187 6.189 6.192 6.199 6.203 6.208 6.203 6.207 6.214 ``` And with [the patch](https://mirror.git.trinitydesktop.org/gitea/TDE/tqt3/commit/75efd1af6b8d1e3acaa338db848336090c35893b): ``` $ for i in $(seq 1 5); do LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench; done 6.853 6.863 6.872 6.853 6.856 6.858 6.867 6.870 6.875 6.852 6.856 6.878 6.867 6.870 6.876 ``` If I didn't mess up my measurements, it seems I owe an apology to somebody in trolltech: this actually saves ~10% on a signal handling costs. I still think it'd be nice to de-duplicate, but now I'd me more cautious about it.
Collaborator

I haven't looked at the original code, but maybe we could use a macro to achieve the de-duplication? That way we can reuse code and avoid runtime costs?

I haven't looked at the original code, but maybe we could use a macro to achieve the de-duplication? That way we can reuse code and avoid runtime costs?
Poster
Collaborator

Actually, range-based for does some magic:

@@ -2788,11 +2788,8 @@ void TQObject::activate_signal( TQConnectionList *clist, TQUObject *o )
     TQObject* oldSender = 0;
-    TQConnection *c;
     TQConnection *cd = 0;
-    TQConnectionListIt it(*clist);
-    while ( (c=it.current()) ) {
-       ++it;
+    for(TQConnection *c: *clist) {
        if ( c == cd )

And code runs even (marginally) faster than before:

$ for i in $(seq 1 5); do LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench; done
6.089 6.092 6.094
6.083 6.086 6.101
6.092 6.094 6.096
6.082 6.083 6.084
6.082 6.084 6.085

Thank c++ gods for c++11.

Actually, range-based `for` does some magic: ```diff @@ -2788,11 +2788,8 @@ void TQObject::activate_signal( TQConnectionList *clist, TQUObject *o ) TQObject* oldSender = 0; - TQConnection *c; TQConnection *cd = 0; - TQConnectionListIt it(*clist); - while ( (c=it.current()) ) { - ++it; + for(TQConnection *c: *clist) { if ( c == cd ) ``` And code runs even (marginally) faster than before: ``` $ for i in $(seq 1 5); do LD_PRELOAD=/tmp/tqt3/lib/libtqt-mt.so ./bench; done 6.089 6.092 6.094 6.083 6.086 6.101 6.092 6.094 6.096 6.082 6.083 6.084 6.082 6.084 6.085 ``` Thank c++ gods for c++11.
Owner

Thank c++ gods for c++11.

I think it is more likely that the original tqt3 iterator was subpar compared to modern iterator implementation. Or the extra time coud com

Anyway, even with the original tqt3 iterator, 700ms over 1 billion signal executions (without any real code in the receiver) would not have been a huge delay.

Looks like we can simplify the code with a bit of modern c++ :-)

> Thank c++ gods for c++11. I think it is more likely that the original tqt3 iterator was subpar compared to modern iterator implementation. Or the extra time coud com Anyway, even with the original tqt3 iterator, 700ms over 1 billion signal executions (without any real code in the receiver) would not have been a huge delay. Looks like we can simplify the code with a bit of modern c++ :-)
Poster
Collaborator

I think it is more likely that the original tqt3 iterator was subpar compared to modern iterator implementation. Or the extra time coud com

The range-based for uses (almost) the same iterators under the hood. The actual difference is:

  • it technically uses TQPtrList<TQConnection>::Iterator instead of TQConnectionListIt, but the last one is solely a thin wrapper that should be optimized out.
  • it uses a different condition: ci != clist.end() instead of ci.current().

I haven't tested it, but I suspect that because of this the compiler should be able to produce a more branch-prediction-friendly code.

> I think it is more likely that the original tqt3 iterator was subpar compared to modern iterator implementation. Or the extra time coud com The range-based `for` uses (almost) the same iterators under the hood. The actual difference is: - it technically uses `TQPtrList<TQConnection>::Iterator` instead of `TQConnectionListIt`, but the last one is solely a thin wrapper that should be optimized out. - it uses a different condition: `ci != clist.end()` instead of `ci.current()`. I haven't tested it, but I suspect that because of this the compiler should be able to produce a more branch-prediction-friendly code.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

No dependencies set.

Reference: TDE/tqt3#121
Loading…
There is no content yet.