Optimisation causes negative value to turn "for" loop infinite #93

Manually merged
MicheleC merged 1 commits from bug/prevent-menu-svg-icon-loop into master 4 years ago
Collaborator

The declaration of "n_segs" in calculateArc is now declared as "volatile". Something in the GCC optimisation can cause the method to set a negative value in the variable causing the "for" loop to become infinite and hang the menu panel.

Signed-off-by: aneejit1 aneejit1@gmail.com

The declaration of "n_segs" in calculateArc is now declared as "volatile". Something in the GCC optimisation can cause the method to set a negative value in the variable causing the "for" loop to become infinite and hang the menu panel. Signed-off-by: aneejit1 <aneejit1@gmail.com>
Poster
Collaborator

There is a long-standing potential loop in ksvgiconpainter.cpp. I've seen this running 14.0.6 on CentOS 7, so it's not just my custom setup that has seen this problem.

Certain SVG icons cannot be displayed on the menu panel. When the pointer is run over the main panel and a popup list should appear to the right, kicker just goes into a loop. The only way out is to kill it, restart it and try to avoid touching that option on the menu again.

Since I've been starting TDE from an xterm, I got the following message repeated: "TQCArray::at: Absolute index [n] out of range". In order to get gdb to provide some useful information, I rebuilt tdelibs with debugging information on at which point the issue stopped. The same problem occurs in KDE 3.5.10, so I switched to that and managed to locate the problem.

It in method "calculateArc", the value of n_segs is being calculated to a negative number which causes the for loop there to go crazy. This appears not to be a mathematical issue but rather how GCC is optimising the code (I suspect that turning debugging information on also turned optimisation off which was why I could not do this with tdelibs). Having experimented with turning optimisation for the entire method off, testing for n_segs being negative and not running the loop, and various other options, it appears that simply defining n_segs as "volatile" resolves the issue.

I've commented the new declaration just in case someone wonders "what's that 'volatile' doing there?" and removes it!

There is a long-standing potential loop in ksvgiconpainter.cpp. I've seen this running 14.0.6 on CentOS 7, so it's not just my custom setup that has seen this problem. Certain SVG icons cannot be displayed on the menu panel. When the pointer is run over the main panel and a popup list should appear to the right, kicker just goes into a loop. The only way out is to kill it, restart it and try to avoid touching that option on the menu again. Since I've been starting TDE from an xterm, I got the following message repeated: "TQCArray::at: Absolute index [n] out of range". In order to get gdb to provide some useful information, I rebuilt tdelibs with debugging information on at which point the issue stopped. The same problem occurs in KDE 3.5.10, so I switched to that and managed to locate the problem. It in method "calculateArc", the value of n_segs is being calculated to a negative number which causes the for loop there to go crazy. This appears not to be a mathematical issue but rather how GCC is optimising the code (I suspect that turning debugging information on also turned optimisation off which was why I could not do this with tdelibs). Having experimented with turning optimisation for the entire method off, testing for n_segs being negative and not running the loop, and various other options, it appears that simply defining n_segs as "volatile" resolves the issue. I've commented the new declaration just in case someone wonders "what's that 'volatile' doing there?" and removes it!
Owner

Hi @aneejit1, I am sure you have done your studies on this problem, but I can't see how this could be the correct solution.

Variable values are guarantee to be correctly assigned and GCC optimization can not make a variable value wrong (unless of GCC bugs of course). If n_segs becomes negative there must be a reason (probably a bug somewhere) and we should try to find out why.

I don't consider using volatile the correct solution, it may just be hiding the real problem and work for some reasons, but without knowing why.

Hi @aneejit1, I am sure you have done your studies on this problem, but I can't see how this could be the correct solution.<br/> Variable values are guarantee to be correctly assigned and GCC optimization can not make a variable value wrong (unless of GCC bugs of course). If n_segs becomes negative there must be a reason (probably a bug somewhere) and we should try to find out why.<br/> I don't consider using volatile the correct solution, it may just be hiding the real problem and work for some reasons, but without knowing why.
Owner

I discussed with Slavek on IRC and we both agree on the above comment. I am marking the PR as not ok. @aneejit1 please try doing more investigation to find the real reason why n_segs becomes negative 👍

I discussed with Slavek on IRC and we both agree on the above comment. I am marking the PR as not ok. @aneejit1 please try doing more investigation to find the real reason why n_segs becomes negative :+1:
MicheleC added the PR/not-ok label 4 years ago
Poster
Collaborator

OK, I'll dig out my assembly language book and see if I can work out what it's doing wrong. This may take a while, I've always avoided the floating point instruction set.

OK, I'll dig out my assembly language book and see if I can work out what it's doing wrong. This may take a while, I've always avoided the floating point instruction set.
Poster
Collaborator

I'm starting to think setting "CMAKE_BUILD_TYPE" to "RelWithDebInfo" and walking away from this would have been a better option!

I like IBM's mainframe compilers. You set the "LIST" option on and you get a nicely formatted listing showing the program statements along with the assembly code that the compiler has generated. Not so with GCC. Set the "-S" option and what comes out is a pile of assembly language with no indication as to where it comes from or what it relates to aside from a few obvious library calls.

So, I've compiled ksvgiconpainter.cpp with '-S' set and the 'volatile' removed to produce a baseline compile and done some comparisons on the code generated in various conditions. I've also tried a few other things. Here's what I've found so far:

  • putting 'printf("%d\n",n_segs)' just before the 'for' loop does nothing; it generates the exact same code as the baseline, so the statement is ignored.
  • putting 'printf("%d\n",n_segs)' inside the 'for' loop causes the looping code to continuously print -2147483648.
  • adding '&& n_segs > 0' to the condition in the 'for' statement does nothing; the same code as the baseline is generated so the condition is ignored.
  • adding 'if(n_segs > 0)' before the 'for' statement generates different code, but the 'for' loop is still becomes infinite.
  • adding 'if(n_segs < 0) break;' into the start of the 'for' loop has no effect.
  • adding 'if(n_segs < 0) break;' into the 'for' loop at the end has no effect.
  • if I convert n_segs to a 'double' and remove the '(int) (int)' from the assignment just before the 'for', the loop does as it should and the menu panel does not hang.
  • if I convert n_segs to a 'double' and leave the '(int) (int)' in, the infinite loop happens and the menu panel hangs.
  • if I leave n_segs as an 'int' and remove the '(int) (int)', it hangs again.
  • with the original source, compiling with '-O2' or '-O3' gives object code that falls into the infinite loop whereas '-O0' and '-O1' do not.
  • if I compile with '-O3' using clang++ 10, there's no hang.
  • with '-O0', most of the values of th_arc and n_segs have values and only two sets of values from the failing menu are set to -NaN and -2147483648 respectively; the loop is avoided, so contrary to the title, it's not that the value of n_segs is negative that is the cause of the issue but the fact that the condition in the 'for' loop is being ignored.
  • printing the value of th_arc, n_segs, i and the evaluation of the expression 'i < n_segs' inside the loop with '-O3' gives -NaN, -2147483648, incrementing values of i, and '1' which indicates that the evaluation of the condition in the 'for' statement would produce a valid result if it were not being ignored.
  • defining i as 'volatile' also prevents the infinite loop from happening.
  • commenting out the content of the 'for' loop has no effect on the infinite loop.
  • commenting out everything before the loop as well and setting n_segs to -2147483648 results in no loop.
  • explicitly setting n_segs to -2147483648 after it has been calculated avoids the loop.
  • initialising n_segs to zero at the start of the method still gives the infinite loop (well, it was worth a try!).
  • doing some arbitrary maths on n_segs after it is set does not prevent the infinite loop.

This looks to me like the optimisation produced by '-O2' and '-O3' in GCC is unable or unwilling to corrently handle test conditions involving n_segs. It seems to be triggered by the assignment of n_segs, the only ways of avoiding this being to either declare n_segs as 'volatile' thus forcing the variable's value to be reserved in and written to storage, or as 'double'. The negativity of the value of n_segs, contrary to what I first thought, appears irrelevent.

I would say that looking at how 'i' and 'n_segs' are used within the loop, it would probably make more sense if they were defined as 'double' rather than 'int' as they are both only used as part of calculations involving floating arithmetic and it would prevent some unnecessary implicit cast conversions being required.

Suggestions?

I'm starting to think setting "CMAKE_BUILD_TYPE" to "RelWithDebInfo" and walking away from this would have been a better option! I like IBM's mainframe compilers. You set the "LIST" option on and you get a nicely formatted listing showing the program statements along with the assembly code that the compiler has generated. Not so with GCC. Set the "-S" option and what comes out is a pile of assembly language with no indication as to where it comes from or what it relates to aside from a few obvious library calls. So, I've compiled ksvgiconpainter.cpp with '-S' set and the 'volatile' removed to produce a baseline compile and done some comparisons on the code generated in various conditions. I've also tried a few other things. Here's what I've found so far: - putting 'printf("%d\n",n_segs)' just before the 'for' loop does nothing; it generates the exact same code as the baseline, so the statement is ignored. - putting 'printf("%d\n",n_segs)' inside the 'for' loop causes the looping code to continuously print -2147483648. - adding '&& n_segs > 0' to the condition in the 'for' statement does nothing; the same code as the baseline is generated so the condition is ignored. - adding 'if(n_segs > 0)' before the 'for' statement generates different code, but the 'for' loop is still becomes infinite. - adding 'if(n_segs < 0) break;' into the start of the 'for' loop has no effect. - adding 'if(n_segs < 0) break;' into the 'for' loop at the end has no effect. - if I convert n_segs to a 'double' and remove the '(int) (int)' from the assignment just before the 'for', the loop does as it should and the menu panel does not hang. - if I convert n_segs to a 'double' and leave the '(int) (int)' in, the infinite loop happens and the menu panel hangs. - if I leave n_segs as an 'int' and remove the '(int) (int)', it hangs again. - with the original source, compiling with '-O2' or '-O3' gives object code that falls into the infinite loop whereas '-O0' and '-O1' do not. - if I compile with '-O3' using clang++ 10, there's no hang. - with '-O0', most of the values of th_arc and n_segs have values and only two sets of values from the failing menu are set to -NaN and -2147483648 respectively; the loop is avoided, so contrary to the title, it's not that the value of n_segs is negative that is the cause of the issue but the fact that the condition in the 'for' loop is being ignored. - printing the value of th_arc, n_segs, i and the evaluation of the expression 'i < n_segs' inside the loop with '-O3' gives -NaN, -2147483648, incrementing values of i, and '1' which indicates that the evaluation of the condition in the 'for' statement would produce a valid result if it were not being ignored. - defining i as 'volatile' also prevents the infinite loop from happening. - commenting out the content of the 'for' loop has no effect on the infinite loop. - commenting out everything before the loop as well and setting n_segs to -2147483648 results in no loop. - explicitly setting n_segs to -2147483648 after it has been calculated avoids the loop. - initialising n_segs to zero at the start of the method still gives the infinite loop (well, it was worth a try!). - doing some arbitrary maths on n_segs after it is set does not prevent the infinite loop. This looks to me like the optimisation produced by '-O2' and '-O3' in GCC is unable or unwilling to corrently handle test conditions involving n_segs. It seems to be triggered by the assignment of n_segs, the only ways of avoiding this being to either declare n_segs as 'volatile' thus forcing the variable's value to be reserved in and written to storage, or as 'double'. The negativity of the value of n_segs, contrary to what I first thought, appears irrelevent. I would say that looking at how 'i' and 'n_segs' are used within the loop, it would probably make more sense if they were defined as 'double' rather than 'int' as they are both only used as part of calculations involving floating arithmetic and it would prevent some unnecessary implicit cast conversions being required. Suggestions?
Owner

Great analysis @aneejit1. I will try to have a look at this as well on Monday, I got interested now. Is there a way to reproduce this systematically? Any specific svg icon I should set on the menu?

One quick thing I notices is that -2147483648 is 0x80000000, which does not seem a totally random number to me. Either n_segs is overflowing above the max positive value or somewhere this value gets set in the code.

Again, I personally would be extremely surprised if the problem is just down to compiler optimization.

Great analysis @aneejit1. I will try to have a look at this as well on Monday, I got interested now. Is there a way to reproduce this systematically? Any specific svg icon I should set on the menu?<br/> One quick thing I notices is that -2147483648 is 0x80000000, which does not seem a totally random number to me. Either n_segs is overflowing above the max positive value or somewhere this value gets set in the code.<br/> Again, I personally would be extremely surprised if the problem is just down to compiler optimization.
Poster
Collaborator

OK, I think I understand why the negative value is there and it seems to be a perfectly valid situation to have, albeit one which is not handled in a simple to decypher manner without additional documentation. I believe it is one of the conditions referred to in the SVG specifications as an "Out-of-Range Parameter", specifically, the condition where the start and end points are the same:

https://www.w3.org/TR/SVG11/implnote.html#ArcImplementationNotes

I've compiled with a "printf" to show the values of (x0,y0) and (x1,y1) along with the calculated values of th_arc and n_segs. Where the two sets of co-ordinated are identical, a -NaN value is calculated for th_arc resulting in the value of -2147483648 being stored in n_segs. The generated code should, by virtue of the 'i < n_segs' test in the 'for' statement, avoid the loop entirely under these circumstances.

I would suggest that the reason for the start and end points being the same is down to the size of the icon; the arc that needs to be drawn has been scaled down to the point where there is insufficient resolution in the output co-ordinate system for the arc to be discernable.

OK, I think I understand why the negative value is there and it seems to be a perfectly valid situation to have, albeit one which is not handled in a simple to decypher manner without additional documentation. I believe it is one of the conditions referred to in the SVG specifications as an "Out-of-Range Parameter", specifically, the condition where the start and end points are the same: https://www.w3.org/TR/SVG11/implnote.html#ArcImplementationNotes I've compiled with a "printf" to show the values of (x0,y0) and (x1,y1) along with the calculated values of th_arc and n_segs. Where the two sets of co-ordinated are identical, a -NaN value is calculated for th_arc resulting in the value of -2147483648 being stored in n_segs. The generated code should, by virtue of the 'i < n_segs' test in the 'for' statement, avoid the loop entirely under these circumstances. I would suggest that the reason for the start and end points being the same is down to the size of the icon; the arc that needs to be drawn has been scaled down to the point where there is insufficient resolution in the output co-ordinate system for the arc to be discernable.
Poster
Collaborator

The icon I was having problems with is 'org.gnome.Music.svg' which is loaded from 'org.gnome.Music.desktop'. It comes from the gnome-music package, version 3.36.2.

I've done a mod to print the icon file names, run up the menu editor and expanded the tree; the only other one that appears to have the same problem is 'org.gnome.Epiphany.svg' which comes from epiphany 3.36.1. The one that I had on the CentOS 7 system was on the 'Wine' menu, but I don't know which specific icon file on there is causing the problem. It seems to be a rare situation but one that's very inconvenient if you happen to hit it!

As for optimisation issues, they can happen. Having turned the "LIST" option on IBM's COBOL compiler on with the optimisation enabled has had me scratching my head at times as to why it has done a certain thing, but what it has done seems to work OK. It's kind of a complicated area which is why I'm willing to point the finger in that direction when the evidence points that way.

I did pick up on the significance of the value of n_segs. I'm currently running with both i and n_segs defined as 'double' and the printfs I have in place are showing n_segs as being NaN. It looks like casting to 'int' is what gives it that specific value as an int obviously has no equivalent of NaN.

The icon I was having problems with is 'org.gnome.Music.svg' which is loaded from 'org.gnome.Music.desktop'. It comes from the gnome-music package, version 3.36.2. I've done a mod to print the icon file names, run up the menu editor and expanded the tree; the only other one that appears to have the same problem is 'org.gnome.Epiphany.svg' which comes from epiphany 3.36.1. The one that I had on the CentOS 7 system was on the 'Wine' menu, but I don't know which specific icon file on there is causing the problem. It seems to be a rare situation but one that's very inconvenient if you happen to hit it! As for optimisation issues, they can happen. Having turned the "LIST" option on IBM's COBOL compiler on with the optimisation enabled has had me scratching my head at times as to why it has done a certain thing, but what it has done seems to work OK. It's kind of a complicated area which is why I'm willing to point the finger in that direction when the evidence points that way. I did pick up on the significance of the value of n_segs. I'm currently running with both i and n_segs defined as 'double' and the printfs I have in place are showing n_segs as being NaN. It looks like casting to 'int' is what gives it that specific value as an int obviously has no equivalent of NaN.
Owner

I did pick up on the significance of the value of n_segs. I'm currently running with both i and n_segs defined as 'double' and the printfs I have in place are showing n_segs as being NaN. It looks like casting to 'int' is what gives it that specific value as an int obviously has no equivalent of NaN.

Nice finding, that makes more sense to me and was actually expecting something on that line 😄

In any case running with i and n_segs as double is also not correct in my opinion since a for loop on a NaN makes not much sense (and probably will exit immediately). Let me have a look, I have a quick fix in my mind but I would like to reproduce the problem and test first if possible.

> I did pick up on the significance of the value of n_segs. I'm currently running with both i and n_segs defined as 'double' and the printfs I have in place are showing n_segs as being NaN. It looks like casting to 'int' is what gives it that specific value as an int obviously has no equivalent of NaN. Nice finding, that makes more sense to me and was actually expecting something on that line :smile: <br/> In any case running with i and n_segs as double is also not correct in my opinion since a for loop on a NaN makes not much sense (and probably will exit immediately). Let me have a look, I have a quick fix in my mind but I would like to reproduce the problem and test first if possible.
Owner

Turned out the problem was a division by zero when the original point and the translated point were the same. This was causing NaN and an infinite looping in the while() {} inside the "KSVGIconPainter::drawPath()" function.

@aneejit1 the problem was reproduced here using the same icon and the fix seems to work. Could you test on your system too before I merge the PR?

Turned out the problem was a division by zero when the original point and the translated point were the same. This was causing NaN and an infinite looping in the while() {} inside the "KSVGIconPainter::drawPath()" function. @aneejit1 the problem was reproduced here using the same icon and the fix seems to work. Could you test on your system too before I merge the PR?
MicheleC added PR/rfc and removed PR/not-ok labels 4 years ago
MicheleC added this to the R14.0.9 release milestone 4 years ago
Poster
Collaborator

That's working fine.

That's working fine.
Poster
Collaborator

It just occurred to me to query why converting i/n_segs to double works. Turns out that if n_segs is NaN, 'i < n_segs', 'i <= n_segs', 'i > n_segs' and 'i >= n_segs' all evaluate to 'false'. The only comparisons that would give a true result would be 'i != n_segs', or 'i == n_segs' if i was NaN.

Well, that's the something new learnt today!

It just occurred to me to query why converting i/n_segs to double works. Turns out that if n_segs is NaN, 'i < n_segs', 'i <= n_segs', 'i > n_segs' and 'i >= n_segs' all evaluate to 'false'. The only comparisons that would give a true result would be 'i != n_segs', or 'i == n_segs' if i was NaN. Well, that's the something new learnt today!
Owner

Thanks for testing, I will proceed with the merge.

It just occurred to me to query why converting i/n_segs to double works. Turns out that if n_segs is NaN, ‘i < n_segs’, ‘i <= n_segs’, ‘i > n_segs’ and ‘i >= n_segs’ all evaluate to ‘false’. The only comparisons that would give a true result would be ‘i != n_segs’, or ‘i == n_segs’ if i was NaN.

Regarding this, that's a strange one. I did try (as first solution) to make sure the for() look was only executed if n_segs was positive (using int, not float) and I still ran into problems with an infinite loop on the KSVGIconPainter::drawPath()” function, which prompted me to do further investigation. But perhaps with NaN some other strange things were happening behind the scenes, I guess....

Anyhow, we found the problem and fix it, that is what is important 😄

Thanks for testing, I will proceed with the merge. >It just occurred to me to query why converting i/n_segs to double works. Turns out that if n_segs is NaN, ‘i < n_segs’, ‘i <= n_segs’, ‘i > n_segs’ and ‘i >= n_segs’ all evaluate to ‘false’. The only comparisons that would give a true result would be ‘i != n_segs’, or ‘i == n_segs’ if i was NaN. Regarding this, that's a strange one. I did try (as first solution) to make sure the for() look was only executed if n_segs was positive (using int, not float) and I still ran into problems with an infinite loop on the KSVGIconPainter::drawPath()” function, which prompted me to do further investigation. But perhaps with NaN some other strange things were happening behind the scenes, I guess.... Anyhow, we found the problem and fix it, that is what is important :smile:
MicheleC closed this pull request 4 years ago
MicheleC removed the PR/rfc label 4 years ago
MicheleC deleted branch bug/prevent-menu-svg-icon-loop 4 years ago
Owner

Congratulations on your good research and successful detection and resolution of the problem!

Congratulations on your good research and successful detection and resolution of the problem!
The pull request has been manually merged as ffe8e495d7.
Sign in to join this conversation.
No reviewers
No Milestone
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

No dependencies set.

Reference: TDE/tdelibs#93
Loading…
There is no content yet.