Posts by Steven Meyer

1)

Message boards : Number crunching : Bug Report - Random Reboots

( Message 5360 )
Posted 2899 days ago by Steven Meyer
Are you running the BOINC screensaver on that system, Steven? If so, maybe it's the D@H screensaver interacting with the CUDA code.

-- David


No screen saver at all because that would cut into the CUDA through-put since the CUDA code by itself will peg the GPU at 100% usage.

In any case, I decided to just run S@H Op Apps, for CPU and GPU, on the Q6600. The other comp is running S@H Op Apps for CPU and D@H.  Since it has no CUDA-capable GPU, it is running with no CUDA.          ... and no troubles.

2)

Message boards : Number crunching : Bug Report - Random Reboots

( Message 5338 )
Posted 2912 days ago by Steven Meyer
Garrrrg... the quoting is somewhat messed up. I have cleaned it up below (I hope) to show who said what in reply to which ...
Topolm said:
Yes, I'm aware that you need it for seti on cuda but to rule out that the problem is the nvidia driver you should disable the nvidia driver, reboot and run docking with the windows provided driver. As seti have fixed downtimes you can do that in that timeframe.

Steven said:
I know already that both the nVidia video driver and the D@H code are involved, because the problem began after I upgraded the nVidia video driver, and then it went away after I stopped running D@H. The S@H CUDA code may also be involved.

Topolm said:
Which cuda seti app are you using (the stock or from lunatics) and does it work if you process both projects with cpu and not with gpu/cpu? As your hard crash indicates that's a driver/hardware problem.

Steven said:
I am a lunatic
Will try turning off the GPU and turning on D@H.
Steven said:
It takes all of the components together to create the problem, because by removing just one (D@H) I have a stable system with the other two (nVidia and CUDA).
Now that we know what the components are, can anything be done to make them all co-exist?

Topolm said:
Yes, that's a lot of work. You must try which driver/seti cuda apps fits best so that you can process your wu's for seti & docking without such a hard reset of your system. Maybe someone is reading this and post his/her driver release as I can't I'm not on Windows.
Steven said:
Or do I just have to give up on running more than the one project?
FYI, here is some of the info from the nVidia control panel.

Topolm said:
Thanks, but I would start with the NVidia driver:
1.) remove it
2.) run seti & docking with their stock apps
3.) run seti with optimized cpu app & docking together
if all of the above works then you must find a driver which works together with your seti cuda app and try out that combination and for that task you need patience. I recommend that you start with an older cuda capable driver.

Steven said:
As it took some doing to get a version of the nVidia drivers installed that would work with the optimized CUDA app, I'm not willing to go down that road again yet.

Will start with checking to see if SETI and DOCKING will co-exist without CUDA because that is just a setting in options to turn off the use of the GPU, so I will not have to download and install all sorts of drivers and CUDA apps to find an older pair that will work together.

Unfortunately, since the GPU can process about 3 WU per hour while the CPU does less than 1 per hour with each 4 of the processors, disabling the GPU will be a huge reduction in throughput.

So I would rather find another way.
3)

Message boards : Number crunching : Bug Report - Random Reboots

( Message 5335 )
Posted 2912 days ago by Steven Meyer
Yes, I'm aware that you need it for seti on cuda but to rule out that the problem is the nvidia driver you should disable the nvidia driver, reboot and run docking with the windows provided driver. As seti have fixed downtimes you can do that in that timeframe.


I know already that both the nVidia video driver and the D@H code are involved, because the problem began after I upgraded the nVidia video driver, and then it went away after I stopped running D@H. The S@H CUDA code may also be involved.

It takes all of the components together to create the problem, because by removing just one (D@H) I have a stable system with the other two (nVidia and CUDA).

Now that we know what the components are, can anything be done to make them all co-exist?

Or do I just have to give up on running more than the one project?

FYI, here is some of the info from the nVidia control panel.

CPU
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz

Operating System
Microsoft Windows XP Professional Service Pack 3, Build 2600

Motherboard Vendor
NVIDIA

Motherboard Version
2.0

Motherboard Model
NVK84CRB

DirectX
9.0c (5.3.2600.5512)

Nforce Driver Package
6.03

Graphics Driver
190.38 (6.14.11.9038)

Ethernet Driver
67.72 (1.00.02.06772)

IDE Driver
9.99.0.8

nTune
6.03.12

GPU
GeForce 8800 GT
4)

Message boards : Number crunching : Bug Report - Random Reboots

( Message 5332 )
Posted 2913 days ago by Steven Meyer
Does it happen too if you set in the preferences: Leave applications in memory while suspended?

I already have that option set.

Should I unset it?

Could you also try to run your system only with the windows provided drivers?

The problem there is that the S@H CUDA code requires the nVidia video drivers.
5)

Message boards : Number crunching : Bug Report - Random Reboots

( Message 5330 )
Posted 2913 days ago by Steven Meyer
Docking is the only other BOINC project you have tried... You have made the assumption that SETI is fine and Docking is the problem - if that is what you want to believe, well I doubt we will change your mind.


I could have just abandoned D@H, uninstalled all of its code, and happily gone on my way, with zero random reboots.

Instead, I am still writing to this thread.

This should suggest to you that I am interested in getting to the bottom of this and that I am not interested in just attaching blame to the easiest target.
6)

Message boards : Number crunching : Bug Report - Random Reboots

( Message 5329 )
Posted 2913 days ago by Steven Meyer
Reverting the video driver is easy, but then, cannot really tell me what happened or why. If this method is successful, it will simply tell me that the video driver is involved because the problem will go away. Also, the CUDA code from S@H will not work with the older video driver, so that part of the situation will be changed as well as the video driver. I will not be testing just 1 change, but 2.

I need to work on something that will let us see the problem happen, and why.

I was thinking more along the lines of some settings that will make BOINC put some debugging info into the log so that when the error occurs, I will know what happened and why.

No one has yet answered my question, which I have asked 4 times, in 4 different ways:

Posted 8 Aug 2009 16:14:16 UTC
I wonder if there are any switches that I can set that will cause the D@H apps to write more debugging info to the log so that the cause of the random reboots can be determined.

Posted 9 Aug 2009 17:03:13 UTC
What about the debugging?

Posted 16 Aug 2009 15:04:36 UTC
Is there anything I can do to make BOINC do more debugging in order to track down this problem?

Posted 17 Aug 2009 8:31:08 UTC
I wonder if anything can be done to help locate the problem other than trying a bunch of other projects?

If there really is no way to debug BOINC, then please just say so instead of ignoring the questions.


By the way, as of this posting D@H is still not running and still there have been zero random reboots after almost 10 days.

Zero response about debugging and zero reboots without D@H running does not make me want to turn D@H on again.
7)

Message boards : Number crunching : Bug Report - Random Reboots

( Message 5323 )
Posted 2914 days ago by Steven Meyer
Docking is the only other BOINC project you have tried. My suggestion was to try others and see if they run okay or not. That could point you in the direction of the problem - all you know now is that SETI and Docking won't cohabit on your system.

I have Docking and SETI running together on this machine, no issues here. There are MANY people who are doing the same thing without problems. You have made the assumption that SETI is fine and Docking is the problem - if that is what you want to believe, well I doubt we will change your mind.


I am not trying to point fingers, so chill. :D

I wonder if anything can be done to help locate the problem other than trying a bunch of other projects?
8)

Message boards : Number crunching : Bug Report - Random Reboots

( Message 5317 )
Posted 2915 days ago by Steven Meyer
It has been a week since my last post on this subject, and during that week D@H has not been running while S@H has been. Also, I have not had any spontaneous reboots during that week.
By not doing so, you make the assumption that Docking is at fault and that SETI is fine. The thousands of other Docking users that do not have your problem must just be lucky? Rather then increasing the possibilities, the course of action I suggest narrows your problem. Turning Docking off achieves nothing.

What I have achieved is that now we know that it is very likely that D@H is involved. If S@H optimized code is the cause then at very least D@H is the catalyst. One or the other of these two projects may be stepping on the other in such a way that causes these reboots.

It could also be that S@H is not involved at all.

Something in my system is stepping on D@H, or vice-versa.

  • Something that "the thousands of other Docking users" do not have on their systems.
  • Or else they have not yet bothered to report it because they have not been bothered enough by it yet.
  • Or they do not leave their systems on 24x7 and so they do not see it happen.
  • Or they get so many spontaneous reboots that they do not yet suspect D@H because the reboots did not start right after D@H was installed.
  • Or they could be lucky.
  • Or . . .


Is there anything I can do to make BOINC do more debugging in order to track down this problem?

9)

Message boards : Number crunching : Bug Report - Random Reboots

( Message 5304 )
Posted 2922 days ago by Steven Meyer
Try another co-project, POEM or Rosetta for example. Try to narrow the problem down, ie. if the problem is occuring with other co-running projects as well. A lot of the "optimised" stuff around is not 100% reliable across all hardware systems.


I may try that after a while. I want to reduce the possibilities rather than increase them.

The optimized code seems to be working fine.

There have been zero reboots since I suspended the D@H tasks.

So I have two time periods during which there were no reboots, and in both of those times, there were also no D@H tasks running.

However, this second time period has been only about 1 day so I'll be letting it go on for a few more days to make sure.

What about the debugging?
10)

Message boards : Number crunching : Bug Report - Random Reboots

( Message 5301 )
Posted 2923 days ago by Steven Meyer
To clarify, "Random Reboot" is not the infamous "Blue Screen Of Death", which at least reports the probable cause of the problem. When one of these reboots happens, the screen simply goes black, and then the POST appears.


Next 10 posts