Invalid results reported thread [Use Here]
Message boards : Number crunching : Invalid results reported thread [Use Here]
Author | Message | |
---|---|---|
Hi all:)
|
||
ID: 1534 | Rating: 0 | rate: / | ||
Another one detected: http://docking.utep.edu/result.php?resultid=47298 |
||
ID: 1553 | Rating: 0 | rate: / | ||
The celeron and PIII have the same answer, the pentium D has another one, thus it is deemed invalid. For this one we know what the problem is, but do not have a solution yet. Frantically working on it though :-)
Hi all:) ____________ D@H the greatest project in the world... a while from now! |
||
ID: 1567 | Rating: 0 | rate: / | ||
Same as the previous one. This time the two celerons beat the pentium D.
Another one detected: http://docking.utep.edu/result.php?resultid=47298 ____________ D@H the greatest project in the world... a while from now! |
||
ID: 1568 | Rating: 0 | rate: / | ||
Ah, I got it:) I've almost forgot that point...shame on myself...lol
|
||
ID: 1569 | Rating: 0 | rate: / | ||
Please use this thread if invalid results are detected.
|
||
ID: 2102 | Rating: 0 | rate: / | ||
Please use this thread if invalid results are detected. http://docking.utep.edu/result.php?resultid=75455 http://docking.utep.edu/result.php?resultid=75269 http://docking.utep.edu/result.php?resultid=74480 http://docking.utep.edu/result.php?resultid=74241 http://docking.utep.edu/result.php?resultid=74137 http://docking.utep.edu/result.php?resultid=74094 There's more for this computer, but I'm going to look at the other machines. |
||
ID: 2109 | Rating: 0 | rate: / | ||
Please use this thread if invalid results are detected. http://docking.utep.edu/result.php?resultid=75927 http://docking.utep.edu/result.php?resultid=75920 http://docking.utep.edu/result.php?resultid=75911 |
||
ID: 2110 | Rating: 0 | rate: / | ||
Please use this thread if invalid results are detected. http://docking.utep.edu/result.php?resultid=76753 http://docking.utep.edu/result.php?resultid=75768 http://docking.utep.edu/result.php?resultid=73550 |
||
ID: 2111 | Rating: 0 | rate: / | ||
Please use this thread if invalid results are detected. http://docking.utep.edu/result.php?resultid=76832 http://docking.utep.edu/result.php?resultid=76267 http://docking.utep.edu/result.php?resultid=76252 http://docking.utep.edu/result.php?resultid=75805 http://docking.utep.edu/result.php?resultid=75803 http://docking.utep.edu/result.php?resultid=75790 |
||
ID: 2112 | Rating: 0 | rate: / | ||
Please use this thread if invalid results are detected. http://docking.utep.edu/result.php?resultid=76271 http://docking.utep.edu/result.php?resultid=75788 http://docking.utep.edu/result.php?resultid=74809 |
||
ID: 2113 | Rating: 0 | rate: / | ||
Please use this thread if invalid results are detected. http://docking.utep.edu/result.php?resultid=76256 http://docking.utep.edu/result.php?resultid=75779 http://docking.utep.edu/result.php?resultid=75765 http://docking.utep.edu/result.php?resultid=75764 http://docking.utep.edu/result.php?resultid=76256 http://docking.utep.edu/result.php?resultid=75779 http://docking.utep.edu/result.php?resultid=75765 http://docking.utep.edu/result.php?resultid=75764 |
||
ID: 2114 | Rating: 0 | rate: / | ||
Please use this thread if invalid results are detected. http://docking.utep.edu/result.php?resultid=75771 |
||
ID: 2115 | Rating: 0 | rate: / | ||
Please use this thread if invalid results are detected. http://docking.utep.edu/result.php?resultid=74031 |
||
ID: 2116 | Rating: 0 | rate: / | ||
http://docking.utep.edu/result.php?resultid=76265
|
||
ID: 2117 | Rating: 0 | rate: / | ||
This one is due to the app version difference 5.03 and 5.04.
http://docking.utep.edu/result.php?resultid=76265 ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2119 | Rating: 0 | rate: / | ||
This one is due to the app version difference 5.03 and 5.04.
Please use this thread if invalid results are detected. ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2120 | Rating: 0 | rate: / | ||
I haven't checked all of these results, but it seems that most of the invalid status's can be attributed to the app version difference on windows. Please check this first before reporting any invalid results. The app version can be found at the bottom of a result page.
|
||
ID: 2121 | Rating: 0 | rate: / | ||
Here is the first result of the computer ID 745 ("AuthenticAMD x86 Family 6 Model 4 Stepping 2 1333MHz") validated against two Pentium III computers and the result is invalid! All three computers have finished the results with charmm 5.04.
|
||
ID: 2142 | Rating: 0 | rate: / | ||
> Just an extra note to the Docking team,
|
||
ID: 2226 | Rating: 0 | rate: / | ||
I found one, 2 results validated with 5.03, mine marked invalid with 5.04.
|
||
ID: 2227 | Rating: 0 | rate: / | ||
Thanks Conan,
> Just an extra note to the Docking team, ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2231 | Rating: 0 | rate: / | ||
Yep, but if noone picks up the missing third results and finishes them even with 5.04 then usually two other guys won't get credit for this workunit... Btw, this one will be the other way round. I delivered result #1 on the 7th of January with version 5.03 and then #2 and #3 were sent on the 15th and the 19th of January. I can't think of an easy solution but there has to be one before this project goes productive. I don't think people will like the idea of loosing credits on every update of the science application. In the moment I certainly don't mind, after all, it's Alpha! Regards Alex My results during the HR tests |
||
ID: 2232 | Rating: 0 | rate: / | ||
David Anderson suggested some workarounds in this
post
. The second one seems like a fairly easy to implement workaround. He also has it on his to-do list to built in a solution in boinc for such cases.
____________ D@H the greatest project in the world... a while from now! |
||
ID: 2234 | Rating: 0 | rate: / | ||
The second one seems like a fairly easy to implement workaround. Ok, provided this is acceptable to the science of the project. But wouldn't it have the same effect to decrease the quorum to 1 instead of 3 or 2 during the time of transition? With the additional benefit of doing more work with the same computational power? Maybe I miss something here. Well, you could also switch off workunit creation prior to introduction of a new version and then wait for the workunits to run dry... Regards Alex |
||
ID: 2235 | Rating: 0 | rate: / | ||
Some good tips there...
The second one seems like a fairly easy to implement workaround. ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2240 | Rating: 0 | rate: / | ||
David Anderson suggested some workarounds in this post . The second one seems like a fairly easy to implement workaround. He also has it on his to-do list to built in a solution in boinc for such cases. If WUs can be parsed by the type of CPU (or arbitrary class), why can't WUs be parsed by the client version? |
||
ID: 2243 | Rating: 0 | rate: / | ||
I think that's what David A. means with his first workaround, because boinc currently doesn't have the capability to parse workunits based on app version; instead this could be done in the validator. Eventually code changes will be necessary on the server-side (scheduling should take app version into account) and client-side (multiple app versions should be kept on disk for a while) for the real solution.
If WUs can be parsed by the type of CPU (or arbitrary class), why can't WUs be parsed by the client version? ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2245 | Rating: 0 | rate: / | ||
Some good tips there... I hope you don't inadvertently follow in Rosetta@Home's footsteps and create a graphics compatibility disaster. |
||
ID: 2268 | Rating: 0 | rate: / | ||
Just tried the latest and greatest Linux boinc client (5.8.7). Four Docking@home workunits in a row terminated abnormally (code 0x1). Might be some sort of incompatibility between what is being done to the boinc client, and the Charmm 5.02 application. [Did not experience Docking@home problems when running with previous boinc clients (5.8.6 or earlier).]
|
||
ID: 2338 | Rating: 0 | rate: / | ||
Interesting... that seems to be the same error as we have seen (and see) on machines that have set their stack limit to low. Just to make sure: is your stack limit set to unlimited? (check with 'ulimit -s'). The strange things is that that usually happens after 10 to 30 secs or so and yours seem to have crunched quite a bit longer. I don't quite understand how the boinc client can influence the app this way, but maybe this is a new case.
Just tried the latest and greatest Linux boinc client (5.8.7). Four Docking@home workunits in a row terminated abnormally (code 0x1). Might be some sort of incompatibility between what is being done to the boinc client, and the Charmm 5.02 application. [Did not experience Docking@home problems when running with previous boinc clients (5.8.6 or earlier).] ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2339 | Rating: 0 | rate: / | ||
Interesting... that seems to be the same error as we have seen (and see) on machines that have set their stack limit to low. Just to make sure: is your stack limit set to unlimited? (check with 'ulimit -s'). The strange things is that that usually happens after 10 to 30 secs or so and yours seem to have crunched quite a bit longer. I don't quite understand how the boinc client can influence the app this way, but maybe this is a new case. I'll switch a dedicated D&H Windows box to 5.8.8 as soon as it runs dry and see how that goes. |
||
ID: 2340 | Rating: 0 | rate: / | ||
OK. Just tried 5.8.8 (latest as of now) and all my results errored out with error 1. The error looks like the stack limit problem, but must be something else as my SuSE box has stack limit set to unlimited. Thus, I can reproduce the problem and am working with David A. to see if we can find a resolution.
|
||
ID: 2341 | Rating: 0 | rate: / | ||
OK. Just tried 5.8.8 (latest as of now) and all my results errored out with error 1. The error looks like the stack limit problem, but must be something else as my SuSE box has stack limit set to unlimited. Thus, I can reproduce the problem and am working with David A. to see if we can find a resolution. Ran one Windows box dry and started 5.8.8. I only caught one WU for one CPU, but it has gone 10% so far, no problem yet. Switched another Windows box to 5.8.8 that had three WUs on it. Two are running and have moved up another 10% or so with no apparent problem. I'll go switch a Linux box to see. |
||
ID: 2342 | Rating: 0 | rate: / | ||
It might be linux only. I have just ran two workunits on client 5.4.9 and also these error out... This doesn't make sense though: nor the workunit or app version haven't changed since weeks. Will research further.
OK. Just tried 5.8.8 (latest as of now) and all my results errored out with error 1. The error looks like the stack limit problem, but must be something else as my SuSE box has stack limit set to unlimited. Thus, I can reproduce the problem and am working with David A. to see if we can find a resolution. ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2343 | Rating: 0 | rate: / | ||
I'll go switch a Linux box to see. Let us know what you find out. I'm still not convinced it is us nor the new boinc client. More data might help us find the cause. AK ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2344 | Rating: 0 | rate: / | ||
Just to make sure: is your stack limit set to unlimited? (check with 'ulimit -s'). yes, it is. The strange things is that that usually happens after 10 to 30 secs or so and yours seem to have crunched quite a bit longer. The result I gave as a reference started computing with the 5.8.6 client, then failed after a couple of minutes of running under 5.8.7. Tht's why it has accumulated more time. The other three results that failed under 5.8.7 all started ok, ran for about six minutes, then exited with 0x1. Here is a more recent failing result, this time on a single-processor Ubuntu 6.10 system (with stack set to unlimited), using the boinc 5.8.8 client: http://docking.utep.edu/result.php?resultid=87703 . |
||
ID: 2345 | Rating: 0 | rate: / | ||
I'll go switch a Linux box to see. Picked a "dry" Linux box and converted to 5.8.8..........but not getting WUs. I'll covert a Linux box with WUs being already processed. |
||
ID: 2346 | Rating: 0 | rate: / | ||
I'll go switch a Linux box to see. I'm sure you noticed, but installing 5.8.8 overwrites the run_client and run_manager, so the ulimit -s unlimited could be missing......mine was. |
||
ID: 2347 | Rating: 0 | rate: / | ||
I looked at the BOINC developers CVS patch check-in mailing list archive yesterday and noticed that some code had been checked in recently where the boinc core client tries to set the stack limit to at least 500 MB. That was before I saw the posts about the problems with new boinc client versions so I'll have to go back and find the exact patch.
|
||
ID: 2348 | Rating: 0 | rate: / | ||
I looked at the BOINC developers CVS patch check-in mailing list archive yesterday and noticed that some code had been checked in recently where the boinc core client tries to set the stack limit to at least 500 MB. That was before I saw the posts about the problems with new boinc client versions so I'll have to go back and find the exact patch. This would definitely explain the crashing behavior on my SuSE linux box that never did this before. Going from unlimited stack to 500MB would crash the simulation after 5 to 10 minutes... EDIT: Here is a link url of a patch being checked in that changes rlimit for RLIMIT_STACK. This was from January 30th, but it looks like they have been changing it earlier in the month as well. I'm not sure when they started messing with it. Thanks for the pointers. I've asked Dave A. about which patches were in which boinc client versions. EDIT: I just emailed the info to Andre so he won't waste time looking at the problem without knowing about this. Thanks! Andre ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2349 | Rating: 0 | rate: / | ||
I looked at the BOINC developers CVS patch check-in mailing list archive yesterday and noticed that some code had been checked in recently where the boinc core client tries to set the stack limit to at least 500 MB. That was before I saw the posts about the problems with new boinc client versions so I'll have to go back and find the exact patch. Could it be that the change in the patch also over-rides the ulimit direction we put in the run files? |
||
ID: 2350 | Rating: 0 | rate: / | ||
Yes, unfortunately it does, because the limit is lowered by the boinc client which is started from the script. This has already been corrected in the boinc client code though, I just don't know in which version the fix of the patch will end up. Andre ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2353 | Rating: 0 | rate: / | ||
Is there anything you want me to try with a couple of Linux boxes, before I go back to the recommended 5.4.11? |
||
ID: 2369 | Rating: 0 | rate: / | ||
Saw on the boinc-dev list (and Rom told me) that they will probably release 5.8.9 to correct the problem. Let's wait until that release to re-test.
____________ D@H the greatest project in the world... a while from now! |
||
ID: 2373 | Rating: 0 | rate: / | ||
The new client has been officially released. Message from Rom:
|
||
ID: 2377 | Rating: 0 | rate: / | ||
Saw on the boinc-dev list (and Rom told me) that they will probably release 5.8.9 to correct the problem. Let's wait until that release to re-test. The latest Linux boinc client fixes the problem that client releases 5.8.7 and 5.8.8 had in Linux. . |
||
ID: 2451 | Rating: 0 | rate: / | ||
Please use this thread if invalid results are detected.
|
||
ID: 2702 | Rating: 0 | rate: / | ||
Hi Michael,
Please use this thread if invalid results are detected. ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2710 | Rating: 0 | rate: / | ||
Zero credit for this result which looks ok to me at least as much as I can see.
|
||
ID: 2719 | Rating: 0 | rate: / | ||
Hi Conan,
Zero credit for this result which looks ok to me at least as much as I can see. ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2720 | Rating: 0 | rate: / | ||
http://docking.utep.edu/result.php?resultid=128455
|
||
ID: 2857 | Rating: 0 | rate: / | ||
Looks like my one invalid is probably the problem referenced above with restarting a work unit...
|
||
ID: 2858 | Rating: 0 | rate: / | ||
Is this the first one where that happened? How many others went okay? Let us know if it happens again.
http://docking.utep.edu/result.php?resultid=128455 ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2859 | Rating: 0 | rate: / | ||
Is this the first one where that happened? How many others went okay? Let us know if it happens again. So far it is the only one that has done this. I have completed a lot with no problem. I will let you know if I have any more issues. (Impressively quick reply I must say.) One thing I did notice was that the time to completion had stopped counting down and was increasing. When I aborted it, it was at over 6 hours to completion and climbing. The CPU time was still rising. Before I aborted it, I had paused it then resumed it. The CPU time reset, amount completed (can't remember if the time to completion did or not.) Anyway after I resumed it the response was the same---stuck at the % complete, and both the time to completion and CPU were rising. I do not believe I did anything on my pc to interrupt it. ____________ |
||
ID: 2862 | Rating: 0 | rate: / | ||
It might help with cases where Charmm exits with code 0, but in this case, there was an error: incorrect function; that one is still kind of a mystery to us and we are investigating it. It seems to be a fortran related error as the cpdn guys have seen this before as well.
Looks like my one invalid is probably the problem referenced above with restarting a work unit... ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2864 | Rating: 0 | rate: / | ||
We have seen such cases in the lab before, but very sporadically. Trilce and Roger found out that charmm is actually stuck in a loop at that point due to a certain condition of the protein, ligand and random seed.This might be the first time we see it happen on the grid. We'll investigate using the input file your host was send.
One thing I did notice was that the time to completion had stopped counting down and was increasing. When I aborted it, it was at over 6 hours to completion and climbing. The CPU time was still rising. Before I aborted it, I had paused it then resumed it. The CPU time reset, amount completed (can't remember if the time to completion did or not.) Anyway after I resumed it the response was the same---stuck at the % complete, and both the time to completion and CPU were rising. I do not believe I did anything on my pc to interrupt it. ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2865 | Rating: 0 | rate: / | ||
We have seen such cases in the lab before, but very sporadically. Trilce and Roger found out that charmm is actually stuck in a loop at that point due to a certain condition of the protein, ligand and random seed.This might be the first time we see it happen on the grid. We'll investigate using the input file your host was send. Please let me know, I am curious. This was actually the first WU in a long that (if ever) that I had go bad. ____________ |
||
ID: 2866 | Rating: 0 | rate: / | ||
It might help with cases where Charmm exits with code 0, but in this case, there was an error: incorrect function; that one is still kind of a mystery to us and we are investigating it. It seems to be a fortran related error as the cpdn guys have seen this before as well. Anything you want me to do Andre? It was the only WU like this. And everything else, including CPDN/CPDN Beta are running happily on that machine. |
||
ID: 2869 | Rating: 0 | rate: / | ||
Anything you want me to do Andre? It was the only WU like this. And everything else, including CPDN/CPDN Beta are running happily on that machine. Not much that you can do at the moment; please report if you get others like this! Thanks Andre PS We haven't been able to reproduce this incorrect function error in the lab, so it will be a tough one to solve. If you see something on other project forums about a possible cause (or solution), please let us know. ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2873 | Rating: 0 | rate: / | ||
http://docking.utep.edu/result.php?resultid=128497
|
||
ID: 2886 | Rating: 0 | rate: / | ||
http://docking.utep.edu/result.php?resultid=135992
|
||
ID: 2888 | Rating: 0 | rate: / | ||
http://docking.utep.edu/result.php?resultid=135992 Is Charmm capable of being run on Win9x, in the first place? > Andre ____________ I'm a volunteer participant; my views are not necessarily those of Docking@Home or its participating institutions. |
||
ID: 2889 | Rating: 0 | rate: / | ||
http://docking.utep.edu/result.php?resultid=135992 Win9x is not supported. |
||
ID: 2890 | Rating: 0 | rate: / | ||
We are looking into this, but it really shouln't happen twice in a row so to say... can you find out if there is another cause to do with your pc?
http://docking.utep.edu/result.php?resultid=128497 ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2904 | Rating: 0 | rate: / | ||
Charmm is not supported on Win98. I don't know a way to tell BOINC that it shouldn't send any WUs to Win98 machines (does anybody?). Will put a message on the news and the FAQ though.
http://docking.utep.edu/result.php?resultid=135992 ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2905 | Rating: 0 | rate: / | ||
Charmm is not supported on Win98. I don't know a way to tell BOINC that it shouldn't send any WUs to Win98 machines (does anybody?). Will put a message on the news and the FAQ though. Doesn't it work to delete Win9x from the platform list on your server? ____________ I'm a volunteer participant; my views are not necessarily those of Docking@Home or its participating institutions. |
||
ID: 2909 | Rating: 0 | rate: / | ||
Doesn't it work to delete Win9x from the platform list on your server? That platform doesn't exist on boinc :-( AK ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2913 | Rating: 0 | rate: / | ||
We are looking into this, but it really shouln't happen twice in a row so to say... can you find out if there is another cause to do with your pc? I suspect then my younger son did something that caused it to freeze. Both times he had been on the computer playing games on-line during the time the WUs got messed up. (He's 3 and sometimes hits other buttons by accident) It hadn't happened before so that is why I thought something else had happened. I will keep trying to figure out what it is that he is doing. Thanks! ____________ |
||
ID: 2914 | Rating: 0 | rate: / | ||
Charmm is not supported on Win98. I don't know a way to tell BOINC that it shouldn't send any WUs to Win98 machines (does anybody?). Will put a message on the news and the FAQ though. Maybe it is wise to put up an extra point on the homepage. Something like "Hardware requirements and limitations" or so. ;-) Because the news are somewhat hidden after some days and not everyone will look into the FAQ. ____________ Bribe me with Lasagna!! :-) |
||
ID: 2916 | Rating: 0 | rate: / | ||
Good suggestion. It's on our to-do list. AK ____________ D@H the greatest project in the world... a while from now! |
||
ID: 2918 | Rating: 0 | rate: / | ||
We are looking into this, but it really shouln't happen twice in a row so to say... can you find out if there is another cause to do with your pc? I had another one http://docking.utep.edu/result.php?resultid=137115 that 'froze' when my son was at the computer. As far as I can tell he didn't do anything other than play games (nothing else seemed messed up). It just maybe that something with Nickjr.com does something to these WUs. Although I haven't had this problem crunching for other projects. May I will just crunch other projects when he is on the computer. ____________ |
||
ID: 2963 | Rating: 0 | rate: / | ||
Any chance that you can monitor your memory usage during those times? I wonder if it is a memory issue that causes the freeze.
____________ D@H the greatest project in the world... a while from now! |
||
ID: 2967 | Rating: 0 | rate: / | ||
I will try to. I currently have 1MB of RAM right now.
|
||
ID: 2968 | Rating: 0 | rate: / | ||
> I don't understand how this result (result id=151646) was validated against my result (result id= 151644). Mine took 18,775.22 seconds on Host 130 and the other took 1.17 seconds on host 1290. Both got validated.
|
||
ID: 3045 | Rating: 0 | rate: / | ||
> I don't understand how this result (result id=151646) was validated against my result (result id= 151644). Mine took 18,775.22 seconds on Host 130 and the other took 1.17 seconds on host 1290. Both got validated. I've seen reports that the 5.05 Docking Client sometimes doesn't stop when it's told to suspend. I haven't encountered this personally, at least not that I remember noticing. I'm not sure if it's related to BOINC client version and/or OS type/version. I've noticed that the D@H client has 4 threads in it. My guess would be that BOINC tried to suspend it for some reason on host 1290 and it kept running. Somehow, in the interaction between the BOINC client and the D@H project client, that CPU time wasn't counted. In other words, host 1290 took a lot longer to run the WU but didn't report the time correctly. Hmm, host 1290 shows that it restarted that result from a checkpoint. I'll bet the initial run continued when it was told to stop, completed the WU, and the restart from checkpoint woke up, found the result completed, and reported it finished. That second run probably took 1.17 seconds. The project staff probably needs to look into clients running while suspended and also to check if the CPU time is being accumulated properly when BOINC starts a new copy of the D@H client and it continues a WU. BTW, your host 130 is running BOINC core client 5.8.11 which doesn't return all the info needed for HR. Also Result 150783 was granted 0.00 credit against 2 others (one of which was also one of my other machines), the only difference was the process time, mine took over twice as long. That's strange. When I looked at the 3 results, the 2 machines that validated ran the WU with D@H client 5.04 and the third machine somehow ran it with D@H client 5.05. The project staff will need to figure out how this happened, too . BTW, Your host 1569 is also running BOINC core client 5.8.11. To be properly matched on HR for D@H, you need a later BOINC client which includes the processor Family, Model, and Stepping code. Look at that WU and look at each of the machines CPU type. Host 130 - BOINC client 5.8.11 - D@H application 5.04 CPU type AuthenticAMD Dual Core AMD Opteron(tm) Processor 285 [fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy] Host 1569 - BOINC client 5.8.11 - D@H application 5.05 CPU type AuthenticAMD Dual Core AMD Opteron(tm) Processor 2 [fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy] Host 1680 - BOINC client 5.8.17 - D@H application 5.04 CPU type AuthenticAMD AMD Opteron(tm) Processor 146 [Family 15 Model 39 Stepping 1][fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow up pni lahf_lm] If I understand it correctly, the HR algorithm uses the manufacturer (e.g. AuthenticAMD) and the Family/Model/Step (e.g [Family 15 Model 39 Stepping 1])to match machines so they will validate against each other for HR (Homogeneous Redundancy). I hope this helps and I'm not trying to gripe at you for having the older BOINC client. I need to check all of mine too and make sure they're the latest stable BOINC client. Sometimes newer stable clients are released rather quietly. IIRC, 5.8.15 and later will return enough info for D@H on Windows and MAC, but Linux needs a later version. As of right now, the latest stable versions are Windows: 5.8.16 Mac OS X: 5.8.17 with GUI or Command Line Linux: 5.8.17 with GUI NOTE: On one of my Linux machines (2.4.x kernel, text only), I had to stay with BOINC client version 5.8.15 because it didn't have some library that later versions needed. I wish they would release a non-gui version for Linux but they've stopped doing that and are linking the BOINC Client on a machine with later versions of libraries. Happy Crunching, -- David ____________ The views expressed are my own. Facts are subject to memory error :-) Have you read a good science fiction novel lately? |
||
ID: 3046 | Rating: 0 | rate: / | ||
> I don't understand how this result (result id=151646) was validated against my result (result id= 151644). Mine took 18,775.22 seconds on Host 130 and the other took 1.17 seconds on host 1290. Both got validated. Thanks David, a lot of what you said can and does make sense. The reason my Linux machines are still using that Boinc version is that I have had no confirmation that the newer versions show the required information for HR on Linux yet. In fact we were told not to upgrade till it was fixed by Andre no less. Works fine for Windows but not for Linux. So I have had no reason to upgrade at this stage. ____________ |
||
ID: 3048 | Rating: 0 | rate: / | ||
@Conan
|
||
ID: 3050 | Rating: 0 | rate: / | ||
David, most of what you explained is correct. There is one comment that I would like to make and that is that, yes, the Family/Model/Stepping information will make it much easier to classify a certain host into a HR class, but we still use the older filtering rules for the older boinc clients too. This means that sometimes a host that runs an older boinc client is classified into a HR class that it doesn't really belong to and the result might be invalid because of that. These are the ones we have to find out about so that we can adjust the HR rules and make them more reliable.
@Conan ____________ D@H the greatest project in the world... a while from now! |
||
ID: 3070 | Rating: 0 | rate: / | ||
I've seen reports that the 5.05 Docking Client sometimes doesn't stop when it's told to suspend. I haven't encountered this personally, at least not that I remember noticing. I'm not sure if it's related to BOINC client version and/or OS type/version. I've noticed that the D@H client has 4 threads in it. My guess would be that BOINC tried to suspend it for some reason on host 1290 and it kept running. Somehow, in the interaction between the BOINC client and the D@H project client, that CPU time wasn't counted. In other words, host 1290 took a lot longer to run the WU but didn't report the time correctly. Just noticed this thread (I'm the owner of host 1290). Currently running 32-bit Linux Charmm 5.07 -- it __still__ does not stop when I suspend computation. [Before the change to have the Docking application itself set 'ulimit -s unlimited', this same system *was* counting CPU time correctly for Docking workunits (and would stop on suspend). After the change, the system is now reporting times around 1 second for *every* Docking workunit.] Don't know about "restarts from checkpoint". The system runs 24/7 - the only time crunching gets explicitly interrupted is if I'm upgrading the boinc client. When I looked at the 3 results, the 2 machines that validated ran the WU with D@H client 5.04 and the third machine somehow ran it with D@H client 5.05.Don't know if the boinc developers have fixed things, but once upon a time the reported client version was what did the *formatting* of the "task completed" RPC message sent to the server - not the client version that actually did the *crunching* of the workunit. . |
||
ID: 3224 | Rating: 0 | rate: / | ||
Mikus, we can not reproduce the case where the result does not suspend at all. Because of the atomic checkpointing we now use, it is possible that the app keeps on running for up to 10 secs because it is in the middle of the checkpoint. It should stop at some point though. Please keep on monitoring the charmm processes to see if this happens. Thanks Andre ____________ D@H the greatest project in the world... a while from now! |
||
ID: 3226 | Rating: 0 | rate: / | ||
Mikus, we can not reproduce the case where the result does not suspend at all. Because of the atomic checkpointing we now use, it is possible that the app keeps on running for up to 10 secs because it is in the middle of the checkpoint. It should stop at some point though. Please keep on monitoring the charmm processes to see if this happens. My AMD multi-core system has installed 64-bit Ubuntu 7.04, the 32-bit boinc 5.9.5 client/manager, and the 32-bit Linux charmm 5.07 application (as well as numerous other projects). [I believe that 64-bit Ubuntu is "different" in that its /lib is 64-bit, and 32-bit applications are accomodated by /lib32, whereas 64-bit SuSE (for example) provides /lib as 32-bit and /lib64 as 64-bit.] The principal tools I use to (visually) view processing are gkrellm and top. The current Docking application is easy to spot on gkrellm - it uses so much system time that a broad orange stripe (<40% of the CPU) is drawn by gkrellm for whichever CPU is running it. [Boinc_application execution is "niced" to idle, thus shows up green (~100%) in gkrellm.] When I 'Suspend' Docking in boincmgr, the orange stripe is reduced to half-height -- the reason is that the boinc client believes that the Docking task is no longer running, so it dispatches a task from a different project -- and the two tasks (Docking and the other) end up sharing that CPU. This is confirmed by top -- it shows the Docking task continuing to run (at about 50% of the CPU), and the other project's task *also* running at 50% of the CPU. I'm letting the system continue - I expect the task mix to get back to normal once the Docking task completes. But it has been more than 30 minutes now since I issued the 'Suspend' of Docking -- yet the Docking task continues its activity. . |
||
ID: 3229 | Rating: 0 | rate: / | ||
It's interesting - I've noticed on gkrellm that the orange stripe for Docking sometimes goes to full height (<40% CPU). [This is with 'Suspend' still in effect for Docking - and boincmgr *does* show the status of Docking tasks as 'suspended'.]
|
||
ID: 3230 | Rating: 0 | rate: / | ||
Is there anybody else that notices this behavior on Ubuntu (or any other distro for that fact)? I have just tried to reproduce this behavior on our Ubuntu 6.10 machines (don't have 7.04 yet) with the 5.9.5 manager, but do not see this behavior at all: when the docking project is suspended, charmm stops running and the other app takes over as it should. Suspended many times; always the same correct behavior.
It's interesting - I've noticed on gkrellm that the orange stripe for Docking sometimes goes to full height (<40% CPU). [This is with 'Suspend' still in effect for Docking - and boincmgr *does* show the status of Docking tasks as 'suspended'.] ____________ D@H the greatest project in the world... a while from now! |
||
ID: 3231 | Rating: 0 | rate: / | ||
Is there anybody else that notices this behavior on Ubuntu (or any other distro for that fact)? I have just tried to reproduce this behavior on our Ubuntu 6.10 machines (don't have 7.04 yet) with the 5.9.5 manager, but do not see this behavior at all: when the docking project is suspended, charmm stops running and the other app takes over as it should. Suspended many times; always the same correct behavior. Four Intel duals and two AMD single-cores running Ubuntu 7.04 and 32-bit 5.8.16 or 5.8.17, all suspend D@H correctly. |
||
ID: 3235 | Rating: 0 | rate: / | ||
Is there anybody else that notices this behavior on Ubuntu (or any other distro for that fact)? I have just tried to reproduce this behavior on our Ubuntu 6.10 machines (don't have 7.04 yet) with the 5.9.5 manager, but do not see this behavior at all: when the docking project is suspended, charmm stops running and the other app takes over as it should. Suspended many times; always the same correct behavior. It doesn't have to be Ubuntu 7.04. My system first started not suspending Docking workunits when Charmm 5.05 was released. At the time I was running 64-bit Ubuntu 6.10. http://docking.utep.edu/forum_thread.php?id=220&nowrap=true#2997 -------- p.s. Looked at the system when I got up this morning. It it is again running one more boinc application task than there are assigned CPUs - and one of those tasks is a Docking task (boincmgr shows the status of that Docking task as "Waiting to run", yet its progress percentage is incrementing; its CPU time shown remains at less than 1 second). Don't know how come the boinc client allowed an "extra" task to be active in the middle of the night. . |
||
ID: 3238 | Rating: 0 | rate: / | ||
Is there anybody else that notices this behavior on Ubuntu (or any other distro for that fact)? I have just tried to reproduce this behavior on our Ubuntu 6.10 machines (don't have 7.04 yet) with the 5.9.5 manager, but do not see this behavior at all: when the docking project is suspended, charmm stops running and the other app takes over as it should. Suspended many times; always the same correct behavior. Did you change anything on this computer between 9 April and 11 April? It appeared to be running correctly on 9 April (judging from your results) and it seems it hasn't run correctly since 11 April. You were running 5.8.17 on both of those dates. Between 11 April and 27 April, you switched to 5.9.4, but it still didn't appear to be working correctly. Between 27 April and 30 April, you switched to 5.9.5, but it wasn't working anyway. You're still on 5.9.5 and it still isn't working correctly (judging by your results). I don't understand how your results are called valid (with a crunch time of less than one second) and granted credit. |
||
ID: 3239 | Rating: 0 | rate: / | ||
I don't understand how your results are called valid (with a crunch time of less than one second) and granted credit. Mikus's results are actually being crunched normally and return a similar result as the other replicas in the workunit. If we would not be using fixed credit he would not receive much for it, since the time the boinc clients reports is so low; I think this is maybe a combination of a problem of the newer boinc clients and how our app interacts with it. Andre ____________ D@H the greatest project in the world... a while from now! |
||
ID: 3240 | Rating: 0 | rate: / | ||
I don't understand how your results are called valid (with a crunch time of less than one second) and granted credit. @Andre, Please look at WU ID 52204. I do not see how 0.83 seconds is anywhere close to 9,503.05 seconds. Something is not right. |
||
ID: 3242 | Rating: 0 | rate: / | ||
I agree, but the problem is on the client side. The boinc client reports the cpu hours number to the server when the result is uploaded to the server. Although the cpu hours is very low, the result that mikus returns is actually correct and validates against the other result in the workunit. He probably used a lot more time to compute this result than his boinc client tells us. He is using alpha client 5.9.5 and it probably contains a bug. Thanks! Andre ____________ D@H the greatest project in the world... a while from now! |
||
ID: 3244 | Rating: 0 | rate: / | ||
Look at the results from 11 April on.........5.8.17 was still in use and it was returning values of less that one second. It can't be from 5.9.4 or 5 Something else was in play before 5.9.4. I have 5.9.5 on several machines, including Ubuntu 6.10 (then 7.04) without those kinds of CPU time results. OK, I'm done. |
||
ID: 3245 | Rating: 0 | rate: / | ||
Did you change anything on this computer between 9 April and 11 April? Don't remember what I did yesterday, let alone a month ago. But if you look at my results page, the result reported on Apr 11 used charmm 5.04, whereas the result reported on Apr 12 used charmm 5.05. So the answer to your question is: __I__ did not change anything on this computer between 9 April and 11 April. However, the __boinc environment__ "automatically" downloaded a new Docking application version to my system when I connected 11 April. This is confirmed by my post 12 April UTC (11 April local time) in which I described my experiences with the newly downloaded Linux charmm 5.05 -- including the fact that now a 'Suspend' of the charmm task did not work. Interesting that you want answers from me. I'm a user. Why expect *me* to know how come 5.05 behaves differently in the boinc environment than 5.04 ? I don't understand how your results are called valid (with a crunch time of less than one second) and granted credit. That too was a "feature" introduced by the update to 5.05 (this "feature" has so far been carried forward to all subsequent Docking application releases). Note that the boinc client I was using on 9 April was 5.8.17, and the boinc client I was using on 12 April was 5.8.17 -- it was the new charmm 5.05, upon being "tracked" by the __same__ boinc client, that now showed up as taking less than one second of execution. [In actuality, Docking workunits need more than two hours of crunching each on my system. I have no control over what gets to be reported.] My own problem with Docking applications on Linux from 5.05 on is that they use up to 40% of one CPU for "system services". Versions prior to 5.05 did not do this. I suspect that this is "unproductive" overhead, which eats up CPU cycles better spent on crunching the actual applications. . |
||
ID: 3247 | Rating: 0 | rate: / | ||
By looking at the time a replica was sent and the time it was reported you can come to the conclusion that it ran for more than a few seconds.
|
||
ID: 3249 | Rating: 0 | rate: / | ||
Did you change anything on this computer between 9 April and 11 April? I didn't mean to ruffle your feathers. I was only trying to figure out what things changed to cause your Opteron to report such a short crunch time. It arouses my curiosity when I notice things like that. The fact is that at least your Linux Opterons ran 5.05, even if that is when the short time showed up. My Linux Opterons (single-core) crashed every 5.05, but then started running 5.06 with normal times (pretty much same as 5.04). There may be more Linux Opterons running, but I didn't find any yet. |
||
ID: 3252 | Rating: 0 | rate: / | ||
I didn't mean to ruffle your feathers. I was only trying to figure out what things changed to cause your Opteron to report such a short crunch time.My impression was similar. Linux 5.05 did work as long as there was only a single Docking task running. But my boinc environment happened to choose to run three 5.05 tasks simultaneously. Shortly thereafter the system crashed (needed re-boot). That is the __only__ time this Opteron system has ever crashed on me. Because of this crash, what I did was to set Docking 'No new tasks'. And waited for a new version of the application to be released. (Turned out that by the time I got around to resuming Docking, Linux 5.07 was the current version. That's what I've been running -- it has not crashed on me, but it still invokes that !#%$! unproductive "system execution", which I believe increases the elapsed time to finish each Docking workunit.) . |
||
ID: 3254 | Rating: 0 | rate: / | ||
|
||
ID: 3256 | Rating: 0 | rate: / | ||
it has not crashed on me, but it still invokes that !#%$! unproductive "system execution", which I believe increases the elapsed time to finish each Docking workunit.) We're still looking into what is causing this behavior. But you're right; a lot of cycles seem to be eaten up by these gettimeofday calls. Thanks Andre ____________ D@H the greatest project in the world... a while from now! |
||
ID: 3259 | Rating: 0 | rate: / | ||
Do your other BOINC projects run with a "normal" CPU time? Yes - every other project's applications, without exception. So did Docking applications 5.04 and earlier. . |
||
ID: 3260 | Rating: 0 | rate: / | ||
Did you change anything on this computer between 9 April and 11 April? I noticed this evening that my linux box is reporting really short crunch times and the work units are validating ok. Yesterday early I stopped the computer because the chip set cooling fan and another fan had failed and I needed to replace them. I did it and started up the computer. Now the drives (3) in the system a mapped differently and I have not figured out why yet. They are all sata drives and connected to the same ports as before. Anyway, I have to mount most of the partitions manually and start boinc in a terminal window manually. I just use "./boinc" in the boinc directory to start it. That is the only difference that I can figure out. When I know more I will post more. Gene |
||
ID: 3269 | Rating: 0 | rate: / | ||
Turns out that not only does my system currently credit only about one second of execution to each Docking task (which actually takes around three hours to crunch), but it also is crediting only that one second of execution (for the around three hours of wall clock time) to my __system's__ totals.
|
||
ID: 3293 | Rating: 0 | rate: / | ||
Thanks Mikus, that's probably a good idea. We are on top of this and will hopefully find out soon what is going on, but as said before, this problem cannot be reproduced on all of the liunux machines in our lab (incl. the opterons) and that makes it a bit hard to find a good solution...
I have now manually edited my system's <cpu_efficiency> value to allow normal downloading. And have marked Docking as 'no new work' until its current Linux application executable is upgraded. ____________ D@H the greatest project in the world... a while from now! |
||
ID: 3300 | Rating: 0 | rate: / | ||
Hi!
|
||
ID: 3331 | Rating: 0 | rate: / | ||
Did you change anything on this computer between 9 April and 11 April? So here is the story. When I reboot my linux machine, if I hit "esc" to cancel the memory check then Ubuntu gets the disks wrong when booting. If I let the bios do its thing, then the disk are mapped correctly and everything boots fine. This behavior just started because I almost always cancel the memory check. I just don't want to wait for it. The good news is that Docking is now being started by the startup process and is running correctly again. I mean it is reporting the run times correctly again. I have not checked to see if I stopped Docking from running as a deamon and run it manually from the Boinc directory if it will give those 0.xx second run times. I will do that if someone thinks it is worth the effort. Gene |
||
ID: 3340 | Rating: 0 | rate: / | ||
So here is the story. When I reboot my linux machine, if I hit "esc" to cancel the memory check then Ubuntu gets the disks wrong when booting. If I let the bios do its thing, then the disk are mapped correctly and everything boots fine. This behavior just started because I almost always cancel the memory check. I just don't want to wait for it. The good news is that Docking is now being started by the startup process and is running correctly again. I mean it is reporting the run times correctly again. I have not checked to see if I stopped Docking from running as a deamon and run it manually from the Boinc directory if it will give those 0.xx second run times. I will do that if someone thinks it is worth the effort. Gene, please do the experiment to see if that could be a cause. Regarding the memory check, you should be able to disable it in the bios, so you never have to press escape again. That's what I do. Andre ____________ D@H the greatest project in the world... a while from now! |
||
ID: 3342 | Rating: 0 | rate: / | ||
I managed to kill
this WU
. I'm not exactly sure what in the sequence of events did it...
|
||
ID: 3347 | Rating: 0 | rate: / | ||
Here are the docking entries in stdoutdae.txt frome before the crash. I don't see anything of note except it appears the WU crashed at the same time as I suspended the network.
|
||
ID: 3349 | Rating: 0 | rate: / | ||
@KSMarksPsych
|
||
ID: 3350 | Rating: 0 | rate: / | ||
So here is the story. When I reboot my linux machine, if I hit "esc" to cancel the memory check then Ubuntu gets the disks wrong when booting. If I let the bios do its thing, then the disk are mapped correctly and everything boots fine. This behavior just started because I almost always cancel the memory check. I just don't want to wait for it. The good news is that Docking is now being started by the startup process and is running correctly again. I mean it is reporting the run times correctly again. I have not checked to see if I stopped Docking from running as a deamon and run it manually from the Boinc directory if it will give those 0.xx second run times. I will do that if someone thinks it is worth the effort. Andre, I just started boinc as user process in a terminal window like I was doing before when I was having trouble booting my machine correctly. I then opened boincmgr and looked at what was going on in the tasks tab. The "Progress" was being updated but the "CPU time" was not being updated. So I think that the run time will not be reported to the server when the work unit completes. I will let this work unit and several more run to completion just to be sure. When I started boinc as a user process I used the same command line arguments that I use when I start it as a daemon process. Gene |
||
ID: 3352 | Rating: 0 | rate: / | ||
Let me get my arms around this: so you are seeing that cpu time is not updated when boinc is started manually from a terminal window, but when started by the system, it does show cpu time correctly? If that's the case we should maybe post this on the boinc_projects mailing list and see if anybody there understands what's going on.
Andre, ____________ D@H the greatest project in the world... a while from now! |
||
ID: 3354 | Rating: 0 | rate: / | ||
I have some invalid results from my Mac (Intel) :
|
||
ID: 3355 | Rating: -1 | rate: / | ||
Let me get my arms around this: so you are seeing that cpu time is not updated when boinc is started manually from a terminal window, but when started by the system, it does show cpu time correctly? If that's the case we should maybe post this on the boinc_projects mailing list and see if anybody there understands what's going on. Andre, You are correct. I just checked. I have let Docking run ~24 hours since I started it manually in the terminal window. It completed 11 work units in that time. Three of them showed 0.01 seconds computation time and the rest showed 0.00 seconds computation time. The one in progress is about 21% completed and shows 0.00 seconds computation time. If there is anything else I can do to help let me know. Gene |
||
ID: 3356 | Rating: -1 | rate: / | ||
It looks like there still is a problem with the checkpointing method we use. Michela has found another issue that she is looking into. You notice that the result that is invalid has been restarted several times, while the valid others were started and then finished without restarts.
I have some invalid results from my Mac (Intel) : ____________ D@H the greatest project in the world... a while from now! |
||
ID: 3357 | Rating: -1 | rate: / | ||
Message boards : Number crunching : Invalid results reported thread [Use Here]
Database Error: The MySQL server is running with the --read-only option so it cannot execute this statement
array(3) { [0]=> array(7) { ["file"]=> string(47) "/boinc/projects/docking/html_v2/inc/db_conn.inc" ["line"]=> int(97) ["function"]=> string(8) "do_query" ["class"]=> string(6) "DbConn" ["object"]=> object(DbConn)#117 (2) { ["db_conn"]=> resource(162) of type (mysql link persistent) ["db_name"]=> string(7) "docking" } ["type"]=> string(2) "->" ["args"]=> array(1) { [0]=> &string(51) "update DBNAME.thread set views=views+1 where id=108" } } [1]=> array(7) { ["file"]=> string(48) "/boinc/projects/docking/html_v2/inc/forum_db.inc" ["line"]=> int(60) ["function"]=> string(6) "update" ["class"]=> string(6) "DbConn" ["object"]=> object(DbConn)#117 (2) { ["db_conn"]=> resource(162) of type (mysql link persistent) ["db_name"]=> string(7) "docking" } ["type"]=> string(2) "->" ["args"]=> array(3) { [0]=> object(BoincThread)#3 (16) { ["id"]=> string(3) "108" ["forum"]=> string(1) "2" ["owner"]=> string(2) "15" ["status"]=> string(1) "0" ["title"]=> string(42) "Invalid results reported thread [Use Here]" ["timestamp"]=> string(10) "1179700300" ["views"]=> string(4) "3133" ["replies"]=> string(3) "111" ["activity"]=> string(22) "9.342123129504399e-120" ["sufferers"]=> string(1) "0" ["score"]=> string(1) "0" ["votes"]=> string(1) "0" ["create_time"]=> string(10) "1164085161" ["hidden"]=> string(1) "0" ["sticky"]=> string(1) "0" ["locked"]=> string(1) "0" } [1]=> &string(6) "thread" [2]=> &string(13) "views=views+1" } } [2]=> array(7) { ["file"]=> string(63) "/boinc/projects/docking/html_v2/user/community/forum/thread.php" ["line"]=> int(184) ["function"]=> string(6) "update" ["class"]=> string(11) "BoincThread" ["object"]=> object(BoincThread)#3 (16) { ["id"]=> string(3) "108" ["forum"]=> string(1) "2" ["owner"]=> string(2) "15" ["status"]=> string(1) "0" ["title"]=> string(42) "Invalid results reported thread [Use Here]" ["timestamp"]=> string(10) "1179700300" ["views"]=> string(4) "3133" ["replies"]=> string(3) "111" ["activity"]=> string(22) "9.342123129504399e-120" ["sufferers"]=> string(1) "0" ["score"]=> string(1) "0" ["votes"]=> string(1) "0" ["create_time"]=> string(10) "1164085161" ["hidden"]=> string(1) "0" ["sticky"]=> string(1) "0" ["locked"]=> string(1) "0" } ["type"]=> string(2) "->" ["args"]=> array(1) { [0]=> &string(13) "views=views+1" } } }query: update docking.thread set views=views+1 where id=108