Invalid results reported thread [Use Here]


Advanced search

Message boards : Number crunching : Invalid results reported thread [Use Here]

Sort
Author Message
Profile suguruhirahara
Forum moderator
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 282
ID: 15
Credit: 56,614
RAC: 0
Message 1534 - Posted 21 Nov 2006 4:59:21 UTC
Last modified: 31 Mar 2007 2:10:37 UTC

Hi all:)

Though it seems relatively late, I thought just now that having a common place where crunchers can report 'invalid' results would help the team to find out solitions of their problems by recognising common characteristics.

At first I'll post one: http://docking.utep.edu/result.php?resultid=48107

Please post invalid results with its URL.

Thanks,
suguruhirahara
____________

I'm a volunteer participant; my views are not necessarily those of Docking@Home or its participating institutions.

Profile suguruhirahara
Forum moderator
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 282
ID: 15
Credit: 56,614
RAC: 0
Message 1553 - Posted 21 Nov 2006 15:13:16 UTC

Another one detected: http://docking.utep.edu/result.php?resultid=47298

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 1567 - Posted 21 Nov 2006 16:37:10 UTC - in response to Message ID 1534 .

The celeron and PIII have the same answer, the pentium D has another one, thus it is deemed invalid. For this one we know what the problem is, but do not have a solution yet. Frantically working on it though :-)

Andre

Hi all:)

Though it seems relatively late, I thought just now that having a common place where linux crunchers can report 'invalid' results would help the team to find out solitions by recognising a kind of common characteristics.

At first I'll post one: http://docking.utep.edu/result.php?resultid=48107

Please post invalid results with its URL.

Thanks,
suguruhirahara


____________
D@H the greatest project in the world... a while from now!
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 1568 - Posted 21 Nov 2006 16:38:34 UTC - in response to Message ID 1553 .

Same as the previous one. This time the two celerons beat the pentium D.

AK

Another one detected: http://docking.utep.edu/result.php?resultid=47298


____________
D@H the greatest project in the world... a while from now!
Profile suguruhirahara
Forum moderator
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 282
ID: 15
Credit: 56,614
RAC: 0
Message 1569 - Posted 21 Nov 2006 16:40:03 UTC - in response to Message ID 1567 .

Ah, I got it:) I've almost forgot that point...shame on myself...lol
Anyway I'll continue tracking of each results of mine.

suguruhirahara

Profile suguruhirahara
Forum moderator
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 282
ID: 15
Credit: 56,614
RAC: 0
Message 2102 - Posted 15 Jan 2007 11:34:55 UTC

Please use this thread if invalid results are detected.
____________

I'm a volunteer participant; my views are not necessarily those of Docking@Home or its participating institutions.

j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2109 - Posted 15 Jan 2007 21:32:37 UTC - in response to Message ID 2102 .

Please use this thread if invalid results are detected.


http://docking.utep.edu/result.php?resultid=75455
http://docking.utep.edu/result.php?resultid=75269
http://docking.utep.edu/result.php?resultid=74480
http://docking.utep.edu/result.php?resultid=74241
http://docking.utep.edu/result.php?resultid=74137
http://docking.utep.edu/result.php?resultid=74094

There's more for this computer, but I'm going to look at the other machines.
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2110 - Posted 15 Jan 2007 21:48:17 UTC - in response to Message ID 2102 .

Please use this thread if invalid results are detected.


http://docking.utep.edu/result.php?resultid=75927
http://docking.utep.edu/result.php?resultid=75920
http://docking.utep.edu/result.php?resultid=75911
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2111 - Posted 15 Jan 2007 21:50:23 UTC - in response to Message ID 2102 .

Please use this thread if invalid results are detected.


http://docking.utep.edu/result.php?resultid=76753
http://docking.utep.edu/result.php?resultid=75768
http://docking.utep.edu/result.php?resultid=73550
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2112 - Posted 15 Jan 2007 21:53:44 UTC - in response to Message ID 2102 .

Please use this thread if invalid results are detected.


http://docking.utep.edu/result.php?resultid=76832
http://docking.utep.edu/result.php?resultid=76267
http://docking.utep.edu/result.php?resultid=76252
http://docking.utep.edu/result.php?resultid=75805
http://docking.utep.edu/result.php?resultid=75803
http://docking.utep.edu/result.php?resultid=75790
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2113 - Posted 15 Jan 2007 21:55:24 UTC - in response to Message ID 2102 .

Please use this thread if invalid results are detected.


http://docking.utep.edu/result.php?resultid=76271
http://docking.utep.edu/result.php?resultid=75788
http://docking.utep.edu/result.php?resultid=74809
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2114 - Posted 15 Jan 2007 22:00:14 UTC - in response to Message ID 2102 .

Please use this thread if invalid results are detected.


http://docking.utep.edu/result.php?resultid=76256
http://docking.utep.edu/result.php?resultid=75779
http://docking.utep.edu/result.php?resultid=75765
http://docking.utep.edu/result.php?resultid=75764
http://docking.utep.edu/result.php?resultid=76256
http://docking.utep.edu/result.php?resultid=75779
http://docking.utep.edu/result.php?resultid=75765
http://docking.utep.edu/result.php?resultid=75764
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2115 - Posted 15 Jan 2007 22:01:30 UTC - in response to Message ID 2102 .

Please use this thread if invalid results are detected.


http://docking.utep.edu/result.php?resultid=75771
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2116 - Posted 15 Jan 2007 22:02:51 UTC - in response to Message ID 2102 .

Please use this thread if invalid results are detected.


http://docking.utep.edu/result.php?resultid=74031
Tom Philippart
Volunteer tester
Avatar

Joined: Dec 22 06
Posts: 17
ID: 340
Credit: 44,929
RAC: 0
Message 2117 - Posted 15 Jan 2007 22:10:13 UTC

http://docking.utep.edu/result.php?resultid=76265
____________

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2119 - Posted 15 Jan 2007 23:06:23 UTC - in response to Message ID 2117 .

This one is due to the app version difference 5.03 and 5.04.

http://docking.utep.edu/result.php?resultid=76265


____________
D@H the greatest project in the world... a while from now!
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2120 - Posted 15 Jan 2007 23:07:50 UTC - in response to Message ID 2116 .

This one is due to the app version difference 5.03 and 5.04.

Please use this thread if invalid results are detected.

http://docking.utep.edu/result.php?resultid=74031


____________
D@H the greatest project in the world... a while from now!
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2121 - Posted 15 Jan 2007 23:25:20 UTC

I haven't checked all of these results, but it seems that most of the invalid status's can be attributed to the app version difference on windows. Please check this first before reporting any invalid results. The app version can be found at the bottom of a result page.

Thanks
Andre
____________
D@H the greatest project in the world... a while from now!

(retired account)
Volunteer tester

Joined: Nov 22 06
Posts: 62
ID: 331
Credit: 158,686
RAC: 0
Message 2142 - Posted 17 Jan 2007 0:25:46 UTC
Last modified: 17 Jan 2007 0:26:58 UTC

Here is the first result of the computer ID 745 ("AuthenticAMD x86 Family 6 Model 4 Stepping 2 1333MHz") validated against two Pentium III computers and the result is invalid! All three computers have finished the results with charmm 5.04.

#1 Result ID 72853 ;
GenuineIntel x86 Family 6 Model 8 Stepping 6 1002MHz;
Microsoft Windows 2000 Standard Server Edition, Service Pack 4, (05.00.2195.00);
VALID

#2 Result ID 72854 ;
AuthenticAMD x86 Family 6 Model 4 Stepping 2 1333MHz;
Microsoft Windows NT Workstation Edition, Service Pack 6a, (04.00.1381.00);
INVALID

#3 Result ID 72853 ;
GenuineIntel x86 Family 6 Model 7 Stepping 3 547MHz;
Microsoft Windows XP Home Edition, Service Pack 2, (05.01.2600.00);
VALID

Other results earlier show that host ID 745 has validated ok against some AMD processors in the past.

Regards

Alex

My results during the HR tests

Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 2226 - Posted 19 Jan 2007 13:07:46 UTC

> Just an extra note to the Docking team,
I am still getting a number of zero credit work units, the first one is separate to all the other ones as all hosts ran 5.04, but the remaining ones were due to an older 5.03 WU being reissued as a 5.04 when the host the WU was sent to did not reply.

All 3 of the results were successful, all were 5.04 but I got zero credit and the other two got credit.

http://docking.utep.edu/result.php?result=78186

Result 70620 sent 5/1 to host 315 as 5.03 gave no reply, other two hosts sent same wu same time completed the wu successfully as a 5.03. WU was resent as result 77680 on 15/1 to my host 634 as a 5.04 and I then got no credit.

http://docking.utep.edu/result.php?result=77680

Result 71222 sent to host 1214 on 6/1 as a 5.03, no reply, resent to me 16/1 as a 5.04 so no credit when the other two 5.03 wu's validated.
Result 69450 and result 69240 sent to host 86 on the 3/1 as a 5.03 along with two others that completed and validated, resent to me as a 5.04 WU on the 13/1, no credit.

http://docking.utep.edu/result.php?result=78290
http://docking.utep.edu/result.php?result=75817
http://docking.utep.edu/result.php?result=75812

So it appears that we will still get these Zero credit jobs with the 5.03 pending work when the no reply WU's are resent out but as we now have a new application version they are sent as the latest version 5.04 and so fail to get credit against the older 5.03 wu's still pending.
This wont stop till work sent up till the 10th (new app version) all finish timing out, which should be about next week I think.

____________

Profile Kinguni
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 6
ID: 39
Credit: 1,272
RAC: 0
Message 2227 - Posted 19 Jan 2007 14:09:19 UTC - in response to Message ID 2226 .
Last modified: 19 Jan 2007 14:09:45 UTC

I found one, 2 results validated with 5.03, mine marked invalid with 5.04.
http://docking.utep.edu/workunit.php?wuid=20157
____________
Join Team Starfire
BOINC Chat

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2231 - Posted 19 Jan 2007 20:04:51 UTC - in response to Message ID 2226 .

Thanks Conan,

This seems like one of those results that should not happen again. Your result (RMSD 0.649232) was indeed a little bit different from the other two (RMSD 0.651186). As you can see they are very close, but not the same. We'll look into this further and let you know.

Thanks
Andre

PS There's not too much we can do right now about the 5.03/5.04 issue, only wait till all the 5.03 finish or time out as you mention.

> Just an extra note to the Docking team,
I am still getting a number of zero credit work units, the first one is separate to all the other ones as all hosts ran 5.04, but the remaining ones were due to an older 5.03 WU being reissued as a 5.04 when the host the WU was sent to did not reply.

All 3 of the results were successful, all were 5.04 but I got zero credit and the other two got credit.

http://docking.utep.edu/result.php?result=78186

Result 70620 sent 5/1 to host 315 as 5.03 gave no reply, other two hosts sent same wu same time completed the wu successfully as a 5.03. WU was resent as result 77680 on 15/1 to my host 634 as a 5.04 and I then got no credit.

http://docking.utep.edu/result.php?result=77680

Result 71222 sent to host 1214 on 6/1 as a 5.03, no reply, resent to me 16/1 as a 5.04 so no credit when the other two 5.03 wu's validated.
Result 69450 and result 69240 sent to host 86 on the 3/1 as a 5.03 along with two others that completed and validated, resent to me as a 5.04 WU on the 13/1, no credit.

http://docking.utep.edu/result.php?result=78290
http://docking.utep.edu/result.php?result=75817
http://docking.utep.edu/result.php?result=75812

So it appears that we will still get these Zero credit jobs with the 5.03 pending work when the no reply WU's are resent out but as we now have a new application version they are sent as the latest version 5.04 and so fail to get credit against the older 5.03 wu's still pending.
This wont stop till work sent up till the 10th (new app version) all finish timing out, which should be about next week I think.


____________
D@H the greatest project in the world... a while from now!
(retired account)
Volunteer tester

Joined: Nov 22 06
Posts: 62
ID: 331
Credit: 158,686
RAC: 0
Message 2232 - Posted 19 Jan 2007 23:16:02 UTC - in response to Message ID 2226 .
Last modified: 20 Jan 2007 0:09:24 UTC


So it appears that we will still get these Zero credit jobs with the 5.03 pending work when the no reply WU's are resent out but as we now have a new application version they are sent as the latest version 5.04 and so fail to get credit against the older 5.03 wu's still pending.


Yep, but if noone picks up the missing third results and finishes them even with 5.04 then usually two other guys won't get credit for this workunit...

Btw, this one will be the other way round. I delivered result #1 on the 7th of January with version 5.03 and then #2 and #3 were sent on the 15th and the 19th of January.

I can't think of an easy solution but there has to be one before this project goes productive. I don't think people will like the idea of loosing credits on every update of the science application. In the moment I certainly don't mind, after all, it's Alpha!

Regards

Alex

My results during the HR tests
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2234 - Posted 20 Jan 2007 1:14:31 UTC - in response to Message ID 2232 .

David Anderson suggested some workarounds in this post . The second one seems like a fairly easy to implement workaround. He also has it on his to-do list to built in a solution in boinc for such cases.

Thanks
Andre


So it appears that we will still get these Zero credit jobs with the 5.03 pending work when the no reply WU's are resent out but as we now have a new application version they are sent as the latest version 5.04 and so fail to get credit against the older 5.03 wu's still pending.


Yep, but if noone picks up the missing third results and finishes them even with 5.04 then usually two other guys won't get credit for this workunit...

Btw, this one will be the other way round. I delivered result #1 on the 7th of January with version 5.03 and then #2 and #3 were sent on the 15th and the 19th of January.

I can't think of an easy solution but there has to be one before this project goes productive. I don't think people will like the idea of loosing credits on every update of the science application. In the moment I certainly don't mind, after all, it's Alpha!

Regards

Alex

My results during the HR tests


____________
D@H the greatest project in the world... a while from now!
(retired account)
Volunteer tester

Joined: Nov 22 06
Posts: 62
ID: 331
Credit: 158,686
RAC: 0
Message 2235 - Posted 20 Jan 2007 1:40:47 UTC - in response to Message ID 2234 .
Last modified: 20 Jan 2007 1:41:06 UTC

The second one seems like a fairly easy to implement workaround.


Ok, provided this is acceptable to the science of the project.

But wouldn't it have the same effect to decrease the quorum to 1 instead of 3 or 2 during the time of transition? With the additional benefit of doing more work with the same computational power? Maybe I miss something here.

Well, you could also switch off workunit creation prior to introduction of a new version and then wait for the workunits to run dry...

Regards

Alex
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2240 - Posted 20 Jan 2007 4:11:07 UTC - in response to Message ID 2235 .

Some good tips there...

Anyway, the result changes will only happen sporadically I am sure; the screensaver app release will hopefully not change any result :-)

Thanks Alex!
Andre

The second one seems like a fairly easy to implement workaround.


Ok, provided this is acceptable to the science of the project.

But wouldn't it have the same effect to decrease the quorum to 1 instead of 3 or 2 during the time of transition? With the additional benefit of doing more work with the same computational power? Maybe I miss something here.

Well, you could also switch off workunit creation prior to introduction of a new version and then wait for the workunits to run dry...

Regards

Alex


____________
D@H the greatest project in the world... a while from now!
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2243 - Posted 20 Jan 2007 4:17:47 UTC - in response to Message ID 2234 .

David Anderson suggested some workarounds in this post . The second one seems like a fairly easy to implement workaround. He also has it on his to-do list to built in a solution in boinc for such cases.

Thanks
Andre


So it appears that we will still get these Zero credit jobs with the 5.03 pending work when the no reply WU's are resent out but as we now have a new application version they are sent as the latest version 5.04 and so fail to get credit against the older 5.03 wu's still pending.


Yep, but if noone picks up the missing third results and finishes them even with 5.04 then usually two other guys won't get credit for this workunit...

Btw, this one will be the other way round. I delivered result #1 on the 7th of January with version 5.03 and then #2 and #3 were sent on the 15th and the 19th of January.

I can't think of an easy solution but there has to be one before this project goes productive. I don't think people will like the idea of loosing credits on every update of the science application. In the moment I certainly don't mind, after all, it's Alpha!

Regards

Alex

My results during the HR tests



If WUs can be parsed by the type of CPU (or arbitrary class), why can't WUs be parsed by the client version?
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2245 - Posted 20 Jan 2007 4:28:40 UTC - in response to Message ID 2243 .

I think that's what David A. means with his first workaround, because boinc currently doesn't have the capability to parse workunits based on app version; instead this could be done in the validator. Eventually code changes will be necessary on the server-side (scheduling should take app version into account) and client-side (multiple app versions should be kept on disk for a while) for the real solution.

Cheers
Andre

If WUs can be parsed by the type of CPU (or arbitrary class), why can't WUs be parsed by the client version?


____________
D@H the greatest project in the world... a while from now!
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2268 - Posted 20 Jan 2007 16:18:41 UTC - in response to Message ID 2240 .

Some good tips there...

Anyway, the result changes will only happen sporadically I am sure; the screensaver app release will hopefully not change any result :-)

Thanks Alex!
Andre

The second one seems like a fairly easy to implement workaround.


Ok, provided this is acceptable to the science of the project.

But wouldn't it have the same effect to decrease the quorum to 1 instead of 3 or 2 during the time of transition? With the additional benefit of doing more work with the same computational power? Maybe I miss something here.

Well, you could also switch off workunit creation prior to introduction of a new version and then wait for the workunits to run dry...

Regards

Alex



I hope you don't inadvertently follow in Rosetta@Home's footsteps and create a graphics compatibility disaster.
mikus
Volunteer tester

Joined: Oct 28 06
Posts: 18
ID: 193
Credit: 2,915,329
RAC: 0
Message 2338 - Posted 31 Jan 2007 5:19:04 UTC

Just tried the latest and greatest Linux boinc client (5.8.7). Four Docking@home workunits in a row terminated abnormally (code 0x1). Might be some sort of incompatibility between what is being done to the boinc client, and the Charmm 5.02 application. [Did not experience Docking@home problems when running with previous boinc clients (5.8.6 or earlier).]

Of the failing workunits, this one uploaded a big dump (or something):
http://docking.utep.edu/result.php?resultid=87097

--------
p.s. The result sent to the server identifies <core_client_version> as 5.8.6. That's the version I used to __transmit__ the report back to the server, but NOT the version (5.8.7) in use at the time the workunit actually terminated.
.

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2339 - Posted 31 Jan 2007 15:46:01 UTC - in response to Message ID 2338 .

Interesting... that seems to be the same error as we have seen (and see) on machines that have set their stack limit to low. Just to make sure: is your stack limit set to unlimited? (check with 'ulimit -s'). The strange things is that that usually happens after 10 to 30 secs or so and yours seem to have crunched quite a bit longer. I don't quite understand how the boinc client can influence the app this way, but maybe this is a new case.

Have other people with the latest boinc clients seen this?

Thanks
Andre

Just tried the latest and greatest Linux boinc client (5.8.7). Four Docking@home workunits in a row terminated abnormally (code 0x1). Might be some sort of incompatibility between what is being done to the boinc client, and the Charmm 5.02 application. [Did not experience Docking@home problems when running with previous boinc clients (5.8.6 or earlier).]

Of the failing workunits, this one uploaded a big dump (or something):
http://docking.utep.edu/result.php?resultid=87097

--------
p.s. The result sent to the server identifies <core_client_version> as 5.8.6. That's the version I used to __transmit__ the report back to the server, but NOT the version (5.8.7) in use at the time the workunit actually terminated.
.


____________
D@H the greatest project in the world... a while from now!
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2340 - Posted 31 Jan 2007 16:50:14 UTC - in response to Message ID 2339 .

Interesting... that seems to be the same error as we have seen (and see) on machines that have set their stack limit to low. Just to make sure: is your stack limit set to unlimited? (check with 'ulimit -s'). The strange things is that that usually happens after 10 to 30 secs or so and yours seem to have crunched quite a bit longer. I don't quite understand how the boinc client can influence the app this way, but maybe this is a new case.

Have other people with the latest boinc clients seen this?

Thanks
Andre

Just tried the latest and greatest Linux boinc client (5.8.7). Four Docking@home workunits in a row terminated abnormally (code 0x1). Might be some sort of incompatibility between what is being done to the boinc client, and the Charmm 5.02 application. [Did not experience Docking@home problems when running with previous boinc clients (5.8.6 or earlier).]

Of the failing workunits, this one uploaded a big dump (or something):
http://docking.utep.edu/result.php?resultid=87097

--------
p.s. The result sent to the server identifies <core_client_version> as 5.8.6. That's the version I used to __transmit__ the report back to the server, but NOT the version (5.8.7) in use at the time the workunit actually terminated.
.



I'll switch a dedicated D&H Windows box to 5.8.8 as soon as it runs dry and see how that goes.
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2341 - Posted 31 Jan 2007 17:32:52 UTC

OK. Just tried 5.8.8 (latest as of now) and all my results errored out with error 1. The error looks like the stack limit problem, but must be something else as my SuSE box has stack limit set to unlimited. Thus, I can reproduce the problem and am working with David A. to see if we can find a resolution.

Currently running the older 5.4.x client on the same box to see what it will do.

Andre
____________
D@H the greatest project in the world... a while from now!

j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2342 - Posted 31 Jan 2007 17:44:07 UTC - in response to Message ID 2341 .

OK. Just tried 5.8.8 (latest as of now) and all my results errored out with error 1. The error looks like the stack limit problem, but must be something else as my SuSE box has stack limit set to unlimited. Thus, I can reproduce the problem and am working with David A. to see if we can find a resolution.

Currently running the older 5.4.x client on the same box to see what it will do.

Andre


Ran one Windows box dry and started 5.8.8. I only caught one WU for one CPU, but it has gone 10% so far, no problem yet.

Switched another Windows box to 5.8.8 that had three WUs on it. Two are running and have moved up another 10% or so with no apparent problem.

I'll go switch a Linux box to see.
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2343 - Posted 31 Jan 2007 17:52:32 UTC - in response to Message ID 2342 .

It might be linux only. I have just ran two workunits on client 5.4.9 and also these error out... This doesn't make sense though: nor the workunit or app version haven't changed since weeks. Will research further.

Andre

OK. Just tried 5.8.8 (latest as of now) and all my results errored out with error 1. The error looks like the stack limit problem, but must be something else as my SuSE box has stack limit set to unlimited. Thus, I can reproduce the problem and am working with David A. to see if we can find a resolution.

Currently running the older 5.4.x client on the same box to see what it will do.

Andre


Ran one Windows box dry and started 5.8.8. I only caught one WU for one CPU, but it has gone 10% so far, no problem yet.

Switched another Windows box to 5.8.8 that had three WUs on it. Two are running and have moved up another 10% or so with no apparent problem.

I'll go switch a Linux box to see.


____________
D@H the greatest project in the world... a while from now!
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2344 - Posted 31 Jan 2007 18:49:46 UTC - in response to Message ID 2342 .

I'll go switch a Linux box to see.


Let us know what you find out. I'm still not convinced it is us nor the new boinc client. More data might help us find the cause.

AK
____________
D@H the greatest project in the world... a while from now!
mikus
Volunteer tester

Joined: Oct 28 06
Posts: 18
ID: 193
Credit: 2,915,329
RAC: 0
Message 2345 - Posted 31 Jan 2007 19:01:51 UTC - in response to Message ID 2339 .

Just to make sure: is your stack limit set to unlimited? (check with 'ulimit -s').

yes, it is.
The strange things is that that usually happens after 10 to 30 secs or so and yours seem to have crunched quite a bit longer.

The result I gave as a reference started computing with the 5.8.6 client, then failed after a couple of minutes of running under 5.8.7. Tht's why it has accumulated more time.

The other three results that failed under 5.8.7 all started ok, ran for about six minutes, then exited with 0x1.

Here is a more recent failing result, this time on a single-processor Ubuntu 6.10 system (with stack set to unlimited), using the boinc 5.8.8 client: http://docking.utep.edu/result.php?resultid=87703
.
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2346 - Posted 31 Jan 2007 19:39:01 UTC - in response to Message ID 2344 .

I'll go switch a Linux box to see.


Let us know what you find out. I'm still not convinced it is us nor the new boinc client. More data might help us find the cause.

AK


Picked a "dry" Linux box and converted to 5.8.8..........but not getting WUs.

I'll covert a Linux box with WUs being already processed.
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2347 - Posted 31 Jan 2007 19:56:39 UTC - in response to Message ID 2344 .

I'll go switch a Linux box to see.


Let us know what you find out. I'm still not convinced it is us nor the new boinc client. More data might help us find the cause.

AK


I'm sure you noticed, but installing 5.8.8 overwrites the run_client and run_manager, so the ulimit -s unlimited could be missing......mine was.
Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 2348 - Posted 31 Jan 2007 20:23:40 UTC
Last modified: 31 Jan 2007 20:46:56 UTC

I looked at the BOINC developers CVS patch check-in mailing list archive yesterday and noticed that some code had been checked in recently where the boinc core client tries to set the stack limit to at least 500 MB. That was before I saw the posts about the problems with new boinc client versions so I'll have to go back and find the exact patch.

EDIT: Here is a link url of a patch being checked in that changes rlimit for RLIMIT_STACK. This was from January 30th, but it looks like they have been changing it earlier in the month as well. I'm not sure when they started messing with it.

EDIT: I just emailed the info to Andre so he won't waste time looking at the problem without knowing about this.

____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2349 - Posted 31 Jan 2007 20:55:51 UTC - in response to Message ID 2348 .

I looked at the BOINC developers CVS patch check-in mailing list archive yesterday and noticed that some code had been checked in recently where the boinc core client tries to set the stack limit to at least 500 MB. That was before I saw the posts about the problems with new boinc client versions so I'll have to go back and find the exact patch.


This would definitely explain the crashing behavior on my SuSE linux box that never did this before. Going from unlimited stack to 500MB would crash the simulation after 5 to 10 minutes...

EDIT: Here is a link url of a patch being checked in that changes rlimit for RLIMIT_STACK. This was from January 30th, but it looks like they have been changing it earlier in the month as well. I'm not sure when they started messing with it.


Thanks for the pointers. I've asked Dave A. about which patches were in which boinc client versions.

EDIT: I just emailed the info to Andre so he won't waste time looking at the problem without knowing about this.


Thanks!

Andre
____________
D@H the greatest project in the world... a while from now!
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2350 - Posted 31 Jan 2007 21:20:45 UTC - in response to Message ID 2349 .

I looked at the BOINC developers CVS patch check-in mailing list archive yesterday and noticed that some code had been checked in recently where the boinc core client tries to set the stack limit to at least 500 MB. That was before I saw the posts about the problems with new boinc client versions so I'll have to go back and find the exact patch.


This would definitely explain the crashing behavior on my SuSE linux box that never did this before. Going from unlimited stack to 500MB would crash the simulation after 5 to 10 minutes...

EDIT: Here is a link url of a patch being checked in that changes rlimit for RLIMIT_STACK. This was from January 30th, but it looks like they have been changing it earlier in the month as well. I'm not sure when they started messing with it.


Thanks for the pointers. I've asked Dave A. about which patches were in which boinc client versions.

EDIT: I just emailed the info to Andre so he won't waste time looking at the problem without knowing about this.


Thanks!

Andre


Could it be that the change in the patch also over-rides the ulimit direction we put in the run files?
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2353 - Posted 1 Feb 2007 0:15:28 UTC - in response to Message ID 2350 .


Could it be that the change in the patch also over-rides the ulimit direction we put in the run files?


Yes, unfortunately it does, because the limit is lowered by the boinc client which is started from the script. This has already been corrected in the boinc client code though, I just don't know in which version the fix of the patch will end up.

Andre
____________
D@H the greatest project in the world... a while from now!
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2369 - Posted 1 Feb 2007 9:56:14 UTC - in response to Message ID 2353 .


Could it be that the change in the patch also over-rides the ulimit direction we put in the run files?


Yes, unfortunately it does, because the limit is lowered by the boinc client which is started from the script. This has already been corrected in the boinc client code though, I just don't know in which version the fix of the patch will end up.

Andre


Is there anything you want me to try with a couple of Linux boxes, before I go back to the recommended 5.4.11?
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2373 - Posted 1 Feb 2007 20:31:39 UTC - in response to Message ID 2369 .

Saw on the boinc-dev list (and Rom told me) that they will probably release 5.8.9 to correct the problem. Let's wait until that release to re-test.

Thanks
Andre


Could it be that the change in the patch also over-rides the ulimit direction we put in the run files?


Yes, unfortunately it does, because the limit is lowered by the boinc client which is started from the script. This has already been corrected in the boinc client code though, I just don't know in which version the fix of the patch will end up.

Andre


Is there anything you want me to try with a couple of Linux boxes, before I go back to the recommended 5.4.11?


____________
D@H the greatest project in the world... a while from now!
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2377 - Posted 1 Feb 2007 20:46:41 UTC

The new client has been officially released. Message from Rom:

------------------------
BOINC 5.8 has now ready for public release.

Some of the highlights for this release include:
* Simple GUI (Complements of World Community Grid)
* Basic CPU identification
* CPU throttling
* New CPU scheduler
* New work fetch policy
* Improved memory management

We have one outstanding issue with the Linux client at this time. As
soon as we have resolved it we'll release it as well.

We would like to thank everyone for their hard work in making this a
solid release.
--------------------------

The issue with the linux client is the stack limit fix...

Andre
____________
D@H the greatest project in the world... a while from now!

mikus
Volunteer tester

Joined: Oct 28 06
Posts: 18
ID: 193
Credit: 2,915,329
RAC: 0
Message 2451 - Posted 10 Feb 2007 19:24:16 UTC - in response to Message ID 2373 .

Saw on the boinc-dev list (and Rom told me) that they will probably release 5.8.9 to correct the problem. Let's wait until that release to re-test.

The latest Linux boinc client fixes the problem that client releases 5.8.7 and 5.8.8 had in Linux.
.
Michael Stoeter
Volunteer tester

Joined: Oct 2 06
Posts: 4
ID: 166
Credit: 854,550
RAC: 0
Message 2702 - Posted 9 Mar 2007 21:13:35 UTC

Please use this thread if invalid results are detected.

http://docking.utep.edu/result.php?resultid=108179
http://docking.utep.edu/result.php?resultid=108169
http://docking.utep.edu/result.php?resultid=108118
http://docking.utep.edu/result.php?resultid=107803
http://docking.utep.edu/result.php?resultid=107676
http://docking.utep.edu/result.php?resultid=107673
http://docking.utep.edu/result.php?resultid=107546
http://docking.utep.edu/result.php?resultid=107539
http://docking.utep.edu/result.php?resultid=107527
http://docking.utep.edu/result.php?resultid=107522
http://docking.utep.edu/result.php?resultid=93278
http://docking.utep.edu/result.php?resultid=93271

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2710 - Posted 11 Mar 2007 16:50:06 UTC - in response to Message ID 2702 .

Hi Michael,

I've checked out the first invalid result (and suspect the others are invalid because of the same reasons). If you check out the other results in the workunit, you notice that the others crunched their results without interruption (you see only one 'starting charmm run...' in stderr), but you (or your computer) interrupted it once (there are two 'starting charmm run...' in stderr). The most likely result of this is that charmm was interrupted in a checkpoint section and thus restarted with the wrong initial seeds.

We think we have found a solution for this issue by wrapping all the code that should not be interrupted into a critical section that will not be interrupted by the boinc client. If testing goes okay, this will be deployed for all platforms in the coming week.

A big thanks to you and all other alpha volunteers to help us find this problem!

Thanks
Andre

Please use this thread if invalid results are detected.

http://docking.utep.edu/result.php?resultid=108179
http://docking.utep.edu/result.php?resultid=108169
http://docking.utep.edu/result.php?resultid=108118
http://docking.utep.edu/result.php?resultid=107803
http://docking.utep.edu/result.php?resultid=107676
http://docking.utep.edu/result.php?resultid=107673
http://docking.utep.edu/result.php?resultid=107546
http://docking.utep.edu/result.php?resultid=107539
http://docking.utep.edu/result.php?resultid=107527
http://docking.utep.edu/result.php?resultid=107522
http://docking.utep.edu/result.php?resultid=93278
http://docking.utep.edu/result.php?resultid=93271


____________
D@H the greatest project in the world... a while from now!
Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 2719 - Posted 17 Mar 2007 13:18:22 UTC

Zero credit for this result which looks ok to me at least as much as I can see.
All 3 computers are AMD, all WU's are 5.04 and all use Linux. I got Zero and the other 2 got 49.50.

http://docking.utep.edu/result.php?resultid=120050

Thanks.
____________

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2720 - Posted 17 Mar 2007 18:09:37 UTC - in response to Message ID 2719 .

Hi Conan,

This happened because of the same reason as the previous post . You notice that your computation was interrupted by the boinc client and the others were not; the interruption probably hit a checkpoint in the middle which caused a divergency. We are currently testing an implementation of an atomic checkpointing method which we will release next week if all the tests go well.

Thanks
Andre

Zero credit for this result which looks ok to me at least as much as I can see.
All 3 computers are AMD, all WU's are 5.04 and all use Linux. I got Zero and the other 2 got 49.50.

http://docking.utep.edu/result.php?resultid=120050

Thanks.


____________
D@H the greatest project in the world... a while from now!
Profile [B^S] Acmefrog
Volunteer tester
Avatar

Joined: Nov 14 06
Posts: 45
ID: 252
Credit: 1,604,407
RAC: 0
Message 2857 - Posted 31 Mar 2007 1:53:38 UTC

http://docking.utep.edu/result.php?resultid=128455

This one got stuck at 5.97% for three hours. I finally aborted it.
____________

Profile KSMarksPsych
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 26
ID: 87
Credit: 8,222
RAC: 0
Message 2858 - Posted 31 Mar 2007 2:31:01 UTC

Looks like my one invalid is probably the problem referenced above with restarting a work unit...

http://docking.utep.edu/result.php?resultid=131429


It's the only one I have like this. Would it help to bump up the switch time on that machine to like 4 hours (since units are taking about 3.5 hours to complete)?


____________
Kathryn :o)
The BOINC FAQ Service
The Unofficial BOINC Wiki
The Trac System
More BOINC information than you can shake a stick of RAM at.

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2859 - Posted 31 Mar 2007 2:31:40 UTC - in response to Message ID 2857 .

Is this the first one where that happened? How many others went okay? Let us know if it happens again.

AK

http://docking.utep.edu/result.php?resultid=128455

This one got stuck at 5.97% for three hours. I finally aborted it.


____________
D@H the greatest project in the world... a while from now!
Profile [B^S] Acmefrog
Volunteer tester
Avatar

Joined: Nov 14 06
Posts: 45
ID: 252
Credit: 1,604,407
RAC: 0
Message 2862 - Posted 31 Mar 2007 3:12:20 UTC - in response to Message ID 2859 .
Last modified: 31 Mar 2007 3:36:10 UTC

Is this the first one where that happened? How many others went okay? Let us know if it happens again.

AK

http://docking.utep.edu/result.php?resultid=128455

This one got stuck at 5.97% for three hours. I finally aborted it.



So far it is the only one that has done this. I have completed a lot with no problem. I will let you know if I have any more issues. (Impressively quick reply I must say.)

One thing I did notice was that the time to completion had stopped counting down and was increasing. When I aborted it, it was at over 6 hours to completion and climbing. The CPU time was still rising. Before I aborted it, I had paused it then resumed it. The CPU time reset, amount completed (can't remember if the time to completion did or not.) Anyway after I resumed it the response was the same---stuck at the % complete, and both the time to completion and CPU were rising. I do not believe I did anything on my pc to interrupt it.
____________
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2864 - Posted 31 Mar 2007 4:32:32 UTC - in response to Message ID 2858 .

It might help with cases where Charmm exits with code 0, but in this case, there was an error: incorrect function; that one is still kind of a mystery to us and we are investigating it. It seems to be a fortran related error as the cpdn guys have seen this before as well.

If everything goes as planned we will release a charmm with atomic checkpointing for all platforms next week, which will help solve many of the invalid cases; or at least we think so...

Andre

Looks like my one invalid is probably the problem referenced above with restarting a work unit...

http://docking.utep.edu/result.php?resultid=131429


It's the only one I have like this. Would it help to bump up the switch time on that machine to like 4 hours (since units are taking about 3.5 hours to complete)?



____________
D@H the greatest project in the world... a while from now!
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2865 - Posted 31 Mar 2007 4:35:12 UTC - in response to Message ID 2862 .

We have seen such cases in the lab before, but very sporadically. Trilce and Roger found out that charmm is actually stuck in a loop at that point due to a certain condition of the protein, ligand and random seed.This might be the first time we see it happen on the grid. We'll investigate using the input file your host was send.

Thanks
Andre

One thing I did notice was that the time to completion had stopped counting down and was increasing. When I aborted it, it was at over 6 hours to completion and climbing. The CPU time was still rising. Before I aborted it, I had paused it then resumed it. The CPU time reset, amount completed (can't remember if the time to completion did or not.) Anyway after I resumed it the response was the same---stuck at the % complete, and both the time to completion and CPU were rising. I do not believe I did anything on my pc to interrupt it.


____________
D@H the greatest project in the world... a while from now!
Profile [B^S] Acmefrog
Volunteer tester
Avatar

Joined: Nov 14 06
Posts: 45
ID: 252
Credit: 1,604,407
RAC: 0
Message 2866 - Posted 31 Mar 2007 5:27:18 UTC - in response to Message ID 2865 .

We have seen such cases in the lab before, but very sporadically. Trilce and Roger found out that charmm is actually stuck in a loop at that point due to a certain condition of the protein, ligand and random seed.This might be the first time we see it happen on the grid. We'll investigate using the input file your host was send.

Thanks
Andre

One thing I did notice was that the time to completion had stopped counting down and was increasing. When I aborted it, it was at over 6 hours to completion and climbing. The CPU time was still rising. Before I aborted it, I had paused it then resumed it. The CPU time reset, amount completed (can't remember if the time to completion did or not.) Anyway after I resumed it the response was the same---stuck at the % complete, and both the time to completion and CPU were rising. I do not believe I did anything on my pc to interrupt it.


Please let me know, I am curious. This was actually the first WU in a long that (if ever) that I had go bad.
____________
Profile KSMarksPsych
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 26
ID: 87
Credit: 8,222
RAC: 0
Message 2869 - Posted 31 Mar 2007 12:10:59 UTC - in response to Message ID 2864 .

It might help with cases where Charmm exits with code 0, but in this case, there was an error: incorrect function; that one is still kind of a mystery to us and we are investigating it. It seems to be a fortran related error as the cpdn guys have seen this before as well.

If everything goes as planned we will release a charmm with atomic checkpointing for all platforms next week, which will help solve many of the invalid cases; or at least we think so...

Andre

Looks like my one invalid is probably the problem referenced above with restarting a work unit...

http://docking.utep.edu/result.php?resultid=131429


It's the only one I have like this. Would it help to bump up the switch time on that machine to like 4 hours (since units are taking about 3.5 hours to complete)?





Anything you want me to do Andre? It was the only WU like this. And everything else, including CPDN/CPDN Beta are running happily on that machine.
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2873 - Posted 31 Mar 2007 14:39:54 UTC - in response to Message ID 2869 .

Anything you want me to do Andre? It was the only WU like this. And everything else, including CPDN/CPDN Beta are running happily on that machine.


Not much that you can do at the moment; please report if you get others like this!

Thanks
Andre

PS We haven't been able to reproduce this incorrect function error in the lab, so it will be a tough one to solve. If you see something on other project forums about a possible cause (or solution), please let us know.
____________
D@H the greatest project in the world... a while from now!
Profile [B^S] Acmefrog
Volunteer tester
Avatar

Joined: Nov 14 06
Posts: 45
ID: 252
Credit: 1,604,407
RAC: 0
Message 2886 - Posted 1 Apr 2007 0:37:45 UTC

http://docking.utep.edu/result.php?resultid=128497

I had another one that froze. This time it got to 90.1% done. The running time had been over 5 hours. I am begining to wonder if my pc is doing something weird or was that WU and my previous one that froze we related.
____________

Profile Frank Boerner
Volunteer tester

Joined: Sep 13 06
Posts: 18
ID: 101
Credit: 744,548
RAC: 0
Message 2888 - Posted 1 Apr 2007 6:45:22 UTC

http://docking.utep.edu/result.php?resultid=135992
I have downloaded 13 Workunits on an AMD Duron 800 with Windows 98.

All 13 WUs failed for the same reason.

Boinc Client is 5.8.16

Profile suguruhirahara
Forum moderator
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 282
ID: 15
Credit: 56,614
RAC: 0
Message 2889 - Posted 1 Apr 2007 7:02:36 UTC - in response to Message ID 2888 .

http://docking.utep.edu/result.php?resultid=135992
I have downloaded 13 Workunits on an AMD Duron 800 with Windows 98.

All 13 WUs failed for the same reason.

Boinc Client is 5.8.16

Is Charmm capable of being run on Win9x, in the first place? > Andre

____________

I'm a volunteer participant; my views are not necessarily those of Docking@Home or its participating institutions.
Nightbird
Volunteer tester

Joined: Oct 2 06
Posts: 35
ID: 129
Credit: 11,804
RAC: 0
Message 2890 - Posted 1 Apr 2007 10:18:26 UTC - in response to Message ID 2888 .

http://docking.utep.edu/result.php?resultid=135992
I have downloaded 13 Workunits on an AMD Duron 800 with Windows 98.

All 13 WUs failed for the same reason.

Boinc Client is 5.8.16

Win9x is not supported.
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2904 - Posted 2 Apr 2007 4:14:03 UTC - in response to Message ID 2886 .

We are looking into this, but it really shouln't happen twice in a row so to say... can you find out if there is another cause to do with your pc?

Andre

http://docking.utep.edu/result.php?resultid=128497

I had another one that froze. This time it got to 90.1% done. The running time had been over 5 hours. I am begining to wonder if my pc is doing something weird or was that WU and my previous one that froze we related.


____________
D@H the greatest project in the world... a while from now!
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2905 - Posted 2 Apr 2007 4:16:23 UTC - in response to Message ID 2888 .

Charmm is not supported on Win98. I don't know a way to tell BOINC that it shouldn't send any WUs to Win98 machines (does anybody?). Will put a message on the news and the FAQ though.

Andre

http://docking.utep.edu/result.php?resultid=135992
I have downloaded 13 Workunits on an AMD Duron 800 with Windows 98.

All 13 WUs failed for the same reason.

Boinc Client is 5.8.16


____________
D@H the greatest project in the world... a while from now!
Profile suguruhirahara
Forum moderator
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 282
ID: 15
Credit: 56,614
RAC: 0
Message 2909 - Posted 2 Apr 2007 8:32:37 UTC - in response to Message ID 2905 .

Charmm is not supported on Win98. I don't know a way to tell BOINC that it shouldn't send any WUs to Win98 machines (does anybody?). Will put a message on the news and the FAQ though.

Andre

http://docking.utep.edu/result.php?resultid=135992
I have downloaded 13 Workunits on an AMD Duron 800 with Windows 98.

All 13 WUs failed for the same reason.

Boinc Client is 5.8.16


Doesn't it work to delete Win9x from the platform list on your server?

____________

I'm a volunteer participant; my views are not necessarily those of Docking@Home or its participating institutions.
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2913 - Posted 2 Apr 2007 14:06:42 UTC - in response to Message ID 2909 .

Doesn't it work to delete Win9x from the platform list on your server?


That platform doesn't exist on boinc :-(

AK
____________
D@H the greatest project in the world... a while from now!
Profile [B^S] Acmefrog
Volunteer tester
Avatar

Joined: Nov 14 06
Posts: 45
ID: 252
Credit: 1,604,407
RAC: 0
Message 2914 - Posted 2 Apr 2007 14:13:51 UTC - in response to Message ID 2904 .

We are looking into this, but it really shouln't happen twice in a row so to say... can you find out if there is another cause to do with your pc?

Andre

http://docking.utep.edu/result.php?resultid=128497

I had another one that froze. This time it got to 90.1% done. The running time had been over 5 hours. I am begining to wonder if my pc is doing something weird or was that WU and my previous one that froze we related.



I suspect then my younger son did something that caused it to freeze. Both times he had been on the computer playing games on-line during the time the WUs got messed up. (He's 3 and sometimes hits other buttons by accident) It hadn't happened before so that is why I thought something else had happened. I will keep trying to figure out what it is that he is doing.

Thanks!
____________
Profile Cori
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 161
ID: 90
Credit: 5,817
RAC: 0
Message 2916 - Posted 2 Apr 2007 18:49:04 UTC - in response to Message ID 2905 .

Charmm is not supported on Win98. I don't know a way to tell BOINC that it shouldn't send any WUs to Win98 machines (does anybody?). Will put a message on the news and the FAQ though.

Andre

Maybe it is wise to put up an extra point on the homepage. Something like "Hardware requirements and limitations" or so. ;-)
Because the news are somewhat hidden after some days and not everyone will look into the FAQ.
____________
Bribe me with Lasagna!! :-)
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2918 - Posted 2 Apr 2007 22:24:23 UTC - in response to Message ID 2916 .


Maybe it is wise to put up an extra point on the homepage. Something like "Hardware requirements and limitations" or so. ;-)
Because the news are somewhat hidden after some days and not everyone will look into the FAQ.


Good suggestion. It's on our to-do list.

AK
____________
D@H the greatest project in the world... a while from now!
Profile [B^S] Acmefrog
Volunteer tester
Avatar

Joined: Nov 14 06
Posts: 45
ID: 252
Credit: 1,604,407
RAC: 0
Message 2963 - Posted 8 Apr 2007 5:36:02 UTC - in response to Message ID 2914 .

We are looking into this, but it really shouln't happen twice in a row so to say... can you find out if there is another cause to do with your pc?

Andre

http://docking.utep.edu/result.php?resultid=128497

I had another one that froze. This time it got to 90.1% done. The running time had been over 5 hours. I am begining to wonder if my pc is doing something weird or was that WU and my previous one that froze we related.



I suspect then my younger son did something that caused it to freeze. Both times he had been on the computer playing games on-line during the time the WUs got messed up. (He's 3 and sometimes hits other buttons by accident) It hadn't happened before so that is why I thought something else had happened. I will keep trying to figure out what it is that he is doing.

Thanks!


I had another one http://docking.utep.edu/result.php?resultid=137115 that 'froze' when my son was at the computer. As far as I can tell he didn't do anything other than play games (nothing else seemed messed up). It just maybe that something with Nickjr.com does something to these WUs. Although I haven't had this problem crunching for other projects. May I will just crunch other projects when he is on the computer.

____________
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2967 - Posted 9 Apr 2007 15:51:36 UTC - in response to Message ID 2963 .

Any chance that you can monitor your memory usage during those times? I wonder if it is a memory issue that causes the freeze.

AK


I had another one http://docking.utep.edu/result.php?resultid=137115 that 'froze' when my son was at the computer. As far as I can tell he didn't do anything other than play games (nothing else seemed messed up). It just maybe that something with Nickjr.com does something to these WUs. Although I haven't had this problem crunching for other projects. May I will just crunch other projects when he is on the computer.


____________
D@H the greatest project in the world... a while from now!
Profile [B^S] Acmefrog
Volunteer tester
Avatar

Joined: Nov 14 06
Posts: 45
ID: 252
Credit: 1,604,407
RAC: 0
Message 2968 - Posted 9 Apr 2007 21:11:07 UTC

I will try to. I currently have 1MB of RAM right now.
____________

Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 3045 - Posted 15 Apr 2007 1:52:23 UTC

> I don't understand how this result (result id=151646) was validated against my result (result id= 151644). Mine took 18,775.22 seconds on Host 130 and the other took 1.17 seconds on host 1290. Both got validated.

Also Result 150783 was granted 0.00 credit against 2 others (one of which was also one of my other machines), the only difference was the process time, mine took over twice as long.
Could it be that the different Flop count in differnet WU's don't valididate against each other?
____________

Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 3046 - Posted 15 Apr 2007 5:01:18 UTC - in response to Message ID 3045 .

> I don't understand how this result (result id=151646) was validated against my result (result id= 151644). Mine took 18,775.22 seconds on Host 130 and the other took 1.17 seconds on host 1290. Both got validated.


I've seen reports that the 5.05 Docking Client sometimes doesn't stop when it's told to suspend. I haven't encountered this personally, at least not that I remember noticing. I'm not sure if it's related to BOINC client version and/or OS type/version. I've noticed that the D@H client has 4 threads in it. My guess would be that BOINC tried to suspend it for some reason on host 1290 and it kept running. Somehow, in the interaction between the BOINC client and the D@H project client, that CPU time wasn't counted. In other words, host 1290 took a lot longer to run the WU but didn't report the time correctly.

Hmm, host 1290 shows that it restarted that result from a checkpoint. I'll bet the initial run continued when it was told to stop, completed the WU, and the restart from checkpoint woke up, found the result completed, and reported it finished. That second run probably took 1.17 seconds. The project staff probably needs to look into clients running while suspended and also to check if the CPU time is being accumulated properly when BOINC starts a new copy of the D@H client and it continues a WU.

BTW, your host 130 is running BOINC core client 5.8.11 which doesn't return all the info needed for HR.

Also Result 150783 was granted 0.00 credit against 2 others (one of which was also one of my other machines), the only difference was the process time, mine took over twice as long.
Could it be that the different Flop count in differnet WU's don't valididate against each other?



That's strange. When I looked at the 3 results, the 2 machines that validated ran the WU with D@H client 5.04 and the third machine somehow ran it with D@H client 5.05. The project staff will need to figure out how this happened, too .

BTW, Your host 1569 is also running BOINC core client 5.8.11.

To be properly matched on HR for D@H, you need a later BOINC client which includes the processor Family, Model, and Stepping code.

Look at that WU and look at each of the machines CPU type.

Host 130 - BOINC client 5.8.11 - D@H application 5.04
CPU type AuthenticAMD
Dual Core AMD Opteron(tm) Processor 285 [fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy]

Host 1569 - BOINC client 5.8.11 - D@H application 5.05
CPU type AuthenticAMD
Dual Core AMD Opteron(tm) Processor 2 [fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy]

Host 1680 - BOINC client 5.8.17 - D@H application 5.04
CPU type AuthenticAMD
AMD Opteron(tm) Processor 146 [Family 15 Model 39 Stepping 1][fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow up pni lahf_lm]


If I understand it correctly, the HR algorithm uses the manufacturer (e.g. AuthenticAMD) and the Family/Model/Step (e.g [Family 15 Model 39 Stepping 1])to match machines so they will validate against each other for HR (Homogeneous Redundancy).

I hope this helps and I'm not trying to gripe at you for having the older BOINC client. I need to check all of mine too and make sure they're the latest stable BOINC client. Sometimes newer stable clients are released rather quietly. IIRC, 5.8.15 and later will return enough info for D@H on Windows and MAC, but Linux needs a later version. As of right now, the latest stable versions are

Windows: 5.8.16
Mac OS X: 5.8.17 with GUI or Command Line
Linux: 5.8.17 with GUI
NOTE: On one of my Linux machines (2.4.x kernel, text only), I had to stay with BOINC client version 5.8.15 because it didn't have some library that later versions needed. I wish they would release a non-gui version for Linux but they've stopped doing that and are linking the BOINC Client on a machine with later versions of libraries.

Happy Crunching,

-- David
____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?
Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 3048 - Posted 15 Apr 2007 5:47:35 UTC - in response to Message ID 3046 .

> I don't understand how this result (result id=151646) was validated against my result (result id= 151644). Mine took 18,775.22 seconds on Host 130 and the other took 1.17 seconds on host 1290. Both got validated.


I've seen reports that the 5.05 Docking Client sometimes doesn't stop when it's told to suspend. I haven't encountered this personally, at least not that I remember noticing. I'm not sure if it's related to BOINC client version and/or OS type/version. I've noticed that the D@H client has 4 threads in it. My guess would be that BOINC tried to suspend it for some reason on host 1290 and it kept running. Somehow, in the interaction between the BOINC client and the D@H project client, that CPU time wasn't counted. In other words, host 1290 took a lot longer to run the WU but didn't report the time correctly.

Hmm, host 1290 shows that it restarted that result from a checkpoint. I'll bet the initial run continued when it was told to stop, completed the WU, and the restart from checkpoint woke up, found the result completed, and reported it finished. That second run probably took 1.17 seconds. The project staff probably needs to look into clients running while suspended and also to check if the CPU time is being accumulated properly when BOINC starts a new copy of the D@H client and it continues a WU.

BTW, your host 130 is running BOINC core client 5.8.11 which doesn't return all the info needed for HR.

Also Result 150783 was granted 0.00 credit against 2 others (one of which was also one of my other machines), the only difference was the process time, mine took over twice as long.
Could it be that the different Flop count in differnet WU's don't valididate against each other?



That's strange. When I looked at the 3 results, the 2 machines that validated ran the WU with D@H client 5.04 and the third machine somehow ran it with D@H client 5.05. The project staff will need to figure out how this happened, too .

BTW, Your host 1569 is also running BOINC core client 5.8.11.

To be properly matched on HR for D@H, you need a later BOINC client which includes the processor Family, Model, and Stepping code.

Look at that WU and look at each of the machines CPU type.

Host 130 - BOINC client 5.8.11 - D@H application 5.04
CPU type AuthenticAMD
Dual Core AMD Opteron(tm) Processor 285 [fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy]

Host 1569 - BOINC client 5.8.11 - D@H application 5.05
CPU type AuthenticAMD
Dual Core AMD Opteron(tm) Processor 2 [fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy]

Host 1680 - BOINC client 5.8.17 - D@H application 5.04
CPU type AuthenticAMD
AMD Opteron(tm) Processor 146 [Family 15 Model 39 Stepping 1][fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow up pni lahf_lm]


If I understand it correctly, the HR algorithm uses the manufacturer (e.g. AuthenticAMD) and the Family/Model/Step (e.g [Family 15 Model 39 Stepping 1])to match machines so they will validate against each other for HR (Homogeneous Redundancy).

I hope this helps and I'm not trying to gripe at you for having the older BOINC client. I need to check all of mine too and make sure they're the latest stable BOINC client. Sometimes newer stable clients are released rather quietly. IIRC, 5.8.15 and later will return enough info for D@H on Windows and MAC, but Linux needs a later version. As of right now, the latest stable versions are

Windows: 5.8.16
Mac OS X: 5.8.17 with GUI or Command Line
Linux: 5.8.17 with GUI
NOTE: On one of my Linux machines (2.4.x kernel, text only), I had to stay with BOINC client version 5.8.15 because it didn't have some library that later versions needed. I wish they would release a non-gui version for Linux but they've stopped doing that and are linking the BOINC Client on a machine with later versions of libraries.

Happy Crunching,

-- David


Thanks David, a lot of what you said can and does make sense.
The reason my Linux machines are still using that Boinc version is that I have had no confirmation that the newer versions show the required information for HR on Linux yet.
In fact we were told not to upgrade till it was fixed by Andre no less. Works fine for Windows but not for Linux.
So I have had no reason to upgrade at this stage.

____________
Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 3050 - Posted 15 Apr 2007 13:40:58 UTC

@Conan

I'll defer to Andre on what version to use for Linux but I'm fairly certain that it's the CPU Manufacturer/Family/Model/Stepping info that is needed for HR.

Happy Crunching,

-- David
____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3070 - Posted 16 Apr 2007 17:48:32 UTC - in response to Message ID 3050 .

David, most of what you explained is correct. There is one comment that I would like to make and that is that, yes, the Family/Model/Stepping information will make it much easier to classify a certain host into a HR class, but we still use the older filtering rules for the older boinc clients too. This means that sometimes a host that runs an older boinc client is classified into a HR class that it doesn't really belong to and the result might be invalid because of that. These are the ones we have to find out about so that we can adjust the HR rules and make them more reliable.

Also, it is possible that version 5.04 and 5.05 give different results (it has happened before when we upgraded charmm). I didn't expect it, because the changes were so small this time and we did not change the molecular dynamics algorithm, but it is not impossible. If everybody is on 5.05 we should not see it anymore (I wished boinc had the mechanism where you can specify an app version per workunit).

Thanks for all the explanations; it's a big help!!

Andre

@Conan

I'll defer to Andre on what version to use for Linux but I'm fairly certain that it's the CPU Manufacturer/Family/Model/Stepping info that is needed for HR.

Happy Crunching,

-- David


____________
D@H the greatest project in the world... a while from now!
mikus
Volunteer tester

Joined: Oct 28 06
Posts: 18
ID: 193
Credit: 2,915,329
RAC: 0
Message 3224 - Posted 2 May 2007 14:27:19 UTC - in response to Message ID 3046 .

I've seen reports that the 5.05 Docking Client sometimes doesn't stop when it's told to suspend. I haven't encountered this personally, at least not that I remember noticing. I'm not sure if it's related to BOINC client version and/or OS type/version. I've noticed that the D@H client has 4 threads in it. My guess would be that BOINC tried to suspend it for some reason on host 1290 and it kept running. Somehow, in the interaction between the BOINC client and the D@H project client, that CPU time wasn't counted. In other words, host 1290 took a lot longer to run the WU but didn't report the time correctly.

Hmm, host 1290 shows that it restarted that result from a checkpoint. I'll bet the initial run continued when it was told to stop, completed the WU, and the restart from checkpoint woke up, found the result completed, and reported it finished. That second run probably took 1.17 seconds. The project staff probably needs to look into clients running while suspended and also to check if the CPU time is being accumulated properly when BOINC starts a new copy of the D@H client and it continues a WU.

Just noticed this thread (I'm the owner of host 1290).

Currently running 32-bit Linux Charmm 5.07 -- it __still__ does not stop when I suspend computation. [Before the change to have the Docking application itself set 'ulimit -s unlimited', this same system *was* counting CPU time correctly for Docking workunits (and would stop on suspend). After the change, the system is now reporting times around 1 second for *every* Docking workunit.]

Don't know about "restarts from checkpoint". The system runs 24/7 - the only time crunching gets explicitly interrupted is if I'm upgrading the boinc client.

When I looked at the 3 results, the 2 machines that validated ran the WU with D@H client 5.04 and the third machine somehow ran it with D@H client 5.05.
Don't know if the boinc developers have fixed things, but once upon a time the reported client version was what did the *formatting* of the "task completed" RPC message sent to the server - not the client version that actually did the *crunching* of the workunit.
.
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3226 - Posted 2 May 2007 15:53:43 UTC - in response to Message ID 3224 .


Currently running 32-bit Linux Charmm 5.07 -- it __still__ does not stop when I suspend computation. [Before the change to have the Docking application itself set 'ulimit -s unlimited', this same system *was* counting CPU time correctly for Docking workunits (and would stop on suspend). After the change, the system is now reporting times around 1 second for *every* Docking workunit.


Mikus, we can not reproduce the case where the result does not suspend at all. Because of the atomic checkpointing we now use, it is possible that the app keeps on running for up to 10 secs because it is in the middle of the checkpoint. It should stop at some point though. Please keep on monitoring the charmm processes to see if this happens.

Thanks
Andre

____________
D@H the greatest project in the world... a while from now!
mikus
Volunteer tester

Joined: Oct 28 06
Posts: 18
ID: 193
Credit: 2,915,329
RAC: 0
Message 3229 - Posted 3 May 2007 19:16:59 UTC - in response to Message ID 3226 .

Mikus, we can not reproduce the case where the result does not suspend at all. Because of the atomic checkpointing we now use, it is possible that the app keeps on running for up to 10 secs because it is in the middle of the checkpoint. It should stop at some point though. Please keep on monitoring the charmm processes to see if this happens.

My AMD multi-core system has installed 64-bit Ubuntu 7.04, the 32-bit boinc 5.9.5 client/manager, and the 32-bit Linux charmm 5.07 application (as well as numerous other projects). [I believe that 64-bit Ubuntu is "different" in that its /lib is 64-bit, and 32-bit applications are accomodated by /lib32, whereas 64-bit SuSE (for example) provides /lib as 32-bit and /lib64 as 64-bit.] The principal tools I use to (visually) view processing are gkrellm and top.

The current Docking application is easy to spot on gkrellm - it uses so much system time that a broad orange stripe (<40% of the CPU) is drawn by gkrellm for whichever CPU is running it. [Boinc_application execution is "niced" to idle, thus shows up green (~100%) in gkrellm.] When I 'Suspend' Docking in boincmgr, the orange stripe is reduced to half-height -- the reason is that the boinc client believes that the Docking task is no longer running, so it dispatches a task from a different project -- and the two tasks (Docking and the other) end up sharing that CPU. This is confirmed by top -- it shows the Docking task continuing to run (at about 50% of the CPU), and the other project's task *also* running at 50% of the CPU.

I'm letting the system continue - I expect the task mix to get back to normal once the Docking task completes. But it has been more than 30 minutes now since I issued the 'Suspend' of Docking -- yet the Docking task continues its activity.
.
mikus
Volunteer tester

Joined: Oct 28 06
Posts: 18
ID: 193
Credit: 2,915,329
RAC: 0
Message 3230 - Posted 3 May 2007 19:37:01 UTC

It's interesting - I've noticed on gkrellm that the orange stripe for Docking sometimes goes to full height (<40% CPU). [This is with 'Suspend' still in effect for Docking - and boincmgr *does* show the status of Docking tasks as 'suspended'.]

What was happening was that the applications migrated between cores. Top showed that the (still running) Docking task now had one CPU by itself, whereas it was two tasks from other applications that were "sharing" their CPU 50-50. [Every so often the applications seem to migrate - it's been back to "sharing" with Docking again.]
.

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3231 - Posted 3 May 2007 19:49:56 UTC - in response to Message ID 3230 .

Is there anybody else that notices this behavior on Ubuntu (or any other distro for that fact)? I have just tried to reproduce this behavior on our Ubuntu 6.10 machines (don't have 7.04 yet) with the 5.9.5 manager, but do not see this behavior at all: when the docking project is suspended, charmm stops running and the other app takes over as it should. Suspended many times; always the same correct behavior.

Thanks
Andre

It's interesting - I've noticed on gkrellm that the orange stripe for Docking sometimes goes to full height (<40% CPU). [This is with 'Suspend' still in effect for Docking - and boincmgr *does* show the status of Docking tasks as 'suspended'.]

What was happening was that the applications migrated between cores. Top showed that the (still running) Docking task now had one CPU by itself, whereas it was two tasks from other applications that were "sharing" their CPU 50-50. [Every so often the applications seem to migrate - it's been back to "sharing" with Docking again.]
.


____________
D@H the greatest project in the world... a while from now!
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 3235 - Posted 4 May 2007 2:42:48 UTC - in response to Message ID 3231 .

Is there anybody else that notices this behavior on Ubuntu (or any other distro for that fact)? I have just tried to reproduce this behavior on our Ubuntu 6.10 machines (don't have 7.04 yet) with the 5.9.5 manager, but do not see this behavior at all: when the docking project is suspended, charmm stops running and the other app takes over as it should. Suspended many times; always the same correct behavior.

Thanks
Andre

It's interesting - I've noticed on gkrellm that the orange stripe for Docking sometimes goes to full height (<40% CPU). [This is with 'Suspend' still in effect for Docking - and boincmgr *does* show the status of Docking tasks as 'suspended'.]

What was happening was that the applications migrated between cores. Top showed that the (still running) Docking task now had one CPU by itself, whereas it was two tasks from other applications that were "sharing" their CPU 50-50. [Every so often the applications seem to migrate - it's been back to "sharing" with Docking again.]
.



Four Intel duals and two AMD single-cores running Ubuntu 7.04 and 32-bit 5.8.16 or 5.8.17, all suspend D@H correctly.
mikus
Volunteer tester

Joined: Oct 28 06
Posts: 18
ID: 193
Credit: 2,915,329
RAC: 0
Message 3238 - Posted 4 May 2007 11:39:17 UTC - in response to Message ID 3231 .

Is there anybody else that notices this behavior on Ubuntu (or any other distro for that fact)? I have just tried to reproduce this behavior on our Ubuntu 6.10 machines (don't have 7.04 yet) with the 5.9.5 manager, but do not see this behavior at all: when the docking project is suspended, charmm stops running and the other app takes over as it should. Suspended many times; always the same correct behavior.

It doesn't have to be Ubuntu 7.04. My system first started not suspending Docking workunits when Charmm 5.05 was released. At the time I was running 64-bit Ubuntu 6.10.
http://docking.utep.edu/forum_thread.php?id=220&nowrap=true#2997

--------
p.s. Looked at the system when I got up this morning. It it is again running one more boinc application task than there are assigned CPUs - and one of those tasks is a Docking task (boincmgr shows the status of that Docking task as "Waiting to run", yet its progress percentage is incrementing; its CPU time shown remains at less than 1 second). Don't know how come the boinc client allowed an "extra" task to be active in the middle of the night.
.
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 3239 - Posted 4 May 2007 12:47:56 UTC - in response to Message ID 3238 .

Is there anybody else that notices this behavior on Ubuntu (or any other distro for that fact)? I have just tried to reproduce this behavior on our Ubuntu 6.10 machines (don't have 7.04 yet) with the 5.9.5 manager, but do not see this behavior at all: when the docking project is suspended, charmm stops running and the other app takes over as it should. Suspended many times; always the same correct behavior.

It doesn't have to be Ubuntu 7.04. My system first started not suspending Docking workunits when Charmm 5.05 was released. At the time I was running 64-bit Ubuntu 6.10.
http://docking.utep.edu/forum_thread.php?id=220&nowrap=true#2997

--------
p.s. Looked at the system when I got up this morning. It it is again running one more boinc application task than there are assigned CPUs - and one of those tasks is a Docking task (boincmgr shows the status of that Docking task as "Waiting to run", yet its progress percentage is incrementing; its CPU time shown remains at less than 1 second). Don't know how come the boinc client allowed an "extra" task to be active in the middle of the night.
.


Did you change anything on this computer between 9 April and 11 April?

It appeared to be running correctly on 9 April (judging from your results) and it seems it hasn't run correctly since 11 April.

You were running 5.8.17 on both of those dates.

Between 11 April and 27 April, you switched to 5.9.4, but it still didn't appear to be working correctly.

Between 27 April and 30 April, you switched to 5.9.5, but it wasn't working anyway.

You're still on 5.9.5 and it still isn't working correctly (judging by your results).

I don't understand how your results are called valid (with a crunch time of less than one second) and granted credit.


Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3240 - Posted 4 May 2007 13:17:42 UTC - in response to Message ID 3239 .

I don't understand how your results are called valid (with a crunch time of less than one second) and granted credit.


Mikus's results are actually being crunched normally and return a similar result as the other replicas in the workunit. If we would not be using fixed credit he would not receive much for it, since the time the boinc clients reports is so low; I think this is maybe a combination of a problem of the newer boinc clients and how our app interacts with it.

Andre
____________
D@H the greatest project in the world... a while from now!
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 3242 - Posted 4 May 2007 18:55:55 UTC - in response to Message ID 3240 .

I don't understand how your results are called valid (with a crunch time of less than one second) and granted credit.


Mikus's results are actually being crunched normally and return a similar result as the other replicas in the workunit. If we would not be using fixed credit he would not receive much for it, since the time the boinc clients reports is so low; I think this is maybe a combination of a problem of the newer boinc clients and how our app interacts with it.

Andre


@Andre, Please look at WU ID 52204. I do not see how 0.83 seconds is anywhere close to 9,503.05 seconds. Something is not right.
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3244 - Posted 4 May 2007 20:09:44 UTC - in response to Message ID 3242 .
Last modified: 4 May 2007 20:09:55 UTC


@Andre, Please look at WU ID 52204. I do not see how 0.83 seconds is anywhere close to 9,503.05 seconds. Something is not right.


I agree, but the problem is on the client side. The boinc client reports the cpu hours number to the server when the result is uploaded to the server. Although the cpu hours is very low, the result that mikus returns is actually correct and validates against the other result in the workunit. He probably used a lot more time to compute this result than his boinc client tells us. He is using alpha client 5.9.5 and it probably contains a bug.

Thanks!
Andre
____________
D@H the greatest project in the world... a while from now!
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 3245 - Posted 4 May 2007 21:20:19 UTC - in response to Message ID 3244 .


@Andre, Please look at WU ID 52204. I do not see how 0.83 seconds is anywhere close to 9,503.05 seconds. Something is not right.


I agree, but the problem is on the client side. The boinc client reports the cpu hours number to the server when the result is uploaded to the server. Although the cpu hours is very low, the result that mikus returns is actually correct and validates against the other result in the workunit. He probably used a lot more time to compute this result than his boinc client tells us. He is using alpha client 5.9.5 and it probably contains a bug.

Thanks!
Andre


Look at the results from 11 April on.........5.8.17 was still in use and it was returning values of less that one second. It can't be from 5.9.4 or 5 Something else was in play before 5.9.4.

I have 5.9.5 on several machines, including Ubuntu 6.10 (then 7.04) without those kinds of CPU time results.

OK, I'm done.
mikus
Volunteer tester

Joined: Oct 28 06
Posts: 18
ID: 193
Credit: 2,915,329
RAC: 0
Message 3247 - Posted 5 May 2007 17:59:33 UTC - in response to Message ID 3239 .

Did you change anything on this computer between 9 April and 11 April?

It appeared to be running correctly on 9 April (judging from your results) and it seems it hasn't run correctly since 11 April.


Don't remember what I did yesterday, let alone a month ago. But if you look at my results page, the result reported on Apr 11 used charmm 5.04, whereas the result reported on Apr 12 used charmm 5.05. So the answer to your question is: __I__ did not change anything on this computer between 9 April and 11 April. However, the __boinc environment__ "automatically" downloaded a new Docking application version to my system when I connected 11 April. This is confirmed by my post 12 April UTC (11 April local time) in which I described my experiences with the newly downloaded Linux charmm 5.05 -- including the fact that now a 'Suspend' of the charmm task did not work.

Interesting that you want answers from me. I'm a user. Why expect *me* to know how come 5.05 behaves differently in the boinc environment than 5.04 ?


I don't understand how your results are called valid (with a crunch time of less than one second) and granted credit.

That too was a "feature" introduced by the update to 5.05 (this "feature" has so far been carried forward to all subsequent Docking application releases). Note that the boinc client I was using on 9 April was 5.8.17, and the boinc client I was using on 12 April was 5.8.17 -- it was the new charmm 5.05, upon being "tracked" by the __same__ boinc client, that now showed up as taking less than one second of execution. [In actuality, Docking workunits need more than two hours of crunching each on my system. I have no control over what gets to be reported.]

My own problem with Docking applications on Linux from 5.05 on is that they use up to 40% of one CPU for "system services". Versions prior to 5.05 did not do this. I suspect that this is "unproductive" overhead, which eats up CPU cycles better spent on crunching the actual applications.
.
Memo
Forum moderator
Project developer
Project tester

Joined: Sep 13 06
Posts: 88
ID: 14
Credit: 1,666,392
RAC: 0
Message 3249 - Posted 5 May 2007 23:15:07 UTC

By looking at the time a replica was sent and the time it was reported you can come to the conclusion that it ran for more than a few seconds.

I can say its only a problem with the applications(BOINC or CHARMM) I will try to look at that next week but I make no promises as is finals week here at UTEP and I am loaded with work, but for sure I can throw it in the queue :)

Thanks for the observation

j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 3252 - Posted 6 May 2007 14:46:57 UTC - in response to Message ID 3247 .

Did you change anything on this computer between 9 April and 11 April?

It appeared to be running correctly on 9 April (judging from your results) and it seems it hasn't run correctly since 11 April.


Don't remember what I did yesterday, let alone a month ago. But if you look at my results page, the result reported on Apr 11 used charmm 5.04, whereas the result reported on Apr 12 used charmm 5.05. So the answer to your question is: __I__ did not change anything on this computer between 9 April and 11 April. However, the __boinc environment__ "automatically" downloaded a new Docking application version to my system when I connected 11 April. This is confirmed by my post 12 April UTC (11 April local time) in which I described my experiences with the newly downloaded Linux charmm 5.05 -- including the fact that now a 'Suspend' of the charmm task did not work.

Interesting that you want answers from me. I'm a user. Why expect *me* to know how come 5.05 behaves differently in the boinc environment than 5.04 ?


I don't understand how your results are called valid (with a crunch time of less than one second) and granted credit.

That too was a "feature" introduced by the update to 5.05 (this "feature" has so far been carried forward to all subsequent Docking application releases). Note that the boinc client I was using on 9 April was 5.8.17, and the boinc client I was using on 12 April was 5.8.17 -- it was the new charmm 5.05, upon being "tracked" by the __same__ boinc client, that now showed up as taking less than one second of execution. [In actuality, Docking workunits need more than two hours of crunching each on my system. I have no control over what gets to be reported.]

My own problem with Docking applications on Linux from 5.05 on is that they use up to 40% of one CPU for "system services". Versions prior to 5.05 did not do this. I suspect that this is "unproductive" overhead, which eats up CPU cycles better spent on crunching the actual applications.
.


I didn't mean to ruffle your feathers. I was only trying to figure out what things changed to cause your Opteron to report such a short crunch time.

It arouses my curiosity when I notice things like that. The fact is that at least your Linux Opterons ran 5.05, even if that is when the short time showed up. My Linux Opterons (single-core) crashed every 5.05, but then started running 5.06 with normal times (pretty much same as 5.04).

There may be more Linux Opterons running, but I didn't find any yet.
mikus
Volunteer tester

Joined: Oct 28 06
Posts: 18
ID: 193
Credit: 2,915,329
RAC: 0
Message 3254 - Posted 6 May 2007 17:56:17 UTC - in response to Message ID 3252 .

I didn't mean to ruffle your feathers. I was only trying to figure out what things changed to cause your Opteron to report such a short crunch time.

It arouses my curiosity when I notice things like that. The fact is that at least your Linux Opterons ran 5.05, even if that is when the short time showed up. My Linux Opterons (single-core) crashed every 5.05, but then started running 5.06 with normal times (pretty much same as 5.04).

There may be more Linux Opterons running, but I didn't find any yet.
My impression was similar. Linux 5.05 did work as long as there was only a single Docking task running. But my boinc environment happened to choose to run three 5.05 tasks simultaneously. Shortly thereafter the system crashed (needed re-boot). That is the __only__ time this Opteron system has ever crashed on me.

Because of this crash, what I did was to set Docking 'No new tasks'. And waited for a new version of the application to be released. (Turned out that by the time I got around to resuming Docking, Linux 5.07 was the current version. That's what I've been running -- it has not crashed on me, but it still invokes that !#%$! unproductive "system execution", which I believe increases the elapsed time to finish each Docking workunit.)
.
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 3256 - Posted 6 May 2007 22:26:51 UTC - in response to Message ID 3254 .


[/quote] My impression was similar. Linux 5.05 did work as long as there was only a single Docking task running. But my boinc environment happened to choose to run three 5.05 tasks simultaneously. Shortly thereafter the system crashed (needed re-boot). That is the __only__ time this Opteron system has ever crashed on me.

Because of this crash, what I did was to set Docking 'No new tasks'. And waited for a new version of the application to be released. (Turned out that by the time I got around to resuming Docking, Linux 5.07 was the current version. That's what I've been running -- it has not crashed on me, but it still invokes that !#%$! unproductive "system execution", which I believe increases the elapsed time to finish each Docking workunit.)
.[/quote]

Do your other BOINC projects run with a "normal" CPU time?

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3259 - Posted 7 May 2007 1:22:17 UTC - in response to Message ID 3254 .

it has not crashed on me, but it still invokes that !#%$! unproductive "system execution", which I believe increases the elapsed time to finish each Docking workunit.)
.


We're still looking into what is causing this behavior. But you're right; a lot of cycles seem to be eaten up by these gettimeofday calls.

Thanks
Andre
____________
D@H the greatest project in the world... a while from now!
mikus
Volunteer tester

Joined: Oct 28 06
Posts: 18
ID: 193
Credit: 2,915,329
RAC: 0
Message 3260 - Posted 7 May 2007 4:17:32 UTC - in response to Message ID 3256 .
Last modified: 7 May 2007 4:18:30 UTC

Do your other BOINC projects run with a "normal" CPU time?

Yes - every other project's applications, without exception.

So did Docking applications 5.04 and earlier.
.
GeneM
Volunteer tester

Joined: Nov 28 06
Posts: 22
ID: 333
Credit: 7,402,034
RAC: 0
Message 3269 - Posted 10 May 2007 1:38:40 UTC - in response to Message ID 3252 .

Did you change anything on this computer between 9 April and 11 April?

It appeared to be running correctly on 9 April (judging from your results) and it seems it hasn't run correctly since 11 April.


Don't remember what I did yesterday, let alone a month ago. But if you look at my results page, the result reported on Apr 11 used charmm 5.04, whereas the result reported on Apr 12 used charmm 5.05. So the answer to your question is: __I__ did not change anything on this computer between 9 April and 11 April. However, the __boinc environment__ "automatically" downloaded a new Docking application version to my system when I connected 11 April. This is confirmed by my post 12 April UTC (11 April local time) in which I described my experiences with the newly downloaded Linux charmm 5.05 -- including the fact that now a 'Suspend' of the charmm task did not work.

Interesting that you want answers from me. I'm a user. Why expect *me* to know how come 5.05 behaves differently in the boinc environment than 5.04 ?


I don't understand how your results are called valid (with a crunch time of less than one second) and granted credit.

That too was a "feature" introduced by the update to 5.05 (this "feature" has so far been carried forward to all subsequent Docking application releases). Note that the boinc client I was using on 9 April was 5.8.17, and the boinc client I was using on 12 April was 5.8.17 -- it was the new charmm 5.05, upon being "tracked" by the __same__ boinc client, that now showed up as taking less than one second of execution. [In actuality, Docking workunits need more than two hours of crunching each on my system. I have no control over what gets to be reported.]

My own problem with Docking applications on Linux from 5.05 on is that they use up to 40% of one CPU for "system services". Versions prior to 5.05 did not do this. I suspect that this is "unproductive" overhead, which eats up CPU cycles better spent on crunching the actual applications.
.


I didn't mean to ruffle your feathers. I was only trying to figure out what things changed to cause your Opteron to report such a short crunch time.

It arouses my curiosity when I notice things like that. The fact is that at least your Linux Opterons ran 5.05, even if that is when the short time showed up. My Linux Opterons (single-core) crashed every 5.05, but then started running 5.06 with normal times (pretty much same as 5.04).

There may be more Linux Opterons running, but I didn't find any yet.


I noticed this evening that my linux box is reporting really short crunch times and the work units are validating ok. Yesterday early I stopped the computer because the chip set cooling fan and another fan had failed and I needed to replace them. I did it and started up the computer. Now the drives (3) in the system a mapped differently and I have not figured out why yet. They are all sata drives and connected to the same ports as before. Anyway, I have to mount most of the partitions manually and start boinc in a terminal window manually. I just use "./boinc" in the boinc directory to start it. That is the only difference that I can figure out. When I know more I will post more.

Gene
mikus
Volunteer tester

Joined: Oct 28 06
Posts: 18
ID: 193
Credit: 2,915,329
RAC: 0
Message 3293 - Posted 13 May 2007 12:51:57 UTC - in response to Message ID 3247 .

Turns out that not only does my system currently credit only about one second of execution to each Docking task (which actually takes around three hours to crunch), but it also is crediting only that one second of execution (for the around three hours of wall clock time) to my __system's__ totals.

The result is that the <cpu_efficiency> value for my system (the accumulated number of seconds spent executing, divided by the accumulated number of seconds elapsed) keeps going down as I run current Docking application tasks. The <cpu_efficiency> value is being used by the servers to predict how long it will take my system to complete each downloaded workunit. My <cpu_efficiency> value got to be so low that projects were no longer downloading to me (they calculated that my system would be unable to provide enough execution time to complete such downloaded work before its deadline).

I have now manually edited my system's <cpu_efficiency> value to allow normal downloading. And have marked Docking as 'no new work' until its current Linux application executable is upgraded.
.

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3300 - Posted 13 May 2007 19:46:08 UTC - in response to Message ID 3293 .

Thanks Mikus, that's probably a good idea. We are on top of this and will hopefully find out soon what is going on, but as said before, this problem cannot be reproduced on all of the liunux machines in our lab (incl. the opterons) and that makes it a bit hard to find a good solution...

Thanks
Andre

I have now manually edited my system's <cpu_efficiency> value to allow normal downloading. And have marked Docking as 'no new work' until its current Linux application executable is upgraded.
.


____________
D@H the greatest project in the world... a while from now!
Profile [B^S] BOINC-SG
Volunteer tester
Avatar

Joined: Oct 2 06
Posts: 17
ID: 136
Credit: 52,985
RAC: 0
Message 3331 - Posted 16 May 2007 6:07:16 UTC

Hi!

http://docking.utep.edu/result.php?resultid=191777

Cheers!
____________


My NEW BOINC-Site

Why people joined BOINC Synergy...

GeneM
Volunteer tester

Joined: Nov 28 06
Posts: 22
ID: 333
Credit: 7,402,034
RAC: 0
Message 3340 - Posted 16 May 2007 18:04:45 UTC - in response to Message ID 3269 .
Last modified: 16 May 2007 18:11:46 UTC

Did you change anything on this computer between 9 April and 11 April?

It appeared to be running correctly on 9 April (judging from your results) and it seems it hasn't run correctly since 11 April.


Don't remember what I did yesterday, let alone a month ago. But if you look at my results page, the result reported on Apr 11 used charmm 5.04, whereas the result reported on Apr 12 used charmm 5.05. So the answer to your question is: __I__ did not change anything on this computer between 9 April and 11 April. However, the __boinc environment__ "automatically" downloaded a new Docking application version to my system when I connected 11 April. This is confirmed by my post 12 April UTC (11 April local time) in which I described my experiences with the newly downloaded Linux charmm 5.05 -- including the fact that now a 'Suspend' of the charmm task did not work.

Interesting that you want answers from me. I'm a user. Why expect *me* to know how come 5.05 behaves differently in the boinc environment than 5.04 ?


I don't understand how your results are called valid (with a crunch time of less than one second) and granted credit.

That too was a "feature" introduced by the update to 5.05 (this "feature" has so far been carried forward to all subsequent Docking application releases). Note that the boinc client I was using on 9 April was 5.8.17, and the boinc client I was using on 12 April was 5.8.17 -- it was the new charmm 5.05, upon being "tracked" by the __same__ boinc client, that now showed up as taking less than one second of execution. [In actuality, Docking workunits need more than two hours of crunching each on my system. I have no control over what gets to be reported.]

My own problem with Docking applications on Linux from 5.05 on is that they use up to 40% of one CPU for "system services". Versions prior to 5.05 did not do this. I suspect that this is "unproductive" overhead, which eats up CPU cycles better spent on crunching the actual applications.
.


I didn't mean to ruffle your feathers. I was only trying to figure out what things changed to cause your Opteron to report such a short crunch time.

It arouses my curiosity when I notice things like that. The fact is that at least your Linux Opterons ran 5.05, even if that is when the short time showed up. My Linux Opterons (single-core) crashed every 5.05, but then started running 5.06 with normal times (pretty much same as 5.04).

There may be more Linux Opterons running, but I didn't find any yet.


I noticed this evening that my linux box is reporting really short crunch times and the work units are validating ok. Yesterday early I stopped the computer because the chip set cooling fan and another fan had failed and I needed to replace them. I did it and started up the computer. Now the drives (3) in the system a mapped differently and I have not figured out why yet. They are all sata drives and connected to the same ports as before. Anyway, I have to mount most of the partitions manually and start boinc in a terminal window manually. I just use "./boinc" in the boinc directory to start it. That is the only difference that I can figure out. When I know more I will post more.

Gene



So here is the story. When I reboot my linux machine, if I hit "esc" to cancel the memory check then Ubuntu gets the disks wrong when booting. If I let the bios do its thing, then the disk are mapped correctly and everything boots fine. This behavior just started because I almost always cancel the memory check. I just don't want to wait for it. The good news is that Docking is now being started by the startup process and is running correctly again. I mean it is reporting the run times correctly again. I have not checked to see if I stopped Docking from running as a deamon and run it manually from the Boinc directory if it will give those 0.xx second run times. I will do that if someone thinks it is worth the effort.

Gene
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3342 - Posted 16 May 2007 19:51:55 UTC - in response to Message ID 3340 .

So here is the story. When I reboot my linux machine, if I hit "esc" to cancel the memory check then Ubuntu gets the disks wrong when booting. If I let the bios do its thing, then the disk are mapped correctly and everything boots fine. This behavior just started because I almost always cancel the memory check. I just don't want to wait for it. The good news is that Docking is now being started by the startup process and is running correctly again. I mean it is reporting the run times correctly again. I have not checked to see if I stopped Docking from running as a deamon and run it manually from the Boinc directory if it will give those 0.xx second run times. I will do that if someone thinks it is worth the effort.

Gene


Gene, please do the experiment to see if that could be a cause.

Regarding the memory check, you should be able to disable it in the bios, so you never have to press escape again. That's what I do.

Andre
____________
D@H the greatest project in the world... a while from now!
Profile KSMarksPsych
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 26
ID: 87
Credit: 8,222
RAC: 0
Message 3347 - Posted 17 May 2007 3:11:49 UTC

I managed to kill this WU . I'm not exactly sure what in the sequence of events did it...


I just got a new laptop with Vista on it. Since I'm going to wipe it soon, I didn't do much with the installed software. The pop ups to register McAfee were driving me nuts. So I decided to uninstall it and stick Avast on it for the time being. Even though I'm behind a router, I wanted to pull the network cable while I did this. I suspended networking in BOINC. I went to uninstall McAfee. While this was going on, either the entire manager crashed or the communications between the manager and the daemon died.

After rebooting to get rid of the last of McAfee and another reboot to finish the Avast installation, I started up BOINC again. My Predictor WU and my Reisel Seive WU made it through okay. The Docking one was toast.

I'm not in front of that computer right now, but I'll check the logs when I'm back down there.

Profile KSMarksPsych
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 26
ID: 87
Credit: 8,222
RAC: 0
Message 3349 - Posted 17 May 2007 13:49:34 UTC

Here are the docking entries in stdoutdae.txt frome before the crash. I don't see anything of note except it appears the WU crashed at the same time as I suspended the network.


2007-05-16 21:42:37 [Docking@Home] Starting task 1tng_mod0011_12328_94826_1 using charmm version 507
2007-05-16 21:42:39 [Docking@Home] [file_xfer] Started upload of file 1tng_mod0011_12120_243524_1_0
2007-05-16 21:42:39 [Docking@Home] [file_xfer] Started upload of file 1tng_mod0011_12120_243524_1_1
2007-05-16 21:42:42 [Docking@Home] [file_xfer] Finished upload of file 1tng_mod0011_12120_243524_1_0
2007-05-16 21:42:42 [Docking@Home] [file_xfer] Throughput 11903 bytes/sec
2007-05-16 21:42:42 [Docking@Home] [file_xfer] Finished upload of file 1tng_mod0011_12120_243524_1_1
2007-05-16 21:42:42 [Docking@Home] [file_xfer] Throughput 11903 bytes/sec
2007-05-16 21:42:42 [Docking@Home] [file_xfer] Started upload of file 1tng_mod0011_12120_243524_1_2
2007-05-16 21:42:42 [Docking@Home] [file_xfer] Started upload of file 1tng_mod0011_12120_243524_1_3
2007-05-16 21:42:46 [Docking@Home] [file_xfer] Finished upload of file 1tng_mod0011_12120_243524_1_3
2007-05-16 21:42:46 [Docking@Home] [file_xfer] Throughput 2633 bytes/sec
2007-05-16 21:42:48 [Docking@Home] [file_xfer] Finished upload of file 1tng_mod0011_12120_243524_1_2
2007-05-16 21:42:48 [Docking@Home] [file_xfer] Throughput 115673 bytes/sec
2007-05-16 21:57:16 [Docking@Home] Sending scheduler request: To report completed tasks
2007-05-16 21:57:16 [Docking@Home] Reporting 1 tasks
2007-05-16 21:57:21 [Docking@Home] Scheduler RPC succeeded [server version 509]
2007-05-16 21:57:21 [Docking@Home] Deferring communication for 11 sec
2007-05-16 21:57:21 [Docking@Home] Reason: requested by project
2007-05-16 22:18:26 [---] Suspending network activity - user request
2007-05-16 22:22:33 [Docking@Home] Deferring communication for 1 min 0 sec
2007-05-16 22:22:33 [Docking@Home] Reason: Unrecoverable error for result 1tng_mod0011_12328_94826_1 ( - exit code 1073807364 (0x40010004))
2007-05-16 22:22:33 [Docking@Home] Computation for task 1tng_mod0011_12328_94826_1 finished
[05/16/07 22:22:33] TRACE [5968]: ***** Console Event Detected *****

[05/16/07 22:22:33] TRACE [5968]: Event: CTRL-SHUTDOWN Event

2007-05-16 22:22:34 [---] Exit requested by user



Specs just for the record...

Vista Home Premium

2007-05-16 22:26:37 [---] Starting BOINC client version 5.8.16 for windows_intelx86
2007-05-16 22:26:37 [---] log flags: task, file_xfer, sched_ops
2007-05-16 22:26:37 [---] Libraries: libcurl/7.16.0 OpenSSL/0.9.8a zlib/1.2.3
2007-05-16 22:26:37 [---] Data directory: C:\BOINC
2007-05-16 22:26:38 [---] Processor: 2 GenuineIntel Genuine Intel(R) CPU T2300 @ 1.66GHz [x86 Family 6 Model 14 Stepping 8] [fpu tsc pae nx sse sse2 sse3 mmx]
2007-05-16 22:26:38 [---] Memory: 1.99 GB physical, 4.20 GB virtual
2007-05-16 22:26:38 [---] Disk: 139.31 GB total, 121.65 GB free
____________
Kathryn :o)
The BOINC FAQ Service
The Unofficial BOINC Wiki
The Trac System
More BOINC information than you can shake a stick of RAM at.

Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 3350 - Posted 17 May 2007 18:33:22 UTC

@KSMarksPsych

I've had problems with Vista too. I'm also using McAfee that came with it.

When I go through ICS (Internet Connection Sharing - actually seems to be standards based and uses DHCP like a router) with my XP machine connected to the modem, I have problems with Vista completing uploads and not realizing it. Then it keeps trying to re-upload the file and gets errors each time. An update a while back fixed that, but the latest patch Tuesday not only brought the problem back but messed up BOINC so much that I lost probably 10 - 15 results and ended up having to wipe BOINC, re-install it, and re-attach to projects. Vista has real problems, especially if you're on dial-up. It also has problems any time it reboots because it doesn't seem to give BOINC enough time to shut down. There's a registry patch for the shutdown timing problem but I'll have to find it and install it. I've had to go back to dropping the Vista machine from my network and letting it dial-in separately.

I'm not sure if these problems are caused by McAfee or Vista, since McAfee is constantly updating and some of those updates include software updates, not just virus/spam sigs. IIRC, there's usually a McAfee software update around the time of patch Tuesday as well.

I really wish I didn't have some legacy apps that would be very hard to move to Linux.

-- David

____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?

GeneM
Volunteer tester

Joined: Nov 28 06
Posts: 22
ID: 333
Credit: 7,402,034
RAC: 0
Message 3352 - Posted 19 May 2007 16:39:47 UTC - in response to Message ID 3342 .
Last modified: 19 May 2007 16:41:03 UTC

So here is the story. When I reboot my linux machine, if I hit "esc" to cancel the memory check then Ubuntu gets the disks wrong when booting. If I let the bios do its thing, then the disk are mapped correctly and everything boots fine. This behavior just started because I almost always cancel the memory check. I just don't want to wait for it. The good news is that Docking is now being started by the startup process and is running correctly again. I mean it is reporting the run times correctly again. I have not checked to see if I stopped Docking from running as a deamon and run it manually from the Boinc directory if it will give those 0.xx second run times. I will do that if someone thinks it is worth the effort.

Gene


Gene, please do the experiment to see if that could be a cause.

Regarding the memory check, you should be able to disable it in the bios, so you never have to press escape again. That's what I do.

Andre



Andre,

I just started boinc as user process in a terminal window like I was doing before when I was having trouble booting my machine correctly. I then opened boincmgr and looked at what was going on in the tasks tab. The "Progress" was being updated but the "CPU time" was not being updated. So I think that the run time will not be reported to the server when the work unit completes. I will let this work unit and several more run to completion just to be sure. When I started boinc as a user process I used the same command line arguments that I use when I start it as a daemon process.

Gene
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3354 - Posted 20 May 2007 0:47:30 UTC - in response to Message ID 3352 .

Let me get my arms around this: so you are seeing that cpu time is not updated when boinc is started manually from a terminal window, but when started by the system, it does show cpu time correctly? If that's the case we should maybe post this on the boinc_projects mailing list and see if anybody there understands what's going on.

Thanks
Andre

Andre,

I just started boinc as user process in a terminal window like I was doing before when I was having trouble booting my machine correctly. I then opened boincmgr and looked at what was going on in the tasks tab. The "Progress" was being updated but the "CPU time" was not being updated. So I think that the run time will not be reported to the server when the work unit completes. I will let this work unit and several more run to completion just to be sure. When I started boinc as a user process I used the same command line arguments that I use when I start it as a daemon process.

Gene


____________
D@H the greatest project in the world... a while from now!
Dotsch
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 49
ID: 75
Credit: 57,728
RAC: 0
Message 3355 - Posted 20 May 2007 10:12:50 UTC

I have some invalid results from my Mac (Intel) :
http://docking.utep.edu/result.php?resultid=191636
http://docking.utep.edu/result.php?resultid=187665
http://docking.utep.edu/result.php?resultid=182663

GeneM
Volunteer tester

Joined: Nov 28 06
Posts: 22
ID: 333
Credit: 7,402,034
RAC: 0
Message 3356 - Posted 20 May 2007 17:36:59 UTC - in response to Message ID 3354 .

Let me get my arms around this: so you are seeing that cpu time is not updated when boinc is started manually from a terminal window, but when started by the system, it does show cpu time correctly? If that's the case we should maybe post this on the boinc_projects mailing list and see if anybody there understands what's going on.

Thanks
Andre

Andre,

I just started boinc as user process in a terminal window like I was doing before when I was having trouble booting my machine correctly. I then opened boincmgr and looked at what was going on in the tasks tab. The "Progress" was being updated but the "CPU time" was not being updated. So I think that the run time will not be reported to the server when the work unit completes. I will let this work unit and several more run to completion just to be sure. When I started boinc as a user process I used the same command line arguments that I use when I start it as a daemon process.

Gene




Andre,

You are correct. I just checked. I have let Docking run ~24 hours since I started it manually in the terminal window. It completed 11 work units in that time. Three of them showed 0.01 seconds computation time and the rest showed 0.00 seconds computation time. The one in progress is about 21% completed and shows 0.00 seconds computation time. If there is anything else I can do to help let me know.

Gene
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3357 - Posted 20 May 2007 22:31:40 UTC - in response to Message ID 3355 .

It looks like there still is a problem with the checkpointing method we use. Michela has found another issue that she is looking into. You notice that the result that is invalid has been restarted several times, while the valid others were started and then finished without restarts.

Thanks for the reports.

Andre

I have some invalid results from my Mac (Intel) :
http://docking.utep.edu/result.php?resultid=191636
http://docking.utep.edu/result.php?resultid=187665
http://docking.utep.edu/result.php?resultid=182663


____________
D@H the greatest project in the world... a while from now!

Message boards : Number crunching : Invalid results reported thread [Use Here]

Database Error
: The MySQL server is running with the --read-only option so it cannot execute this statement
array(3) {
  [0]=>
  array(7) {
    ["file"]=>
    string(47) "/boinc/projects/docking/html_v2/inc/db_conn.inc"
    ["line"]=>
    int(97)
    ["function"]=>
    string(8) "do_query"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#117 (2) {
      ["db_conn"]=>
      resource(162) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(51) "update DBNAME.thread set views=views+1 where id=108"
    }
  }
  [1]=>
  array(7) {
    ["file"]=>
    string(48) "/boinc/projects/docking/html_v2/inc/forum_db.inc"
    ["line"]=>
    int(60)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#117 (2) {
      ["db_conn"]=>
      resource(162) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(3) {
      [0]=>
      object(BoincThread)#3 (16) {
        ["id"]=>
        string(3) "108"
        ["forum"]=>
        string(1) "2"
        ["owner"]=>
        string(2) "15"
        ["status"]=>
        string(1) "0"
        ["title"]=>
        string(42) "Invalid results reported thread [Use Here]"
        ["timestamp"]=>
        string(10) "1179700300"
        ["views"]=>
        string(4) "3133"
        ["replies"]=>
        string(3) "111"
        ["activity"]=>
        string(22) "9.342123129504399e-120"
        ["sufferers"]=>
        string(1) "0"
        ["score"]=>
        string(1) "0"
        ["votes"]=>
        string(1) "0"
        ["create_time"]=>
        string(10) "1164085161"
        ["hidden"]=>
        string(1) "0"
        ["sticky"]=>
        string(1) "0"
        ["locked"]=>
        string(1) "0"
      }
      [1]=>
      &string(6) "thread"
      [2]=>
      &string(13) "views=views+1"
    }
  }
  [2]=>
  array(7) {
    ["file"]=>
    string(63) "/boinc/projects/docking/html_v2/user/community/forum/thread.php"
    ["line"]=>
    int(184)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(11) "BoincThread"
    ["object"]=>
    object(BoincThread)#3 (16) {
      ["id"]=>
      string(3) "108"
      ["forum"]=>
      string(1) "2"
      ["owner"]=>
      string(2) "15"
      ["status"]=>
      string(1) "0"
      ["title"]=>
      string(42) "Invalid results reported thread [Use Here]"
      ["timestamp"]=>
      string(10) "1179700300"
      ["views"]=>
      string(4) "3133"
      ["replies"]=>
      string(3) "111"
      ["activity"]=>
      string(22) "9.342123129504399e-120"
      ["sufferers"]=>
      string(1) "0"
      ["score"]=>
      string(1) "0"
      ["votes"]=>
      string(1) "0"
      ["create_time"]=>
      string(10) "1164085161"
      ["hidden"]=>
      string(1) "0"
      ["sticky"]=>
      string(1) "0"
      ["locked"]=>
      string(1) "0"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(13) "views=views+1"
    }
  }
}
query: update docking.thread set views=views+1 where id=108