CHARMM ERRORS AND QUIRKS


Advanced search

Message boards : Number crunching : CHARMM ERRORS AND QUIRKS

Sort
Author Message
Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 4210 - Posted 3 Aug 2008 0:52:30 UTC
Last modified: 3 Aug 2008 1:45:30 UTC

I have been noticing that the new CHARMM work units often get to 100.00% but still have some more processing to do before they finish. They will sit at 100.00% "waiting to run" when the task switches to another project.
No real problem as they all appear to have then completed normally when they start running again.

Also there is a mix of work units that don't display any progress till completion and others that do show progress right from the start.
Is there a reason for this??

The real reason for this thread is a number of work units that are now running well past the hour that nearly all the others are running.

So far I have had 2 work units error out due to this error

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
Maximum CPU time exceeded
</message>
<stderr_txt>
Calling BOINC init.
Starting charmm run (initial or from checkpoint)...

Result 15515
Result 15055

I have another couple of work units at the moment that have reached 5.43 hours and 3.28 hours.
Should I abort these work units as I fear that they will error out anyway.??

Thanks and it's glad to see the project back, been waiting a while.

[EDIT] Of the two I mentioned above that had gone past 1 hour, the 5.43 hour one did error out with the same error and I aborted the second one as it had reached 4.15 hours and would of failed as well.
Result 15056
Result 15057
____________

Profile Michela
Forum moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Joined: Sep 13 06
Posts: 163
ID: 10
Credit: 97,083
RAC: 0
Message 4211 - Posted 3 Aug 2008 2:18:17 UTC - in response to Message ID 4210 .

Conan, it seems to me that these work-units are not going to the end. Please abort those that have overcome the 3 hours. Each work-unit should not overcome 1.5/2 hours,

We will try to reproduce the problem in your work-unit and keep you posted.

Thanks,

Michela






____________
If you are interested in working on Docking@Home in a great group at UDel, contact me at 'taufer at acm dot org'!

Profile Saenger
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 125
ID: 79
Credit: 411,959
RAC: 0
Message 4213 - Posted 3 Aug 2008 11:40:48 UTC

I recieved a mail from you about a misconfigured BOINC on my machine_

Docking@Home notification:

Dear Saenger
Your machine (host # 6528) described below appears to have a misconfigured BOINC
installation. Could you please have a look at it?

Sincerely,
The Docking@Home team

I can't confirm this, as Docking is the only project with problems, I'm running CPDN, Einstein, yoyo, POEM, malaria, Cosmology, UTC-malaria, WCG, Lattice, orbit, Rosetta, QMC, Milkyway, Leiden, Ralph, Magnetism, Ibercivis and soon again Simap, the occasional Pirates and LHC on it, all crunch well.

I don't think it's a problem with the configuration.

My puter is described here , and of course in my account as well , but there's just rudimentary information.

The errors I get are the following :
<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Calling BOINC init.
Starting charmm run (initial or from checkpoint)...
ERROR - Charmm exited with code 1.
Calling BOINC finish.
called boinc_finish

</stderr_txt>
]]>
Profile Trilce Estrada
Forum moderator
Project administrator
Project developer
Project tester

Joined: Sep 19 06
Posts: 189
ID: 119
Credit: 1,217,236
RAC: 0
Message 4219 - Posted 4 Aug 2008 14:14:33 UTC - in response to Message ID 4210 .

Hi Conan, yes I think that is better if you cancel those very long work units (taking more than 2 or 3 hrs). It is a very strange behavior that we are trying to understand. I will run those workunits outside BOINC to see if I can reproduce that same behavior and find the cause. I don't know if I will succeed because often, different architectures give different results, but I'll try and I'll keep you posted.

Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 4221 - Posted 4 Aug 2008 15:23:03 UTC - in response to Message ID 4219 .

Hi Conan, yes I think that is better if you cancel those very long work units (taking more than 2 or 3 hrs). It is a very strange behavior that we are trying to understand. I will run those workunits outside BOINC to see if I can reproduce that same behavior and find the cause. I don't know if I will succeed because often, different architectures give different results, but I'll try and I'll keep you posted.


G'Day Trilce,
I think I worked it out that the work units I was having trouble with (there were about 5 of them), had problems because they were all resends already sent twice as 6.04 work units and then sent to me as 6.07 work units.

I have had no more trouble since.

(I sent an e-mail to Michela to say I could not supply the files she requested, but offered her what I believe to be the solution to the problem, and I now have passed this onto you)

Thanks again.

____________
Profile Saenger
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 125
ID: 79
Credit: 411,959
RAC: 0
Message 4223 - Posted 5 Aug 2008 16:00:01 UTC

I still get the same errors for every single WU an my puter as described below (or above, depending on your sorting;)

Is there anything I can do abvout it, as every other projects doesn't have any problems.
____________
Gruesse vom Saenger

For questions about Boinc look in the BOINC-Wiki

Profile Trilce Estrada
Forum moderator
Project administrator
Project developer
Project tester

Joined: Sep 19 06
Posts: 189
ID: 119
Credit: 1,217,236
RAC: 0
Message 4224 - Posted 5 Aug 2008 17:25:50 UTC - in response to Message ID 4223 .

Hi Saenger, I think it could be a corrupted input file. Can you please reset the project, or detach-attach the project. Maybe this way if there is a corrupted file you will get a good one. I'm still investigating what could be the cause for those errors.


Hi Conan, it makes sense what you said, it could be that the client was a little confused for the change of versions. But as I just said to Saenger we are working to find the causes of all those errors


Thank you both

Profile Trilce Estrada
Forum moderator
Project administrator
Project developer
Project tester

Joined: Sep 19 06
Posts: 189
ID: 119
Credit: 1,217,236
RAC: 0
Message 4225 - Posted 5 Aug 2008 17:54:28 UTC

OK I found one cause in my own client: The input files where empty. So, if you go to the BOINC client directory, then /projects/docking.cis.udel.edu/ and some of the sizes of the input files are 0, your are likely to keep having errors. I just detach=attach the project, but now I'm having problems to download the files.

So, I need more time to solve the problem. But if you have the input files empty, please suspend the project. I will let you know when the problems to download these files are solved and then you can either restart or detach-attach the project.

Thank you for your patience

Profile Trilce Estrada
Forum moderator
Project administrator
Project developer
Project tester

Joined: Sep 19 06
Posts: 189
ID: 119
Credit: 1,217,236
RAC: 0
Message 4227 - Posted 5 Aug 2008 19:52:56 UTC

Hi All, Im sure about one kind of error:


  • The input files (some of them) are there, but empty
  • Right now, if you have this problem DO NOT detach the project, just SUSPEND . This is because new users attached to the project are likely to get lots of download errors
  • We are working on the solution, and we'll let you know when its ready



Thank you all.

Profile Saenger
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 125
ID: 79
Credit: 411,959
RAC: 0
Message 4232 - Posted 7 Aug 2008 16:19:27 UTC - in response to Message ID 4224 .

Hi Saenger, I think it could be a corrupted input file. Can you please reset the project, or detach-attach the project. Maybe this way if there is a corrupted file you will get a good one. I'm still investigating what could be the cause for those errors.

I first tried a reset, it didn't help.
Then I detach/reattached and now I've done my first 4. I hope my max-per-core of 1 will go up again soon, but patience is a virtue, especially in beta stages ;)
____________
Gruesse vom Saenger

For questions about Boinc look in the BOINC-Wiki
Profile Saenger
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 125
ID: 79
Credit: 411,959
RAC: 0
Message 4235 - Posted 8 Aug 2008 19:56:26 UTC

I just had 2 with Maximum CPU time exceeded as well:
http://docking.cis.udel.edu/result.php?resultid=19501
http://docking.cis.udel.edu/result.php?resultid=19489

As I'm not at the puter most of the time, I can't babysit the WUs. I just saw with the last one that the completion bar stuck at 0.000% while the time went up, but as I haven't seen the other ones that completed fine, I thought perhaps the bar is just broken.

Is there anything to avoid this without permanent babysit?

Profile Michela
Forum moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Joined: Sep 13 06
Posts: 163
ID: 10
Credit: 97,083
RAC: 0
Message 4236 - Posted 8 Aug 2008 20:27:05 UTC - in response to Message ID 4235 .

I just had 2 with Maximum CPU time exceeded as well:
http://docking.cis.udel.edu/result.php?resultid=19501
http://docking.cis.udel.edu/result.php?resultid=19489

As I'm not at the puter most of the time, I can't babysit the WUs. I just saw with the last one that the completion bar stuck at 0.000% while the time went up, but as I haven't seen the other ones that completed fine, I thought perhaps the bar is just broken.

Is there anything to avoid this without permanent babysit?


This is the issue that keep us busy right now. We cannot reproduce the problem in standalone or on our testing server. On docking we do not get back any file that could tell us what went on. The bar is not broken and the fact that it stays at 0.00% probably means that the charmm simulation does not start.

It helps us to get from you and the other volunteers with this problem the list of files in the slot/# associated to the work-unit and the list of files in project/docking.cis.udel.edu.

Thanks,

Michela



____________
If you are interested in working on Docking@Home in a great group at UDel, contact me at 'taufer at acm dot org'!
Profile Saenger
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 125
ID: 79
Credit: 411,959
RAC: 0
Message 4237 - Posted 8 Aug 2008 20:40:20 UTC - in response to Message ID 4236 .

It helps us to get from you and the other volunteers with this problem the list of files in the slot/# associated to the work-unit and the list of files in project/docking.cis.udel.edu.

The slots are gone with the WU obviously, If I notice another one in the future I will save it before aborting the WU.
I'll send you the project/docking folder, I've archived it to a .tar.gz-archive, is that fine? If so, I'll send it to the mail address in this page .
Profile Michela
Forum moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Joined: Sep 13 06
Posts: 163
ID: 10
Credit: 97,083
RAC: 0
Message 4238 - Posted 8 Aug 2008 20:54:54 UTC - in response to Message ID 4237 .

It helps us to get from you and the other volunteers with this problem the list of files in the slot/# associated to the work-unit and the list of files in project/docking.cis.udel.edu.

The slots are gone with the WU obviously, If I notice another one in the future I will save it before aborting the WU.
I'll send you the project/docking folder, I've archived it to a .tar.gz-archive, is that fine? If so, I'll send it to the mail address in this page .


Yes, please.

I was looking at the database and oder those wus with error-status = -177 (what we get). Well, it seems that all of them have been created before 2008-07-28 23:11:18 and the last wu with the problem (so far) had id=6149. Unfortunately our setting repeated wus with errors up to 3 time - we just changed this now.

These were old wus and we made some changes since we created them that could be the cause of the -error 177.

Let me keep monitored the situation and see if this is indeed the case.

Michela



____________
If you are interested in working on Docking@Home in a great group at UDel, contact me at 'taufer at acm dot org'!
Profile Saenger
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 125
ID: 79
Credit: 411,959
RAC: 0
Message 4239 - Posted 8 Aug 2008 21:05:02 UTC - in response to Message ID 4238 .

Yes, please.

Been there, done that.
4MB on their way to you.
Profile Michela
Forum moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Joined: Sep 13 06
Posts: 163
ID: 10
Credit: 97,083
RAC: 0
Message 4240 - Posted 8 Aug 2008 22:14:01 UTC - in response to Message ID 4239 .

Yes, please.

Been there, done that.
4MB on their way to you.


If my observation is correct and the error -177 is indeed related to old work-units then canceling those old work-units should fix the problem. I just cancelled old work-units up to the work-unit with id 6149. This should not affect credits already awarded. I am monitoring docking and will keep you all posted.

Michela
____________
If you are interested in working on Docking@Home in a great group at UDel, contact me at 'taufer at acm dot org'!

Message boards : Number crunching : CHARMM ERRORS AND QUIRKS

Database Error
: The MySQL server is running with the --read-only option so it cannot execute this statement
array(3) {
  [0]=>
  array(7) {
    ["file"]=>
    string(47) "/boinc/projects/docking/html_v2/inc/db_conn.inc"
    ["line"]=>
    int(97)
    ["function"]=>
    string(8) "do_query"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#21 (2) {
      ["db_conn"]=>
      resource(72) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(51) "update DBNAME.thread set views=views+1 where id=321"
    }
  }
  [1]=>
  array(7) {
    ["file"]=>
    string(48) "/boinc/projects/docking/html_v2/inc/forum_db.inc"
    ["line"]=>
    int(60)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#21 (2) {
      ["db_conn"]=>
      resource(72) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(3) {
      [0]=>
      object(BoincThread)#3 (16) {
        ["id"]=>
        string(3) "321"
        ["forum"]=>
        string(1) "2"
        ["owner"]=>
        string(3) "100"
        ["status"]=>
        string(1) "0"
        ["title"]=>
        string(24) "CHARMM ERRORS AND QUIRKS"
        ["timestamp"]=>
        string(10) "1218233641"
        ["views"]=>
        string(3) "406"
        ["replies"]=>
        string(2) "15"
        ["activity"]=>
        string(20) "1.2488385068118e-100"
        ["sufferers"]=>
        string(1) "0"
        ["score"]=>
        string(1) "0"
        ["votes"]=>
        string(1) "0"
        ["create_time"]=>
        string(10) "1217724750"
        ["hidden"]=>
        string(1) "0"
        ["sticky"]=>
        string(1) "0"
        ["locked"]=>
        string(1) "0"
      }
      [1]=>
      &string(6) "thread"
      [2]=>
      &string(13) "views=views+1"
    }
  }
  [2]=>
  array(7) {
    ["file"]=>
    string(63) "/boinc/projects/docking/html_v2/user/community/forum/thread.php"
    ["line"]=>
    int(184)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(11) "BoincThread"
    ["object"]=>
    object(BoincThread)#3 (16) {
      ["id"]=>
      string(3) "321"
      ["forum"]=>
      string(1) "2"
      ["owner"]=>
      string(3) "100"
      ["status"]=>
      string(1) "0"
      ["title"]=>
      string(24) "CHARMM ERRORS AND QUIRKS"
      ["timestamp"]=>
      string(10) "1218233641"
      ["views"]=>
      string(3) "406"
      ["replies"]=>
      string(2) "15"
      ["activity"]=>
      string(20) "1.2488385068118e-100"
      ["sufferers"]=>
      string(1) "0"
      ["score"]=>
      string(1) "0"
      ["votes"]=>
      string(1) "0"
      ["create_time"]=>
      string(10) "1217724750"
      ["hidden"]=>
      string(1) "0"
      ["sticky"]=>
      string(1) "0"
      ["locked"]=>
      string(1) "0"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(13) "views=views+1"
    }
  }
}
query: update docking.thread set views=views+1 where id=321