Computation error


Advanced search

Message boards : Number crunching : Computation error

Sort
Author Message
Mark Rush

Joined: Feb 15 09
Posts: 4
ID: 7162
Credit: 5,779,850
RAC: 0
Message 6944 - Posted 29 Oct 2012 15:54:06 UTC

After last month's issue with "0 progress" WUs, I am seeing another issue. Out of about 80 or so WUs, about 70 or so ended with "Computation error." The other 10 or so successfully completed. The computation errors appear at different intervals. The longest seems to be one WU that ran for about 90 minutes before it stopped with a computation error and the shortest is one that ran for only 5 minutes.

I have probably another 200 to 300 Docking WUs. Here is another example where I truly wish Docking would develop the capability to delete WUs that are broken even after they have been distributed.

Mark

googloo

Joined: Nov 30 09
Posts: 6
ID: 22204
Credit: 1,182,026
RAC: 0
Message 6945 - Posted 29 Oct 2012 19:27:38 UTC

FYI, managers: I aborted task
1m0b1htf_mod0014crossdockinghiv1_27177_94200_0
because of the 0 % completed error

and I aborted two tasks relevant to this thread because a number of similar tasks had Computation errors. All began with 1hv.

1hvl1htf_mod0014crossdockinghiv1_274_85478_0
1hvj1htf_mod0014crossdockinghiv1_24698_169775_0

Profile adrianxw
Volunteer tester
Avatar

Joined: Dec 30 06
Posts: 164
ID: 343
Credit: 1,669,741
RAC: 0
Message 6947 - Posted 30 Oct 2012 16:18:11 UTC
Last modified: 30 Oct 2012 16:30:45 UTC

0% prgress after lengthy runs, (up to 21 hours for example), no new tasks set.

1m0b1htf_mod0014crossdockinghiv1_42034_236751_0 for example.

Looks like the task runs, ie. the "to completion" drops, but when it gets to the end, it doesn't stop.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Profile adrianxw
Volunteer tester
Avatar

Joined: Dec 30 06
Posts: 164
ID: 343
Credit: 1,669,741
RAC: 0
Message 6951 - Posted 30 Oct 2012 19:23:43 UTC - in response to Message ID 6947 .

Looks like the task runs, ie. the "to completion" drops, but when it gets to the end, it doesn't stop.


Actually, watching it, I'm not sure that is true. It looks like some "normal" wu's are coming also. These led me to suspect the case above, but I can see the percentage done increasing on those ones.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 6954 - Posted 1 Nov 2012 10:44:02 UTC
Last modified: 1 Nov 2012 10:54:34 UTC

ALL work is ending in computation error, nothing works, what has gone wrong?
No response from the Project Administrators either, can't they see that the error rate has sky rocketed?

See 34615435
also 34609862

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
Calling BOINC init.
Starting charmm run (initial or from checkpoint)...
ERROR - Charmm exited with code 1.
Calling BOINC finish.
called boinc_finish

Plus 34636864
and 34615272

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Exit code 1282 (0x502)
</message>
<stderr_txt>
Calling BOINC init.
Starting charmm run (initial or from checkpoint)...



Exit code 1 work units start to run and fail after a short while.

Exit code 1282 work units fails almost straight away.

Conan
____________

googloo

Joined: Nov 30 09
Posts: 6
ID: 22204
Credit: 1,182,026
RAC: 0
Message 6956 - Posted 1 Nov 2012 15:42:00 UTC

Since this thread seems to be confounding the zero progress error with the computation error problem, Note that I aborted task 1iiq1htf_mod0014crossdockinghiv1_26144_321209_0 after 11+ hours and zero progress.

Boyu Zhang
Forum moderator
Project administrator
Project developer
Project tester

Joined: May 5 10
Posts: 88
ID: 28821
Credit: 2,013,795
RAC: 0
Message 6957 - Posted 1 Nov 2012 15:59:08 UTC - in response to Message ID 6954 .

Dear Conan,

We are working towards this problem, the transitioner selected some of the old workunits from the database and generated results to distribute. We have set the status "not needed" for those results, and we are monitoring it.

Please abort the workunits start with "1hv", "1m0b", "1iiq". Sorry for the troubles this problem has brought!

Thanks a lot!
Boyu

ALL work is ending in computation error, nothing works, what has gone wrong?
No response from the Project Administrators either, can't they see that the error rate has sky rocketed?

See 34615435
also 34609862

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
Calling BOINC init.
Starting charmm run (initial or from checkpoint)...
ERROR - Charmm exited with code 1.
Calling BOINC finish.
called boinc_finish

Plus 34636864
and 34615272

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Exit code 1282 (0x502)
</message>
<stderr_txt>
Calling BOINC init.
Starting charmm run (initial or from checkpoint)...



Exit code 1 work units start to run and fail after a short while.

Exit code 1282 work units fails almost straight away.

Conan

Profile TheFiend

Joined: Apr 7 09
Posts: 70
ID: 9482
Credit: 20,705,527
RAC: 0
Message 6961 - Posted 1 Nov 2012 16:44:09 UTC

Thanks Boyu...... hopefully the WU problems are over.....

Profile adrianxw
Volunteer tester
Avatar

Joined: Dec 30 06
Posts: 164
ID: 343
Credit: 1,669,741
RAC: 0
Message 6965 - Posted 2 Nov 2012 15:24:35 UTC
Last modified: 2 Nov 2012 15:29:57 UTC

Another of these that crunches to "completion", ie. remaining changes to "---", but the elapsed goes on up. This is a "1d4j" unit.

This is probably a different problem, I have had several wu's now that run to completion, the elapsed goes up and the remaining goes down, but the percentage done remains at 0.000% so you don't know until some time later if it is a duffer or just a badly behaved good...

This project has had a good number of "accidents" recently.

<edit>
And another. This, a "1ohr" unit.
</edit>
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Mark Rush

Joined: Feb 15 09
Posts: 4
ID: 7162
Credit: 5,779,850
RAC: 0
Message 6970 - Posted 2 Nov 2012 23:44:21 UTC - in response to Message ID 6957 .

Boyu:

It is at least passingly ironic that immediately after I made the first post, about the "broken" WUs, I noticed that Malaria Control had deleted another WU that had already been sent to me. If Docking had this capability, I would not have wasted another 60 hours on 2 more zero-progress Docking WUs. Those 60 hours would have completed 5 Docking WUs...if they had not been wasted running broken WUs.

Mark
____________

Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 6971 - Posted 3 Nov 2012 0:52:05 UTC
Last modified: 3 Nov 2012 1:26:38 UTC

Thanks Boyu,
I will have to wait for the work units to flush from my computer as I am not near that computer to abort them.

But glad you are monitoring the situation.

Conan

Edit:--@ Mark,
I feel for your loss as I have lost over 432 hours in just 4 work units due to the no progress problem (I think, as I am not at that computer)

I have reported 4 work units over the last week or so that had run times of up to 108 hours, then the Error message "Maximum Elapsed Time Exceeded" kills the work unit.

I am not near that computer so do not know if any progress was made or even if the CPU did anything with these work units.

WU 34246730 Run time 383,240.00 s Claim 2,187.08 Grant 0.00 Type 1ohr1htf_
WU 34234226 Run time 391,618.80 s Claim 2,234.90 Grant 0.00 Type 1iiq1htf_
WU 34621588 Run time 391,345.90 s Claim 2,236.05 Grant 0.00 Type 1iiq1htf_
WU 34627009 Run time 391,779.20 s Claim 2,238.51 Grant 0.00 Type 1iiq1htf_

Run time add up to 1,557,983.90 seconds or 432.77 hours lost in just these 4 work units.
Very disappointing (would of liked the claimed points though (8,896.54 total).

You can see how they all hit the maximum allowed time and then they get killed.

Conan
____________

Mark Rush

Joined: Feb 15 09
Posts: 4
ID: 7162
Credit: 5,779,850
RAC: 0
Message 6976 - Posted 3 Nov 2012 23:47:09 UTC - in response to Message ID 6971 .


Conan:

I did not know there was a maximum time. I worried that perhaps the computer I do monitor very frequently might churn away until the sun dies, working on a never-ending Docking WU. I'm interested/happy (?) to learn that there is a maximum... Thanks for the information!

Mark
____________

Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 6977 - Posted 4 Nov 2012 11:39:09 UTC
Last modified: 4 Nov 2012 11:40:00 UTC

Well I just nudged past that 391,000 seconds to a new high of 399,551.20 seconds on WU 34646201 , with another stuck work unit, Type 1m0b1htf.

Hope they clear these out soon as I have gotten no credit for over a week but processed hundreds of hours of work for no result.

Conan
____________

Profile Steve Hawker*

Joined: Oct 25 12
Posts: 3
ID: 68997
Credit: 90,427
RAC: 0
Message 7095 - Posted 22 Apr 2013 17:36:00 UTC - in response to Message ID 6957 .

Dear Conan,

We are working towards this problem, the transitioner selected some of the old workunits from the database and generated results to distribute. We have set the status "not needed" for those results, and we are monitoring it.

Please abort the workunits start with "1hv", "1m0b", "1iiq". Sorry for the troubles this problem has brought!

Thanks a lot!
Boyu


Six months on from this and I am still receiving 1hv WUs. My computer tried to crunch a whole bunch before I noticed.

Would be great if you could delete these from the queue and save us all a lot of bother.

Thanks!

Message boards : Number crunching : Computation error

Database Error
: The MySQL server is running with the --read-only option so it cannot execute this statement
array(3) {
  [0]=>
  array(7) {
    ["file"]=>
    string(47) "/boinc/projects/docking/html_v2/inc/db_conn.inc"
    ["line"]=>
    int(97)
    ["function"]=>
    string(8) "do_query"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#19 (2) {
      ["db_conn"]=>
      resource(90) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(51) "update DBNAME.thread set views=views+1 where id=702"
    }
  }
  [1]=>
  array(7) {
    ["file"]=>
    string(48) "/boinc/projects/docking/html_v2/inc/forum_db.inc"
    ["line"]=>
    int(60)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#19 (2) {
      ["db_conn"]=>
      resource(90) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(3) {
      [0]=>
      object(BoincThread)#3 (16) {
        ["id"]=>
        string(3) "702"
        ["forum"]=>
        string(1) "2"
        ["owner"]=>
        string(4) "7162"
        ["status"]=>
        string(1) "0"
        ["title"]=>
        string(17) "Computation error"
        ["timestamp"]=>
        string(10) "1366652160"
        ["views"]=>
        string(3) "248"
        ["replies"]=>
        string(2) "13"
        ["activity"]=>
        string(18) "7.296778022951e-28"
        ["sufferers"]=>
        string(1) "0"
        ["score"]=>
        string(1) "0"
        ["votes"]=>
        string(1) "0"
        ["create_time"]=>
        string(10) "1351526046"
        ["hidden"]=>
        string(1) "0"
        ["sticky"]=>
        string(1) "0"
        ["locked"]=>
        string(1) "0"
      }
      [1]=>
      &string(6) "thread"
      [2]=>
      &string(13) "views=views+1"
    }
  }
  [2]=>
  array(7) {
    ["file"]=>
    string(63) "/boinc/projects/docking/html_v2/user/community/forum/thread.php"
    ["line"]=>
    int(184)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(11) "BoincThread"
    ["object"]=>
    object(BoincThread)#3 (16) {
      ["id"]=>
      string(3) "702"
      ["forum"]=>
      string(1) "2"
      ["owner"]=>
      string(4) "7162"
      ["status"]=>
      string(1) "0"
      ["title"]=>
      string(17) "Computation error"
      ["timestamp"]=>
      string(10) "1366652160"
      ["views"]=>
      string(3) "248"
      ["replies"]=>
      string(2) "13"
      ["activity"]=>
      string(18) "7.296778022951e-28"
      ["sufferers"]=>
      string(1) "0"
      ["score"]=>
      string(1) "0"
      ["votes"]=>
      string(1) "0"
      ["create_time"]=>
      string(10) "1351526046"
      ["hidden"]=>
      string(1) "0"
      ["sticky"]=>
      string(1) "0"
      ["locked"]=>
      string(1) "0"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(13) "views=views+1"
    }
  }
}
query: update docking.thread set views=views+1 where id=702