Incorrect Function 1


Advanced search

Message boards : Number crunching : Incorrect Function 1

Sort
Author Message
Brian Priebe

Joined: Oct 3 10
Posts: 3
ID: 33519
Credit: 4,529,494
RAC: 0
Message 6850 - Posted 1 Oct 2012 22:30:21 UTC

It seems this problem has raised its ugly head again. I am seeing dozens of WU's on a number of PC's all fail with:

Calling BOINC init.
Starting charmm run (initial or from checkpoint)...
ERROR - Charmm exited with code 1.
Calling BOINC finish.
called boinc_finish

Ananas

Joined: Aug 29 09
Posts: 56
ID: 17736
Credit: 2,500,425
RAC: 0
Message 6853 - Posted 5 Oct 2012 3:10:34 UTC - in response to Message ID 6850 .
Last modified: 5 Oct 2012 3:22:46 UTC

same here :-(

I aborted all unstarted ones and set to NNW.

p.s.: @Project : Exit 0 if you recognize this situation and handle it on server side. As this is a situation, that is recognized by the program, which aborts the result on purpose, it is not a crash or invalid result so don't treat it like one.

Reeferman

Joined: Sep 25 12
Posts: 2
ID: 67677
Credit: 1,489
RAC: 0
Message 6854 - Posted 5 Oct 2012 20:06:56 UTC

I am also getting the same failed messages...........

Lois Petrolito

Joined: Apr 1 12
Posts: 2
ID: 53545
Credit: 58,779
RAC: 0
Message 6855 - Posted 5 Oct 2012 22:26:58 UTC

I don't get "incorrect function". What I've been getting lately is "computation error" ?

Reeferman

Joined: Sep 25 12
Posts: 2
ID: 67677
Credit: 1,489
RAC: 0
Message 6856 - Posted 5 Oct 2012 23:50:06 UTC - in response to Message ID 6855 .

I don't get "incorrect function". What I've been getting lately is "computation error" ?



Yeah I get "computation error" on my results also, but when I click on the reported WU results, it lists the "incorrect function 1" error.
Lois Petrolito

Joined: Apr 1 12
Posts: 2
ID: 53545
Credit: 58,779
RAC: 0
Message 6857 - Posted 6 Oct 2012 4:06:47 UTC - in response to Message ID 6856 .

I don't get "incorrect function". What I've been getting lately is "computation error" ?



Yeah I get "computation error" on my results also, but when I click on the reported WU results, it lists the "incorrect function 1" error.


Now that I check that page, I DO get the same error. What's happening?
Ananas

Joined: Aug 29 09
Posts: 56
ID: 17736
Credit: 2,500,425
RAC: 0
Message 6858 - Posted 6 Oct 2012 8:54:26 UTC - in response to Message ID 6857 .
Last modified: 6 Oct 2012 9:10:34 UTC

...
Now that I check that page, I DO get the same error. What's happening?

If I remember right, somewhere in the middle of a calculation it finds out, that one value is out of bounds and exits with an status value of 1.

Afaik. the translation of "1" into "incorrect function" is a BOINC thing and doesn't really reflect the reason for this error.

When they start the calculation, they cannot tell if it is a candidate for this type of error but I still think they should treat it as a valid result on BOINC-side, as no technical error led to the error. It could very well be handled when it's transferred (or not transferred in this case) from BOINC into the scientific database instead of filling our task lists with error results.

p.s.: Here's an older thread about the same problem, it is a known problem, that's why Brian wrote "again".
der_Day

Joined: Jan 16 10
Posts: 10
ID: 24434
Credit: 1,922,000
RAC: 0
Message 6859 - Posted 7 Oct 2012 11:08:14 UTC

I've also some of these buggy WUs...

Since today I got some WUs with an estimitated time of 10 hours (normal 3h)! And it seems, that they have no checkpoints, is this normal? example wu
Is it helpful to reset the project?

Andreas38871

Joined: Jan 8 09
Posts: 2
ID: 5693
Credit: 8,459
RAC: 0
Message 6860 - Posted 7 Oct 2012 11:55:43 UTC
Last modified: 7 Oct 2012 11:56:15 UTC

So I have the same problem, no checkpoints. After 3 hours and 30 minutes no more time is displayed. Is relatively poor when the computer is to be issued once.
Andreas

der_Day

Joined: Jan 16 10
Posts: 10
ID: 24434
Credit: 1,922,000
RAC: 0
Message 6861 - Posted 7 Oct 2012 14:59:45 UTC - in response to Message ID 6860 .

So I have the same problem, no checkpoints. After 3 hours and 30 minutes no more time is displayed. Is relatively poor when the computer is to be issued once.
Andreas

I looked in the slot-folder (for example d:\Boinc\Project_Data\slots\) of the broken WUs and saw, that several files are missing. I cancelled these jobs
Boyu Zhang
Forum moderator
Project administrator
Project developer
Project tester

Joined: May 5 10
Posts: 88
ID: 28821
Credit: 2,013,795
RAC: 0
Message 6875 - Posted 9 Oct 2012 2:24:00 UTC

During the past weekend, the space on D@H server is getting filled up and as a result, the server sent out some incomplete workunits, please abort workunits with name "1iiq1hih" or "1ohr1hih". Currently, the server is back to normal again.

Thanks for letting us know and bear with us during the difficulty!

Boyu

speechless

Joined: Nov 24 11
Posts: 5
ID: 46818
Credit: 1,075,787
RAC: 0
Message 6879 - Posted 9 Oct 2012 13:36:25 UTC - in response to Message ID 6875 .

During the past weekend, the space on D@H server is getting filled up and as a result, the server sent out some incomplete workunits, please abort workunits with name "1iiq1hih" or "1ohr1hih". Currently, the server is back to normal again.

Thanks for letting us know and bear with us during the difficulty!

Boyu



I have the 0 % problem on 3 WU that start with 1m0b1htf, some work just fine, though. What should I do?
Aaron Finney
Volunteer tester

Joined: Mar 23 07
Posts: 74
ID: 367
Credit: 2,409,831
RAC: 0
Message 6880 - Posted 9 Oct 2012 14:38:09 UTC - in response to Message ID 6879 .

During the past weekend, the space on D@H server is getting filled up and as a result, the server sent out some incomplete workunits, please abort workunits with name "1iiq1hih" or "1ohr1hih". Currently, the server is back to normal again.

Thanks for letting us know and bear with us during the difficulty!

Boyu



I have the 0 % problem on 3 WU that start with 1m0b1htf, some work just fine, though. What should I do?



I also have the 0% problem with 3 WU that start with "1m0b1htf", and also 3 that start with "1ohr1htf". They are over 20 hours and counting, noticed it this morning.
Boyu Zhang
Forum moderator
Project administrator
Project developer
Project tester

Joined: May 5 10
Posts: 88
ID: 28821
Credit: 2,013,795
RAC: 0
Message 6881 - Posted 9 Oct 2012 15:11:42 UTC - in response to Message ID 6879 .

Please abort the ones with 0% progress.

Thanks!

During the past weekend, the space on D@H server is getting filled up and as a result, the server sent out some incomplete workunits, please abort workunits with name "1iiq1hih" or "1ohr1hih". Currently, the server is back to normal again.

Thanks for letting us know and bear with us during the difficulty!

Boyu



I have the 0 % problem on 3 WU that start with 1m0b1htf, some work just fine, though. What should I do?

Profile TheFiend

Joined: Apr 7 09
Posts: 70
ID: 9482
Credit: 20,705,527
RAC: 0
Message 6887 - Posted 9 Oct 2012 19:08:29 UTC

Just downloaded some 1ebz1hih_mod0014crossdockinghiv1 and they are coming up with the 0% progress problem... Aborting them

Ananas

Joined: Aug 29 09
Posts: 56
ID: 17736
Credit: 2,500,425
RAC: 0
Message 6891 - Posted 10 Oct 2012 6:58:30 UTC
Last modified: 10 Oct 2012 7:13:46 UTC

Two separate problems I think ...

The tasks that return with exit code 1 do show progress, they seem to run with quite a normal speed (judged by the % display) and then they exit before having reached the 100%

Examples :

1hvi1htf_mod0014crossdockinghiv1_26830_396694_0
1hvi1htf_mod0014crossdockinghiv1_26823_446111_0
1hvi1htf_mod0014crossdockinghiv1_26819_164747_0
1hvi1htf_mod0014crossdockinghiv1_26714_324070_0

so the problem started with 1hvi1htf_... (for me)

p.s.: I checked the ones of Brian, the thread starter : 1dif1htf_ is what I found there, so it's not restricted to the series that caused the problem for me.

p.p.s.: Maybe an x64 windows issue? Brians fleet runs this OS type and my box does too.

Simba123

Joined: Dec 7 11
Posts: 23
ID: 47237
Credit: 2,607,800
RAC: 0
Message 6916 - Posted 14 Oct 2012 5:55:58 UTC

Hello,
I also am seeing a lot of these 'computational error' workunits.

The latest failures all seem to be coming from the
1ohr1htf series of workunits.....

Profile TheFiend

Joined: Apr 7 09
Posts: 70
ID: 9482
Credit: 20,705,527
RAC: 0
Message 6917 - Posted 14 Oct 2012 10:08:46 UTC
Last modified: 14 Oct 2012 10:21:40 UTC

Out of the current batch of 1ohr1htf units I've had 7 errors out of 121 crunched so far and those were limited to 1 PC - a 1090T hex core.

EDIT.... as I have recently been tweaking core voltages on my 1090T I have upped it a notch to see if the errors stop occuring on the 1090T.

Ananas

Joined: Aug 29 09
Posts: 56
ID: 17736
Credit: 2,500,425
RAC: 0
Message 6927 - Posted 19 Oct 2012 21:18:09 UTC

After a little timeout ... currently the results seem to run much better, no "exit 1" error, my last five went through flawless.

Ananas

Joined: Aug 29 09
Posts: 56
ID: 17736
Credit: 2,500,425
RAC: 0
Message 6928 - Posted 21 Oct 2012 0:09:52 UTC - in response to Message ID 6927 .
Last modified: 21 Oct 2012 0:12:34 UTC

After a little timeout ... currently the results seem to run much better, no "exit 1" error, my last five went through flawless.

That was too early :-(
After 25 valid 1t7k1htf, 1dif1htf_mod0014crossdockinghiv1_23875_162815 failed with -1
There always seem to be certain series that have this flaw, other series are completely unaffected.
Ananas

Joined: Aug 29 09
Posts: 56
ID: 17736
Credit: 2,500,425
RAC: 0
Message 6929 - Posted 21 Oct 2012 19:29:31 UTC - in response to Message ID 6917 .
Last modified: 21 Oct 2012 19:46:12 UTC

Out of the current batch of 1ohr1htf units I've had 7 errors out of 121 crunched so far and those were limited to 1 PC - a 1090T hex core. ...

Your box started working fine when you ran out of 1ohr1htf and received 1t7k1htf instead. I doubt that it is a voltage issue, it would have done that with the lower voltage too I bet. (You had way more than 7 errors btw. and all only in certain series)

1dif1htf (those that caused problems for me) fail on your box too and on all other hosts I checked.

Mine is a Xeon L5520, standard voltages and frequencies.

@project : Again ... as this "ERROR - Charmm exited with code 1." is a program controlled exit, you should "exit 0" and set a flag that the result ran into a condition where further processing doesn't make sense for scientific reasons. On BOINC-side it should be successfull. Compare it to a prime project - there the numbers that turn out to be no prime do not exit with an error either.
Profile TheFiend

Joined: Apr 7 09
Posts: 70
ID: 9482
Credit: 20,705,527
RAC: 0
Message 6931 - Posted 22 Oct 2012 10:54:09 UTC - in response to Message ID 6929 .

Out of the current batch of 1ohr1htf units I've had 7 errors out of 121 crunched so far and those were limited to 1 PC - a 1090T hex core. ...

Your box started working fine when you ran out of 1ohr1htf and received 1t7k1htf instead. I doubt that it is a voltage issue, it would have done that with the lower voltage too I bet. (You had way more than 7 errors btw. and all only in certain series)

1dif1htf (those that caused problems for me) fail on your box too and on all other hosts I checked.



Came to the conclusion it was dodgy WU's and dropped the the voltage agin a couple of days ago.


My 1055T has just had a few 1dif units error out so aborting all 1dif on both crunchers.
Ananas

Joined: Aug 29 09
Posts: 56
ID: 17736
Credit: 2,500,425
RAC: 0
Message 6937 - Posted 25 Oct 2012 17:03:35 UTC

1hvi1htf is another buggy series, I'll abort all I get from this type.

Profile robertmiles

Joined: Apr 16 09
Posts: 96
ID: 9967
Credit: 1,290,747
RAC: 0
Message 6938 - Posted 26 Oct 2012 0:29:00 UTC

A large fraction, but not all, of my 1hvi1htf workunits are now giving this compute error:

Incorrect function. (0x1) - exit code 1 (0x1)

Could you investigate why, and what should be done about this?

Ananas

Joined: Aug 29 09
Posts: 56
ID: 17736
Credit: 2,500,425
RAC: 0
Message 6939 - Posted 26 Oct 2012 1:54:44 UTC
Last modified: 26 Oct 2012 2:35:33 UTC

Add 1hvj1htf and 1hvk1htf to the badlist

Ananas

Joined: Aug 29 09
Posts: 56
ID: 17736
Credit: 2,500,425
RAC: 0
Message 6940 - Posted 26 Oct 2012 19:57:42 UTC

hmmmm ... the non-1hv*** results are getting rare, maybe it's time for another timeout. The last one had been caused by series having tons of "ERROR - Charmm exited with code 1." but no one has fixed it since then and no one seems to care about tons of crashing results at all.

ZapSSD

Joined: Jul 17 12
Posts: 1
ID: 63092
Credit: 206,746
RAC: 0
Message 6941 - Posted 26 Oct 2012 22:14:31 UTC

Wel the problem last almost a month now and indeed the projectleaders seems to not bother at all. I think I detach from this project permanently.

Message boards : Number crunching : Incorrect Function 1

Database Error
: The MySQL server is running with the --read-only option so it cannot execute this statement
array(3) {
  [0]=>
  array(7) {
    ["file"]=>
    string(47) "/boinc/projects/docking/html_v2/inc/db_conn.inc"
    ["line"]=>
    int(97)
    ["function"]=>
    string(8) "do_query"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#32 (2) {
      ["db_conn"]=>
      resource(126) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(51) "update DBNAME.thread set views=views+1 where id=691"
    }
  }
  [1]=>
  array(7) {
    ["file"]=>
    string(48) "/boinc/projects/docking/html_v2/inc/forum_db.inc"
    ["line"]=>
    int(60)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#32 (2) {
      ["db_conn"]=>
      resource(126) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(3) {
      [0]=>
      object(BoincThread)#3 (16) {
        ["id"]=>
        string(3) "691"
        ["forum"]=>
        string(1) "2"
        ["owner"]=>
        string(5) "33519"
        ["status"]=>
        string(1) "0"
        ["title"]=>
        string(20) "Incorrect Function 1"
        ["timestamp"]=>
        string(10) "1351289671"
        ["views"]=>
        string(3) "414"
        ["replies"]=>
        string(2) "26"
        ["activity"]=>
        string(22) "1.6378441531674999e-34"
        ["sufferers"]=>
        string(1) "0"
        ["score"]=>
        string(1) "0"
        ["votes"]=>
        string(1) "0"
        ["create_time"]=>
        string(10) "1349130621"
        ["hidden"]=>
        string(1) "0"
        ["sticky"]=>
        string(1) "0"
        ["locked"]=>
        string(1) "0"
      }
      [1]=>
      &string(6) "thread"
      [2]=>
      &string(13) "views=views+1"
    }
  }
  [2]=>
  array(7) {
    ["file"]=>
    string(63) "/boinc/projects/docking/html_v2/user/community/forum/thread.php"
    ["line"]=>
    int(184)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(11) "BoincThread"
    ["object"]=>
    object(BoincThread)#3 (16) {
      ["id"]=>
      string(3) "691"
      ["forum"]=>
      string(1) "2"
      ["owner"]=>
      string(5) "33519"
      ["status"]=>
      string(1) "0"
      ["title"]=>
      string(20) "Incorrect Function 1"
      ["timestamp"]=>
      string(10) "1351289671"
      ["views"]=>
      string(3) "414"
      ["replies"]=>
      string(2) "26"
      ["activity"]=>
      string(22) "1.6378441531674999e-34"
      ["sufferers"]=>
      string(1) "0"
      ["score"]=>
      string(1) "0"
      ["votes"]=>
      string(1) "0"
      ["create_time"]=>
      string(10) "1349130621"
      ["hidden"]=>
      string(1) "0"
      ["sticky"]=>
      string(1) "0"
      ["locked"]=>
      string(1) "0"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(13) "views=views+1"
    }
  }
}
query: update docking.thread set views=views+1 where id=691