When a work unti fails, the computer doesn't keep going. . .


Advanced search

Message boards : Number crunching : When a work unti fails, the computer doesn't keep going. . .

Sort
Author Message
anthonmg

Joined: Apr 11 09
Posts: 64
ID: 9657
Credit: 17,959,472
RAC: 0
Message 5058 - Posted 2 Jun 2009 17:14:24 UTC

I came in this mornign to find all my computers were giving this error:

charmm34_6.15_windows_intelx86 has enoucred a problem and needs to close. We are sorry for the inconvenience.

The tastks giving this error had been running much longer than normal (14 hours instead of 3 hours). Even worse, they didn't just accept the errror and move on to the next work unit. Thus, all of my processors have been sitting idle for a really long time.

Is the source of these failing work units known, and is there someway, when one fails, for the program to just moveon to the next unit? It's gonna be a while before the RAC recovers from this (16 processors down for a significant chunk of time and returning only a hanful of errored work units that dont' receive credit).

anthonmg

Joined: Apr 11 09
Posts: 64
ID: 9657
Credit: 17,959,472
RAC: 0
Message 5070 - Posted 8 Jun 2009 5:39:15 UTC

So, I just got back from a 3 day trip to find that a few hours after i left, 5 jobs failed on one of my computers due to the error above (acmed returned an error). Until I acknolwedge the error it doesn't move on. I only noticed that as looking at my laptop tonight I found that my RAC has been steadily falling the last few days while away. Going through the various computers attached to this I found the error and they're running again. Is there a reason for this that can be corrected. One up side to a distributed client that runs in the background is that it should run in the background without needing constant monitoring. Over the last few months anytime I went out of town for a few days some error like this, or the neverending workunits, etc., cropped up and massive time was lost (in this case ~100 work units). Is there anything that can be done to reduce the amount of maintenence being a part of this project requires?

Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 5079 - Posted 12 Jun 2009 11:15:34 UTC

I've never seen anything like that from docking WU's, even when they fail... It sounds like an operating system generated message...

What version of windows are you using on your crunchers?

What version of BOINC client software?
____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?

Steven Meyer
Avatar

Joined: May 26 09
Posts: 23
ID: 12091
Credit: 130,335
RAC: 0
Message 5089 - Posted 25 Jun 2009 15:45:47 UTC - in response to Message ID 5058 .
Last modified: 25 Jun 2009 16:02:47 UTC

I came in this mornign to find all my computers were giving this error:

charmm34_6.15_windows_intelx86 has enoucred a problem and needs to close. We are sorry for the inconvenience.

The tastks giving this error had been running much longer than normal (14 hours instead of 3 hours). Even worse, they didn't just accept the errror and move on to the next work unit. Thus, all of my processors have been sitting idle for a really long time.

Is the source of these failing work units known, and is there someway, when one fails, for the program to just moveon to the next unit? It's gonna be a while before the RAC recovers from this (16 processors down for a significant chunk of time and returning only a hanful of errored work units that dont' receive credit).


This message
such-and-so-program has encountered a problem and needs to close
is from the Windows Operating System, and thus is outside the control of BOINC, or the client programs from D@H. The operating system puts the dialog box on the screen in order to inform you of the failure of the program and then waits for you to respond by clicking on the "OK" button. (IMO this should really be a "Bummer" button!) In any case, the operating system is really patient and will wait forever for you to respond.

Although the message box is outside the control of D@H, I would think that the program developers at D@H will be interested in why their program raised such an error on so many computers at about the same time since it is likely that there is a bug in their code.

The other possibility is that some other program that is running on all of your computers is the cause of the failure by stepping on something needed by the D@H program.

Note: That other program could be a computer virus or worm. Do be sure to check your computers for infections.

Can you think of something else that was running at the same time on all of those computers?

Note too: That other program could be your virus scanner! It could be, for example, that the virus scanner will open some file in order to scan it for viruses with an exclusive lock, which prevents other programs from opening the file until the virus scanner is done with the file. If the D@H code tries to open the file and does not handle the failure to open the file, then that can be an error that will be raised to the operating system and may result in the message that you saw.
____________
anthonmg

Joined: Apr 11 09
Posts: 64
ID: 9657
Credit: 17,959,472
RAC: 0
Message 5105 - Posted 3 Jul 2009 20:09:24 UTC

Thanks for the various posts and thoughts about this. It's Windows XP professional, SP3, and on the machine that's most afflicted with this error, version 6.6.36. Though it has happened with other versions of the client. Interesting thought about the virus scanner. That's beyond my control as the sys-admins set all that up and we can't modify it. I'd also be surprised to hear that the virus software locks files while reading them since that would be really bad for the operating system in general.

The computer is also my work machine so yes, there are often other programs running, which hopefully a good distributed client would be able to deal with since it's supposed to be in the background. IF I'm leaving for a trip I usually dump most of the programs, but do tend to use the computer remotely through remote desktop. I do know that remote desktop causes problems with GPU clients (why I stopped contributing to GPUgrid, logging in remotely would crash the GPU-based work untis and I have to log in remotely frequently).

I did also notice that the work units that were in progress when I got the Acmed error had already failed on someone else's computer before once or twice so they were certainly work units with issues.

Message boards : Number crunching : When a work unti fails, the computer doesn't keep going. . .

Database Error
: The MySQL server is running with the --read-only option so it cannot execute this statement
array(3) {
  [0]=>
  array(7) {
    ["file"]=>
    string(47) "/boinc/projects/docking/html_v2/inc/db_conn.inc"
    ["line"]=>
    int(97)
    ["function"]=>
    string(8) "do_query"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#10 (2) {
      ["db_conn"]=>
      resource(66) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(51) "update DBNAME.thread set views=views+1 where id=436"
    }
  }
  [1]=>
  array(7) {
    ["file"]=>
    string(48) "/boinc/projects/docking/html_v2/inc/forum_db.inc"
    ["line"]=>
    int(60)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#10 (2) {
      ["db_conn"]=>
      resource(66) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(3) {
      [0]=>
      object(BoincThread)#3 (16) {
        ["id"]=>
        string(3) "436"
        ["forum"]=>
        string(1) "2"
        ["owner"]=>
        string(4) "9657"
        ["status"]=>
        string(1) "0"
        ["title"]=>
        string(60) "When a work unti fails, the computer doesn't keep going. . ."
        ["timestamp"]=>
        string(10) "1246651764"
        ["views"]=>
        string(3) "175"
        ["replies"]=>
        string(1) "4"
        ["activity"]=>
        string(19) "2.3053301250218e-87"
        ["sufferers"]=>
        string(1) "0"
        ["score"]=>
        string(1) "0"
        ["votes"]=>
        string(1) "0"
        ["create_time"]=>
        string(10) "1243962864"
        ["hidden"]=>
        string(1) "0"
        ["sticky"]=>
        string(1) "0"
        ["locked"]=>
        string(1) "0"
      }
      [1]=>
      &string(6) "thread"
      [2]=>
      &string(13) "views=views+1"
    }
  }
  [2]=>
  array(7) {
    ["file"]=>
    string(63) "/boinc/projects/docking/html_v2/user/community/forum/thread.php"
    ["line"]=>
    int(184)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(11) "BoincThread"
    ["object"]=>
    object(BoincThread)#3 (16) {
      ["id"]=>
      string(3) "436"
      ["forum"]=>
      string(1) "2"
      ["owner"]=>
      string(4) "9657"
      ["status"]=>
      string(1) "0"
      ["title"]=>
      string(60) "When a work unti fails, the computer doesn't keep going. . ."
      ["timestamp"]=>
      string(10) "1246651764"
      ["views"]=>
      string(3) "175"
      ["replies"]=>
      string(1) "4"
      ["activity"]=>
      string(19) "2.3053301250218e-87"
      ["sufferers"]=>
      string(1) "0"
      ["score"]=>
      string(1) "0"
      ["votes"]=>
      string(1) "0"
      ["create_time"]=>
      string(10) "1243962864"
      ["hidden"]=>
      string(1) "0"
      ["sticky"]=>
      string(1) "0"
      ["locked"]=>
      string(1) "0"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(13) "views=views+1"
    }
  }
}
query: update docking.thread set views=views+1 where id=436