workunit that reached the max # of error


Advanced search

Message boards : Number crunching : workunit that reached the max # of error

Sort
Author Message
Profile suguruhirahara
Forum moderator
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 282
ID: 15
Credit: 56,614
RAC: 0
Message 1495 - Posted 19 Nov 2006 5:44:00 UTC

How does the server deal with the workunit that reached the maximum number of error?

In my list of "pending credit", there are more than 10 workunits which have 3 computing error results. I've assumed at first that these would be canceled because of too many errors, but the workunit has 1 unsent result and keeps itself 'alive'.

For example, http://docking.utep.edu/workunit.php?wuid=13518

47266 Client error Compute error 977.96 1.06 ---
47267 Client error Compute error 1,363.61 1.88 ---
47268 Success Done 48,849.17 28.08 pending
47550 Success Done 13,677.59 14.74 pending
48170 Client error Compute error 875.77 1.33 ---
48201 Unsent Unknown New --- --- ---

The belowest one is an unsent result.

How will be workunits like that treated? Either another copy of each will be issued or they will be canceled?

thanks for reading,
suguruhirahara
____________

I'm a volunteer participant; my views are not necessarily those of Docking@Home or its participating institutions.
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 1502 - Posted 19 Nov 2006 19:05:09 UTC - in response to Message ID 1495 .

I think the server will wait for that last one to be sent out and returned and then close that workunit.

Andre

How does the server deal with the workunit that reached the maximum number of error?

In my list of "pending credit", there are more than 10 workunits which have 3 computing error results. I've assumed at first that these would be canceled because of too many errors, but the workunit has 1 unsent result and keeps itself 'alive'.

For example, http://docking.utep.edu/workunit.php?wuid=13518
47266 Client error Compute error 977.96 1.06 ---
47267 Client error Compute error 1,363.61 1.88 ---
47268 Success Done 48,849.17 28.08 pending
47550 Success Done 13,677.59 14.74 pending
48170 Client error Compute error 875.77 1.33 ---
48201 Unsent Unknown New --- --- ---

The belowest one is an unsent result.

How will be workunits like that treated? Either another copy of each will be issued or they will be canceled?

thanks for reading,
suguruhirahara


____________
D@H the greatest project in the world... a while from now!
Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 1503 - Posted 19 Nov 2006 19:06:44 UTC
Last modified: 19 Nov 2006 19:10:32 UTC

Hello Suguruhirahara,

Both my Linux and Windows XP machines have been getting the notorious "Docking@Home|Message from server: (there was work but it was committed to other platforms)" message for a couple of days so either it's not sending it out or the problem with the memory buffer on the Docking@Home Boinc server getting filled with another type of work units (probably Mac) is back. We really need more Intel Macs on this project, but I don't know anyone who has one that I haven't already suggested joining Docking@Home to.

BTW, I doubt that work unit will validate even if a third result was returned. The two machines that have pending credit are a Pentium III and a Pentium D. I don't think they've solved the P-III giving different results problem yet.

Does anyone know how many OS/CPU combinations are being grouped for work units now. Is it just Linux / Intel Mac / Windows ?

OT: I hope you don't mind this westerners ignorance, but is "Suguruhirahara" one name or two "Suguru Hirahara"? I've noticed some people just calling you Suguru.

-- David

EDIT: Oops, Andre and I responded at the same time, with different answers.

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 1507 - Posted 19 Nov 2006 22:46:35 UTC - in response to Message ID 1503 .
Last modified: 19 Nov 2006 23:03:41 UTC

Hi David,

At the moment we (and BOINC) have the following configured for homogeneous redundancy (HR):

Windows/Intel
Windows/AMD
Linux/Intel
Linux/AMD
Mac/Intel
Mac/PPC (not used yet)

We have found that this works for world community grid and predictor@home, but it doesn't seem to be working for us as P4's and PIII/PII's give different results (PIII/PII being the same as the AMD's as Memo found) and even some of the Intel Macs seem to give different results (maybe as a consequence of having different Intel architectures?). This means we have to change our HR (and thus the boinc source code) as this is not configurable as yet (maybe we should ask the boinc devs to make this more configurable).

Hope this helps.
Andre

[EDIT} I added this to the FAQ as well.

Does anyone know how many OS/CPU combinations are being grouped for work units now. Is it just Linux / Intel Mac / Windows ?
EDIT: Oops, Andre and I responded at the same time, with different answers.


____________
D@H the greatest project in the world... a while from now!
Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 1513 - Posted 20 Nov 2006 7:06:15 UTC - in response to Message ID 1507 .
Last modified: 20 Nov 2006 7:11:32 UTC

Hi David,

At the moment we (and BOINC) have the following configured for homogeneous redundancy (HR):

Windows/Intel
Windows/AMD
Linux/Intel
Linux/AMD
Mac/Intel
Mac/PPC (not used yet)


Thanks for the info. I can see why you might have trouble on the Intel Macs. From what I can tell, some are Xeons, some are the older Core Solo/Duo (upgraded Pentium-M on 65 nm process), and some are the new Core 2 Duo (not sure if they're the desktop or mobile version). When looking through the Mac results a few weeks ago, I noticed a Mac system with very low benchmarks. IIRC, Measured FP and Integer speed for that Mac were both "10 million ops/sec", while even my Celeron 2.4 gets FP: "645.16 million ops/sec" and Integer: "1214.19 million ops/sec". I figured that had to be either an emulator or a really weird boinc client. The Mac's speed was shown as "10" and not "10.00".

BTW, It's my Windows/Intel (Celeron 2.3 GHz) and my Linux/AMD (Socket A Sempron 2500+) box that haven't been able to get work for a couple of days. I'm trying to think of something new to try on the RedHat RHEL3 (Celeron 2.4 GHz) box since the unlimited stack doesn't seem to have fixed it. Is there any way of making available debug info about what the app was doing when it failed?

-- David
EDIT: noted RHEL3 box is Celeron. BTW, work fetch is suspended on that box until I can setup for trying something different.
Profile suguruhirahara
Forum moderator
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 282
ID: 15
Credit: 56,614
RAC: 0
Message 1515 - Posted 20 Nov 2006 10:08:21 UTC - in response to Message ID 1502 .
Last modified: 20 Nov 2006 12:36:47 UTC

Hi andre,

I think the server will wait for that last one to be sent out and returned and then close that workunit.

I've regarded the system that when the number of the error result reached 3, the workunit will be canceled. But in fact does it mean that when the number of the error 'exceed' 3, the workunit will be canceled? I'm afraid there will be a kind of trouble with the workunit since another copy of the result is unsent yet. Otherwise does it result from that a few hosts crunch results so far?

[Edit2: Also I noticed that the result unsent isn't sent and passed, while further results are distributed...Will they be really distributed in future? (I question though I'm not familiar with results producing and distributing system)]

thanks for reading,
suguruhirahara

Edit: to dave
Actually some call me as suguru. Of course it's no problem:) Who could imagine a reallly long given / family name like that? lol
____________

I'm a volunteer participant; my views are not necessarily those of Docking@Home or its participating institutions.
Profile suguruhirahara
Forum moderator
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 282
ID: 15
Credit: 56,614
RAC: 0
Message 1516 - Posted 20 Nov 2006 10:43:37 UTC - in response to Message ID 1513 .

BTW, It's my Windows/Intel (Celeron 2.3 GHz) and my Linux/AMD (Socket A Sempron 2500+) box that haven't been able to get work for a couple of days. I'm trying to think of something new to try on the RedHat RHEL3 (Celeron 2.4 GHz) box since the unlimited stack doesn't seem to have fixed it. Is there any way of making available debug info about what the app was doing when it failed?

-- David
EDIT: noted RHEL3 box is Celeron. BTW, work fetch is suspended on that box until I can setup for trying something different.

I could get work for my linux on virtual machines (Linux/Intel). Both Vine Linux, based on Red Hat, and SUSE have got several ten results. However I couldn't get work for Windows.

I'm not sure whether unlimited stack cannot work or not, but I assume it has been discussed in this forum.

____________

I'm a volunteer participant; my views are not necessarily those of Docking@Home or its participating institutions.

Message boards : Number crunching : workunit that reached the max # of error

Database Error
: The MySQL server is running with the --read-only option so it cannot execute this statement
array(3) {
  [0]=>
  array(7) {
    ["file"]=>
    string(47) "/boinc/projects/docking/html_v2/inc/db_conn.inc"
    ["line"]=>
    int(97)
    ["function"]=>
    string(8) "do_query"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#12 (2) {
      ["db_conn"]=>
      resource(66) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(51) "update DBNAME.thread set views=views+1 where id=104"
    }
  }
  [1]=>
  array(7) {
    ["file"]=>
    string(48) "/boinc/projects/docking/html_v2/inc/forum_db.inc"
    ["line"]=>
    int(60)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#12 (2) {
      ["db_conn"]=>
      resource(66) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(3) {
      [0]=>
      object(BoincThread)#3 (16) {
        ["id"]=>
        string(3) "104"
        ["forum"]=>
        string(1) "2"
        ["owner"]=>
        string(2) "15"
        ["status"]=>
        string(1) "0"
        ["title"]=>
        string(40) "workunit that reached the max # of error"
        ["timestamp"]=>
        string(10) "1164019417"
        ["views"]=>
        string(3) "820"
        ["replies"]=>
        string(1) "6"
        ["activity"]=>
        string(19) "6.786518141169e-128"
        ["sufferers"]=>
        string(1) "0"
        ["score"]=>
        string(1) "0"
        ["votes"]=>
        string(1) "0"
        ["create_time"]=>
        string(10) "1163915040"
        ["hidden"]=>
        string(1) "0"
        ["sticky"]=>
        string(1) "0"
        ["locked"]=>
        string(1) "0"
      }
      [1]=>
      &string(6) "thread"
      [2]=>
      &string(13) "views=views+1"
    }
  }
  [2]=>
  array(7) {
    ["file"]=>
    string(63) "/boinc/projects/docking/html_v2/user/community/forum/thread.php"
    ["line"]=>
    int(184)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(11) "BoincThread"
    ["object"]=>
    object(BoincThread)#3 (16) {
      ["id"]=>
      string(3) "104"
      ["forum"]=>
      string(1) "2"
      ["owner"]=>
      string(2) "15"
      ["status"]=>
      string(1) "0"
      ["title"]=>
      string(40) "workunit that reached the max # of error"
      ["timestamp"]=>
      string(10) "1164019417"
      ["views"]=>
      string(3) "820"
      ["replies"]=>
      string(1) "6"
      ["activity"]=>
      string(19) "6.786518141169e-128"
      ["sufferers"]=>
      string(1) "0"
      ["score"]=>
      string(1) "0"
      ["votes"]=>
      string(1) "0"
      ["create_time"]=>
      string(10) "1163915040"
      ["hidden"]=>
      string(1) "0"
      ["sticky"]=>
      string(1) "0"
      ["locked"]=>
      string(1) "0"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(13) "views=views+1"
    }
  }
}
query: update docking.thread set views=views+1 where id=104