Shared Memory Page Problem


Advanced search

Message boards : Number crunching : Shared Memory Page Problem

Sort
Author Message
BobCat13
Volunteer tester

Joined: Nov 14 06
Posts: 22
ID: 239
Credit: 285,322
RAC: 0
Message 3128 - Posted 27 Apr 2007 3:53:39 UTC

The Host Distribution column doesn't look correct. <eg>

Over 4 billion hosts in the Other category.

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3142 - Posted 27 Apr 2007 14:11:42 UTC - in response to Message ID 3128 .

Thanks. We'll check it out.

AK

The Host Distribution column doesn't look correct. <eg>

Over 4 billion hosts in the Other category.


____________
D@H the greatest project in the world... a while from now!
Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 3332 - Posted 16 May 2007 10:57:07 UTC

> Please advise if I don't understand this Shared Memory and Host distribution page.

In the first column, called 'Shared Memory HR Distribution' I find

Linux/Intel Family 6 =82, Linux/Intel Family 15 =1, Linux/AMD Family 6 =0, Linux/AMD Family 15 =0, Windows/Intel Family 6 =0, Windows/Intel Family 15 =0, Windows/AMD Family 6 =0, Windows/AMD Family 15 =0, Darwin/Intel Family 6 =141, Darwin/PPC =18. With 758 unassigned.

In the second column, called 'Database HR Distribution' I find

Linux/Intel Family 6 =82, Linux/Intel Family 15 =26, Linux/AMD Family 6 =5, Linux/AMD Family 15 =3, Windows/Intel Family 6 =78, Windows/Intel Family 15 =32, Windows/AMD Family 6 =11, Windows/AMD Family 15 =8, Darwin/Intel Family 6 =129, Darwin/PPC =21. With 795 unassigned.

In the third column, called 'Host Distribution Credit >1' I find

Linux/Intel Family 6 =21, Linux/Intel Family 15 =48, Linux/AMD Family 6 =51, Linux/AMD Family 15 =36, Windows/Intel Family 6 =173, Windows/Intel Family 15 =292, Windows/AMD Family 6 =169, Windows/AMD Family 15 =112, Darwin/Intel Family 6 =34, Darwin/PPC =17.
With 4,294,967,139 other Hosts.

Data as of 2030 Australian Eastern Standard Time.

From this I gather that In the Shared Memory pool there is workunits waiting to be sent to Linux/Intel Fam 6, Linux/Intel Fam 15, Darwin/Intel Fam 6 and Darwin/PPC Operating System/CPUs. No currently allocated jobs waiting in Shared Memory for the other OS/CPU mixes.

I also see that in the HR Database that there are 82 jobs in the database for L/I Fam 6 but there are only 21 Hosts available to take that work.
26 WUs for L/I Fam 15 and 48 Hosts
5 WUs for L/A Fam 6 and 51 Hosts
3 WUs for L/A Fam 15 and 36 Hosts
78 WUs for W/I Fam 6 and 173 Hosts
32 WUs for W/I Fam 15 and 292 Hosts
11 WUs for W/A Fam 6 and 169 Hosts
8 WUs for W/A Fam 15 and 112 Hosts
129 WUs for D/I Fam 6 and 34 Hosts
21 WUs for D/PPC and 17 Hosts.

So if there are 51 AMD Hosts running Linux and only 5 WUs available then how do they get any work?
Same with 292 Intel Hosts running Windows and only 32 WUs available.

Am I reading this wrong?

Also Memo's RAC will skyrocket if and when you harness the 4,294,967,139 'other' Hosts listed under the Host Distribution, that will even floor SETI's output.

____________

Aaron Finney
Volunteer tester

Joined: Mar 23 07
Posts: 74
ID: 367
Credit: 2,409,831
RAC: 0
Message 3333 - Posted 16 May 2007 11:24:45 UTC - in response to Message ID 3332 .
Last modified: 16 May 2007 11:25:10 UTC


Also Memo's RAC will skyrocket if and when you harness the 4,294,967,139 'other' Hosts listed under the Host Distribution, that will even floor SETI's output.


I can answer that!

I believe that since 1991, all cell phones and toaster ovens have been preinstalled with the docking@home software. Unfortunately, the client isn't stable yet - But we'll keep you posted!
Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 3335 - Posted 16 May 2007 14:56:08 UTC


@Conan

The shared memory pool is a subset of the database work unit pool which is kept in memory for fast access by multiple programs on the server.

The ones that have an HR class listed are ones which have already been assigned to at least 1 computer and therefore are only able to be assigned to computers with the same HR class. Unassigned work units haven't been assigned to any computers yet, so they are available to be assigned to any computer but once assigned to a computer they get that computers HR class associated with them.

It's been a while since I looked into this so I'm skipping some of the details (or never knew them in the first place), but here's what happens when a computer asks for work. All of this takes place using the shared memory pool.

1. The server looks at the HR class of the computer asking for work and checks if it already has work units in the shared memory that are assigned to that HR class and

A) Aren't already being crunched by enough computers. For example, only 2 computers have been assigned that work unit but the number of computers specified in the initial distribution is 3.
B) Haven't already been crunched by the computer asking for work.
C) Haven't already been crunched by a computer belonging to the same user as the computer asking for work (Docking has this feature turned off because there a so few users in some HR classes)
D) Don't have some special resource requirement that the computer asking for work doesn't meet (i.e. more that 512MBytes of RAM)

If the server finds a work unit that is already assigned to the correct HR class and meets the above criteria, then that work unit is assigned to the computer asking for work.

2. If there are no work units already assigned to the correct HR class that meet the criteria to be sent to this computer then one of the unassigned work units (again, special resource requirements are taken into account) is assigned to this HR class and assigned to the computer that is asking for work. It will remain in the database/shared_memory and will be available for assignment to other computers until it meets it's initial distribution. (I'm not sure of the implementation details but it probably stays in there until the computers return their results).

3. If the computer didn't get work in step 1 or step 2, then it gets the infamous "There was work but it was committed to other platforms" message.

I know that there's a lot more going on but this is the part that applies to your question. This process is repeated until the computer asking for work gets enough work units assigned to it.

I'm not sure how the process of moving work units between the database and shared memory works.

I also know that the computer asking for work sends some information about it's work queue and the scheduler can decide based on this information and other factors to not send the computer work because it doesn't think the computer can finish the work and return it in time.

There's some really weird stuff that goes on in the scheduler that takes into account how often the computer contacts the server, the resources available on the computer (i.e RAM), and the historical reliability of the computer asking for work.

Hope this makes sense :-)

Happy Crunching,

-- David
____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?

Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 3337 - Posted 16 May 2007 15:08:29 UTC - in response to Message ID 3128 .

The Host Distribution column doesn't look correct. <eg>

Over 4 billion hosts in the Other category.


That sounds suspiciously like a 32 bit signed integer with a negative value that is being displayed as an unsigned 32 bit integer.

For instance, a 32 bit signed integer with the value -1 would show up as 4,294,967,295 if it was displayed as a 32 bit unsigned integer. The value in hex would be 0xffffffff.

-- David
____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3344 - Posted 16 May 2007 20:00:56 UTC - in response to Message ID 3337 .

That's what it is most likely, but nobody had time to fix it yet though :-( It's in our bugzilla system though!

AK

The Host Distribution column doesn't look correct. <eg>

Over 4 billion hosts in the Other category.


That sounds suspiciously like a 32 bit signed integer with a negative value that is being displayed as an unsigned 32 bit integer.

For instance, a 32 bit signed integer with the value -1 would show up as 4,294,967,295 if it was displayed as a 32 bit unsigned integer. The value in hex would be 0xffffffff.

-- David


____________
D@H the greatest project in the world... a while from now!

Message boards : Number crunching : Shared Memory Page Problem

Database Error
: The MySQL server is running with the --read-only option so it cannot execute this statement
array(3) {
  [0]=>
  array(7) {
    ["file"]=>
    string(47) "/boinc/projects/docking/html_v2/inc/db_conn.inc"
    ["line"]=>
    int(97)
    ["function"]=>
    string(8) "do_query"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#12 (2) {
      ["db_conn"]=>
      resource(78) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(51) "update DBNAME.thread set views=views+1 where id=240"
    }
  }
  [1]=>
  array(7) {
    ["file"]=>
    string(48) "/boinc/projects/docking/html_v2/inc/forum_db.inc"
    ["line"]=>
    int(60)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#12 (2) {
      ["db_conn"]=>
      resource(78) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(3) {
      [0]=>
      object(BoincThread)#3 (16) {
        ["id"]=>
        string(3) "240"
        ["forum"]=>
        string(1) "2"
        ["owner"]=>
        string(3) "239"
        ["status"]=>
        string(1) "0"
        ["title"]=>
        string(26) "Shared Memory Page Problem"
        ["timestamp"]=>
        string(10) "1179345656"
        ["views"]=>
        string(4) "1076"
        ["replies"]=>
        string(1) "6"
        ["activity"]=>
        string(23) "2.2306106086392996e-120"
        ["sufferers"]=>
        string(1) "0"
        ["score"]=>
        string(1) "0"
        ["votes"]=>
        string(1) "0"
        ["create_time"]=>
        string(10) "1177646019"
        ["hidden"]=>
        string(1) "0"
        ["sticky"]=>
        string(1) "0"
        ["locked"]=>
        string(1) "0"
      }
      [1]=>
      &string(6) "thread"
      [2]=>
      &string(13) "views=views+1"
    }
  }
  [2]=>
  array(7) {
    ["file"]=>
    string(63) "/boinc/projects/docking/html_v2/user/community/forum/thread.php"
    ["line"]=>
    int(184)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(11) "BoincThread"
    ["object"]=>
    object(BoincThread)#3 (16) {
      ["id"]=>
      string(3) "240"
      ["forum"]=>
      string(1) "2"
      ["owner"]=>
      string(3) "239"
      ["status"]=>
      string(1) "0"
      ["title"]=>
      string(26) "Shared Memory Page Problem"
      ["timestamp"]=>
      string(10) "1179345656"
      ["views"]=>
      string(4) "1076"
      ["replies"]=>
      string(1) "6"
      ["activity"]=>
      string(23) "2.2306106086392996e-120"
      ["sufferers"]=>
      string(1) "0"
      ["score"]=>
      string(1) "0"
      ["votes"]=>
      string(1) "0"
      ["create_time"]=>
      string(10) "1177646019"
      ["hidden"]=>
      string(1) "0"
      ["sticky"]=>
      string(1) "0"
      ["locked"]=>
      string(1) "0"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(13) "views=views+1"
    }
  }
}
query: update docking.thread set views=views+1 where id=240