There is work but it is committed to other platforms


Advanced search

Message boards : Number crunching : There is work but it is committed to other platforms

Sort
Author Message
Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 2368 - Posted 1 Feb 2007 8:13:04 UTC

Docking Team, going on the above message I am getting on my Linux machine, you have run out of work.
At the time of writing the Server Status said 990 results ready to send then increased to 1000 but still none available for my computer.

"Message from server: (there was work but it is committed to other platforms)"
"No work from project"
____________

(retired account)
Volunteer tester

Joined: Nov 22 06
Posts: 62
ID: 331
Credit: 158,686
RAC: 0
Message 2370 - Posted 1 Feb 2007 16:52:03 UTC - in response to Message ID 2368 .
Last modified: 1 Feb 2007 16:52:35 UTC


"Message from server: (there was work but it is committed to other platforms)"
"No work from project"


I have the same problem here with my AMD K6-III currently. I guess the workunits available are committed to other platforms due to Homogeneous Redundancy. However, it is a bit odd, that your Opteron/Linux is not getting any work, because there are quite a number of compatible AMD K7, AMD K8, Intel PII and PIII. At least they were available... I hope they are not all switched off / detached now...

So the only thing we can do, I suppose, is keep them polling and crunch for other projects in the meantime. My K6-III is doing a little bit of ABC and SIMAP currently, as slow as it is.

Regards

Alex
____________
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2371 - Posted 1 Feb 2007 18:42:32 UTC - in response to Message ID 2368 .

Docking Team, going on the above message I am getting on my Linux machine, you have run out of work.
At the time of writing the Server Status said 990 results ready to send then increased to 1000 but still none available for my computer.

"Message from server: (there was work but it is committed to other platforms)"
"No work from project"


I have a total of six WUs on eight Linux CPUs.............kinda dry.
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2372 - Posted 1 Feb 2007 20:30:30 UTC

Not much we can do about this at the moment. I suspect that Alex is right and a lot of machines might not be available anymore. I guess this is an disadvantage of using HR: we need enough machines per class. We will try to come up with some way of showing what is in the shared memory and to which type of HR class workunits are assigned so that it will be easier to see for you guys what is going on.

Thanks
Andre
____________
D@H the greatest project in the world... a while from now!

Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 2380 - Posted 2 Feb 2007 5:00:39 UTC


All three of my machines have been seeing that.

1) P4 Northwood Celeron - WinXP
2) Sempron 2500+ - Ubuntu Linux
3) Sempron 3100+ - WinXP

All three are running standard 5.4.11 boinc core clients. Hmmm, I just checked the BOINC website and 5.8.8 is now the recommended client for Windows and Mac OS X, but is still listed as unstable for Linux.

I have them set to try to get a one day cache of work. They keep getting the "There is work but it is committed to other platforms" message but they do the random back-off and eventually get enough work that they haven't run dry (afaik). Work is out there, but it takes a lot of tries to get it and they often only get one WU, after which they reduce the number of seconds of work they are requesting and keep doing random back-off and asking again.

With a smaller cache, they would probably run dry. Also, these machines are doing half rosetta and half docking. They still seem to try to keep almost a full day of docking work in the cache. One machine does about 5% each of SIMAP and Seti, just to keep the projects active.

BTW, I have a 4th machine (RHEL3) that does about 90% Rosetta and 10% Seti, but It's never been able to run a docking WU longer than about 12 minutes, even with the "ulimit -s unlimited" fix and "ulimit -a" showing that the fix took.


____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2381 - Posted 2 Feb 2007 5:18:01 UTC - in response to Message ID 2380 .

This is now hopefully fixed. I just dawned on me that I forgot to apply the patch we got from the world community grid people that solves this 'work committed to other platforms' problem. Silly me....

Thanks for drawing my attention to this...
Andre


All three of my machines have been seeing that.

1) P4 Northwood Celeron - WinXP
2) Sempron 2500+ - Ubuntu Linux
3) Sempron 3100+ - WinXP

All three are running standard 5.4.11 boinc core clients. Hmmm, I just checked the BOINC website and 5.8.8 is now the recommended client for Windows and Mac OS X, but is still listed as unstable for Linux.

I have them set to try to get a one day cache of work. They keep getting the "There is work but it is committed to other platforms" message but they do the random back-off and eventually get enough work that they haven't run dry (afaik). Work is out there, but it takes a lot of tries to get it and they often only get one WU, after which they reduce the number of seconds of work they are requesting and keep doing random back-off and asking again.

With a smaller cache, they would probably run dry. Also, these machines are doing half rosetta and half docking. They still seem to try to keep almost a full day of docking work in the cache. One machine does about 5% each of SIMAP and Seti, just to keep the projects active.

BTW, I have a 4th machine (RHEL3) that does about 90% Rosetta and 10% Seti, but It's never been able to run a docking WU longer than about 12 minutes, even with the "ulimit -s unlimited" fix and "ulimit -a" showing that the fix took.



____________
D@H the greatest project in the world... a while from now!
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2382 - Posted 2 Feb 2007 5:22:11 UTC - in response to Message ID 2380 .
Last modified: 2 Feb 2007 5:22:23 UTC


Hmmm, I just checked the BOINC website and 5.8.8 is now the recommended client for Windows and Mac OS X, but is still listed as unstable for Linux.


See the front page news item on this.

AK
____________
D@H the greatest project in the world... a while from now!
(retired account)
Volunteer tester

Joined: Nov 22 06
Posts: 62
ID: 331
Credit: 158,686
RAC: 0
Message 2383 - Posted 2 Feb 2007 11:33:29 UTC - in response to Message ID 2381 .

This is now hopefully fixed. I just dawned on me that I forgot to apply the patch we got from the world community grid people that solves this 'work committed to other platforms' problem.


Thanks for the fix. My K6-III now got another workunit.

@all: Btw, there are at least four results unsent for K6/Windows ( here and here ) and the last one gave the two other members of the quorum and me 124 credits. This was the first Docking credit at all for my K6. So if you have a K6 online, you might want to grab one of these results... perhaps.

Regards

Alex
Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 2384 - Posted 2 Feb 2007 13:50:32 UTC
Last modified: 2 Feb 2007 14:06:54 UTC

Thanks Andre, unfortunately I don't think the patch is working as I am still getting the error message that there is no work available for my platform (linux). my Windows machines are not having this problem.

"Edit:- I have just checked and one of my 2 Windows machines is also getting the same message.
____________

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2385 - Posted 3 Feb 2007 0:08:58 UTC - in response to Message ID 2384 .

We'll keep on checking what can be the cause of this. I've also requested one of the students to built a tool that will show us which HR classes are currently in the shared memory. This will give us a better feel on which classes are running out of work. This might take a while though...

Thanks
Andre

Thanks Andre, unfortunately I don't think the patch is working as I am still getting the error message that there is no work available for my platform (linux). my Windows machines are not having this problem.

"Edit:- I have just checked and one of my 2 Windows machines is also getting the same message.


____________
D@H the greatest project in the world... a while from now!
Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 2389 - Posted 3 Feb 2007 23:18:21 UTC
Last modified: 3 Feb 2007 23:19:38 UTC

Hi,

Something has changed for the better. All 3 of my boxes just requested multiple work units each and got work without any problems. In case anyone is wondering why all 3 requested work at almost the same time, I'm on dialup so I have to make the rounds giving each one a chance to communicate :-)


____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2391 - Posted 4 Feb 2007 3:50:08 UTC - in response to Message ID 2389 .

Great! That's good news.

AK

Hi,

Something has changed for the better. All 3 of my boxes just requested multiple work units each and got work without any problems. In case anyone is wondering why all 3 requested work at almost the same time, I'm on dialup so I have to make the rounds giving each one a chance to communicate :-)



____________
D@H the greatest project in the world... a while from now!
Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 2393 - Posted 4 Feb 2007 17:29:16 UTC - in response to Message ID 2391 .

Great! That's good news.

AK


Well, it would be good news, but it's back today. It's not as bad though.

Will having more client machines running D@H make this more of a problem or less of a problem?

____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2394 - Posted 4 Feb 2007 19:25:38 UTC - in response to Message ID 2393 .

More client machines will be better as the throughput in the share memory will increase, thus more new workunits that are unassigned to a HR class will enter the game.

AK

Great! That's good news.

AK


Well, it would be good news, but it's back today. It's not as bad though.

Will having more client machines running D@H make this more of a problem or less of a problem?


____________
D@H the greatest project in the world... a while from now!
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2395 - Posted 4 Feb 2007 21:24:51 UTC - in response to Message ID 2394 .

More client machines will be better as the throughput in the share memory will increase, thus more new workunits that are unassigned to a HR class will enter the game.

AK

Great! That's good news.

AK


Well, it would be good news, but it's back today. It's not as bad though.

Will having more client machines running D@H make this more of a problem or less of a problem?



I guess I had it just ass-backwards..........I started starving some machines thinking that would leave the "few" available WUs for others.
(retired account)
Volunteer tester

Joined: Nov 22 06
Posts: 62
ID: 331
Credit: 158,686
RAC: 0
Message 2396 - Posted 4 Feb 2007 22:57:39 UTC - in response to Message ID 2395 .


I guess I had it just ass-backwards..........I started starving some machines thinking that would leave the "few" available WUs for others.


Hello j2satx,

is your K6 still online? Maybe you want to grab one or two of these unsent results here :

Workunit 24800
Workunit 25108

Of course we would still need a third one, then. But we have already shared the following quorum, which gave some nice credits:

Workunit 22173

Think about it...

This is the first project were I start to sell results... should I be worried? Could be the next, more serious stage of BOINC addiction.

Regards

Alex

My results during the HR tests
____________
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2397 - Posted 4 Feb 2007 23:01:42 UTC - in response to Message ID 2396 .


I guess I had it just ass-backwards..........I started starving some machines thinking that would leave the "few" available WUs for others.


Hello j2satx,

is your K6 still online? Maybe you want to grab one or two of these unsent results here :

Workunit 24800
Workunit 25108

Of course we would still need a third one, then. But we have already shared the following quorum, which gave some nice credits:

Workunit 22173

Think about it...

This is the first project were I start to sell results... should I be worried? Could be the next, more serious stage of BOINC addiction.

Regards

Alex

My results during the HR tests


Alex,just for you I'll put it on line for a couple of WUs........can't get tooo excited about 30+ hour WUs.

j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2398 - Posted 4 Feb 2007 23:15:54 UTC - in response to Message ID 2396 .
Last modified: 4 Feb 2007 23:16:59 UTC


I guess I had it just ass-backwards..........I started starving some machines thinking that would leave the "few" available WUs for others.


Hello j2satx,

is your K6 still online? Maybe you want to grab one or two of these unsent results here :

Workunit 24800
Workunit 25108

Of course we would still need a third one, then. But we have already shared the following quorum, which gave some nice credits:

Workunit 22173

Think about it...

This is the first project were I start to sell results... should I be worried? Could be the next, more serious stage of BOINC addiction.

Regards

Alex

My results during the HR tests


@Alex, OK, converted to 5.8.8 and caught a WU. We'll see how long it takes..........BoincManager shows about 38 hours.

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2399 - Posted 5 Feb 2007 3:18:40 UTC

Guys, this is REAL team work :-)

And I guess Alex has become our first workunit trader...

Keep up the good work.

Andre (speaking for the whole team!)
____________
D@H the greatest project in the world... a while from now!

j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2400 - Posted 5 Feb 2007 14:33:17 UTC - in response to Message ID 2399 .

Guys, this is REAL team work :-)

And I guess Alex has become our first workunit trader...

Keep up the good work.

Andre (speaking for the whole team!)


@Andre, If you could give us a matrix of the HR classes and number of machines in each class, maybe some other machines could be resurrected to fill out the classes.
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2401 - Posted 5 Feb 2007 15:04:47 UTC - in response to Message ID 2400 .

@Andre, If you could give us a matrix of the HR classes and number of machines in each class, maybe some other machines could be resurrected to fill out the classes.


We'll see what we can come up with today/tomorrow.
AK
____________
D@H the greatest project in the world... a while from now!
Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 2404 - Posted 5 Feb 2007 22:40:02 UTC


It's been going downhill since Saturday night. I'm not able to get ANY new D@H work units on any of my machines. One is completely out of work and is running it's secondary project now.

Is anyone able to get work today (Monday) ?

-- David

j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2406 - Posted 5 Feb 2007 22:55:09 UTC - in response to Message ID 2404 .


It's been going downhill since Saturday night. I'm not able to get ANY new D@H work units on any of my machines. One is completely out of work and is running it's secondary project now.

Is anyone able to get work today (Monday) ?

-- David


I don't see where I caught any on my Linux boxes.........I'd already set my Windows boxes to no new work.
Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 2407 - Posted 5 Feb 2007 23:16:16 UTC - in response to Message ID 2406 .


It's been going downhill since Saturday night. I'm not able to get ANY new D@H work units on any of my machines. One is completely out of work and is running it's secondary project now.

Is anyone able to get work today (Monday) ?

-- David


I don't see where I caught any on my Linux boxes.........I'd already set my Windows boxes to no new work.


Something has changed. As soon as I posted that message, one of my machines (the one I posted from) was able to finally get 2 work units. It's a Northwood P4 based Celeron. My 2 AMD machines (1 Linux, 1 WinXP) are still out of luck though.

-- David
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2408 - Posted 5 Feb 2007 23:30:40 UTC - in response to Message ID 2407 .


It's been going downhill since Saturday night. I'm not able to get ANY new D@H work units on any of my machines. One is completely out of work and is running it's secondary project now.

Is anyone able to get work today (Monday) ?

-- David


I don't see where I caught any on my Linux boxes.........I'd already set my Windows boxes to no new work.


Something has changed. As soon as I posted that message, one of my machines (the one I posted from) was able to finally get 2 work units. It's a Northwood P4 based Celeron. My 2 AMD machines (1 Linux, 1 WinXP) are still out of luck though.

-- David


Two other users also caught each WU that you did.
(retired account)
Volunteer tester

Joined: Nov 22 06
Posts: 62
ID: 331
Credit: 158,686
RAC: 0
Message 2409 - Posted 6 Feb 2007 0:30:56 UTC - in response to Message ID 2398 .


@Alex, OK, converted to 5.8.8 and caught a WU. We'll see how long it takes..........BoincManager shows about 38 hours.


Thanks. Hope all goes well.

All my PCs are self-build (except my notebook) and so I usually keep them for a long time, even when they start to become obsolete. This includes my K6-III which has served me well since 1999. And so it is fun to crunch a few workunits with these oldtimers from to time to time.

I kept my 286 till 2002 (running Minix) and my last 486 (running WinNT) was disposed last year. :-)

Btw, I have found that ABC, RieselSieve with the sieve app. (not llr) and SIMAP with the simap app. (not hmmer) (the last two only with manually installed non-standard applications) are going well with the K6. All those applications seem to use only or mostly integer calcalution.

Regards

Alex
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2411 - Posted 6 Feb 2007 0:58:00 UTC - in response to Message ID 2401 .

Memo is going to build a tool that will show is what is the contents of the shared memory, HR classes and all. This will give is a better chance to explain what is happening with the workunits.

Also, I have the feeling the number of active hosts has dropped substantially :-( This is of course not a good thing to keep things flowing smootly.

Andre

PS I have upgraded the server to boinc 5.9.2 a week ago; this might have something to do with it as well I guess. Don't know what though.

@Andre, If you could give us a matrix of the HR classes and number of machines in each class, maybe some other machines could be resurrected to fill out the classes.


We'll see what we can come up with today/tomorrow.
AK


____________
D@H the greatest project in the world... a while from now!
Profile clownius
Volunteer tester
Avatar

Joined: Nov 14 06
Posts: 61
ID: 280
Credit: 2,677
RAC: 0
Message 2412 - Posted 6 Feb 2007 5:24:13 UTC

I think you will find most of B@A is over at ABC taking part in AA5 but many should return around the 15th when AA5 is over. Also like many a Linux user i wont bring part of my farm back until the credits are fixed. I don't like my C2D earning less credits an hour than a Windows P2 or P3. I throw it at projects where the crunching power is appreciated.
____________

Profile John B. Kalla
Volunteer tester
Avatar

Joined: Oct 18 06
Posts: 54
ID: 188
Credit: 104,643
RAC: 0
Message 2414 - Posted 6 Feb 2007 6:08:55 UTC - in response to Message ID 2404 .
Last modified: 6 Feb 2007 6:10:46 UTC


...Is anyone able to get work today (Monday) ?

-- David


I've been out for a day or two. Working on Einstein@Home til Docking comes back up. I guess I've still got over 3,000 pending, so no big deal!
____________
John

MacPro
2 x 2.66GHz Dual-Core Xeon | 2GB RAM | ATI x1900 | BOINC 5.9.5
Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 2415 - Posted 6 Feb 2007 7:19:39 UTC

> Only getting the occassional WU through on both Windows and Linux, one machine has less than a dozen and the other 3 less than 6 (only 4 left on my Opteron 275, one WU per core).
If money was not a problem I would go out and buy some more computers of different types just for the project, unfortunately it is a problem.
I will have one more built by next weekend though, it will be another Opteron running Linux, sorry can't help with the other HR groups and I will probably be retiring my Intel P4 253 Ghz Windows machine (one in one out to balance the budget and electicity bill).
____________

Profile John B. Kalla
Volunteer tester
Avatar

Joined: Oct 18 06
Posts: 54
ID: 188
Credit: 104,643
RAC: 0
Message 2418 - Posted 7 Feb 2007 2:59:23 UTC

WooHoo! Got a couple more workunits today.

Does running out of WU mean that Docking has all the help it needs with Alpha testing? Or is there a snag/problem? (perhaps with BOINC 5.8.8 app?)

Sorry. I'm largely ignorant of the science behind BOINC and protein folding.
____________
John

MacPro
2 x 2.66GHz Dual-Core Xeon | 2GB RAM | ATI x1900 | BOINC 5.9.5

Rene
Volunteer tester
Avatar

Joined: Oct 2 06
Posts: 121
ID: 160
Credit: 109,415
RAC: 0
Message 2453 - Posted 11 Feb 2007 14:14:32 UTC - in response to Message ID 2418 .

Does running out of WU mean that Docking has all the help it needs with Alpha testing?


Don't think so... just keep that Mac attached... ;-)

____________
Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 2459 - Posted 12 Feb 2007 6:34:19 UTC - in response to Message ID 2418 .

WooHoo! Got a couple more workunits today.

Does running out of WU mean that Docking has all the help it needs with Alpha testing? Or is there a snag/problem? (perhaps with BOINC 5.8.8 app?)


There's multiple snags/problems :-)

BOINC 5.8.8 (now 5.8.11) did cause a problem because it changed the system information strings returned by the BOINC client and I believe that messed up Homogeneous redundancy (referred to as HR). Basically different processors return slightly different results due to changes in the floating point processing and rounding. HR attempts to group the processors that have the same characteristics so that their results can be compared to verify that they got the right result.

There's also a problem with the shared memory segment on the server side and they're writing a tool to gather more info to try to find out what is going on. IIRC, Andre or Memo said that more users would actually help alleviate the problem.

If I understand correctly (definitely not guaranteed), The shared memory segment is a pool of available work units. A work unit is assigned to 3 client computers to run (3 is the current quorum). Here's what happens in the shared memory segment.

1) Work units go into the shared memory segment on the server.

2) A client requests work. Their HR is matched against the HR of work units in the shared memory.

2A) If a HR match is found AND the machines which have already been assigned to work on the work unit don't belong to the user which owns this client machine, then the work unit is assigned to this client.

2B) If the conditions in 2A aren't met, it looks for a work unit which hasn't been assigned to ANY HR group and if it finds one, it marks the work unit to only be assigned to this clients HR group and assigns the work unit to this client.

2C) If 2A and 2B aren't met, then you get the work was committed to other platforms message and your client will try again in a few minutes. It will try a few times with one minute between tries and then will go to waiting a gradually increasing random time between tries. At some point, more work is added to the pool and the client succeeds in getting work.

3) I'm even more vague on what happens after the work unit has been assigned to enough clients to meet the minimum quorum. I suspect it leaves the shared memory work pool unless one of the machines it is assigned to gets an error or doesn't return the result within the time limit. That's just a guess, though.

I think that some HR groups are filling up the shared memory area because they don't have another machine in the same HR group to issue the work unit to. This is probably caused in part by some HR groups having a couple of very fast machines and some really slow machines (like mine). To feed the fast machines, a bunch of work has to be assigned to their HR group, but it has to wait for one of the slower machines to get around to asking for more work for the quorum to be met and a space to be freed up in the shared memory pool.

There might also be limits on how much of the pool one HR group can reserve.

Please remember that much of this is guessing from what I've heard in the various forums and developer lists. I fully expect that Andre will come by in the morning and tell you how much of it I got wrong :-)

Anyway, the problems are being worked on and the project needs as wide a variety of clients on it as possible, so please keep crunching.

Hopefully, this is at least close to how the process works and will help explain things :-) I'd have to actually get the source code from the CVS repository and spend hours or days looking at it to be sure.



____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 2460 - Posted 12 Feb 2007 16:15:50 UTC - in response to Message ID 2459 .

[quote]WooHoo! Got a couple more workunits today.

2A) If a HR match is found AND the machines which have already been assigned to work on the work unit don't belong to the user which owns this client machine, then the work unit is assigned to this client.



Why should it matter if work goes to a client machine owned by the same user which has the WU?

I can see why you wouldn't want it to go to the same computer, but why not a computer in the same HR class, no matter who owns it.
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2461 - Posted 12 Feb 2007 16:32:22 UTC - in response to Message ID 2460 .


Why should it matter if work goes to a client machine owned by the same user which has the WU?

I can see why you wouldn't want it to go to the same computer, but why not a computer in the same HR class, no matter who owns it.


Because in principle (and this is coming from the seti days a long time ago) a user can cheat and inject the same result files on all his/her computers and then trigger boinc to send these back. I don't think this is that easy to do with the newest boinc client software, but the mechanism still exists.

Andre
____________
D@H the greatest project in the world... a while from now!
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 2463 - Posted 12 Feb 2007 17:53:55 UTC - in response to Message ID 2459 .

Hi David,


There's multiple snags/problems :-)
BOINC 5.8.8 (now 5.8.11) did cause a problem because it changed the system information strings returned by the BOINC client and I believe that messed up Homogeneous redundancy (referred to as HR). Basically different processors return slightly different results due to changes in the floating point processing and rounding. HR attempts to group the processors that have the same characteristics so that their results can be compared to verify that they got the right result.


Correct. The new boinc client is (hopefully) going to make this HR filtering easier, but since we have a mix of old and new clients, for the moment it is harder. If you can, everybody please upgrade to the latest version of the boinc client. In a while we probably set this as the required minimum version of the client needed.


There's also a problem with the shared memory segment on the server side and they're writing a tool to gather more info to try to find out what is going on. IIRC, Andre or Memo said that more users would actually help alleviate the problem.


There's not really a problem with the shared memory segment, but at the moment BOINC doesn't supply the tools to see what HR classes are in there, so we've decided to write a tool ourselves.


If I understand correctly (definitely not guaranteed), The shared memory segment is a pool of available work units. A work unit is assigned to 3 client computers to run (3 is the current quorum). Here's what happens in the shared memory segment.


The shared memory is a pool of available results (replicas). A workunit generates 3 replicas that are stored in the shared memory by the feeder program.


1) Work units go into the shared memory segment on the server.


Replicas go into the shared memory segment.


2) A client requests work. Their HR is matched against the HR of work units in the shared memory.


Correct. (except that the workunit is a replica)


2A) If a HR match is found AND the machines which have already been assigned to work on the work unit don't belong to the user which owns this client machine, then the work unit is assigned to this client.


Correct. (except that the workunit is a replica)


2B) If the conditions in 2A aren't met, it looks for a work unit which hasn't been assigned to ANY HR group and if it finds one, it marks the work unit to only be assigned to this clients HR group and assigns the work unit to this client.


Correct. (except that the workunit is a replica)


2C) If 2A and 2B aren't met, then you get the work was committed to other platforms message and your client will try again in a few minutes. It will try a few times with one minute between tries and then will go to waiting a gradually increasing random time between tries. At some point, more work is added to the pool and the client succeeds in getting work.


Correct.


3) I'm even more vague on what happens after the work unit has been assigned to enough clients to meet the minimum quorum. I suspect it leaves the shared memory work pool unless one of the machines it is assigned to gets an error or doesn't return the result within the time limit. That's just a guess, though.


Every replica that has been assigned to a host is removed from the SM. If replicas time out or return an error, the system generates a new replica for tht workunit and sticks it into the SM for distribution to a new host.


I think that some HR groups are filling up the shared memory area because they don't have another machine in the same HR group to issue the work unit to. This is probably caused in part by some HR groups having a couple of very fast machines and some really slow machines (like mine). To feed the fast machines, a bunch of work has to be assigned to their HR group, but it has to wait for one of the slower machines to get around to asking for more work for the quorum to be met and a space to be freed up in the shared memory pool.


Correct. The fast/slow machine issue is a problem (we will probably give them their own HR class later) and the fact that we don't have many machines attached for certain classes (e.g. K6's or Macs) is a problem.


There might also be limits on how much of the pool one HR group can reserve.


No, we don't have limits setup. The Leiden Classical guys have actually implemented limits and it seems to work well. We might go that way later on.


Please remember that much of this is guessing from what I've heard in the various forums and developer lists. I fully expect that Andre will come by in the morning and tell you how much of it I got wrong :-)

Anyway, the problems are being worked on and the project needs as wide a variety of clients on it as possible, so please keep crunching.


That is 100% correct :-) Very good 'guesswork' David!


Hopefully, this is at least close to how the process works and will help explain things :-) I'd have to actually get the source code from the CVS repository and spend hours or days looking at it to be sure.


Thanks
Andre

____________
D@H the greatest project in the world... a while from now!

Message boards : Number crunching : There is work but it is committed to other platforms

Database Error
: The MySQL server is running with the --read-only option so it cannot execute this statement
array(3) {
  [0]=>
  array(7) {
    ["file"]=>
    string(47) "/boinc/projects/docking/html_v2/inc/db_conn.inc"
    ["line"]=>
    int(97)
    ["function"]=>
    string(8) "do_query"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#41 (2) {
      ["db_conn"]=>
      resource(96) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(51) "update DBNAME.thread set views=views+1 where id=167"
    }
  }
  [1]=>
  array(7) {
    ["file"]=>
    string(48) "/boinc/projects/docking/html_v2/inc/forum_db.inc"
    ["line"]=>
    int(60)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#41 (2) {
      ["db_conn"]=>
      resource(96) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(3) {
      [0]=>
      object(BoincThread)#3 (16) {
        ["id"]=>
        string(3) "167"
        ["forum"]=>
        string(1) "2"
        ["owner"]=>
        string(3) "100"
        ["status"]=>
        string(1) "0"
        ["title"]=>
        string(52) "There is work but it is committed to other platforms"
        ["timestamp"]=>
        string(10) "1171302835"
        ["views"]=>
        string(4) "1588"
        ["replies"]=>
        string(2) "35"
        ["activity"]=>
        string(22) "7.960986777641899e-124"
        ["sufferers"]=>
        string(1) "0"
        ["score"]=>
        string(1) "0"
        ["votes"]=>
        string(1) "0"
        ["create_time"]=>
        string(10) "1170317583"
        ["hidden"]=>
        string(1) "0"
        ["sticky"]=>
        string(1) "0"
        ["locked"]=>
        string(1) "0"
      }
      [1]=>
      &string(6) "thread"
      [2]=>
      &string(13) "views=views+1"
    }
  }
  [2]=>
  array(7) {
    ["file"]=>
    string(63) "/boinc/projects/docking/html_v2/user/community/forum/thread.php"
    ["line"]=>
    int(184)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(11) "BoincThread"
    ["object"]=>
    object(BoincThread)#3 (16) {
      ["id"]=>
      string(3) "167"
      ["forum"]=>
      string(1) "2"
      ["owner"]=>
      string(3) "100"
      ["status"]=>
      string(1) "0"
      ["title"]=>
      string(52) "There is work but it is committed to other platforms"
      ["timestamp"]=>
      string(10) "1171302835"
      ["views"]=>
      string(4) "1588"
      ["replies"]=>
      string(2) "35"
      ["activity"]=>
      string(22) "7.960986777641899e-124"
      ["sufferers"]=>
      string(1) "0"
      ["score"]=>
      string(1) "0"
      ["votes"]=>
      string(1) "0"
      ["create_time"]=>
      string(10) "1170317583"
      ["hidden"]=>
      string(1) "0"
      ["sticky"]=>
      string(1) "0"
      ["locked"]=>
      string(1) "0"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(13) "views=views+1"
    }
  }
}
query: update docking.thread set views=views+1 where id=167