Issues with new 5.05 Linux application


Advanced search

Message boards : Number crunching : Issues with new 5.05 Linux application

Sort
Author Message
mikus
Volunteer tester

Joined: Oct 28 06
Posts: 18
ID: 193
Credit: 2,915,329
RAC: 0
Message 2997 - Posted 12 Apr 2007 4:22:01 UTC

Running 32-bit boinc client 5.8.17 on a 64-bit Ubuntu 6.10 system. The new (32-bit) charmm application was downloaded by my system today. Two things I've noticed:

(1) Gkrellm shows the CPUs running the 5.05 Charmm tasks are spending 20%-40% of their time in non-nice execution (that is, in system mode). Top confirms this. [All other projects spend at least 98% of their time in nice execution (that is, not in system mode).]

(2) To verify that this unexpected system mode execution was being done for the Docking application tasks, I 'Suspended' (via boincmgr) the Docking project. The 5.05 Charmm tasks __ignored__ the boincmgr suspend, and kept right on executing. [All other projects obey the boincmgr 'Suspend'.]
.

Rene
Volunteer tester
Avatar

Joined: Oct 2 06
Posts: 121
ID: 160
Credit: 109,415
RAC: 0
Message 2998 - Posted 12 Apr 2007 5:06:37 UTC - in response to Message ID 2997 .
Last modified: 12 Apr 2007 5:07:59 UTC

Running 32-bit boinc client 5.8.17 on a 64-bit Ubuntu 6.10 system. The new (32-bit) charmm application was downloaded by my system today. Two things I've noticed:

(1) Gkrellm shows the CPUs running the 5.05 Charmm tasks are spending 20%-40% of their time in non-nice execution (that is, in system mode). Top confirms this. [All other projects spend at least 98% of their time in nice execution (that is, not in system mode).]

(2) To verify that this unexpected system mode execution was being done for the Docking application tasks, I 'Suspended' (via boincmgr) the Docking project. The 5.05 Charmm tasks __ignored__ the boincmgr suspend, and kept right on executing. [All other projects obey the boincmgr 'Suspend'.]
.


It also looks like your pc has done it's calculation in less than a second (0.96).

;-)

Did those units ran in one instance or where they running in que with other projecs switching and asking for cpu-time?


____________
mikus
Volunteer tester

Joined: Oct 28 06
Posts: 18
ID: 193
Credit: 2,915,329
RAC: 0
Message 3001 - Posted 12 Apr 2007 6:50:29 UTC - in response to Message ID 2998 .
Last modified: 12 Apr 2007 6:53:14 UTC


It also looks like your pc has done it's calculation in less than a second (0.96).

;-)

Did those units ran in one instance or where they running in que with other projecs switching and asking for cpu-time?


I do not have an explanation. It was mentioned that now a shell is being called, which in turn calls the application executable. Perhaps the 0.96 value is the time spent executing the shell (in other words, perhaps the time spent executing the application itself was not being measured).

Those units were downloaded and added to the ready queue at my computer. They were dispatched from the ready queue by the "normal" boinc client scheduling algorithm -- that is, they ran "normally" under the boinc client (being scheduled and dispatched alongside other tasks from other projects).

[I connect to the network only from time to time. The units which I tried to 'Suspend' have not yet been uploaded to the server. So far, one of them has finished - it too has an elapsed time value of around one second.] So the short time reported appears to be yet another issue, in addition to the 'Suspend' issue described earlier.
.

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3006 - Posted 12 Apr 2007 10:37:53 UTC - in response to Message ID 3001 .

There might be something funny going on with your setup, because especially the subsecond execution time plus a valid result is a bit of an oddity that we haven't seen before. Anybody else noticing this behavior on their linux boxes? Memo will do some testing today or tomorrow to see if he can reproduce this.

Thanks!
Andre


It also looks like your pc has done it's calculation in less than a second (0.96).

;-)

Did those units ran in one instance or where they running in que with other projecs switching and asking for cpu-time?


I do not have an explanation. It was mentioned that now a shell is being called, which in turn calls the application executable. Perhaps the 0.96 value is the time spent executing the shell (in other words, perhaps the time spent executing the application itself was not being measured).

Those units were downloaded and added to the ready queue at my computer. They were dispatched from the ready queue by the "normal" boinc client scheduling algorithm -- that is, they ran "normally" under the boinc client (being scheduled and dispatched alongside other tasks from other projects).

[I connect to the network only from time to time. The units which I tried to 'Suspend' have not yet been uploaded to the server. So far, one of them has finished - it too has an elapsed time value of around one second.] So the short time reported appears to be yet another issue, in addition to the 'Suspend' issue described earlier.
.



____________
D@H the greatest project in the world... a while from now!
j2satx
Volunteer tester

Joined: Dec 22 06
Posts: 183
ID: 339
Credit: 16,191,581
RAC: 0
Message 3009 - Posted 12 Apr 2007 11:55:48 UTC
Last modified: 12 Apr 2007 12:00:19 UTC

My 64-bit computers are erroring on the 5.05s.

Ubuntu 7.04 64-bit
Augustine's 64-bit 5.8.17 client

Computer Project Date ID Message
6100M901 Docking@Home 4/12/2007 6:52:55 AM 222370 Task 1tng_mod0001_45013_219488_2 exited with zero status but no 'finished' file
6100M901 Docking@Home 4/12/2007 6:52:54 AM 222369 Restarting task 1tng_mod0001_45013_219488_2 using charmm version 505
6100M901 Docking@Home 4/12/2007 6:52:54 AM 222368 If this happens repeatedly you may need to reset the project.
6100M901 Docking@Home 4/12/2007 6:52:54 AM 222367 Task 1tng_mod0001_45013_219488_2 exited with zero status but no 'finished' file
6100M901 Docking@Home 4/12/2007 6:52:53 AM 222366 Restarting task 1tng_mod0001_45013_219488_2 using charmm version 505
6100M901 Docking@Home 4/12/2007 6:52:53 AM 222365 If this happens repeatedly you may need to reset the project.
6100M901 Docking@Home 4/12/2007 6:52:53 AM 222364 Task 1tng_mod0001_45013_219488_2 exited with zero status but no 'finished' file
6100M901 Docking@Home 4/12/2007 6:52:52 AM 222363 Restarting task 1tng_mod0001_45013_219488_2 using charmm version 505
6100M901 Docking@Home 4/12/2007 6:52:52 AM 222362 If this happens repeatedly you may need to reset the project.


Computer Project Date ID Message
6100M902 Docking@Home 4/12/2007 6:53:18 AM 250891 Task 1tng_mod0001_45705_348832_0 exited with zero status but no 'finished' file
6100M902 Docking@Home 4/12/2007 6:53:17 AM 250890 Restarting task 1tng_mod0001_45705_348832_0 using charmm version 505
6100M902 Docking@Home 4/12/2007 6:53:17 AM 250889 If this happens repeatedly you may need to reset the project.
6100M902 Docking@Home 4/12/2007 6:53:17 AM 250888 Task 1tng_mod0001_45705_348832_0 exited with zero status but no 'finished' file
6100M902 Docking@Home 4/12/2007 6:53:16 AM 250887 Restarting task 1tng_mod0001_45705_348832_0 using charmm version 505
6100M902 Docking@Home 4/12/2007 6:53:16 AM 250886 If this happens repeatedly you may need to reset the project.
6100M902 Docking@Home 4/12/2007 6:53:16 AM 250885 Task 1tng_mod0001_45705_348832_0 exited with zero status but no 'finished' file
6100M902 Docking@Home 4/12/2007 6:53:15 AM 250884 Restarting task 1tng_mod0001_45705_348832_0 using charmm version 505
6100M902 Docking@Home 4/12/2007 6:53:15 AM 250883 If this happens repeatedly you may need to reset the project.
6100M902 Docking@Home 4/12/2007 6:53:15 AM 250882 Task 1tng_mod0001_45705_348832_0 exited with zero status but no 'finished' file
6100M902 Docking@Home 4/12/2007 6:53:14 AM 250881 Restarting task 1tng_mod0001_45705_348832_0 using charmm version 505
6100M902 Docking@Home 4/12/2007 6:53:14 AM 250880 If this happens repeatedly you may need to reset the project.

I've suspended them from D@H until we get a fix.

mikus
Volunteer tester

Joined: Oct 28 06
Posts: 18
ID: 193
Credit: 2,915,329
RAC: 0
Message 3010 - Posted 12 Apr 2007 12:25:30 UTC
Last modified: 12 Apr 2007 12:26:48 UTC

Follow on:

I found that when the last previously-downloaded Charmm 5.05 task ran, the boinc client decided to pre-empt it (at "normal" project rotation time) in order to run a malaria task in its place. Instead, the Charmm 5.05 task kept on running, with the result that the CPU was being shared 50% (according to 'top') between malaria and docking. [For the time being I will wait for a new Linux application version from Docking.]

----

I wanted to try running the Charmm 5.05 application by itself (*without* the shell script) -- but the Docking server refused to send me work -- it claimed "platform 'anonymous' not found". Andre - to fully integrate Docking into the BOINC environment, you ought to upgrade your server to recognize the 'anonymous' platform as well as the 'i686' platform. ['Anonymous' is how the boinc client describes the platform when a non-standard application executable is specified.]
.

BobCat13
Volunteer tester

Joined: Nov 14 06
Posts: 22
ID: 239
Credit: 285,322
RAC: 0
Message 3017 - Posted 12 Apr 2007 22:32:04 UTC

On output files 1tng_mod001_xxxxx_xxxxxx_x_2, 5.04 application contained 6400 lines of data ranging from Conformation/Rotation of 1/1 to 80/80, but 5.05 application contains one line for Conformation/Rotation 80/80. Is this correct?

Also, does the 5.05 application require a certain level of libraries to run efficiently? I am running Ubuntu 6.06 and since the change to 5.05, my runtime has increased by ~29% on all units. I ran a couple of other projects to check, but they still run in the same amount of time as before so it appears to only be D@H that has changed.

Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 3023 - Posted 13 Apr 2007 4:24:37 UTC - in response to Message ID 3017 .

On output files 1tng_mod001_xxxxx_xxxxxx_x_2, 5.04 application contained 6400 lines of data ranging from Conformation/Rotation of 1/1 to 80/80, but 5.05 application contains one line for Conformation/Rotation 80/80. Is this correct?

I was just noticing how much smaller that file was myself and wondering if it was broken. They said they were cutting out some of the debug information in 5.05. Maybe that's it. Andre or Memo would probably have to answer that for us and I think Andre said something about being away for a few days in another thread. I don't know how often Memo is in the forums but, from the amount of work involved, I suspect they've had him chained to his computers getting the new version out.
Also, does the 5.05 application require a certain level of libraries to run efficiently? I am running Ubuntu 6.06 and since the change to 5.05, my runtime has increased by ~29% on all units. I ran a couple of other projects to check, but they still run in the same amount of time as before so it appears to only be D@H that has changed.


My Linux machine is Ubuntu 6.10 and it went from about 14800 seconds per WU to about 16000 seconds per WU. It's a Socket A Sempron and IIRC it only has 128KB of L2 cache. I wonder if the new app is more cache sensitive.

One of my WinXP machines has a Socket 754 Sempron with 256KB L2 cache and the runtime dropped by about 100 seconds per WU on that machine.

My WinXP P4 2.8 with 512KB cache increased by a few hundred seconds but that could have been related to system load. I've kept it busy with some database stuff today.

My Vista PD 925 with 2x2MB cache decreased by about 300 seconds per WU. It's being burned in and is only running D@H so it's very consistent.

It's interesting that the 2 machines which increased are older machines with IDE disks and the 2 which decreased are SATA. Maybe it's the disk subsystem and not the cache size. Didn't someone, in another thread, say something about overly aggressive checkpointing keeping the disk busy?

Has anyone else seen a pattern related to either cache size or IDE vs SATA harddisks?

-- David
____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?
Profile UBT-Timby

Joined: Mar 25 07
Posts: 2
ID: 368
Credit: 3,870
RAC: 0
Message 3030 - Posted 13 Apr 2007 20:21:44 UTC

Hi all quite new to this project but I have the same end result as j2satx.
My set up is Ubuntu 6.10 64bit with 5.8.11 Client amd an AMD Athlon 64 5200+ CPU

Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 3036 - Posted 14 Apr 2007 13:43:21 UTC

Yes I also have noticed that the Work units on Linux now take much longer.
On my AMD Opteron 285, 5.04 WU's took 2 hours 40 minutes, now 5.05 is taking up to 5 hours 58 minutes, this is more than double.
My Opteron 275 is now taking over 6 hours (up to 6 hours 15 minutes), it used to take under 3 hours.
My Windows machines seem to still take just as long as before on 5.05 as they took on 5.04.
Why has Linux started to take so long?
Had one 5.05 WU take a long time on the 11/4/07 then the others were ok at the normal run time up until 13/4/07 and 14/4/07 with both computers now taking over twice as long to run all 5.05 Wu's.
Host 130 and 1569.

Also with the long run times the allocated Credit amounts not changing upwards to match the longer crunching times with corrosponding drop in credit per hour (now dropping to about 8 per hour).
____________

fubared
Volunteer tester

Joined: Nov 14 06
Posts: 11
ID: 293
Credit: 57,379
RAC: 0
Message 3047 - Posted 15 Apr 2007 5:44:34 UTC

I started a thread about the linux x64 issues. The script has a typo in it. Open the charmm script file and you will notice it mentions the i686 file on line 22 and 25. The reason it throws an error message is it can't find the charmm i686 binary.

Rene
Volunteer tester
Avatar

Joined: Oct 2 06
Posts: 121
ID: 160
Credit: 109,415
RAC: 0
Message 3051 - Posted 15 Apr 2007 13:56:48 UTC - in response to Message ID 3023 .
Last modified: 15 Apr 2007 14:21:25 UTC

Has anyone else seen a pattern related to either cache size or IDE vs SATA harddisks?

-- David


I don't know if there's a pattern, but what I can see is that my Pentium D 805 (1Mb/p core) and running Vista seem to be doing the wu's slightly faster.
My Athlon 2600+ Barton (512kb) has an increased running time... and that's my Linux host running Kubuntu 6.10
Both are equiped with IDE disks.

Is kinda like comparing apples and peaches in my case, but you meight have a point here.
Although the increase on my Linux host... approx 33%... could be cashe related, I also think that it could be Linux related.

Maybe others can confirm your assumption (or not)... ;-)

Edit: There's also a decrease of system load on my Linux host. The app asks between 75 and 90% of the cpu load... this used to be 95-100.

____________
Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 3054 - Posted 15 Apr 2007 14:22:37 UTC - in response to Message ID 3023 .

On output files 1tng_mod001_xxxxx_xxxxxx_x_2, 5.04 application contained 6400 lines of data ranging from Conformation/Rotation of 1/1 to 80/80, but 5.05 application contains one line for Conformation/Rotation 80/80. Is this correct?

I was just noticing how much smaller that file was myself and wondering if it was broken. They said they were cutting out some of the debug information in 5.05. Maybe that's it. Andre or Memo would probably have to answer that for us and I think Andre said something about being away for a few days in another thread. I don't know how often Memo is in the forums but, from the amount of work involved, I suspect they've had him chained to his computers getting the new version out.
Also, does the 5.05 application require a certain level of libraries to run efficiently? I am running Ubuntu 6.06 and since the change to 5.05, my runtime has increased by ~29% on all units. I ran a couple of other projects to check, but they still run in the same amount of time as before so it appears to only be D@H that has changed.


My Linux machine is Ubuntu 6.10 and it went from about 14800 seconds per WU to about 16000 seconds per WU. It's a Socket A Sempron and IIRC it only has 128KB of L2 cache. I wonder if the new app is more cache sensitive.

One of my WinXP machines has a Socket 754 Sempron with 256KB L2 cache and the runtime dropped by about 100 seconds per WU on that machine.

My WinXP P4 2.8 with 512KB cache increased by a few hundred seconds but that could have been related to system load. I've kept it busy with some database stuff today.

My Vista PD 925 with 2x2MB cache decreased by about 300 seconds per WU. It's being burned in and is only running D@H so it's very consistent.

It's interesting that the 2 machines which increased are older machines with IDE disks and the 2 which decreased are SATA. Maybe it's the disk subsystem and not the cache size. Didn't someone, in another thread, say something about overly aggressive checkpointing keeping the disk busy?

Has anyone else seen a pattern related to either cache size or IDE vs SATA harddisks?

-- David


>> David I run SATA II discs on 4 of my 5 computers and do not see any difference between them and the one that has IDE, constant disk reading and writing on all.
The difference that I and many others are seeing is between Windows/Linux/Darwin.
With the new 5.05 Charmm my Windows machines are actually a bit faster than before by some minutes.
My Linux machines are more than 125% slower on 5.05 than on 5.04.
The Darwin users seem to have times blowing out by over 600%.
So the new version favours Windows and kills the others.
I will consider removing my Linux machines if the credit/hour does not improve (dropping from 17-20 per hour down to 7-8 per hour), as the results don't justify the time spent, especially if they then don't validate.
____________
Rene
Volunteer tester
Avatar

Joined: Oct 2 06
Posts: 121
ID: 160
Credit: 109,415
RAC: 0
Message 3055 - Posted 15 Apr 2007 16:31:35 UTC - in response to Message ID 3054 .
Last modified: 15 Apr 2007 16:35:20 UTC

I will consider removing my Linux machines if the credit/hour does not improve (dropping from 17-20 per hour down to 7-8 per hour), as the results don't justify the time spent, especially if they then don't validate.


I did not know it was already time to abandon ship... AFAIK engines are only running slower in some parts of the engine room... we aren't sinking yet.

;-)

Just give it some time Conan.. I'm sure a fix will be made


Until then just grab hold to this one...

____________
Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 3059 - Posted 16 Apr 2007 0:16:36 UTC

Hello Conan,

I posted a bunch of information here which you might want to read. It goes into the whole HR/Disk type and cache size/file access timestamping/Multiple Core thing in detail as well as possible interactions between them.

I wouldn't abandon D@H yet. I think the big problem is that, for various reasons (IIRC including a conference within the last few days), the actual Docking programming staff is off or out of town this weekend. I think they'll be back either Monday or Tuesday and will probably do something within a few hours of reading all the data accumulating here in the message boards.

I may be a forum moderator but I'm just a volunteer so I don't have access to the software or servers. In fact, I'm in northern Alabama and D@H is in Texas.

I wish I could be more help to you Conan but until Andre gets back in town there's nothing I can do except try to collect data and respond to people here in the forum. I know you're probably thinking that it wasn't a good idea to release a new D@H app just before going out of town, but I would think that they tested this a lot before releasing it. Unfortunately, I suspect that what they interpreted as an increased number of floating point operations per WU was actually this problem but they didn't have a configuration where it stood out as much as it does on some of the user configurations.

BTW, I think most (or all) of their Windows testing is done using VMware on Linux hosts. Also, didn't Andre say a while back that his personal machine is a Mac. It might be a PowerPC Mac, or I could be remembering wrong.

For now, there are 2 things I can suggest trying if there's not something else on your Linux machines that it would mess up.

1. If the linux partition is ext3, switch it to ext2 to turn off journaling.

2. Mount it with the noatime attribute to turn off file access timestamping.

I'm not sure if any of that would apply to Mac OS X machines. Aren't they based on some variant of FreeBSD which is Unix like? I've never used one.

I'm sure Andre will address this situation ASAP when he gets back.

Conan, you've been a very valuable contributer and I know I would hate to see you leave. Since your machines seem to be getting hit particularly hard by the changes, maybe you could suspend Docking or give it a smaller share until Andre returns and can address your issues. I'm hoping he'll be back tomorrow morning.

Regards,

-- David
____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?

Profile Conan
Volunteer tester
Avatar

Joined: Sep 13 06
Posts: 219
ID: 100
Credit: 4,256,493
RAC: 0
Message 3061 - Posted 16 Apr 2007 1:57:27 UTC - in response to Message ID 3059 .

Hello Conan,

I posted a bunch of information here which you might want to read. It goes into the whole HR/Disk type and cache size/file access timestamping/Multiple Core thing in detail as well as possible interactions between them.

I wouldn't abandon D@H yet. I think the big problem is that, for various reasons (IIRC including a conference within the last few days), the actual Docking programming staff is off or out of town this weekend. I think they'll be back either Monday or Tuesday and will probably do something within a few hours of reading all the data accumulating here in the message boards.

I may be a forum moderator but I'm just a volunteer so I don't have access to the software or servers. In fact, I'm in northern Alabama and D@H is in Texas.

I wish I could be more help to you Conan but until Andre gets back in town there's nothing I can do except try to collect data and respond to people here in the forum. I know you're probably thinking that it wasn't a good idea to release a new D@H app just before going out of town, but I would think that they tested this a lot before releasing it. Unfortunately, I suspect that what they interpreted as an increased number of floating point operations per WU was actually this problem but they didn't have a configuration where it stood out as much as it does on some of the user configurations.

BTW, I think most (or all) of their Windows testing is done using VMware on Linux hosts. Also, didn't Andre say a while back that his personal machine is a Mac. It might be a PowerPC Mac, or I could be remembering wrong.

For now, there are 2 things I can suggest trying if there's not something else on your Linux machines that it would mess up.

1. If the linux partition is ext3, switch it to ext2 to turn off journaling.

2. Mount it with the noatime attribute to turn off file access timestamping.

I'm not sure if any of that would apply to Mac OS X machines. Aren't they based on some variant of FreeBSD which is Unix like? I've never used one.

I'm sure Andre will address this situation ASAP when he gets back.

Conan, you've been a very valuable contributer and I know I would hate to see you leave. Since your machines seem to be getting hit particularly hard by the changes, maybe you could suspend Docking or give it a smaller share until Andre returns and can address your issues. I'm hoping he'll be back tomorrow morning.

Regards,

-- David


> Thanks Rene and David.
I didn't say I was leaving Docking, just stopping my Linux machines, the Windows ones are still chugging away. This happens to be my favorite project so I won't be leaving anytime soon, besides Andre was thinking of giving bonus points for the longer you are a member so I can't pass that up.

I am also curious as to why the Boinc Manager keeps increasing the time to completion on these same Linux WU's as they have now increased to 9 hours which is getting close to the Darwin users 11 hours.

No all is fine with me, no hissy fit yet, more curious than anything. I will stay, just trying to get the best production out of my computers that I can.
____________
Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 3062 - Posted 16 Apr 2007 4:57:11 UTC

Conan Wrote:

> Thanks Rene and David.
I didn't say I was leaving Docking, just stopping my Linux machines, the Windows ones are still chugging away. This happens to be my favorite project so I won't be leaving anytime soon, besides Andre was thinking of giving bonus points for the longer you are a member so I can't pass that up.

I am also curious as to why the Boinc Manager keeps increasing the time to completion on these same Linux WU's as they have now increased to 9 hours which is getting close to the Darwin users 11 hours.

No all is fine with me, no hissy fit yet, more curious than anything. I will stay, just trying to get the best production out of my computers that I can.


Hi,

Glad to hear you're staying!!!

As for the time to completion, I've never looked at the algorithm. I think the manager just gets it from the Client, but I'd have to research how the client calculates it. I believe that the application occasionally tells it the percent complete and each time the boinc client is asked for the data, it returns the current CPU time used, the most recent percent complete that the application client has set, and a calculated time to complete the workunit.

I think the latest clients also do some kind of FLOP counting. The time to completion is apparently calculated using accumulated CPU time used, percent complete, and possibly some combination of FLOP data / benchmark data / duration_correction_factor and who knows what else.

There must be some other flags in there that tell the BOINC client how to calculate it because I've seen some different behavior on different projects and work units. Sometimes, when one work unit finishes, later work units have their estimated runtimes adjusted and sometimes they don't.

If the later work units have their estimated runtime adjusted then if the runtime of the just completed WU is bigger than the estimated runtime of the later workunits, all of the following work units get the actual runtime of that unit as their new estimated runtime but if it's less, then the following WUs get their estimated runtimes adjusted downwards by a little bit but it takes a lot of WUs completing with shorter runtimes to get the estimated runtimes to drop down near the runtimes that the WUs are completing in now. Having one completed WU be able to immediately increase the estimated runtime of follow on WUs is probably a safeguard against fetching too much work for the machine to complete on schedule. Having one completed WU only be able to decrease the estimated runtime of follow on WUs by a small fraction of the difference is probably a safeguard against a few goofy WUs lowering the estimated runtime too far and fetching too much work for the machine to complete on schedule. BOINC seems to be designed to be very conservative on this.

I was reading the MalariaControl message boards and apparently they have a very short (normally about 20 minutes on a P4 2.4) WU where they use a wrapper to run a legacy external program. Since there's no way to know how far it has gotten, the wrapper just leaves it at zero % complete until it finishes. Apparently there was an error in the legacy program or the data it was fed. There were people over in their message board complaining that it had run for 14 hours and was still at zero% and asking if they should kill it. Afaik, MalariaControl pulled that series of WU until they can figure out a solution.

I'd probably have to pull the source for the BOINC client and read through it to try to determine what the actual algorithm is, unless it's in the unofficial Wiki somewhere.

I hope this makes sense. I'm very tired and headed for bed soon. Between this crazy weather and my sinus allergies I'm running a low grade fever and feel yucky so the real question is whether I can get to sleep and how long I'll be able to sleep...

Happy Crunching,

-- David
____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?
Rene
Volunteer tester
Avatar

Joined: Oct 2 06
Posts: 121
ID: 160
Credit: 109,415
RAC: 0
Message 3063 - Posted 16 Apr 2007 5:55:05 UTC - in response to Message ID 3061 .

I didn't say I was leaving Docking, just stopping my Linux machines, the Windows ones are still chugging away.


I know... I was just teasing, hoping that you would keep some of your Linux boxes attached also.

;-)

____________
Profile Frank Boerner
Volunteer tester

Joined: Sep 13 06
Posts: 18
ID: 101
Credit: 744,548
RAC: 0
Message 3074 - Posted 16 Apr 2007 18:01:26 UTC

On my Linux machine the working time is always 1 or 0 or 2 seconds for 1 WU, but the Realtime it saya about 2 or 3 hours.
http://docking.utep.edu/result.php?resultid=156040 for example

Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3083 - Posted 16 Apr 2007 22:54:30 UTC - in response to Message ID 3074 .

This is something I have not been able to reproduce in the lab and we've done many, many tests today. It might have to do with the particular boinc client you are using or it might be something else; don't know yet, but we'll keep on searching...

AK

On my Linux machine the working time is always 1 or 0 or 2 seconds for 1 WU, but the Realtime it saya about 2 or 3 hours.
http://docking.utep.edu/result.php?resultid=156040 for example


____________
D@H the greatest project in the world... a while from now!
Profile David Ball
Forum moderator
Volunteer tester
Avatar

Joined: Sep 18 06
Posts: 274
ID: 115
Credit: 1,634,401
RAC: 0
Message 3086 - Posted 17 Apr 2007 1:03:23 UTC - in response to Message ID 3083 .

This is something I have not been able to reproduce in the lab and we've done many, many tests today. It might have to do with the particular boinc client you are using or it might be something else; don't know yet, but we'll keep on searching...

AK

On my Linux machine the working time is always 1 or 0 or 2 seconds for 1 WU, but the Realtime it saya about 2 or 3 hours.
http://docking.utep.edu/result.php?resultid=156040 for example



I remember having a problem like this on Rosetta (IIRC) a long time ago. I don't know if it was a problem with the BOINC client or the application wrapper. Sometimes, when you were running multiple projects and the BOINC client switched between WU/projects, the old WU/project kept running after the new WU/project started running and they would each get half of the idle time on the CPU.

Consider the following scenario:

1. The client IS NOT SET to leave the applications in memory/swap when they're not running.

2. The BOINC client tries to switch but the D@H application and WU keep running and finishes.

3. The BOINC client decides to switch back to the D@H WU and starts a new copy of the client to continue from the last checkpoint. The new copy finds the WU finished and reports the result immediately.

In this scenario, could you end up with only the 0 - 2 second CPU execution time of the last copy of the application client started (the one that found the WU already finished) being returned as the total execution time?

Consider this scenario:

1. The client IS SET to leave the applications in memory/swap when they're not running.

2. The BOINC client tries to switch but the D@H application and WU keeps running and finishes.

If the D@H application and wrapper terminates when the WU finishes, then you're back to section 3 in the first scenario because the BOINC Client has to start a new copy of the D@H application client.

otherwise:

3. The BOINC client decides to switch back to the D@H WU and tells the D@H application to continue. It would depend on the code in the BOINC client for tracking the accumulated CPU time for the client application (not sure how much the D@H application participates in this) but it might get confused and just report the last couple of seconds as the total runtime.


HTH,

-- David
____________
The views expressed are my own.
Facts are subject to memory error :-)
Have you read a good science fiction novel lately?
Profile Andre Kerstens
Forum moderator
Project tester
Volunteer tester
Avatar

Joined: Sep 11 06
Posts: 749
ID: 1
Credit: 15,199
RAC: 0
Message 3090 - Posted 18 Apr 2007 20:09:09 UTC - in response to Message ID 3061 .

This happens to be my favorite project so I won't be leaving anytime soon, besides Andre was thinking of giving bonus points for the longer you are a member so I can't pass that up.


The plan was to implement frequent cruncher credits this week, but unfortunately this will have to wait until we've fixed the problem with the charmm input file.

Thanks
Andre
____________
D@H the greatest project in the world... a while from now!

Message boards : Number crunching : Issues with new 5.05 Linux application

Database Error
: The MySQL server is running with the --read-only option so it cannot execute this statement
array(3) {
  [0]=>
  array(7) {
    ["file"]=>
    string(47) "/boinc/projects/docking/html_v2/inc/db_conn.inc"
    ["line"]=>
    int(97)
    ["function"]=>
    string(8) "do_query"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#27 (2) {
      ["db_conn"]=>
      resource(108) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(51) "update DBNAME.thread set views=views+1 where id=220"
    }
  }
  [1]=>
  array(7) {
    ["file"]=>
    string(48) "/boinc/projects/docking/html_v2/inc/forum_db.inc"
    ["line"]=>
    int(60)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(6) "DbConn"
    ["object"]=>
    object(DbConn)#27 (2) {
      ["db_conn"]=>
      resource(108) of type (mysql link persistent)
      ["db_name"]=>
      string(7) "docking"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(3) {
      [0]=>
      object(BoincThread)#3 (16) {
        ["id"]=>
        string(3) "220"
        ["forum"]=>
        string(1) "2"
        ["owner"]=>
        string(3) "193"
        ["status"]=>
        string(1) "0"
        ["title"]=>
        string(38) "Issues with new 5.05 Linux application"
        ["timestamp"]=>
        string(10) "1176926949"
        ["views"]=>
        string(4) "1591"
        ["replies"]=>
        string(2) "21"
        ["activity"]=>
        string(23) "4.0331247161160996e-121"
        ["sufferers"]=>
        string(1) "0"
        ["score"]=>
        string(1) "0"
        ["votes"]=>
        string(1) "0"
        ["create_time"]=>
        string(10) "1176351721"
        ["hidden"]=>
        string(1) "0"
        ["sticky"]=>
        string(1) "0"
        ["locked"]=>
        string(1) "0"
      }
      [1]=>
      &string(6) "thread"
      [2]=>
      &string(13) "views=views+1"
    }
  }
  [2]=>
  array(7) {
    ["file"]=>
    string(63) "/boinc/projects/docking/html_v2/user/community/forum/thread.php"
    ["line"]=>
    int(184)
    ["function"]=>
    string(6) "update"
    ["class"]=>
    string(11) "BoincThread"
    ["object"]=>
    object(BoincThread)#3 (16) {
      ["id"]=>
      string(3) "220"
      ["forum"]=>
      string(1) "2"
      ["owner"]=>
      string(3) "193"
      ["status"]=>
      string(1) "0"
      ["title"]=>
      string(38) "Issues with new 5.05 Linux application"
      ["timestamp"]=>
      string(10) "1176926949"
      ["views"]=>
      string(4) "1591"
      ["replies"]=>
      string(2) "21"
      ["activity"]=>
      string(23) "4.0331247161160996e-121"
      ["sufferers"]=>
      string(1) "0"
      ["score"]=>
      string(1) "0"
      ["votes"]=>
      string(1) "0"
      ["create_time"]=>
      string(10) "1176351721"
      ["hidden"]=>
      string(1) "0"
      ["sticky"]=>
      string(1) "0"
      ["locked"]=>
      string(1) "0"
    }
    ["type"]=>
    string(2) "->"
    ["args"]=>
    array(1) {
      [0]=>
      &string(13) "views=views+1"
    }
  }
}
query: update docking.thread set views=views+1 where id=220