Issues with new 5.05 Linux application
Message boards : Number crunching : Issues with new 5.05 Linux application
Author | Message | |
---|---|---|
Running 32-bit boinc client 5.8.17 on a 64-bit Ubuntu 6.10 system. The new (32-bit) charmm application was downloaded by my system today. Two things I've noticed:
|
||
ID: 2997 | Rating: 0 | rate: / | ||
Running 32-bit boinc client 5.8.17 on a 64-bit Ubuntu 6.10 system. The new (32-bit) charmm application was downloaded by my system today. Two things I've noticed: It also looks like your pc has done it's calculation in less than a second (0.96). ;-) Did those units ran in one instance or where they running in que with other projecs switching and asking for cpu-time? ____________ |
||
ID: 2998 | Rating: 0 | rate: / | ||
I do not have an explanation. It was mentioned that now a shell is being called, which in turn calls the application executable. Perhaps the 0.96 value is the time spent executing the shell (in other words, perhaps the time spent executing the application itself was not being measured). Those units were downloaded and added to the ready queue at my computer. They were dispatched from the ready queue by the "normal" boinc client scheduling algorithm -- that is, they ran "normally" under the boinc client (being scheduled and dispatched alongside other tasks from other projects). [I connect to the network only from time to time. The units which I tried to 'Suspend' have not yet been uploaded to the server. So far, one of them has finished - it too has an elapsed time value of around one second.] So the short time reported appears to be yet another issue, in addition to the 'Suspend' issue described earlier. . |
||
ID: 3001 | Rating: 0 | rate: / | ||
There might be something funny going on with your setup, because especially the subsecond execution time plus a valid result is a bit of an oddity that we haven't seen before. Anybody else noticing this behavior on their linux boxes? Memo will do some testing today or tomorrow to see if he can reproduce this.
____________ D@H the greatest project in the world... a while from now! |
||
ID: 3006 | Rating: 0 | rate: / | ||
My 64-bit computers are erroring on the 5.05s.
|
||
ID: 3009 | Rating: 0 | rate: / | ||
Follow on:
|
||
ID: 3010 | Rating: 0 | rate: / | ||
On output files 1tng_mod001_xxxxx_xxxxxx_x_2, 5.04 application contained 6400 lines of data ranging from Conformation/Rotation of 1/1 to 80/80, but 5.05 application contains one line for Conformation/Rotation 80/80. Is this correct?
|
||
ID: 3017 | Rating: 0 | rate: / | ||
On output files 1tng_mod001_xxxxx_xxxxxx_x_2, 5.04 application contained 6400 lines of data ranging from Conformation/Rotation of 1/1 to 80/80, but 5.05 application contains one line for Conformation/Rotation 80/80. Is this correct? I was just noticing how much smaller that file was myself and wondering if it was broken. They said they were cutting out some of the debug information in 5.05. Maybe that's it. Andre or Memo would probably have to answer that for us and I think Andre said something about being away for a few days in another thread. I don't know how often Memo is in the forums but, from the amount of work involved, I suspect they've had him chained to his computers getting the new version out. Also, does the 5.05 application require a certain level of libraries to run efficiently? I am running Ubuntu 6.06 and since the change to 5.05, my runtime has increased by ~29% on all units. I ran a couple of other projects to check, but they still run in the same amount of time as before so it appears to only be D@H that has changed. My Linux machine is Ubuntu 6.10 and it went from about 14800 seconds per WU to about 16000 seconds per WU. It's a Socket A Sempron and IIRC it only has 128KB of L2 cache. I wonder if the new app is more cache sensitive. One of my WinXP machines has a Socket 754 Sempron with 256KB L2 cache and the runtime dropped by about 100 seconds per WU on that machine. My WinXP P4 2.8 with 512KB cache increased by a few hundred seconds but that could have been related to system load. I've kept it busy with some database stuff today. My Vista PD 925 with 2x2MB cache decreased by about 300 seconds per WU. It's being burned in and is only running D@H so it's very consistent. It's interesting that the 2 machines which increased are older machines with IDE disks and the 2 which decreased are SATA. Maybe it's the disk subsystem and not the cache size. Didn't someone, in another thread, say something about overly aggressive checkpointing keeping the disk busy? Has anyone else seen a pattern related to either cache size or IDE vs SATA harddisks? -- David ____________ The views expressed are my own. Facts are subject to memory error :-) Have you read a good science fiction novel lately? |
||
ID: 3023 | Rating: 0 | rate: / | ||
Hi all quite new to this project but I have the same end result as j2satx.
|
||
ID: 3030 | Rating: 0 | rate: / | ||
Yes I also have noticed that the Work units on Linux now take much longer.
|
||
ID: 3036 | Rating: 0 | rate: / | ||
I started a thread about the linux x64 issues. The script has a typo in it. Open the charmm script file and you will notice it mentions the i686 file on line 22 and 25. The reason it throws an error message is it can't find the charmm i686 binary. |
||
ID: 3047 | Rating: 0 | rate: / | ||
Has anyone else seen a pattern related to either cache size or IDE vs SATA harddisks? I don't know if there's a pattern, but what I can see is that my Pentium D 805 (1Mb/p core) and running Vista seem to be doing the wu's slightly faster. My Athlon 2600+ Barton (512kb) has an increased running time... and that's my Linux host running Kubuntu 6.10 Both are equiped with IDE disks. Is kinda like comparing apples and peaches in my case, but you meight have a point here. Although the increase on my Linux host... approx 33%... could be cashe related, I also think that it could be Linux related. Maybe others can confirm your assumption (or not)... ;-) Edit: There's also a decrease of system load on my Linux host. The app asks between 75 and 90% of the cpu load... this used to be 95-100. ____________ |
||
ID: 3051 | Rating: 0 | rate: / | ||
On output files 1tng_mod001_xxxxx_xxxxxx_x_2, 5.04 application contained 6400 lines of data ranging from Conformation/Rotation of 1/1 to 80/80, but 5.05 application contains one line for Conformation/Rotation 80/80. Is this correct? >> David I run SATA II discs on 4 of my 5 computers and do not see any difference between them and the one that has IDE, constant disk reading and writing on all. The difference that I and many others are seeing is between Windows/Linux/Darwin. With the new 5.05 Charmm my Windows machines are actually a bit faster than before by some minutes. My Linux machines are more than 125% slower on 5.05 than on 5.04. The Darwin users seem to have times blowing out by over 600%. So the new version favours Windows and kills the others. I will consider removing my Linux machines if the credit/hour does not improve (dropping from 17-20 per hour down to 7-8 per hour), as the results don't justify the time spent, especially if they then don't validate. ____________ |
||
ID: 3054 | Rating: 0 | rate: / | ||
I will consider removing my Linux machines if the credit/hour does not improve (dropping from 17-20 per hour down to 7-8 per hour), as the results don't justify the time spent, especially if they then don't validate. I did not know it was already time to abandon ship... AFAIK engines are only running slower in some parts of the engine room... we aren't sinking yet. ;-) Just give it some time Conan.. I'm sure a fix will be made Until then just grab hold to this one... ____________ |
||
ID: 3055 | Rating: 0 | rate: / | ||
Hello Conan,
|
||
ID: 3059 | Rating: 0 | rate: / | ||
Hello Conan, > Thanks Rene and David. I didn't say I was leaving Docking, just stopping my Linux machines, the Windows ones are still chugging away. This happens to be my favorite project so I won't be leaving anytime soon, besides Andre was thinking of giving bonus points for the longer you are a member so I can't pass that up. I am also curious as to why the Boinc Manager keeps increasing the time to completion on these same Linux WU's as they have now increased to 9 hours which is getting close to the Darwin users 11 hours. No all is fine with me, no hissy fit yet, more curious than anything. I will stay, just trying to get the best production out of my computers that I can. ____________ |
||
ID: 3061 | Rating: 0 | rate: / | ||
Conan Wrote:
> Thanks Rene and David. Hi, Glad to hear you're staying!!! As for the time to completion, I've never looked at the algorithm. I think the manager just gets it from the Client, but I'd have to research how the client calculates it. I believe that the application occasionally tells it the percent complete and each time the boinc client is asked for the data, it returns the current CPU time used, the most recent percent complete that the application client has set, and a calculated time to complete the workunit. I think the latest clients also do some kind of FLOP counting. The time to completion is apparently calculated using accumulated CPU time used, percent complete, and possibly some combination of FLOP data / benchmark data / duration_correction_factor and who knows what else. There must be some other flags in there that tell the BOINC client how to calculate it because I've seen some different behavior on different projects and work units. Sometimes, when one work unit finishes, later work units have their estimated runtimes adjusted and sometimes they don't. If the later work units have their estimated runtime adjusted then if the runtime of the just completed WU is bigger than the estimated runtime of the later workunits, all of the following work units get the actual runtime of that unit as their new estimated runtime but if it's less, then the following WUs get their estimated runtimes adjusted downwards by a little bit but it takes a lot of WUs completing with shorter runtimes to get the estimated runtimes to drop down near the runtimes that the WUs are completing in now. Having one completed WU be able to immediately increase the estimated runtime of follow on WUs is probably a safeguard against fetching too much work for the machine to complete on schedule. Having one completed WU only be able to decrease the estimated runtime of follow on WUs by a small fraction of the difference is probably a safeguard against a few goofy WUs lowering the estimated runtime too far and fetching too much work for the machine to complete on schedule. BOINC seems to be designed to be very conservative on this. I was reading the MalariaControl message boards and apparently they have a very short (normally about 20 minutes on a P4 2.4) WU where they use a wrapper to run a legacy external program. Since there's no way to know how far it has gotten, the wrapper just leaves it at zero % complete until it finishes. Apparently there was an error in the legacy program or the data it was fed. There were people over in their message board complaining that it had run for 14 hours and was still at zero% and asking if they should kill it. Afaik, MalariaControl pulled that series of WU until they can figure out a solution. I'd probably have to pull the source for the BOINC client and read through it to try to determine what the actual algorithm is, unless it's in the unofficial Wiki somewhere. I hope this makes sense. I'm very tired and headed for bed soon. Between this crazy weather and my sinus allergies I'm running a low grade fever and feel yucky so the real question is whether I can get to sleep and how long I'll be able to sleep... Happy Crunching, -- David ____________ The views expressed are my own. Facts are subject to memory error :-) Have you read a good science fiction novel lately? |
||
ID: 3062 | Rating: 0 | rate: / | ||
I didn't say I was leaving Docking, just stopping my Linux machines, the Windows ones are still chugging away. I know... I was just teasing, hoping that you would keep some of your Linux boxes attached also. ;-) ____________ |
||
ID: 3063 | Rating: 0 | rate: / | ||
On my Linux machine the working time is always 1 or 0 or 2 seconds for 1 WU, but the Realtime it saya about 2 or 3 hours.
|
||
ID: 3074 | Rating: 0 | rate: / | ||
This is something I have not been able to reproduce in the lab and we've done many, many tests today. It might have to do with the particular boinc client you are using or it might be something else; don't know yet, but we'll keep on searching...
On my Linux machine the working time is always 1 or 0 or 2 seconds for 1 WU, but the Realtime it saya about 2 or 3 hours. ____________ D@H the greatest project in the world... a while from now! |
||
ID: 3083 | Rating: 0 | rate: / | ||
This is something I have not been able to reproduce in the lab and we've done many, many tests today. It might have to do with the particular boinc client you are using or it might be something else; don't know yet, but we'll keep on searching... I remember having a problem like this on Rosetta (IIRC) a long time ago. I don't know if it was a problem with the BOINC client or the application wrapper. Sometimes, when you were running multiple projects and the BOINC client switched between WU/projects, the old WU/project kept running after the new WU/project started running and they would each get half of the idle time on the CPU. Consider the following scenario: 1. The client IS NOT SET to leave the applications in memory/swap when they're not running. 2. The BOINC client tries to switch but the D@H application and WU keep running and finishes. 3. The BOINC client decides to switch back to the D@H WU and starts a new copy of the client to continue from the last checkpoint. The new copy finds the WU finished and reports the result immediately. In this scenario, could you end up with only the 0 - 2 second CPU execution time of the last copy of the application client started (the one that found the WU already finished) being returned as the total execution time? Consider this scenario: 1. The client IS SET to leave the applications in memory/swap when they're not running. 2. The BOINC client tries to switch but the D@H application and WU keeps running and finishes. If the D@H application and wrapper terminates when the WU finishes, then you're back to section 3 in the first scenario because the BOINC Client has to start a new copy of the D@H application client. otherwise: 3. The BOINC client decides to switch back to the D@H WU and tells the D@H application to continue. It would depend on the code in the BOINC client for tracking the accumulated CPU time for the client application (not sure how much the D@H application participates in this) but it might get confused and just report the last couple of seconds as the total runtime. HTH, -- David ____________ The views expressed are my own. Facts are subject to memory error :-) Have you read a good science fiction novel lately? |
||
ID: 3086 | Rating: 0 | rate: / | ||
This happens to be my favorite project so I won't be leaving anytime soon, besides Andre was thinking of giving bonus points for the longer you are a member so I can't pass that up. The plan was to implement frequent cruncher credits this week, but unfortunately this will have to wait until we've fixed the problem with the charmm input file. Thanks Andre ____________ D@H the greatest project in the world... a while from now! |
||
ID: 3090 | Rating: 0 | rate: / | ||
Message boards : Number crunching : Issues with new 5.05 Linux application
Database Error: The MySQL server is running with the --read-only option so it cannot execute this statement
array(3) { [0]=> array(7) { ["file"]=> string(47) "/boinc/projects/docking/html_v2/inc/db_conn.inc" ["line"]=> int(97) ["function"]=> string(8) "do_query" ["class"]=> string(6) "DbConn" ["object"]=> object(DbConn)#27 (2) { ["db_conn"]=> resource(108) of type (mysql link persistent) ["db_name"]=> string(7) "docking" } ["type"]=> string(2) "->" ["args"]=> array(1) { [0]=> &string(51) "update DBNAME.thread set views=views+1 where id=220" } } [1]=> array(7) { ["file"]=> string(48) "/boinc/projects/docking/html_v2/inc/forum_db.inc" ["line"]=> int(60) ["function"]=> string(6) "update" ["class"]=> string(6) "DbConn" ["object"]=> object(DbConn)#27 (2) { ["db_conn"]=> resource(108) of type (mysql link persistent) ["db_name"]=> string(7) "docking" } ["type"]=> string(2) "->" ["args"]=> array(3) { [0]=> object(BoincThread)#3 (16) { ["id"]=> string(3) "220" ["forum"]=> string(1) "2" ["owner"]=> string(3) "193" ["status"]=> string(1) "0" ["title"]=> string(38) "Issues with new 5.05 Linux application" ["timestamp"]=> string(10) "1176926949" ["views"]=> string(4) "1591" ["replies"]=> string(2) "21" ["activity"]=> string(23) "4.0331247161160996e-121" ["sufferers"]=> string(1) "0" ["score"]=> string(1) "0" ["votes"]=> string(1) "0" ["create_time"]=> string(10) "1176351721" ["hidden"]=> string(1) "0" ["sticky"]=> string(1) "0" ["locked"]=> string(1) "0" } [1]=> &string(6) "thread" [2]=> &string(13) "views=views+1" } } [2]=> array(7) { ["file"]=> string(63) "/boinc/projects/docking/html_v2/user/community/forum/thread.php" ["line"]=> int(184) ["function"]=> string(6) "update" ["class"]=> string(11) "BoincThread" ["object"]=> object(BoincThread)#3 (16) { ["id"]=> string(3) "220" ["forum"]=> string(1) "2" ["owner"]=> string(3) "193" ["status"]=> string(1) "0" ["title"]=> string(38) "Issues with new 5.05 Linux application" ["timestamp"]=> string(10) "1176926949" ["views"]=> string(4) "1591" ["replies"]=> string(2) "21" ["activity"]=> string(23) "4.0331247161160996e-121" ["sufferers"]=> string(1) "0" ["score"]=> string(1) "0" ["votes"]=> string(1) "0" ["create_time"]=> string(10) "1176351721" ["hidden"]=> string(1) "0" ["sticky"]=> string(1) "0" ["locked"]=> string(1) "0" } ["type"]=> string(2) "->" ["args"]=> array(1) { [0]=> &string(13) "views=views+1" } } }query: update docking.thread set views=views+1 where id=220