Checkpointing
Message boards : Number crunching : Checkpointing
Author | Message | |
---|---|---|
I wanted to give the checkpointing behavior a good test so after 20 minutes of runtime with both cores running charmm, I stopped the boinc client completely and restarted it. I have my system set to keep suspended WUs in memory so that's the only way to get it to actually start from a checkpoint.
|
||
ID: 4097 | Rating: 0 | rate: / | ||
One of the WU mentioned above has finished. It matched a result from a p4 on XP and got credit.
|
||
ID: 4098 | Rating: 0 | rate: / | ||
One task is running on my host, but it seems that checkpointing isn't done;
|
||
ID: 4099 | Rating: 0 | rate: / | ||
I had several that ran smoothly start to finish, but no indication they ever had to pause and get out of the way for another project. Sorry, I never got a chance to "catch them in the act" and force a pause.
|
||
ID: 4100 | Rating: 0 | rate: / | ||
One of the WU mentioned above has finished. It matched a result from a p4 on XP and got credit. A true checkpoint is not being stored. Instead, it is a file named percentdone.txt which holds some restart data. The percentdone.txt doesn't have the cpu time stored in it, so it will revert to 0 if the WU has to restart after a reboot or stopping of the core client. Here is the contents from one of the files: * PERCENT DONE STREAM FILE * SET FDONE = 15.2312 SET RUNNAME = 1ABE SET SEED = 1.699129E+06 SET RANDSEED = 1.651957E+06 SET MAXIC = 320 SET MAXROT = 20 SET NUMIC = 46 SET RNUM = 20 SET MAXENER = -8.42386 SET MAXRMSD = 0.168544 SET REJECTS = 0 SET STEPSDONE = 920 Somewhere in there the current cpu time needs to be recorded. |
||
ID: 4101 | Rating: 0 | rate: / | ||
Hi,
|
||
ID: 4102 | Rating: 0 | rate: / | ||
Hi David,
One of the WU mentioned above has finished. It matched a result from a p4 on XP and got credit. ____________ D@H the greatest project in the world... a while from now! |
||
ID: 4103 | Rating: 0 | rate: / | ||
Hi, Thanks Dr. Taufer. We have added boinc_checkpoint_completed to the charmm script and now the CPU time should start from the previous time before the task was suspended/stopped. Now in the stderr.txt file you should see the message: "Starting charmm run (initial or from checkpoint)". We will distribute 300 WUs to test this. Thanks for your feedback and help ! Arun |
||
ID: 4106 | Rating: 0 | rate: / | ||
25-6-2008 21:32:30|Docking@Home|[checkpoint_debug] result 1tng_mod0011sc_621_255286_1 checkpointed Looks OK from here, did a stop and go with the BOINC manager and the wu picked up it's previous checkpoint and running time. ;-) ____________ |
||
ID: 4107 | Rating: 0 | rate: / | ||
Thanks Dr. Taufer. We have added boinc_checkpoint_completed to the charmm script and now the CPU time should start from the previous time before the task was suspended/stopped. Now in the stderr.txt file you should see the message: "Starting charmm run (initial or from checkpoint)". We will distribute 300 WUs to test this. Checkpointing is working here. Upon stopping the daemon, the current time of last checkpoint is written into the init_data.xml and the client reverts to that time when resuming the task. I notice it does checkpoint quite often, about every 22 seconds on this machine even though I have the preferences set to 300 seconds. Not a problem, but it does make for a lot of messages with checkpoint debugging enabled. |
||
ID: 4108 | Rating: 0 | rate: / | ||
Thanks for letting us know checkpointing is working fine. The model we are running right now is a simple model. And since we are checkpointing at the end of each confirmation, the time between each checkpointing is low. We are developing newer models which will have higher time interval (~6 minutes) between each checkpointing. For the current model the time is around 70-80 seconds on a old P4 machines and 16-20 seconds on a dual core machines for 1abe and 1tng complexes. Thanks for your feedback and help ! Arun |
||
ID: 4112 | Rating: 0 | rate: / | ||
I realize at this point that you are concentrating on solving checkpointing issues. However, I've noticed two validated test workunits of mine have problems in relation to credits claimed vs. granted. Right now, it seems that the lowest credit claimed results in both computers granted that credit, regardless of when they were returned. On the following two workunits, my wingman's results brought the granted value down quite sharply. The crunched times are very different, also:
|
||
ID: 4114 | Rating: 0 | rate: / | ||
Hi,
|
||
ID: 4115 | Rating: 0 | rate: / | ||
Ah, yes. Fixed credits would certainly solve it. |
||
ID: 4117 | Rating: 0 | rate: / | ||
I ran the same test as before (stopped and re-started the BOINC client) and everything worked great this time. The cpu time and WU progress continued from the last checkpoint and the WUs validated. Looks good.
|
||
ID: 4121 | Rating: 0 | rate: / | ||
I had at least two wu's in the last batch that successfully paused to allow other projects in BOINC to crunch and then resumed to finish. Both had an initial estimated work time of about 7hrs20min, and actually completed in about 2 hours. Pause was at about 1hr40min.
|
||
ID: 4139 | Rating: 0 | rate: / | ||
I had at least two wu's in the last batch that successfully paused to allow other projects in BOINC to crunch and then resumed to finish. Both had an initial estimated work time of about 7hrs20min, and actually completed in about 2 hours. Pause was at about 1hr40min. We have distributed 300 WUs with revised FLOPS estimate. Please give your feedback for these workunits. Thanks Arun |
||
ID: 4140 | Rating: 0 | rate: / | ||
I had at least two wu's in the last batch that successfully paused to allow other projects in BOINC to crunch and then resumed to finish. Both had an initial estimated work time of about 7hrs20min, and actually completed in about 2 hours. Pause was at about 1hr40min. What are we supposed to be looking for with the revised FLOPS estimate, the time it takes to run the Wu compared to the Time it estimates it will run it ... ??? |
||
ID: 4141 | Rating: 0 | rate: / | ||
Yes, also if the number of tasks your client is getting based on the new FLOPS count for each complex is appropriate for your client setting and host cpu speed. Related discussion . Thanks for your feedback. Arun |
||
ID: 4142 | Rating: 0 | rate: / | ||
Okay, I recieved quite a few of the Wu's (about 220, musta been putting in calls just at the right time) across my Parm but will run them as fast as can ... :) ... I should have them run out by the morning.
|
||
ID: 4143 | Rating: 0 | rate: / | ||
Okay, I recieved quite a few of the Wu's (about 220, musta been putting in calls just at the right time) across my Parm but will run them as fast as can ... :) ... I should have them run out by the morning. You sure did get a lot of the WUs ! Looks like the FLOPS estimate is more closer to the actual running time. With checkpointing it may take different time on other hosts with similar resources. Thanks for your feedback. Arun |
||
ID: 4144 | Rating: 0 | rate: / | ||
Finished 2 more on another Widows Box in 1:49:16 that were Estimated to run between 2:07:55 & 2:20:42 so the Estimates seem a little high but personally I would rather have them on the high side a little so you don't overload with them when you first start to get some ... :)
|
||
ID: 4145 | Rating: 0 | rate: / | ||
Have all but 8 of the 220 turned in already, those 8 are on my slowest Quad but will be done soon too ... :) |
||
ID: 4146 | Rating: 0 | rate: / | ||
Running much closer to the estimates this time, and pausing and resuming as expected.
|
||
ID: 4147 | Rating: 0 | rate: / | ||
Message boards : Number crunching : Checkpointing
Database Error: The MySQL server is running with the --read-only option so it cannot execute this statement
array(3) { [0]=> array(7) { ["file"]=> string(47) "/boinc/projects/docking/html_v2/inc/db_conn.inc" ["line"]=> int(97) ["function"]=> string(8) "do_query" ["class"]=> string(6) "DbConn" ["object"]=> object(DbConn)#29 (2) { ["db_conn"]=> resource(108) of type (mysql link persistent) ["db_name"]=> string(7) "docking" } ["type"]=> string(2) "->" ["args"]=> array(1) { [0]=> &string(51) "update DBNAME.thread set views=views+1 where id=313" } } [1]=> array(7) { ["file"]=> string(48) "/boinc/projects/docking/html_v2/inc/forum_db.inc" ["line"]=> int(60) ["function"]=> string(6) "update" ["class"]=> string(6) "DbConn" ["object"]=> object(DbConn)#29 (2) { ["db_conn"]=> resource(108) of type (mysql link persistent) ["db_name"]=> string(7) "docking" } ["type"]=> string(2) "->" ["args"]=> array(3) { [0]=> object(BoincThread)#3 (16) { ["id"]=> string(3) "313" ["forum"]=> string(1) "2" ["owner"]=> string(3) "115" ["status"]=> string(1) "0" ["title"]=> string(13) "Checkpointing" ["timestamp"]=> string(10) "1214954717" ["views"]=> string(3) "731" ["replies"]=> string(2) "23" ["activity"]=> string(20) "3.6424167844076e-102" ["sufferers"]=> string(1) "0" ["score"]=> string(1) "0" ["votes"]=> string(1) "0" ["create_time"]=> string(10) "1214331267" ["hidden"]=> string(1) "0" ["sticky"]=> string(1) "0" ["locked"]=> string(1) "0" } [1]=> &string(6) "thread" [2]=> &string(13) "views=views+1" } } [2]=> array(7) { ["file"]=> string(63) "/boinc/projects/docking/html_v2/user/community/forum/thread.php" ["line"]=> int(184) ["function"]=> string(6) "update" ["class"]=> string(11) "BoincThread" ["object"]=> object(BoincThread)#3 (16) { ["id"]=> string(3) "313" ["forum"]=> string(1) "2" ["owner"]=> string(3) "115" ["status"]=> string(1) "0" ["title"]=> string(13) "Checkpointing" ["timestamp"]=> string(10) "1214954717" ["views"]=> string(3) "731" ["replies"]=> string(2) "23" ["activity"]=> string(20) "3.6424167844076e-102" ["sufferers"]=> string(1) "0" ["score"]=> string(1) "0" ["votes"]=> string(1) "0" ["create_time"]=> string(10) "1214331267" ["hidden"]=> string(1) "0" ["sticky"]=> string(1) "0" ["locked"]=> string(1) "0" } ["type"]=> string(2) "->" ["args"]=> array(1) { [0]=> &string(13) "views=views+1" } } }query: update docking.thread set views=views+1 where id=313