Posts by Arun

Message boards : Number crunching : Checkpointing

( Message 4144 )
Posted 3327 days ago by

Okay, I recieved quite a few of the Wu's (about 220, musta been putting in calls just at the right time) across my Parm but will run them as fast as can ... :) ... I should have them run out by the morning.

PS: Finished 2 Wu's on 1 Box in 1:30:12 that were estimated to run 1:59;34 ...

You sure did get a lot of the WUs ! Looks like the FLOPS estimate is more closer to the actual running time. With checkpointing it may take different time on other hosts with similar resources.

Thanks for your feedback.

Arun

Message boards : Number crunching : Checkpointing

( Message 4142 )
Posted 3327 days ago by

Arun

What are we supposed to be looking for with the revised FLOPS estimate, the time it takes to run the Wu compared to the Time it estimates it will run it ... ???

Yes, also if the number of tasks your client is getting based on the new FLOPS count for each complex is appropriate for your client setting and host cpu speed.
Related discussion .

Thanks for your feedback.

Arun

Message boards : Number crunching : Checkpointing

( Message 4140 )
Posted 3327 days ago by

Arun

I had at least two wu's in the last batch that successfully paused to allow other projects in BOINC to crunch and then resumed to finish. Both had an initial estimated work time of about 7hrs20min, and actually completed in about 2 hours. Pause was at about 1hr40min.

See wu 3394 and wu 4034 for details.

We have distributed 300 WUs with revised FLOPS estimate. Please give your feedback for these workunits.

Thanks
Arun

Message boards : Number crunching : Cobblestones

( Message 4135 )
Posted 3330 days ago by

Arun

Hmmm, that sounds interesting. Yes, please let the project team know if you find something on that. In the meantime, Arun could try commenting out the boinc calls in the charmm code and see if the time calls are still being made. If not, then that points in the direction you are thinking.

Andre

Do you know for sure if the problem still exists? Unfortunately, I don't remember the details but while docking was shut down there was a fix mentioned on the BOINC developers mailing list (might have been the forums) that sounded to me like it might have been causing a similar problem. It's been a long time so I don't recall if it was in the BOINC client or in the application framework that was distributed. Since it didn't affect most applications, it must have been in a support function or something. A polling loop with no delay in it that was calling the OS time of day function to check elapsed time was what it sounded like. Might have had something to do with a heartbeat function. I'll see if I can find it.

Andre and David,
Thanks for the informative discussion. I used gprof profiling tool and found that the times() function is executed 7.02% of the time, which took 5.12 seconds out of the total 72.98 seconds for this charmm execution. times() was the 3rd most time consuming function after enbfs8 and ephifs fortran calls. The output of strace also showed that times() function is called many times. Any suggestions ?

David, any information you can find will be useful.

cheers
Arun

Message boards : Number crunching : Deadlines are very short.

( Message 4132 )
Posted 3331 days ago by

Arun

Rene Oskam wrote:

Just picked up 2 of them, first is running in high priority mode at this moment, but I think that's because of the "Task duration correction factor" of my host. It thinks that the wu will take about 16 hours, but 3 hours will be more accurate.

Time will cure that issue...

;-)

Unless the <rsc_fpops_est> for each ligand type is set closer to the actual amount, time will not cure that issue. The 1tng WUs are set too high and the 1abe WUs are set too low. Following are the numbers that would get the DCF closer to 1 instead of having it jump around when running a mix of 1tng and 1abe.

1abe is issued with a setting of 10,000,000,000,000.000000 (I added the commas) when a much closer estimate is 18,000,000,000,000.000000

1tng is issued with a setting of 52,558,670,000,000.000000 but actually is closer to 20,600,000,000,000.000000

A single 1abe task will cause the DCF to jump somewhere up around 1.8 or higher and since the 1tng is already estimated at 2.5 times it's real runtime, now the 1.8 DCF will cause it to estimate at almost 5 times the correct runtime.

I use a 16 hour buffer and boinc will only download 2 1tng tasks because the high fpops_est cause it to estimate each task at 8:27 when it only takes about 1:52. On the other hand if it grabs only 1abe tasks then it gets too many because the low fpops_est cause an estimate of 0:43 when it really takes approximately 1:36.

Thanks for your feedback. Your estimates are very good. We will be distributing 1tng and 1abe for 320 conformations and 20 rotations with these flops (+10%) from next time.

1tng - 22,000,000,000,000
1abe - 20,000,000,000,000

We are doing testing using these FLOPS count.

Arun

Message boards : Number crunching : No work available for x64 Windows machines?

( Message 4131 )
Posted 3331 days ago by

Arun

Hello everyone, I'm getting always this annoying error message when trying to do some testing... *sob sob* ;-)

26.06.2008 22:46:05|Docking@Home|Message from server: No work sent
26.06.2008 22:46:05|Docking@Home|Message from server: Charmm with screensaver is not available for your type of computer.

I know I have gotten work on that host before... It's my lappy . *scratches head*

Cori, as Bob said, are you using 6.2.x client ?

Message boards : Number crunching : Deadlines are very short.

( Message 4124 )
Posted 3331 days ago by

Arun

As the title says, the deadlines are very short for this run.

We have created more workunits and have increased the deadline to 5 days.

Thanks for your feedback and help.

Arun

Message boards : Number crunching : Cobblestones

( Message 4113 )
Posted 3332 days ago by

Arun

Good point. No, I've not been able to figure out why these system calls to get the time of day on linux are made a gazillion times per run. I do think that this might cause the massive difference in runtime between linux and windows. The runtime difference issue is already on the project's to-do list, so I'll make sure that whatever notes I have on this will be passed on to the next person trying to crack this issue.

Cheers
Andre

The problem of execution time differing between windows and linux need to be solved before we move on to fixed credit based on FLOPS. I will be working on this issue tomorrow. Andre, can you pass me the notes you have on this issue ?

cheers,
Arun

Message boards : Number crunching : Checkpointing

( Message 4112 )
Posted 3332 days ago by

Arun

Checkpointing is working here. Upon stopping the daemon, the current time of last checkpoint is written into the init_data.xml and the client reverts to that time when resuming the task. I notice it does checkpoint quite often, about every 22 seconds on this machine even though I have the preferences set to 300 seconds. Not a problem, but it does make for a lot of messages with checkpoint debugging enabled.

Thanks for letting us know checkpointing is working fine. The model we are running right now is a simple model. And since we are checkpointing at the end of each confirmation, the time between each checkpointing is low. We are developing newer models which will have higher time interval (~6 minutes) between each checkpointing.

For the current model the time is around 70-80 seconds on a old P4 machines and 16-20 seconds on a dual core machines for 1abe and 1tng complexes.

Thanks for your feedback and help !

Arun

10)

Message boards : Getting started : Invitation Code?

( Message 4110 )
Posted 3332 days ago by

Arun

We are working on opening Docking@Home to the public. Hopefully it will happen soon. Please check your PMs.

Next 10 posts