Work Units That Never Want To End
Message boards : Number crunching : Work Units That Never Want To End
Author | Message | |
---|---|---|
I have had a couple of work units that showed no progress over the past month or so but either aborted them when found or restarted Boinc and then all was ok.
|
||
ID: 4946 | Rating: 0 | rate: / | ||
I am having a similar problem with every work unit that starts with "1hpv". They are all running as if they are working, but no progress is being shown. It normally takes my computer around 2 hours to complete a work unit, and these work units were running for 3 hours and still showing 0% progress.
|
||
ID: 4949 | Rating: 0 | rate: / | ||
I also can confirm that '1hpv' work units show the time going up but no progress.
|
||
ID: 4950 | Rating: 0 | rate: / | ||
I can also confirm the same issue on both Windows and Linux, with 1hpv units running for 8 hours or more with zero progress and either no sign of completing, or a computation error after 8 or so hours.
|
||
ID: 4952 | Rating: 0 | rate: / | ||
I can confirm the same problem with 1hpv units. All sticking on zero progress
|
||
ID: 4953 | Rating: 0 | rate: / | ||
I also am seeing the 1hpv_mod.. WUs "hanging" as described above on one of my machines. Arch Linux on a Intel Q6600. All WUs are sitting @0 % using 100% CPU for >27,732 seconds then erroring out. (Normal runtime on this machine is apx 10,000 secs).
|
||
ID: 4954 | Rating: 0 | rate: / | ||
This wu
also a 1hpv one, is "running" but the time to completion is not changing and the progress bar is 0.000%. Suspended pending advice.
|
||
ID: 4955 | Rating: 0 | rate: / | ||
Exactly the same problem here, Linux and Windows. Cancelling the units clears the problems but new work is full of 1hpv, presumably because everyone is cancelling and so they are being resent. Have set the project to no new work and am crunching the other units in cache as normal. Please sort this out, and cancel the units at server level - how many people have not even noticed? |
||
ID: 4956 | Rating: 0 | rate: / | ||
1hpv WUs are doing the same thing here. Is there an admin reading this thread? |
||
ID: 4957 | Rating: 0 | rate: / | ||
Getting some taks like an 1hbv... that went to 52.975% and quit showing progress and one 1hpv...that shows 0% all the time with now change to time to complete but CPU time is clocking up. Had another one I aborted prior to these two thinking it was just a bad WU.
|
||
ID: 4958 | Rating: 0 | rate: / | ||
>>> Is there an admin reading this thread?
|
||
ID: 4959 | Rating: 0 | rate: / | ||
Exactly the same problem here, Linux and Windows. Cancelling the units clears the problems but new work is full of 1hpv, presumably because everyone is cancelling and so they are being resent. Actually, they are set to 0/1/1 so are canceled the first error. That said, I just noticed a bunch of red units in boincview. All hpv's. All ran 12 1/2 hrs before exceeding cpu, claiming 150 credits and getting none. So, the good news is ... they only fail once. The bad news is they run for 12 hrs instead of 2, and you get no credit. |
||
ID: 4960 | Rating: 0 | rate: / | ||
Exactly the same problem here, Linux and Windows. Cancelling the units clears the problems but new work is full of 1hpv, presumably because everyone is cancelling and so they are being resent. Sorry my mistake, must have just had bad luck on the downloads then |
||
ID: 4961 | Rating: 0 | rate: / | ||
I'm seeing the same problem with 1hpv wu's running on Vista. I have tried them all and wound up aborting them all and accepting no new work until the problem is corrected.
|
||
ID: 4962 | Rating: 0 | rate: / | ||
>>> Is there an admin reading this thread? It's been 28 hours since this problem was first posted, I doubt people sleep that long even in Delaware :-) |
||
ID: 4963 | Rating: 0 | rate: / | ||
looks like all the 1hpv wu are bad..... i'm also having same problems....
|
||
ID: 4964 | Rating: 0 | rate: / | ||
Can't get any decent units & no response from admins, so no new tasks set. Downloading Sztaki instead. |
||
ID: 4968 | Rating: 0 | rate: / | ||
Dear All,
|
||
ID: 4969 | Rating: 0 | rate: / | ||
We suspended the generation of jobs. We are looking at the problem and we expect to restart generation of more robust jobs in the next 5 hours.
|
||
ID: 4970 | Rating: 0 | rate: / | ||
You need to cancel all 1hpv tasks on the server.
|
||
ID: 4973 | Rating: 0 | rate: / | ||
Aborted all 1hpv - received new work 1hsg.
|
||
ID: 4974 | Rating: 0 | rate: / | ||
You need to cancel all 1hpv tasks on the server. Agreed, A server side abort would be the thing to do. Please |
||
ID: 4976 | Rating: 0 | rate: / | ||
1hsg were tested and they don't have the problem of 1hpv. All 1hpv were canceled from the server, 1hbv (note the b instead of the p) are still around but they are running fine
You need to cancel all 1hpv tasks on the server. |
||
ID: 4977 | Rating: 0 | rate: / | ||
|
||
ID: 4978 | Rating: 0 | rate: / | ||
[quote]1hsg were tested and they don't have the problem of 1hpv. All 1hpv were canceled from the server, 1hbv (note the b instead of the p) are still around but they are running fine[quote]
|
||
ID: 4979 | Rating: 0 | rate: / | ||
My current 1hsg wu,
this one
, has been crunching about an hour and says it is 30.700% done.
|
||
ID: 4980 | Rating: 0 | rate: / | ||
It finished without issue. A later wu also 1hsg sat at 1.000% for around 5 minutes, then continued. I say "Ni".
|
||
ID: 4981 | Rating: 0 | rate: / | ||
I've now had 2 of 1hsg units reach 43% and 45% and quit showing any progress for hours. I've also had 1 of the 1hbv die as well.
|
||
ID: 4982 | Rating: 0 | rate: / | ||
I've now had 2 of 1hsg units reach 43% and 45% and quit showing any progress for hours. I've also had 1 of the 1hbv die as well. I just received an email from a friend who also does D@H and has told me his 1hsg units also stopped for several hours showing progress and then kicked back in to normal speed and finished. So maybe I'm just not being patient enough? I had aborted the 1hbv one and maybe that was premature. In the last 4 hours and 22 minutes one of my current 1hsgs hasn't moved at all and one has shown 0.055 increase in % of work done. Maybe these WUs have something special about them that makes them process slowly in the middle? Anyone else still having trouble with them or is it just me? |
||
ID: 4983 | Rating: 0 | rate: / | ||
After several hours of very little noticable progress the 1hsg units I mentioned before have slowly started to show progress again. So I guess they are indeed running, just very slowly. Just a long period of not showing progress. I hope they don't error out at the end. I can normally finish a WU in about 8 hours but these are going to be around 14-16 hours before they are done if the current estimates are correct.
|
||
ID: 4984 | Rating: 0 | rate: / | ||
After several hours of very little noticable progress the 1hsg units I mentioned before have slowly started to show progress again. So I guess they are indeed running, just very slowly. Just a long period of not showing progress. I hope they don't error out at the end. I can normally finish a WU in about 8 hours but these are going to be around 14-16 hours before they are done if the current estimates are correct. The docking method we are currently using is taking more time. We tried to reduce the overall time by reducing the number of docking attempts per job. Also, some jobs may start with ligand conformations that do not really make any sense (from the science point of view) and the job may seems stack but it is actually searching for good ligand conformations. We are monitoring the situation and at the same time looking for possible changes in the docking method that can reduce docking delays. We will post more about the results we are collecting for 1hsg (and their accuracy) in the next 24 hours. The calibration and execution of docking simulations is for sure very challenging when the docking methods become more accurate. Hopefully using our simulations we will be able to help scientists to identify criteria to decide when sophisticated docking methods (like the one we are currently using) are really needed and when these sophisticated methods are not needed because simpler methods (like the one we use in the past) are still accurate (i.e., provide scientists with results that are meaningful). Thanks for the patience and commitment. Michela ____________ If you are interested in working on Docking@Home in a great group at UDel, contact me at 'taufer at acm dot org'! |
||
ID: 4985 | Rating: 0 | rate: / | ||
1hsg are longer than previous workunits (same number of conformations and rotations, but they take longer). Most of them take from 3.4 to 5.8 hours.
|
||
ID: 4986 | Rating: 0 | rate: / | ||
1hsg are longer than previous workunits (same number of conformations and rotations, but they take longer). Most of them take from 3.4 to 5.8 hours.
|
||
ID: 4987 | Rating: 0 | rate: / | ||
Yes my 1hsgs did complete once they got over the hump of showing no progress for a while. I've finished both those I was concerned about. Just needed to be patient and wait a little longer. It takes my old Pentium 4 3GHz a good deal more time to complete than you mention so maybe it feels the increase a bit more than the more current CPUs.
|
||
ID: 4989 | Rating: 0 | rate: / | ||
Yes my 1hsgs did complete once they got over the hump of showing no progress for a while. I've finished both those I was concerned about. Just needed to be patient and wait a little longer. It takes my old Pentium 4 3GHz a good deal more time to complete than you mention so maybe it feels the increase a bit more than the more current CPUs. your logic seems right. for example, i'm running a core2duo 3.0ghz with 4gigs of ram. the newest complex, the 1htf, is taking me about 3 hours to complete. hang in there on those long tasks. :o) |
||
ID: 4990 | Rating: 0 | rate: / | ||
I often saw something similar on another BOINC project where the part of the programs for actually doing the work was working properly, but in a way that made reporting how much work was already done so difficult that there were problems with measuring progress. Could the problems be related to that? |
||
ID: 4994 | Rating: 0 | rate: / | ||
1hsg are longer than previous workunits (same number of conformations and rotations, but they take longer). Most of them take from 3.4 to 5.8 hours. Well I just aborted 3 of this type (1hsg) as all had reached 1 hour 6 minutes or there abouts but were not doing anything. Running High Priority, time to completion still going up (I think) but CPU time was not moving on all three, suspended, resumed still not moving so killed them. Also there seems to be a bug in the result tables, after a job is aborted and sent back it shows that it ran for ZERO time, this was on all three (that all ran for over an hour) of the ones I have just reported. And I noticed also on a previous faulty WU (from the ones that were cancelled) that had run for over 51 HOURS and it showed Zero time as well, unsure how you can fault find when this is happening. ____________ |
||
ID: 4996 | Rating: 0 | rate: / | ||
My guess is that it is the BOINC policy for aborted WUs, let me see if I can find where to change it, so in the case of 1hpv, people is not penalized
1hsg are longer than previous workunits (same number of conformations and rotations, but they take longer). Most of them take from 3.4 to 5.8 hours. |
||
ID: 4997 | Rating: 0 | rate: / | ||
I got a "1hvk" task that had run for 104 hours and had 140 hours left before estimated finish, before I aborted it. I guess this is CPU time (electricity/money) down the drain :( |
||
ID: 5007 | Rating: 0 | rate: / | ||
Having same issue now with 1ohr_mod0014_45* WU's.
|
||
ID: 5041 | Rating: 0 | rate: / | ||
Hi Ron and Shauge,
|
||
ID: 5043 | Rating: 0 | rate: / | ||
Not sure where to post this but I am running BOINC With SETI@HOME and DOCKING@HOME loaded and running. SETI seems to run like it should. The work unit in progress shows progress, elapsed and time to completion and it runs. DOCKING, however, shows no progress bar, and the elapsed time is way past what it should have taken to complete the work unit, 14hrs. The time to completion is sitting at what it was when the work unit started around 3:32 to complete. I even suspended work on SETI and Docking isnt getting anywhere. I first noticed this when I noted a message that said my units were overdue and probably wouldnt be counted as completed. I have already done the normal uninstall and reinstall of BOINC, etc. Whats wrong?
|
||
ID: 5567 | Rating: 0 | rate: / | ||
Just as Chamberlain reported, I have been running D@H along with 2 other projects. For the past 4 weeks, all docking work units are running at least 12-14 hours with no progress shown. I have had to abort every one of the units so other work would continue. World Community and Rosetta are functioning properly. Windows 7, x64 / BOINC 6.10.18 / Charmm 34a2 6.23 (most recent work unit).
|
||
ID: 5568 | Rating: 0 | rate: / | ||
As noted in my original post, I continue to not complete work units or show any percent of completion with time elapsed on this particular project. Since I was running BOINC on windows 7 x64 I decided to try adding Docking to my laptop running BOINC on XP x32. Surprising, the unit showed work, elapsed time, and percent completed right away. I would then surmise that it is Docking at Home having issues running on x64 version of BOINC or BOINC running on x64 Windows7. I have to say that since no one has posted any response to my issues, I feel it necessary to drop Docking at Home and run those projects that work, and that I get support for.
|
||
ID: 5580 | Rating: 0 | rate: / | ||
I had 3 tasks that ran for 21h, 19h and 17h on a Q9400 @ 3.46gHz.
|
||
ID: 5582 | Rating: 0 | rate: / | ||
Message boards : Number crunching : Work Units That Never Want To End
Database Error: The MySQL server is running with the --read-only option so it cannot execute this statement
array(3) { [0]=> array(7) { ["file"]=> string(47) "/boinc/projects/docking/html_v2/inc/db_conn.inc" ["line"]=> int(97) ["function"]=> string(8) "do_query" ["class"]=> string(6) "DbConn" ["object"]=> object(DbConn)#50 (2) { ["db_conn"]=> resource(192) of type (mysql link persistent) ["db_name"]=> string(7) "docking" } ["type"]=> string(2) "->" ["args"]=> array(1) { [0]=> &string(51) "update DBNAME.thread set views=views+1 where id=424" } } [1]=> array(7) { ["file"]=> string(48) "/boinc/projects/docking/html_v2/inc/forum_db.inc" ["line"]=> int(60) ["function"]=> string(6) "update" ["class"]=> string(6) "DbConn" ["object"]=> object(DbConn)#50 (2) { ["db_conn"]=> resource(192) of type (mysql link persistent) ["db_name"]=> string(7) "docking" } ["type"]=> string(2) "->" ["args"]=> array(3) { [0]=> object(BoincThread)#3 (16) { ["id"]=> string(3) "424" ["forum"]=> string(1) "2" ["owner"]=> string(3) "100" ["status"]=> string(1) "0" ["title"]=> string(33) "Work Units That Never Want To End" ["timestamp"]=> string(10) "1261128195" ["views"]=> string(4) "1007" ["replies"]=> string(2) "44" ["activity"]=> string(22) "5.2375679589370004e-80" ["sufferers"]=> string(1) "0" ["score"]=> string(1) "0" ["votes"]=> string(1) "0" ["create_time"]=> string(10) "1241003483" ["hidden"]=> string(1) "0" ["sticky"]=> string(1) "0" ["locked"]=> string(1) "0" } [1]=> &string(6) "thread" [2]=> &string(13) "views=views+1" } } [2]=> array(7) { ["file"]=> string(63) "/boinc/projects/docking/html_v2/user/community/forum/thread.php" ["line"]=> int(184) ["function"]=> string(6) "update" ["class"]=> string(11) "BoincThread" ["object"]=> object(BoincThread)#3 (16) { ["id"]=> string(3) "424" ["forum"]=> string(1) "2" ["owner"]=> string(3) "100" ["status"]=> string(1) "0" ["title"]=> string(33) "Work Units That Never Want To End" ["timestamp"]=> string(10) "1261128195" ["views"]=> string(4) "1007" ["replies"]=> string(2) "44" ["activity"]=> string(22) "5.2375679589370004e-80" ["sufferers"]=> string(1) "0" ["score"]=> string(1) "0" ["votes"]=> string(1) "0" ["create_time"]=> string(10) "1241003483" ["hidden"]=> string(1) "0" ["sticky"]=> string(1) "0" ["locked"]=> string(1) "0" } ["type"]=> string(2) "->" ["args"]=> array(1) { [0]=> &string(13) "views=views+1" } } }query: update docking.thread set views=views+1 where id=424