[Aces-support] PBS Errors
Ryan Abernathey
rpa at MIT.EDU
Fri Nov 2 17:02:38 EDT 2007
I'm getting this type of message in my pbs stdout log:
p0_15212: p4_error: Timeout in making connection to remote process
on aE34-500-031: 0
MPI failure at Fri Nov 2 16:19:39 EDT 2007: not advancing to next
iteration
(It's happing during an mpirun call of the mitgcm).
Here is all the relevant information I can think of:
host system: geo
nodes: aE34-500-023 aE34-500-024 aE34-500-025 aE34-500-026
aE34-500-027 aE34-500-028 aE34-500-029 aE34-500-030 aE34-500-031
aE34-500-032 aE34-500-033 aE34-500-034 aE34-500-035 aE34-500-036
aE34-500-037 aE34-500-038
pbs jobscript: /net/ds-09/scratch-0/rpa/ACC/jobscripts/
2D_solver_level.ACES
The full log is contained below. It should give you a better idea of
the exact times of the failures.
I experienced this error before. It was never resolved. I switched my
jobs to Columbia and it worked fine. Now we're out of time on
Columbia, so I'm back on ACES. Really hoping to run these jobs over
the weekend...
##########
Fixing velocity fields for timestep 480
p0_15630: p4_error: Timeout in making connection to remote process
on aE34-500-031: 0
mpirun exit status: 1
MPI failure at Fri Nov 2 16:24:45 EDT 2007: not advancing to next
iteration
##########
Fixing velocity fields for timestep 480
p0_16051: p4_error: Timeout in making connection to remote process
on aE34-500-031: 0
mpirun exit status: 1
MPI failure at Fri Nov 2 16:29:50 EDT 2007: not advancing to next
iteration
##########
Fixing velocity fields for timestep 480
p0_16499: p4_error: Timeout in making connection to remote process
on aE34-500-031: 0
mpirun exit status: 1
MPI failure at Fri Nov 2 16:35:03 EDT 2007: not advancing to next
iteration
##########
Fixing velocity fields for timestep 480
p0_16940: p4_error: Timeout in making connection to remote process
on aE34-500-031: 0
mpirun exit status: 1
MPI failure at Fri Nov 2 16:40:10 EDT 2007: not advancing to next
iteration
##########
Fixing velocity fields for timestep 480
p0_17361: p4_error: Timeout in making connection to remote process
on aE34-500-031: 0
mpirun exit status: 1
MPI failure at Fri Nov 2 16:45:14 EDT 2007: not advancing to next
iteration
More information about the Aces-support
mailing list