[Aces-support] PBS Errors

Ryan Abernathey rpa at MIT.EDU
Fri Nov 2 17:02:38 EDT 2007


I'm getting this type of message in my pbs stdout log:

p0_15212:  p4_error: Timeout in making connection to remote process  
on aE34-500-031: 0
MPI failure at Fri Nov 2 16:19:39 EDT 2007: not advancing to next  
iteration

(It's happing during an mpirun call of the mitgcm).

Here is all the relevant information I can think of:
host system: geo
nodes: aE34-500-023 aE34-500-024 aE34-500-025 aE34-500-026  
aE34-500-027 aE34-500-028 aE34-500-029 aE34-500-030 aE34-500-031  
aE34-500-032 aE34-500-033 aE34-500-034 aE34-500-035 aE34-500-036  
aE34-500-037 aE34-500-038
pbs jobscript: /net/ds-09/scratch-0/rpa/ACC/jobscripts/ 
2D_solver_level.ACES

The full log is contained below. It should give you a better idea of  
the exact times of the failures.

I experienced this error before. It was never resolved. I switched my  
jobs to Columbia and it worked fine. Now we're out of time on  
Columbia, so I'm back on ACES. Really hoping to run these jobs over  
the weekend...

##########
Fixing velocity fields for timestep 480
p0_15630:  p4_error: Timeout in making connection to remote process  
on aE34-500-031: 0
mpirun exit status: 1
MPI failure at Fri Nov 2 16:24:45 EDT 2007: not advancing to next  
iteration
##########
Fixing velocity fields for timestep 480
p0_16051:  p4_error: Timeout in making connection to remote process  
on aE34-500-031: 0
mpirun exit status: 1
MPI failure at Fri Nov 2 16:29:50 EDT 2007: not advancing to next  
iteration
##########
Fixing velocity fields for timestep 480
p0_16499:  p4_error: Timeout in making connection to remote process  
on aE34-500-031: 0
mpirun exit status: 1
MPI failure at Fri Nov 2 16:35:03 EDT 2007: not advancing to next  
iteration
##########
Fixing velocity fields for timestep 480
p0_16940:  p4_error: Timeout in making connection to remote process  
on aE34-500-031: 0
mpirun exit status: 1
MPI failure at Fri Nov 2 16:40:10 EDT 2007: not advancing to next  
iteration
##########
Fixing velocity fields for timestep 480
p0_17361:  p4_error: Timeout in making connection to remote process  
on aE34-500-031: 0
mpirun exit status: 1
MPI failure at Fri Nov 2 16:45:14 EDT 2007: not advancing to next  
iteration




More information about the Aces-support mailing list