Restart an AWS PAT?
Hello,
My connection to AWS seems to time out. I've been trying to do these fairly large Design of Experiment based algorithmic runs, in some cases where I need them to run overnight. Twice now the PAT seems to stop mid stream. The connection to AWS remains open but I am no longer getting progress on models. PAT version 2.7.0 and AMI 2.7.1. In this case it hung up around 22% complete of a 10240 case simulation. Thank you for any insight.
Regards.
Is it just PAT that is unresponsive, or is the server on AWS also not responsive? (If you open it in a web browser)
whats your server instance type? How many datapoints are you running and how big is each datapoint file?
Service instance was m3.2xlarge, 10,240 data points and each run is approx 50-100mb.
The server seems alive, from the EC2 monitoring I can see the cpu activity as resting. It seems like PAT just stops sending more run instructions. The Resque monitoring shows no activity.
also try '2.7.1-largescale1'. This AMI has some load balancing changes to keep the server node from getting overloaded with worker processes, which can make it unresponsive
well, 10,240 datapoints at 50Mb is 512 Gb, so your instance probably ran out of disk space. You can verify that by using the server.pem key and ssh into the server node and 'df -h'. user name is 'ubuntu'
Thank you! Now I know. I was zeroing in on that as an answer, I must confess I don't actually manage my AWS connection, I piggy back on the rest of my group's account where my usage doesn't normally show up. In this case though I was starting to dream bigger! I will work on some compression strategies. Thanks again for the help.
anytime, the 2.7.1-largescale1 AMI has been pretty stable for large simulations. I also use the c3 or d2 8xlarge instances for larger runs. those have 320 and 2000 GB storage We will be looking forward to adding the more recent instance types to the avail pool in the next few months... hopefully.