====== Jay Lofstead's notes on his ORNL internship (Summer 2007) ====== I got some good insights from Scott Klasky and others while I was there. It is probably a good idea to share these with the others on this list as they directly relate to our work. Scott can correct/amend these as he sees fit: To start with, watching Scott talk to other non-computer scientists, I got strong confidence that his ideas as to what we need to do to have an impact in the area are really in touch with the scientists working there in multiple different fields. For example, his seemingly lofty performance goals seem way beyond what users need today. However, with the new machines on the near horizon and how codes are trying to scale up, these numbers don't seem so lofty anymore. The other interesting bit is that if you tell the scientists you have a way for them to do IO much cheaper, they are excited and then say they want to write a LOT more often. It then becomes an issue with where do you put all of that data and how do you use it afterward (or during). With only 250 TB of disk space right now, that can get filled in a hurry with some jobs. In general, he sees a few key things: 1) unless we can give them a compelling "cost" reason to use our stuff, they won't; 2) unless it is easier than HDF-5/netCDF/MPI-IO, they won't use it; 3) once you have them interested from the "cost" reductions, give them other things, like processing inline to storage and runtime selection of data storage paths, to seal the sale; and 4) try to stay integrated with the tools they already use (like Kepler for workflow and offline processing) unless you have a really compelling reason for them to want to change. One of his mantras is to "do the math" to understand the per-node and aggregate numbers and how they compare to system capabilities. Once you have a good performance metric, other services can be offered to sweeten the deal. Without the reductions, you can't get anyone interested because with tools like Kepler, they can do pretty much everything they need today. In talking to others, they agreed that the external XML configuration file for the IO was a good idea because it gave them a way to describe their data externally and it helped keep their IO code simple and compact. This is good for us because we can use it downstream to configure some of our processing. Another interesting insight was that single node data operations are essentially free unless they are VERY expensive. Operations that require communication have cost and are candidates for in stream processing. In stream processing will only be used if you can do something compelling beyond the disk file-based Kepler solution they use today. For example, custom lossy data compression would have to be done in stream at some point or they would just use a Kepler workflow to do it. Placing it on the compute node may be required, but not sufficient. Splitting across a compute node and in stream is probably ideal since you can do local compression and then work it out over the aggregate of the whole dataset. Some of these operations are still not cost effective. In short, finding things that are compelling for people to do in stream is hard at best. Unless it is an FFT-level computation or you can take Fortran code and move it or do some fancy data analysis requiring communication offline, it isn't going to be seen as useful or a valuable contribution. Also related to this is the spider storage system going in. ORNL is moving to a shared Lustre system with 100 GB/s bandwidth to all of the major compute machines. This will make some of our proposed savings even more difficult to sound convincing unless we can demonstrate savings over this data rate. While 100 GB/s is the ideal scenario bandwidth, for the current system, they are getting better than 40 GB/s out of I think 45 GB/s listed performance available. I went to a scaling seminar they had over 3 days and got a bunch of good insights from that. CFS (Lustre people) was there another time for a 1-day presentation about how things work now (e.g., ext3 for file storage and a single metadata server) and changes coming in the next version (e.g., zfs for file storage and multiple metadata servers). Lots of good hits between both of them about what the problems are and how people are working around them (or not) today. Information was also presented about the new machines and architectures being brought in. The Baker-class machine with the gemini (instead of seastar) network will not support portals anymore. It will have a custom programming interface available at some point closer to the release. It is being built in such a way that small RDMA transfers are MUCH cheaper than they are today. One of the more interesting was the ways they worked around the metadata server limitations through staggering IO start times and the precise parameters to give to the various file operation calls to reduce the metadata limitation impact. As of January, they will have moved to Compute Node Linux for all of the big machines. This is a slow process of testing right now and it is showing promise of working as well as hoped.