> Pls. add abstract and headers.
Thanks for the suggestion. We have added keywords and a full structure with Abstract and the standard
sections: Introduction, Methods, Conclusions, Software Availability and Acknowledgements.
> Explain more, what the outcome was of former evaluations of this code with other software (which
> commercial SW? which open stack implementation?).
There exists no prior evaluation of the software, except for the published paper, Alvioli and Baum (2016),
partially reported here. The previous serial code performance (Baum et al, 2008) was never benchmarked and
its performance (as far as the running speed is concerned) can be assumed to be identical to the current
serial version. As a matter of fact, the minor improvements to the serial code, mentioned in the text,
were only carried over to make the parallelization with MPI libraries practical. The actual performance
gain due to such improvements is negligible, and the only performance gain is attainable using the parallel
version of the code. Also, we do not compare the performance of our code with any other, commercial or
open-source, software. The only purpose of our performance assessment is to measure the performance gain
due to parallelization of TRIGRS, with respect to the serial version. Please note that the serial version
is also bundled with the latest version of the code.
> Explain more that you are comparing code from 2008 with your enhancement in a first section.
For the reasons above, we do not believe this point deserves a whole section. We have added
a sentence clarifying this point in the Methods section, in the revised version.
> In the second section you are comparing your code in a cloud setup: Explain that before hand.
In the revised version, we have added the information in the Abstract.
> Discuss the results also with quantitative numbers.
We believe the results are given in the only possible way, which is indeed quantitative.
One can see, for example, that Fig. 1 reports the results of the performance assessment
with total running time (in the range from 102 to 105 seconds in Figs. 1A and 1D), speedup
(in the range from 1 to 30 in Fig. 1B and from 1 to 140 in Fig. 1E, right) and efficiency (in
the range from 0.3 to 1.0 in Fig. 1C and from 0.3 to 1.1 in Fig. 1F). All other Figures show
explicitly the numbers involved. The only quantitative number that one might ask to know is
the number of cells in the study area, which was reported in (Alvioli et al., 2016) and it
is actually missing here. We have added this number, N=13,410,000 valid cells in the caption
of Fig. 3.
> Content/Approach/Relevance: Justify why an own library and why Fortran was used a) instead of
> existing libs (like OpenMP) and b) of more ""modern"" languages than Fortran.
This is exactly what was done here, i.e. using the general-purpose library Message Passing
Interface (MPI), which is similar to OpenMP. Usage of OpenMP and MPI are very different and,
while parallelization OpenMP can be easier to implement in some cases, MPI is more powerful
since it does not specifically require the multiple instances of the code to be run on shared
memory machines, allowing MPI-parallel codes to be run on any multi-node machines. Moreover,
it is known to the high-perfomance users community that FORTRAN is, since many years already,
as "modern" as C or similar programming languages, at least as far as code speed is concerned.
As a matter of fact, both OpenMP and MPI fully support C and FORTRAN, the latter being probably
the mostly used programming language for scientific high-performance computing.
> What's the advantage of this multi-node/core approach versus co-processors/GPU?
The GPU parallelization is not discussed here for a number of reasons. The first one is that
being TRIGRS an existing code, its parallelization within any of the GPU programming schemes
is virtually impossible, due to the complexity of the operations required to adapt the
model-specific computation to the massively parallel, but rather rigid, numerical processing
offered by GPUs. The second reason is that a GPU-enabled code would most probably require extra
effort from end users, while keeping a simple code was one the the pre-requisites of this work.
We have added a sentence in the Conclusions illustrating this point briefly.
> Justify the use of virtual machines in the second part: How did you exclude other unwanted
> processes and side effects from virtual machines?"
We didn't. The cloud approach has been introduced in this paper in order to perform a blind
comparison with respect to specialised hardware usage with the ultimate goal to evaluate the
cost/benefit of private cloud usage, meant as freely accessible Virtual Machine, for our use
case. At this stage of our work, we decided on purpose not to exclude at all the possible
concurrent processes. The rationale of our strategy is the following: to exclude concurrent
processes in a Cloud paradigm requires either dedicated setup of a privately owned cloud or
the usage of a public infrastructure, possibly with extra charge. We have clarified this point
in the Conclusions, arguing that such straightforward use of virtual machines, which is what
users get when the gain access to research or commercial clouds, may be highly ineffective,
at least for our code. Additional tuning of the cloud software and hardware might be required
to actually obtain substantial performance gain.