/****************************************************************************
**  SCALASCA    http://www.scalasca.org/                                   **
**  KOJAK       http://www.fz-juelich.de/jsc/kojak/                        **
*****************************************************************************
**  Copyright (c) 1998-2010                                                **
**  Forschungszentrum Juelich, Juelich Supercomputing Centre               **
**                                                                         **
**  Copyright (c) 2003-2008                                                **
**  University of Tennessee, Innovative Computing Laboratory               **
**                                                                         **
**  See the file COPYRIGHT in the package base directory for details       **
****************************************************************************/

                        SCALASCA v1.3 OPEN ISSUES
                        =========================

                                                           Status: Apr 2010

This file lists known limitations and unimplemented features of various
SCALASCA and KOJAK components.

--------------------------------------------------------------------------------

* Platform support

  - SCALASCA has been tested on the following platforms:
    + IBM Blue Gene/P
    + IBM SP & BladeCenter clusters
    + Cray XT5
    + SGI Altix
    + NEC SX-8
    + SiCortex systems
    + NEC SX-8
    + various Linux/Intel (x86/x64) clusters
    The supplied Makefile definition files may provide a good basis for
    building and testing the toolset on other systems.

  - The following platforms have not been tested recently:
    + IBM Blue Gene/L
    + Cray XT3/4
    + Sun Solaris/SPARC-based clusters
    + other NEC SX systems
    However the supplied makefile definition files might still work on
    these systems.

  - Automatic hardware topology recording is currently only implemented
    for IBM Blue Gene and Cray XT systems.

  - Each toolset installation can only support one MPI implementation
    (because MPI is only source-code but not binary compatible). If your
    systems support more than one MPI implementation (e.g. Linux clusters
    often do), separate builds for each MPI implementation have to be
    installed.

  - The same is true if your system features more than one compiler
    supporting automatic function instrumentation. (see also next section)

  - When using IBM XL Fortran compilers (on AIX or Linux PPC): As the IBM XL
    Fortran compilers encode subroutine names in lower case without
    additional underscores, SCALASCA/KOJAK measurement (which is implemented in C)
    of Fortran applications will fail, if this application uses Fortran
    subroutine names which are the same as common C standard routines
    (e.g., open, close, fopen, fcose, rename).

  - On platforms where we need to generate wrappers for MPI Fortran routines,
    such as IBM POE, Cray MPI and Intel MPI, the wrappers may result in value 
    corruption when using the constants MPI_BOTTOM (with types), MPI_IN_PLACE
    (with collectives), and MPI_STATUS_IGNORE or MPI_STATUSES_IGNORE (with
    tests of non-blocking communications).  Furthermore, LOGICAL parameters
    may not be handled correctly.

--------------------------------------------------------------------------------

* Automatic instrumentation via "skin" or "kinst" based on (often undocumented)
  compiler switches

  - GNU  : tested with GCC 3 and higher
  - PGI  : on Cray XT, cannot handle compiling and linking in one step
  - Sun  : only works for Fortran (not C/C++)
  - IBM  : only works for xlc/xlC version 7.0 and xlf version 9.1 and higher
           and corresponding bgxl compilers on BlueGene systems
  - Intel: only works with Intel icc/ifort version 10 and higher compilers
  - Pathscale: works as for GCC4

  Support for Intel 10+ and PGI 8+ compilers is based on (older) vendor-specific
  interfaces, which are configured by default.  These newer compilers also support
  the GNU instrumentation interface which can be configured manually by copying the
  compiler interface configuration section of mf/Makefile.defs.linux-gnu into the
  generated Makefile.defs.  (Intel 9 compilers may be able to use the Intel interface,
  but would be restricted to a single compilation unit.)

  Measurement filtering can only be applied to functions instrumented by the
  IBM, GNU/Pathscale, Intel and PGI compilers.  (Filtering of MPI functions, OpenMP
  and user-instrumented regions is always ineffective.)

  Function instrumentation based on using the GNU interface has the limitation
  that instrumented functions in dynamically loaded (shared) libraries are not
  measured (i.e., implicitly filtered).  When using the Intel interface,
  instrumented functions that are in dynamically loaded (shared) libraries are
  measured, however, they cannot be filtered (i.e., filters for them are ineffective).

  Because not all compilers support function instrumentation, and most of them
  have various limitations, an alternative is to use "skin -pomp" together with
  "POMP directives" for function instrumentation (see Scalasca User Guide,
  instrumentation section) which portably works on all supported platforms.

  "skin -pomp" (or "skin -comp=none") can also be used when automatic function
  instrumentation is not desired, such as for a measurement of only MPI functions.
  (In this case, only the final link command needs to be prefixed.)

--------------------------------------------------------------------------------

* The instrumenter utilities ("skin", "kinst" & "kinst-pomp") attempt to
  determine the appropriate EPIK measurement library to link by parsing the
  compilation/link commands.

  If the compiler/linker is not recognised as an MPI compiler front-end
  and no MPI library is explicitly linked, an EPIK measurement library
  without MPI support is used, and measurement will appear to consist of
  independent (non-MPI) processes.  A workaround in this case is to
  explicitly specify a redundant "-lmpi" or "-lmpich" when linking
  (which needs to be specified in such a location on the link line that
  it comes before a second one implicitly inserted by the linker).

  This workaround is not required on Cray XT systems, as these are explicitly
  configured to *always* link an EPIK measurement library with MPI support.

  During installation, a default build mode (32- or 64-bit) is determined,
  and this is then implied when a build mode is not explicitly specified
  during compilation and linking.  On systems with different defaults for
  MPI and non-MPI builds, the build mode determined by the instrumenter
  may be wrong and linking will fail with incompatible object formats.
  A workaround in this case is to explicitly specify the build mode when
  linking to ensure that the correct measurement library version is chosen.

--------------------------------------------------------------------------------

* SCAN collection & analysis launcher

  This utility attempts to parse MPI launcher commands to be able to launch
  measurement collection along with subsequent trace analysis when appropriate.
  It also attempts to determine whether measurement and analysis are likely
  to be blocked by various configuration issues, before performing the actual
  launch(es).  Such basic checks might be invalid in certain circumstances,
  and inhibit legitimate measurement and analysis launches.

  While it has been tested with a selection of MPI launchers (on different
  systems, interactively and via batch systems), it is not possible to test
  all versions, combinations and configuration/launch arguments, and if the
  current support is inadequate for a particular setup, details should be
  sent to the developers for investigation.  In general, launcher flags that
  require one or more arguments can be ignored by SCAN if they are quoted,
  e.g., $MPIEXEC -np 32 "-ignore arg1 arg2" target arglist
  would ignore the "-ignore arg1 arg2" flag and arguments.
  
  Although SCAN parses launcher arguments from the given command-line (and
  in certain cases also launcher environment variables), it does not parse
  launcher configurations from command-files (regardless of whether they are
  specified on the command-line or otherwise).  Since the part of the
  launcher configuration specified in this way is ignored by SCAN, but will
  be used for the measurement and analysis steps launched, this may lead to
  undesirable discrepancies.  If command-files are used for launcher
  configuration, it may therefore be necessary or desirable to repeat some of
  their specifications on the command-line to make it visible to SCAN.

  SCAN uses getopt_long_long typically from "liberty" to parse launcher options.
  Older versions seem to have a bug that fails to stop parsing when the first
  non-option (typically the target executable) is encountered: a workaround in
  such cases is to insert "-- " in the commandline before the target executable,
  e.g., scan -t mpirun -np 4 -- target.exe arglist.

  If an MPI launcher is used that is not recognised by SCAN, such as one
  that has been locally customized, it can be specified via an environment
  variable, e.g., SCAN_MPI_LAUNCHER=mympirun, to have SCAN accept it.
  Warning: In such a case, SCAN's parsing of the launcher's arguments may fail.

  Some MPI launchers result in some or all program output being buffered
  until execution terminates.  In such cases, SCAN_MPI_REDIRECT can be set
  to redirect program standard and error output to separate files in the
  experiment archive.

  If necessary, or prefered, measurement and analysis launches can be performed
  without using SCAN, resulting in "default" measurement collection or explicit
  trace analysis (based on the effective EPIK configuration).

  Automatic measurement and analysis configuration through SCAN via environment
  variables does not yet work if the Cobalt launcher is used. As a workaround,
  the configuration currently needs to be manually specified using an EPIK.CONF
  configuration file.

--------------------------------------------------------------------------------

* EPIK measurement system

  - The EPIK runtime measurement system produces experiment archives that
    can only be analyzed with the SCOUT parallel analyzer (by default).
    For sequential analysis with EXPERT, the generated per-process traces
    have to merged first using the "elg_merge" utility.

  - The EPK_GDIR configuration variable specifies the directory containing
    the EPIK measurement archive (epik_<EPK_TITLE>).  An additional variable,
    EPK_LDIR, allows a temporary location to be used as intermediate storage,
    before the data is finally archived in EPK_GDIR.  Generally, the file I/O
    overhead of transfering data from the intermediate storage location
    is best avoided by leaving EPK_LDIR set to the same location as EPK_GDIR,
    so that files are written directly into the experiment archive.

  - The buffers used by EPIK for definitions records are sized according to
    the configuration variable ESD_BUFFER_SIZE.  If any of these buffers fill
    during measurement, the resulting experiment archive will not be
    analyzable: in such cases, it will be necessary to repeat measurement
    having configured a larger ESD_BUFFER_SIZE as indicated by the associated
    EPIK message during measurement finalization.  (In some cases, even this
    larger ESD_BUFFER_SIZE may need to be increased a second time.)

  - The storage capacity for call-path tracking and associated summary
    measurement data is controlled via the configuation variable ESD_PATHS.
    If additional call-paths are encountered during measurement, these are
    distinguished with an "unknown" marker, however, the precise number of such
    call-paths cannot be determined.  If unknown call-paths are reported at
    measurement finalization or appear in the resulting analysis report,
    it is advisable to increase (e.g., double) ESD_PATHS and re-measure.
    After a successful measurement (with no unknown call-paths), ESD_PATHS
    can be reduced to the actual number of unique call-paths to reduce memory
    requirements for subsequent measurements.

  - Note that function filtering (where supported) or selective function
    instrumentation often significantly reduces the number of unique call-paths.
    It is therefore often advisable to examine measurements reports containing
    unknown callpaths for undesirable functions.  Highly-recursive functions are
    particularly undesirable since they result in significant measurement bloat.

  - If call-path tracking inconsistencies are reported during measurement,
    these may need careful examination by one of the toolset developers,
    as they can indicate problems with the compiled-generated instrumentation.
    On the other hand, applications which abort or explicitly exit prematurely
    (without returning from "main") will also result in measurement warnings
    which can often be ignored.

  - No measurement is possible for MPI applications which abort or otherwise
    fail to call MPI_Finalize on all processes.

  - Measurement of MPI application processes which do not call MPI_Init
    (or MPI_Init_thread) or where the EPIK library linkage has not allowed
    interposition on MPI calls will abort with the message
    "MPI_Init interposition unsuccessful!"

  - Newer MPICH1 versions (and derivatives) provide MPI_Init_thread, but no
    other functions from MPI-2, and instrumented applications using it will
    abort with "MPI_Init interposition unsuccessful!" since no MPI adapter
    wrapper is built for it.  As workaround, the wrapper for MPI_Init_thread
    provided in epik/adapter/MPI/epk_mpiwrap.c (which is only automatically
    enabled for MPI-2) can be explicitly enabled.

  - The C++ compiler on Cray XT systems sometimes forks additional processes
    resulting in the above abort messages, which can be ignored.

  - The EPIK MPI adapter can only handle precompiled numbers of communicators,
    windows and window accesses (set to 50 in the sources), and measurement
    will be aborted if this limit is exceeded.

  - The EPIK MPI adapter does not support the MPI C++ language bindings
    (deprecated since MPI 2.2) directly. Measurement of applications using
    the MPI C++ bindings will only work if the MPI library implements the
    C++ bindings as a lightweight wrapper on top of the MPI C bindings.

  - PAPI native counter names which include colon characters need to be
    escaped ("\:") to distinguish them from the colons used to separate
    counter names in EPIK metric specifications.

  - EPIK and EPILOG library interfaces and file/directory formats
    are *UNSTABLE* and very likely to change.

--------------------------------------------------------------------------------

* EPILOG trace library and tools

  - The EPILOG trace tools (e.g., elg_print, elg_stat, elg_timecorrect,
    elg_merge, etc.) don't yet support these EPIK experiment archives.
    KOJAK components (i.e., EARL and EXPERT) also lack this support.

  - Care is required when selecting an appropriate buffer size for tracing.
    The default trace buffer size (ELG_BUFFER_SIZE) for each process is
    rather small and typically only adequate for very short traces. It is
    therefore recommended to set the trace buffer size as large as available
    memory permits: if too large a size is specified, the application will
    be unable to run or fail to acquire memory.

  - Whenever a process fills its trace buffer, the buffer contents are
    automatically flushed to file and the buffer emptied to allow tracing to
    continue.  While this flushing is marked as Overhead in subsequent
    analysis, generally all other processes will block at their next
    synchronisation and this is not distinguished in analysis.  Furthermore,
    processes typically don't all fill their trace buffers simultaneously,
    but their behaviour is often sufficiently similar that immediately
    following one flush, a chain reaction of (sequential) flushes occurs as
    each process in turn fills, flushes and empties its trace buffer, yet
    must subsequently block on a synchronisation with a process that is
    flushing.  This exponential perturbation typically compromises all
    timings in the resulting measurement/analysis, though it may still help
    to identify excessively visited callpaths and/or a more appropriate
    buffer size for subsequent instrumentation/measurement configuration. 

  - elg_timecorrect attempts to correct logical time inconsistencies in
    EPILOG trace fles, however, it currently only recognizes point-to-point
    communication.  Traces containing collective communication and/or
    (OpenMP) multithreading may therefore be corrupted by elg_timecorrect.
    Measurements made on systems with accurately synchronised high resolution
    timers should not need post-measurement time correction.

--------------------------------------------------------------------------------

* SCOUT parallel trace analysis

  - The MPI and hybrid MPI/OpenMP versions of the SCOUT parallel analyzer must
    be run as an MPI program with exactly the same number of processes as
    contained in the experiment to analyse: typically it will be convenient to
    launch SCOUT immediately following the measurement in a single batch script
    so that the MPI launch command can be configured similarly for both steps.

  - SCOUT is unable to analyse incomplete traces or traces that it is unable
    to load entirely into memory.  Experiment archives are portable to other
    systems where sufficient processors with additional memory are available
    and a compatible version of SCOUT is installed, however, the size of
    such experiment archives typically prohibits this.

  - SCOUT requires user-specified instrumentation blocks to correctly nest
    and match for it to be able to analyse resulting measurement traces.

  - SCOUT may deadlock and be unable to analyse measurement experiments:
    should you suspect this to be the case, please save the experiment
    archive and contact the Scalasca development team for it to be investigated.

  - SCOUT ignores hardware counter measurements recorded in traces.

  - If measurement included simultaneous runtime summarization and tracing,
    the two reports can be combined and remapped via:

        cube3_merge -o merge.cube scout.cube epitome.cube
        cube3_remap -o trace.cube merge.cube

--------------------------------------------------------------------------------

* CUBE3 analysis report explorer

  - CUBE3 consists of libraries for producing analysis reports (cubefiles)
    and an (optional) GUI for analysis report browsing.  These library APIs
    and the resulting cubefiles are incompatible with previous CUBE versions,
    i.e., the CUBE3 GUI and algebra tools provided in this release cannot
    read cubefiles produced by earlier (KOJAK & CUBE) releases, and earlier
    versions of the GUI and tools cannot read cubefiles produced by the
    CUBE3-based tools in this release.

  - Only one topology can be visible at a time: to switch to another topology
    in the WX-based GUI return to the System Tree and then select Topology View again.

  - On Cray XT systems, the physical machine topology is generated during the
    remapping of intermediate cube reports.  Quad-core nodes are assumed, but
    can be modified by setting environment variable XT_NODE_CORES=2.

  - CUBE3 library interfaces and file formats are *UNSTABLE* and very likely
    to change.

--------------------------------------------------------------------------------

* SHMEM communication analysis

  - Support for SHMEM communication analysis is based on the serial EXPERT
    trace analyzer.
  - SHMEM is currently only supported with IBM TurboSHMEM. Especially
    Cray SHMEM support is NOT implemented yet.
  - Due to the lack of freely available applications using the one-sided
    communication paradigm, these toolset components tend to be less well
    tested than others.

  It is planned to address these limitations in future releases.

--------------------------------------------------------------------------------

* OpenMP analysis

  - The measurement and analysis components cannot handle OpenMP programs which
    use nested or task parallelism.  Even disabling nesting may not help.

  - The same team of threads are expected to be used throughout execution,
    i.e., OMP_NUM_THREADS threads.  If a larger number is used for any parallel
    region (e.g., via the "num_threads(#)" clause or "set_num_threads(#)"
    intrinsic function) these are not included in the measurement.  In such
    cases, a larger OMP_NUM_THREADS may be specified.  Automatic trace analysis
    may not be possible if smaller teams are used or regions are not executed
    in parallel due to an "if" clause.

  - SCAN automatic trace analysis of hybrid MPI/OpenMP applications primarily is
    done with a hybrid MPI/OpenMP version of the SCOUT trace analyzer (scout.hyb).
    When this is not available, or when the MPI-only version of the trace analyzer
    (scout.mpi) is specified, analysis results are provided for the master threads only.  
    Alternatively, if the trace files can be merged (via elg_merge), the EXPERT
    trace analyzer may be used, and its report includes analysis results for all
    threads and additional OpenMP-specific performance metrics.

  - The OPARI preprocessor is used for instrumenting OpenMP applications.
    OPARI, being a simple source-to-source transformation tool, has the
    following OpenMP related restrictions (see also next section):

    All languages

      + The first SECTION directive inside a SECTIONS workshare directive
        is required (and is not optional as described in the OpenMP
        specification)

      + ORDERED, TASK and TASKWAIT constructs are not instrumented

      + OPARI processes source files before the compiler preprocessor,
        so macros and included files are not processed.

    Fortran 77/90:

      + The !$OMP END DO and !$OMP END PARALLEL DO directives are required
        (and are not optional as described in the OpenMP specification)

      + The atomic expression controlled by a !$OMP ATOMIC directive has
        to be on a line all by itself.

      + The functions "logical OMP_Test_lock" and "integer OMP_Test_nest_lock"
        may need to be explicitly defined in program units where they are used,
        otherwise compilation of the processed source files may fail reporting
        return type inconsistencies for the instrumentation substitutes.

    C/C++:

      + Structured blocks describing the extend of an OpenMP pragma need
        to be either compound statements {....}, while loops, or simple
        statements.  In addition, for loops are supported after omp for
        and omp parallel for pragmas. Complex statements like if-then-else
        or do-while need to be enclosed in a block ( {....} ).

      + C99 6.10.9 _Pragma operators are not supported.

    It is planned to address these limitations in future releases.

--------------------------------------------------------------------------------

* OPARI

  This section provides some background on how OPARI works, so you can
  better understand how to use it for instrumenting "real" applications.
  Unfortunately, in its current state, it does not always work as
  automatically as would be desirable, and various workarounds are required.

  NOTE: In the following description "kinst" is used as a synonym for
        "scalasca -instrument" or "skin" and "kinst-pomp" as a synonym
        for "scalasca-instrument -pomp" or "skin -pomp".

  OPARI Basic Description
  -----------------------

  OPARI is used for two purposes:
  1) Instrumentation of OpenMP constructs
     (by "kinst" and "kinst-pomp" in OpenMP compilation mode)
  2) Activation of manual instrumentation using "POMP directives"
     (by "kinst-pomp" independent of compilation mode;
     also for MPI applications(!) but NOT by kinst(!))

  For each source file "<base>.<ext>", OPARI as called by the kinst and
  kinst-pomp scripts does the following:
  1) Create a modified (instrumented) file "<base>.mod.<ext>".
  2) Create an instrumentation descriptor file named "<base>.<ext>.opari.inc"
     which contains the corresponding instrumentation descriptor definitions.
  3) A so-called OPARI table file "opari.tab.c" is ultimately created
     which contains a compilable table of all instrumentation descriptors.
     For Fortran, it "#includes" all these instrumentation descriptor files.
     For C/C++, these are "#included" in the modified source files.
  4) OPARI needs to keep track of which files it already instrumented and
     the instrumentation points it found in each source file, in order to
     uniquely identify them. This information is stored (by default) in a
     file called "opari.rc" in the current directory. For each source file
     processed, "opari.rc" is updated.
  5) [kinst-pomp only] All temporary intermediate files (but not opari.rc)
     are automatically removed when they are no longer needed

  Biggest problem
  ---------------

  The main problem is the fact that OPARI needs to keep track the
  instrumentation status in "opari.rc". This has several implications:

  * You cannot use "parallel" makes (e.g "gnumake -j") in the instrumentation
    process, as multiple active OPARI instances could corrupt "opari.rc"
    by (re)writing it at the same time

    SOLUTION: compile with "normal/sequential" make

  * You will get into problems when you try to build/instrument two or more
    applications in the same directory which use different subsets of files

    SOLUTION: remove "opari.rc" after building each application

  * You will get problems when you try to build/instrument applications
    where the source files are spread over multiple directories as each
    directory will have its own incomplete "opari.rc".

    SOLUTION: use the OPARI option "-rcfile <file>" to specify the path to
              a file which will be used as a master "opari.rc" for the whole
              build process (e.g., a file in the top directory of your project).

  HOW TO pass options to OPARI
  ----------------------------

  The kinst scripts (kinst, kinst-pomp) allow to pass options to OPARI the
  following way:

    kinst <opari-options> -- <compiler> ...

  i.e., the opari options are specified after the kinst command terminated
  by a double hyphen ("--"). For example, if you want to use the -rcfile
  option use e.g.,

    kinst-pomp -rcfile /tmp/opari.rc --

  instead of

    kinst-pomp

  Other smaller issues
  --------------------

  * OPARI reports "ERROR: unbalanced pragma nesting".
    This is normally caused by missing do-loop end directives.  In contrast
    to the OpenMP standard, where they are optional, OPARI requires "!$OMP END
    DO" and "!$OMP END PARALLEL DO" directives for parallel loops in Fortran.
    The solution is to add them.

  * Some OpenMP compilers (e.g. PGI) are non-standard-conforming in the way
    they process OpenMP directives by not allowing macro replacement of OpenMP
    directive parameters. This results in error messages containing references
    to POMP_DLIST_##### where ##### is a five-digit number.  In this case, try
    to use the OPARI option "-nodecl". This is unfortunately not a perfect
    workaround, as this can trigger other errors in some rare cases.

  * Some Fortran compilers (e.g., Sun) don't fully support C preprocessor
    commands, especially the "#line" commands.  In case you track a compilation
    error on a OPARI modified/instrumented file down to such a statement, try
    using "-nosrc" as this suppresses the generation of "#line" statements.
    (With the Sun Fortran compiler, using "-xpp=cpp" is a better workaround.)

  * Sometimes instrumentation of OpenMP source files work, but the traces get
    enormously large because the application is using large numbers (millions)
    of small OpenMP synchronisation operations like atomic, locks or flushes
    which are instrumented by default.  Also, in that case, the instrumentation
    overhead might become excessive.

    In that case, you can tell OPARI not to instrument these constructs
    by using the "-disable <construct>[,<construct>]..." option.
    Valid values for constructs are:

      atomic, critical, flush, locks, master, single

    or "sync" which disables all of the above.

    Of course, then these constructs are not measured and you have to keep
    this in mind later, when you analyze the results that although they
    do not show up in the CUBE analysis, the application might(!) have some
    performance problem because of too many OpenMP synchronisation calls

--------------------------------------------------------------------------------

