AIX

AIX

Connect with fellow AIX users and experts to gain knowledge, share insights, and solve problems.


#Power
 View Only
  • 1.  SMT Performance Issues with SMT ON in Power5+ 575

    Posted Mon April 16, 2007 05:07 PM

    Originally posted by: michael-t


    After reading all the nice advantages of SMT on Power5, we decided to try out SMT=ON to two out of 40 Power5+ 575 nodes in our cluster. However now it seems that parallel applications with <= 16 threads (there are 16 Power5+ / node) we have observed that the run-time of those apps increased significantly. Another issue is that the workload distribution per processor has started becomming uneven: code with perfectly even distribution of work to all 16 procs, now is uses the processors very unevenly.

    Is this SMT behavior known or expected? If indeed SMT ON worsens the execution times of code with <=16 threads, I see no advantage of using it, unless the system is already overloaded (RunQueue > 16 on the average).

    Do we have to do something special to take advantage of SMT ? Any hint / pointer will be greatly appreciated!

    Michael Thomadakis
    #AIX-Forum


  • 2.  Re: SMT Performance Issues with SMT ON in Power5+ 575

    Posted Mon April 16, 2007 06:23 PM

    Originally posted by: SystemAdmin


    It would be interesting to know more about the parallel application(s) that you are running. Are these MPI-Based and are you using IBM's Parallel Operating Environment (POE)? (http://www-03.ibm.com/systems/p/software/pe.html)

    It is hard to tell what is causing your performance problem without more information. In general, I would not expect that on a 16-CPU system running a parallel application with <= 16 threads that you will experience much of a performance improvement by enabling SMT since the parallel application cannot benefit from more than 16 hardware threads (i.e., it will not run on more than 16 hardware threads even with SMT turned on); that is unless you are running multiple parallel applications or several applications concurrently with the parallel application that is experiencing a degradation in performance.

    It sounds like the threads of your parallel job are not being scheduled across the primary SMT thread of each CPU. Ideally, you want each thread to run on a separate CPU in order to keep it from contending for resources (such as cache) with the other threads of the parallel application. If you are running MPI jobs, then you might want to look into the force_grq tunable available through the schedo command.

    If the system load is not evenly balanced, then cache contention could become a problem since both SMT threads in each Power5 Core share the same L1 cache.

    To maximize performance of MPI jobs you will want to look into POE and the Co-scheduler. There is a little bit of information on this page: http://www-306.ibm.com/common/ssi/rep_ca/6/897/ENUS205-086/
    #AIX-Forum


  • 3.  Re: SMT Performance Issues with SMT ON in Power5+ 575

    Posted Tue April 17, 2007 12:35 PM

    Originally posted by: michael-t


    Thanks for the reply. What are the performance monitoring tools I could leverage to find out what is going on in POE/OMP SMT code?

    Pls see below some more info.

    regards!
    Michael

    > It would be interesting to know more about the
    > parallel application(s) that you are running. Are
    > these MPI-Based and are you using IBM's Parallel
    > Operating Environment (POE)?
    > (http://www-03.ibm.com/systems/p/software/pe.html)

    The code is MPI (POE) and it is launched via LL (v3.3.2): we have 2 out of 20 nodes with SMT ON and when MPI code runs there, we have seen that for when the runQ length <=16, the threads are not evenly distributed to the cores. The system is not loaded at that moment and the code itself does dense matrix computation by distributing the matrix evenly to "processors". With SMT OFF the same code is spread very evenly across processors. With SMT ON the imbalance leads to increases in the total wall-clock time.

    >
    > It is hard to tell what is causing your performance
    > problem without more information. In general, I
    > would not expect that on a 16-CPU system running a
    > parallel application with <= 16 threads that you will
    > experience much of a performance improvement by
    > enabling SMT since the parallel application cannot
    > benefit from more than 16 hardware threads (i.e., it
    > will not run on more than 16 hardware threads even
    > with SMT turned on); that is unless you are running
    > multiple parallel applications or several
    > applications concurrently with the parallel
    > application that is experiencing a degradation in
    > performance.
    >

    It is only one application (the one above) running at the moment. What is strange is that there appears to be a noticeable degradation in performance with SMT ON and this IS what is bugging me. From the documentation one should not see any clear advantage (runQ <= 16) but the degradation only shows that the amount of time that 2 threads are scheduled on the same core (vs using the primary h/w thread) is significant.

    Is it possible that the time collection process is confused by the fact that now there are 2 threads / core? Wouldn't hpmcount give accurate clock cycle accounting with SMT ON?

    > It sounds like the threads of your parallel job are
    > not being scheduled across the primary SMT thread of
    > each CPU. Ideally, you want each thread to run on a
    > separate CPU in order to keep it from contending for
    > resources (such as cache) with the other threads of
    > the parallel application. If you are running MPI
    > jobs, then you might want to look into the force_grq
    > tunable available through the schedo command.

    I saw these tunables with schedo but it is not clear how the scheduler uses them to make dispatch decisions. Is there any documentation/article about how they affect the scheduler wrt SMT?

    >
    > If the system load is not evenly balanced, then cache
    > contention could become a problem since both SMT
    > threads in each Power5 Core share the same L1 cache.
    >
    > To maximize performance of MPI jobs you will want to
    > look into POE and the Co-scheduler. There is a little
    > bit of information on this page:
    > http://www-306.ibm.com/common/ssi/rep_ca/6/897/ENUS205
    > -086/

    #AIX-Forum


  • 4.  Re: SMT Performance Issues with SMT ON in Power5+ 575

    Posted Tue April 17, 2007 03:01 PM

    Originally posted by: SystemAdmin


    Unfortunately I do not know much about the performance monitoring tools.

    Perhaps my first reply was somewhat misleading. I think that the fact you turned on SMT is allowing more processes to run in parallel including the MPI MP_CHILD processes for the MPI application plus system daemons and other miscellaneous background processes. The slowdown you are seeing could be due to the non-MPI jobs getting in the way. (Most likely, if you are not using the POE Co-scheduler, see below, you are encountering this problem when SMT is turned off as well, but for some reason it is being exaggerated when SMT is turned on.)

    This interview discusses leveraging SMT in speeding up MPI applications:
    "In addition, there is work under way to efficiently exploit the background simultaneous multithreading (SMT) capability to hide the daemon activity at lower priority on these background threads so that they can run without interrupting the running application. This approach allows for the petascale systems to benefit from both a large breadth of ISV applications and libraries being available while effectively controlling the impact of OS jitter." Link: http://www.scidacreview.org/0701/html/interview.html

    It seems that running MPI jobs with SMT turned on might be rather uncharted waters. In fact, my understanding is that most research run their systems run with SMT turned off in order to minimize the amount of interrupts the MPI jobs encounter when they are running.

    These pages provide more information that might be of interest.
    http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.pe.doc/pe_422/am10200437.html
    http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.pe.doc/pe_421/am10100356.html

    In general you might want to look into the Co-scheduler that is provided as a part of POE (now called PE). The co-scheduler attempts to synchronize the time at which MPI processes vs. non-MPI processes are run. This is used to minimize the impact to individual MPI processes when a daemon fires causing one of the MPI processes to be delayed in arriving at the next synchronization point in the MPI application. By making all non-MPI processes run during the same window, the impact of these processes on the synchronization of the MPI processes and their overall performance is minimized.

    The force_grq tunable ensures that for the most part only MPI processes (MP_CHILD) are placed into the local run queues of the hardware threads. This forces all other processes, with some exceptions, to be placed into the global run queue with the intent of more evenly distributing them across the node(s). Then when non-MPI processes are allowed to run, as an entire group they will finish faster than if they are localized to a few nodes.

    From the description of force_grq it should affect scheduling the same when SMT is turned on versus when it is turned off.
    #AIX-Forum


  • 5.  Re: SMT Performance Issues with SMT ON in Power5+ 575

    Posted Thu April 19, 2007 05:42 PM

    Originally posted by: michael-t


    Brian,

    thanks again for the informative answer. I have opened a service call with IBM on this topic (PMR # is available).

    I did several experiments with identical POE (MPI) code that is scalable up to k=32 processes, running on SMT ON and SMT OFF nodes. What I saw is "interesting":

    For k threads, K <= 10, both SMT ON and OFF nodes execute the job taking (+- few seconds) the same time. For the SMT OFF node the dimisnihing running time continues as we increase the processors all the way to k=16.

    For the SMT ON node at around k=9 or 10, the execution time increases (wrt SMT OFF runs) until we hit k=16.

    Then as we icraese the number of processes on the SMT OFF node for 17 <= k <= 32 the execution time is much larger than that for the SMT ON node.

    The problem is to be able to let the scheduler in SMT ON nodes provide the same performance as that with the SMT OFF nodes.

    It would be interesting to see if the co-scheduler can manage to spread the compute threads evenly on the primary h/w threads.

    I've seen a number of schedo parameters (including the force_grq) but it is not clear how to set them properly to correct the dispatcher's behavior.

    Thank you again for all the feedback. This is an interesting topic and I hope that there will be a good way to tune the AIX short-term dispatcher for batch environments vs the interactive ones that it appears it is tuned for.
    regards,
    Michael
    #AIX-Forum