Originally posted by: michael-t
Thanks for the reply. What are the performance monitoring tools I could leverage to find out what is going on in POE/OMP SMT code?
Pls see below some more info.
regards!
Michael
> It would be interesting to know more about the
> parallel application(s) that you are running. Are
> these MPI-Based and are you using IBM's Parallel
> Operating Environment (POE)?
> (http://www-03.ibm.com/systems/p/software/pe.html)
The code is MPI (POE) and it is launched via LL (v3.3.2): we have 2 out of 20 nodes with SMT ON and when MPI code runs there, we have seen that for when the runQ length <=16, the threads are not evenly distributed to the cores. The system is not loaded at that moment and the code itself does dense matrix computation by distributing the matrix evenly to "processors". With SMT OFF the same code is spread very evenly across processors. With SMT ON the imbalance leads to increases in the total wall-clock time.
>
> It is hard to tell what is causing your performance
> problem without more information. In general, I
> would not expect that on a 16-CPU system running a
> parallel application with <= 16 threads that you will
> experience much of a performance improvement by
> enabling SMT since the parallel application cannot
> benefit from more than 16 hardware threads (i.e., it
> will not run on more than 16 hardware threads even
> with SMT turned on); that is unless you are running
> multiple parallel applications or several
> applications concurrently with the parallel
> application that is experiencing a degradation in
> performance.
>
It is only one application (the one above) running at the moment. What is strange is that there appears to be a noticeable degradation in performance with SMT ON and this IS what is bugging me. From the documentation one should not see any clear advantage (runQ <= 16) but the degradation only shows that the amount of time that 2 threads are scheduled on the same core (vs using the primary h/w thread) is significant.
Is it possible that the time collection process is confused by the fact that now there are 2 threads / core? Wouldn't hpmcount give accurate clock cycle accounting with SMT ON?
> It sounds like the threads of your parallel job are
> not being scheduled across the primary SMT thread of
> each CPU. Ideally, you want each thread to run on a
> separate CPU in order to keep it from contending for
> resources (such as cache) with the other threads of
> the parallel application. If you are running MPI
> jobs, then you might want to look into the force_grq
> tunable available through the schedo command.
I saw these tunables with schedo but it is not clear how the scheduler uses them to make dispatch decisions. Is there any documentation/article about how they affect the scheduler wrt SMT?
>
> If the system load is not evenly balanced, then cache
> contention could become a problem since both SMT
> threads in each Power5 Core share the same L1 cache.
>
> To maximize performance of MPI jobs you will want to
> look into POE and the Co-scheduler. There is a little
> bit of information on this page:
>
http://www-306.ibm.com/common/ssi/rep_ca/6/897/ENUS205 > -086/
#AIX-Forum