AIX

AIX

Connect with fellow AIX users and experts to gain knowledge, share insights, and solve problems.


#Power
#Power
 View Only
  • 1.  NJMON CPU data collection

    Posted Thu January 09, 2025 01:06 PM

    Hi Nigel,

    We are utilizing the NJMON tool for collecting system metrics. However, we have recently encountered a peculiar issue with its functionality. Specifically, the tool seems to have a limitation or is exhibiting unexpected behavior when gathering CPU-related data.

    The problem arises when the system's CPU count exceeds 256 cores. Beyond this threshold, NJMON is not collecting or reporting CPU metrics.

    We would appreciate your insights or suggestions on how to address this issue. Is this a known limitation of NJMON, or might there be a configuration setting or workaround to resolve this?

    Many thanks in advance for your support.

    Kind regards

    Yeswanth



    ------------------------------
    Yeswanth Jojode
    ------------------------------


  • 2.  RE: NJMON CPU data collection

    Posted Fri January 10, 2025 07:54 AM

    Hi Yeswanth,

    Not sure why you placed the question on this forum as njmon is open source but no harm done.

    The njmon repository is here https://nmon.sourceforge.io/pmwiki.php?n=Site.Njmon

    And I can be contacted here nigel ar griffiths at hotmail.com - you will have to correct the spaces and @ sign :-)

    It took a while but I found the problem.

    The njmon C code has a sanity check of 256 Logical CPUs numbered 0 to 255.

    • Clearly, computers are getting big over time in the number of logical CPUs and physical CPUs.
    • I have never had access to a Linux computer this large before. 
    • Although, the Power10 servers max out are 240 physical CPUs with max Logical (SMT=8) of 1920.
    • There is also a logistics problem of graphs with more the 200 lines become a complete mess as there is too many lines drawn one on top of the other so you can't see through the birds nest.
    • You will notice the njmonchart has problems showing the key for the CPU and it comes in in your case 15 lists of CPU.
    • Then there is a problem with the human eye can't really detect accurately 200 different colours so if the "blue" CPU is a then there are say 10 "blue" CPUs in the key list.
    • I do not have a simple solution for this problem. 

    When our grand children are running njmon :-) with 100,000 CPUs they will not want a line chart but will perhaps use a heat-map so see how many hot CPUs are active.

    In addition, there is a programming problem in allocating memory structure space for high numbers of CPU (same goes for disks etc).

    The memory is a waste of memory resources is the servers only has a dozen CPUs.

    The GNU guys would call the 256 maximum a static magic number problem. Their coding standard forbid such magic numbers.

    In the short term we can change the maximum to say 2000 and recompile njmon but that just postpones the problem.

    In the medium term, njmon can count the number of CPUs and then allocate the correct memory size.

    But there is a further problem as the number of CPUs can dynamically change live! 

    Or let the user override the default 256 with a command line option of shell variable - but this requires the user reading the manual!!

    In the longer term, njmon will have to check every snapshot if the number of CPUs has gone up or down and adjust the memory sizes.

    I will release a new njmon for Linux and include another fix for a big bug found yesterday - ASAP as version njmon_linux_v85.c at https://nmon.sourceforge.io/pmwiki.php?n=Site.Njmon

    If you want to recompile your version change njmon_linux_v83.c line 2170

    #define MAX_LOGICAL_CPU 256

    to

    #define MAX_LOGICAL_CPU 2048

    The utilisation structure was but grows from 20K to 164K = small beer these days.

    I would very muck like a njmon capture sample from your Linux server for testing proposes once you have the improved njmon version.

    And to code the medium term solution.

    Thanks for the Post - a good community contribution helping other njmon users.



    ------------------------------
    Nigel Griffiths - IBM retired
    London, UK
    @mr_nmon
    ------------------------------



  • 3.  RE: NJMON CPU data collection

    Posted Mon January 13, 2025 03:44 PM

    Hi,

    I released the new version.

    Can you give it a try for me and let me know if this bug is fixed?

    I would still like a sample of your output to investigate a better fix for this and think about how to graph the data better. Like a heat map.

    Cheers, N



    ------------------------------
    Nigel Griffiths - IBM retired
    London, UK
    @mr_nmon
    ------------------------------



  • 4.  RE: NJMON CPU data collection

    Posted Tue January 14, 2025 08:01 AM

    OK fixed that link you pointed out - thanks for doing that.

    I hope you worked out you can also get the Source Forge files from here https://sourceforge.net/projects/nmon/files/



    ------------------------------
    Nigel Griffiths - IBM retired
    London, UK
    @mr_nmon
    ------------------------------



  • 5.  RE: NJMON CPU data collection

    Posted Tue January 14, 2025 08:19 AM

    Hi Nigel,


    Thank you for the easy fix and new release.

    yes, the pointed out issue was fixed. Now we are able to collect CPU data for CPUs above 256.

    Will share the output JSON file soon.

    Thank you so much again help.



    ------------------------------
    Yeswanth Jojode
    ------------------------------