NjmonFAQ

njmon Frequently Asked Questions+

Many questions below are from the Session for Power Systems VUG 2019-2-28 - njmon

  • Date: Thursday, February 28, 2019
  • Starting time: 9:32 AM
  • There is a Replay available on the POWER Systems Virtual User Group website

1 Question: Which is most widely used, nmon or lpar2rrd?

  • Answer: Both are widely used, nmon is included in AIX so I believe it may be more popular

2 Question: How do these compare?

  • Answer: nmon and njmon collect a lot more stats than lpar2rrd but lpar2rrd organises per server data well.
  • Answer: njmon with a time-series database and graphing tool offers great flexibility and tooling for a vast range of other stats to be integrated too.

3 Question: Is njmon available anywhere other than SourceForge? My employer blocks access.

  • Answer: at present only from sourceforge.net - I suggest you download it when at home.

4 Question: Is nmonchart installed by default on AIX?

  • Answer: As nmonchart and njmonchart Korn shell scripts are not part of AIX and handle data from Linux too.
  • Answer: You can download these from http://nmon.sourceforge.net
  • Answer: The nmon Analyser is not part of AIX either - download from the same sourceforge website (move there in Nov 2019)

5 Question: Having issues with nmon2json.

  • Answer: Correct as documented the first real version concentrated on AIX.
  • Answer: As this is a Korn Shell script feel free to participate in the development.

6 Question: Nice to see specific LPAR info, but also useful if we could get a Summary of all LPARs at a glance.

  • Answer: Yes this can be done and a very simple first cut example was shown = 2 Linux and 3 AIX LPARs.
  • Answer: Feel free to develop Grafana Templates and share them.

7 Question: Are any plans to incorporate a njmon analyzer tool into the IBM HMC platform for Power Systems so we don't have to offload the njmon files to a local system?

  • Answer: No plan and please consider direct injection of the njmon data directly to a time-series database, so you don't create a data file management issue.
  • Answer: njmon and JSON data are to encourage admin experimentation, not a locked down environment.
  • Answer: Of course, that would be a question for the HMC developers and perhaps a question for later when the advantages are clear and proven.

8 Question: Is there any integration for Graphite and njmon?

  • Answer: Graphite tools are alternatives to InfluxDB and can also co-operate.
  • Answer: From a brief look into the Graphyte Python module is simple to insert data into the Graphite carbon time servers database.
  • Answer: Grafana can extract data from Graphite carbon - I am not sure about the other way around.
  • Answer: As covered in the webinar I can't cover a dozen alternatives.
  • Answer: If you are Graphite fan please explore loading the JSON output files into Graphite and let me know.

9 Question: Will there be anything in the future to track processes over time (historical research) or top processes?

  • Answer: njmon for AIX already saves process information and it is planned for njmon for Linux too. Done.
  • Answer: The process code is in nmon all ready to be moved to njmon for Linux.
  • Answer: You have to switch on the process stats and then be careful as there can be masses of data.

10 Question: Can the collector be configured to use a key instead of putting the password in the command args?

  • Answer: You set the environment variable NJMON_SECRET then that will get used.
    • NJMON_SECRET=abc123 njmon -s60 -c 1440 -p 8080 -i influx (assumes influx is a hostname for the Influxdb server)
  • Answer: the njmon_collector is only 00 lines of C code - anything could be made possible!

11 Question: Do you have to customize the injector.py or is it part of the influx dB and knows how to connect?

  • Answer: The injector(s) is only a small python program but uses the InfluxDB python module to connect to InfluxDB and send the data.
  • Answer: Another time-series database has slightly different Python modules to do the same thing.
  • Answer: I got the Splunk one working in an hour or so of having access to Splunk.
  • Answer: One we do have a current problem with is Prometheus - it uses a pull module (the central database polls for information from the agents - all the other allow a push data into the database model, InfluxDB, ELK, Splunk, Graphite).

12 Question: Is there a way to encrypt the password in a file, so it isn't stored in clear text on the server? Compliance won't allow at our shop with the password shown.

  • Answer: You have the ssh method to collect data - that is completely independent of njmon and production ready.
  • Answer: Lets not get too paranoid here - if "password" sets off alarm bells then operate with no password at all.
  • After all, this is just performance data - no user or sensitive data is collected.
  • Answer: njmon / collector uses very lightweight encryption for the initial connection (includes the password) - to deter network sniffing attacks.
  • Answer: I don't have the time or interest to develop more complex encryption (i.e. not a priority compared to other features)

13 Question: Can the Microsoft Excel analyzer tools process njmon data?

  • Answer: No and there is no plan. That would be considered a disaster - we are trying to get away from the limitations of Excel.

14 Question: Very interesting tool, could we also the output of some log files to keep an eye on an application or an alert.log file?

  • Answer: Yes and no - njmon will not do that.
  • Answer: But some of the tools like ELK and Splunk have lots of additional features to explore log files, define a wide variety of formats, you name fields and can then graph the stats.
  • Answer: Alternatively, if your application generates additional stats (like SQL commits, user counts, web hits etc) they could also be sent to any of the Time-Series database like InfluxDB and graphed along with njmon data.

15 Question: If you don't have the Influxdb or Grafana is it useless to collect njmon data?

  • Answer: Ouch!
  • Answer: As explained, InfluxDB and Grafana are a 6-minute install away and many alternatives can be used too.
  • Answer: You can also use njmonchart to view the data.

16 Question: Are there downloadable Grafana templates that can be imported in, see Nigel's earlier examples (int the webinar)?

  • Answer: Yes these are downloadable from the Grafana website - search for njmon.
  • Answer: I guess a list of some of the best would be useful - please upload yours to Grafana. Learning from others is very effective.

17 Question: In our environment we are using Prometheus as DB with Grafana. Do you have injector script for Prometheus database.

  • I have looked in to Prometheus and you may have noticed that it uses a pull mechanism - It pull data from the endpoint when it needs it.
  • All the other tools have the endpoint push the data as soon as its available - like InfluxDB, elastic and Splunk.
  • I have not found a method to take njmon data and add it to the Prometheus database - there are some python modules but you will find for example the time data will be the time on the insertion not when the data was captured.
  • I am no Prometheus, expert we need to find one. As it is non-trivial learn Prometheus and work this out.
  • As Grafana can extract Prometheus data and InfluxDB data and there are tools (like telegraf) to extract InfluxDB data and put it into Prometheus, I am not sure we have a big problem.
  • I have made changes to assist elastic and Splunk but I see no simple solution for Prometheus.

18 Can you improve collector tool to reduce number of collector processes?

  • So to reduce ps -ef output, you want me to re-engineer the collector - ha ha ha ha.
  • This is a large and complex piece of development and very hard to test.
  • Each collector is doing very little on the CPU so there is no harm.
  • It is on the to do list but not a priority - it might be written in Python and merged with the Injector.
  • Perhaps you have no noticed by Linux has 1000's of process for the kernel already on servers with many CPUs.

19 If the collector stops, it kills njmon in remote endpoint server. Can you improve collector tool?

  • That is the nature of pipes and sockets and makes the code small and clean = KISS.
  • Given infinite time and resources anything is possible but I am low on both.
  • First approach is to don't stop the collector!
  • Second is minimising the outage get cron to start njmon say every 30 minutes but if it finds the old njmon still running
  • This is already built in with the -k option.
  • Third, start njmon with cron at midnight then if you need to recycle the collector do that at 2 minutes to midnight.
  • I could add buffering to njmon but that added more problems than is solves
    • do we save the stats in memory and slow the server down
    • do we save the data on disk and cause mayhem on a full disk
    • how long do we keep trying to retransmit? Then what do we do?
    • What happens on the collector server when it returns? 3000 endpoint will try to send a few MB each and the collector server gets hammered.
  • One possibility is the njmon connects to the collector for every snapshot and then closes the socket
    • This mode seem like the trend with micro-services these days!!
    • I need to study the compute cycles that consumes - is it to wasteful to have 3000 socket connections set-up and removed
    • then same user will set-up for 1 second snapshots !