This beginner's guide aims to provide some information about how to troubleshoot a performance issue generally or an issue caught during regression testing. This guide assumes that the performance tests are using the Gatling and PyForge frameworks. The tests are run nightly or weekly and can be found here: http://jenkins-fr.internal.forgerock.com:8080/view/AM%20Stress/job/AM-7.1.0/

The slack channel #am_performance also reports any performance issues. 

PyForge

See this page for information about PyForge: https://pyforge.engineering.forgerock.com/docs/getting-started

Config file parameters

Below is an example of the Stress parameters used by the tests. 

[Stress]
num_users = 100000
duration = 3600
concurrency = 10
max_throughput = -1
The duration is in seconds and the it gets divided equally among the tests in the test suite.

If the performance drop was reported on a regression test, update the config to match the Jenkins job’s configuration.

PyForge Test Command

This can be found in the Jenkins console output and would like this:

./cleanup.py -f && ./run-pybot.py -n -v -c perf -s authn.TreesVChains -t RestLogin_perf_datastore_chain OpenAM

Results and Reports

PyForge generates a folder for each run under the results directory. The gatling report can be found <PYFORGE_HOME>/results/<TIMESTAMP>/<SUITE_PREFIX>/graph/

Finding the commit

The first step in troubleshooting a regression issue is to find the offending commit. The performance regression tests are run nightly, so there may be a few commits between each run. When a drop in performance is reported it usually involves a few commits from AM. If the commit is not very obvious, you can do a git bisect. If you build OpenAM locally on a specific commit, this can be copied to the remote machine into the <PYFORGE_HOME>/archives folder.

The steps involved to run the tests:

See Replicating on how to run the tests. At each step, record the throughput (Req/s) and Response times from the gatling report.

Sometimes, if the performance drop is spread across a few commits, you may have to find alternative ways like profiling and sampling.

Replicating

It is better to use a Lab machine as running locally can give inconsistent results. Running locally or in the Lab should be similar once you have PyForge cloned. However, there is a process to get a Lab machine ( See Run in the Lab).

Run in the Lab

Go to the page https://wikis.forgerock.org/confluence/pages/viewpage.action?spaceKey=QA&title=Grenoble+Lab and see if there are any available machines. If there is one, put your name against it and update the page. You can then ssh to the box. There is a #grenoble-lab slack channel to assist with credentials and any other problems. Once you get the credentials and login to the box (You will need to be on the VPN for this), create a directory under /external/testuser with your name and clone PyForge in it. Once a test has been run, the results can be accessed via http. For example, http://gouda.internal.forgerock.com/external/testuser/ravigeda/pyforge/results/

Profiling/Sampling

Flamegraphs

See this for detailed information in Flamegraphs: http://www.brendangregg.com/flamegraphs.html

IntelliJ (ultimate) has the async-profiler and also Flamegraph visualisation built in it. You can kick off a PyForge performance test that you need to investigate and attach the profiler.

https://www.jetbrains.com/help/idea/cpu-profiler.html

This is useful when the issue can be replicated locally. But sometimes the problem may only exist when run under load on a remote machine. In such cases, you can set up an async profiler on the box. See https://github.com/jvm-profiling-tools/async-profiler

There are other alternatives to this like using perf utility, creating a map file of JVM symbols.
Detailed steps available here: https://maheshsenniappan.medium.com/java-performance-profiling-using-flame-graphs-e29238130375

Visual VM

If you need to take CPU usage, GC activity, Memory usage or monitor Live Threads you can use Visual VM. A thread dump can also be taken to diagnose any deadlocks. If  running locally, run the performance test and open the process in Visual VM. But if you are running the performance test on a remote machine, you would need to enable a JMX connection. To enable it in the PyForge environment add/append the following Java args in the OpenAM section as shown in bold here

java_args = ${Default:java_args} -server -Xmx2048m -XX:MaxPermSize=256m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9010 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false

Once PyForge has started the test, you can connect your Visual VM to the remote process.

PyForge restarts the server during its configuration, so you would need to connect after it has restarted.