Process Discovery Analyzer cluster

Automatic process detection using machine learning algorithms and increased data flow in Process Discovery require more processing power. To streamline the need for efficient and effective data processing, Process Discovery provides Analyzer cluster configuration for high performance Big data analysis.

Process Discovery Analyzer cluster is based on Apache Spark technology. A cluster is a distributed computing framework consisting of several computers acting as cluster nodes that run Process Discovery Analyzer application. Spark cluster consists of a master node and one or more worker nodes. The same Analyzer application is installed on a master and each worker node. The master node orchestrates the work of a cluster and takes part in data processing by running the Analyzer worker process. Worker nodes run Analyzer worker processes to analyze data. You can mix Analyzer cluster nodes running on different OS. Master node is assigned by specifying its address in Management Console.

The Apache Spark cluster technology implies that you use computers similar in processing power and memory amount as worker nodes. If you have computers with great difference in RAM and processing power that you want to be cluster nodes, use containers (for example, Docker container) on computers with large amount of RAM to run the Analyzer. This way you can create several clusters with characteristics similar to less powerful nodes.

We recommend using the most powerful computer among nodes as a master node to balance the heavier load.

The minimum (default) number of nodes is one node. Depending on the amount of data Process Discovery Agents collect and your requirements for the analysis time, you can add as many cluster nodes as needed. For example, if you want to decrease the Analyzer data processing time, add a node to the cluster and check the processing time again. The processing time is presented in the Status report of the Process Discovery view in Kofax Analytics for RPA views.

Note that adding twice as many nodes does not decrease processing time twice as much, because some time is used for co-ordinating the nodes and transmitting data between nodes. Therefore, reducing network latency increases cluster performance. Also, it is more efficient to add one powerful server with a large amount of RAM as a node instead of adding several less powerful. Because Analyzer uses all available processing power of the system, we recommend using dedicated computers as nodes in the cluster.

Set up Process Discovery Analyzer cluster
  1. Under Process Discovery Analyzer > Cluster settings in Management Console, specify all necessary parameters, such as assign the master node, specify network pattern (optional) and other settings. See Process Discovery Analyzer for details.

    1. Install, configure, and start Process Discovery Analyzer on computers that you want to be Analyzer cluster worker nodes.

    2. After starting all worker nodes, install, configure, and start Process Discovery Analyzer on a computer that you want to be Analyzer cluster master node.

    When starting instances of the Analyzer application on the nodes, specify the Management Console address where Analyzer cluster settings are defined. Specify other parameters if necessary. See Process Discovery Analyzer for details.

    Once all nodes are running, you can add, remove, and configure worker nodes as required. The changes are applied during the next Analyzer run.

If any of the worker nodes fails, the underlying Apache Spark technology preserves the data and distributes the load between working nodes. If you change any Analyzer settings, restart the master node. If you assign another computer as a master node, restart both the current master node and the newly assigned master node. For example, currently your cluster contains three nodes: A, B, and C. "A" is a master node. If you assign "B" as a master node, restart "A" and "B".

If you encounter an outofmemory type of error in the Analyzer log on a master node, open Settings > Process Discovery Analyzer > Cluster settings in the Management Console, increase the amount of memory in the Master memory (GB) setting, and restart the master node. See "Cluster settings" in the Process Discovery Analyzer for details.

Monitor cluster nodes

Apache Spark cluster includes a tool to monitor node activities. Once you set up a cluster in your environment, open the master node dashboard to make sure all worker nodes are alive. The dashboard contains some basic information about running and completed applications as well as a list of cluster workers. You can see the application details by clicking the application ID in the list. To open the master node dashboard in a browser, type the master address followed by the port number specified in the Master WebUI port option in the Management Console. For example:

10.10.0.15:8080

By clicking a worker ID in the list, you can open the worker dashboard. To open the worker dashboard directly in a browser, type the worker address followed by the port number specified in the Worker WebUI port option in the Management Console. For example:

10.10.0.11:8081

Master and Worker node logs reside at the same location as the general Analyzer log file. To locate the log file in your system, see the Log files section in Process Discovery Analyzer.

For more information about Apache Spark cluster, see Apache Spark documentation at https://spark.apache.org/.