High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only

Monitoring and Debugging Spark Applications

By Archive User posted Tue January 31, 2017 10:26 AM

  

Originally posted by: Charlie_Yang


imageIBM Spectrum Conductor with Spark 2.2 enables users to closely monitor their submitted applications. Information can be obtained on the driver and executor level, which includes listing all of the submitted applications, resource usage, performance tracking, and log retrieval.

The examples in the figures below demonstrate some of the monitoring and debugging features in the IBM Spectrum Conductor with Spark 2.2 cluster management console.

Viewing Spark applications

After submitting Spark applications, users with the SPARK_APPLICATION_VIEW_ALL permission can view their applications from two locations in the cluster management console. The first location is the Workload > Spark > My Applications & Notebooks page; this page lists only the applications that the current user has submitted. You can also view Spark applications on the Applications tab when you view a Spark instance group; this page lists all of the applications that were submitted to the Spark instance group’s Spark masters. On this page, users with the SPARK_APPLICATION_VIEW_ALL permission are able to see all of the Spark applications that are submitted by any user to the current Spark instance group. In Figure 1, the administrator can navigate to the Applications tab to view all of the jobs that are submitted to the current Spark instance group.

imageFigure 1 - List Spark applications

Retrieving application runtime information

To view an application’s runtime information, click on the application hyperlinks listed under Applications tab as shown in Figure 1 above.

Tip: You can immediately know where errors have occurred by examining the number of error messages in either the driver or executor stderr file. The Driver stdout section of the page allows you to quickly view and download the full driver log by clicking the download icon on bottom right corner of the Overview tab. See Figure 2 below.

imageFigure 2 - Application page

Within the Spark application page as seen in Figure 2, each of the tabs displays different types of information about the Spark application. To view the driver and executor information, click the Drivers and Executors tab as seen in Figure 3. To retrieve the individual driver and executor logs, see the Debugging Spark Applications section of this blog below.

imageFigure 3 - Drivers and Executors

The Performance tab displays a performance portfolio for the Spark application. Figure 4 shows the charts that are used to illustrate the tasks execution behavior over the running duration. This tab monitors the running Spark application and provides real-time data analysis, which helps you to evaluate and highlight the job stages that took the longest to execute. Furthermore, you can track task durations and breakdown each job stage. In combination with the visualization charts, this data helps to track performance outliers that can help you evaluate and optimize your application.

imageFigure 4 - Performance

The Resource Usage tab demonstrates the resources that are used to execute the Spark application. IBM Spectrum Conductor with Spark monitors the slots, memory, and CPU core usage of the entire runtime duration as seen in Figure 5. The Resource Usage tab also provides the Slots & Executors performance chart, which demonstrates how many executors were running at a time, and how many slots were used by that executor. From each of these charts, you can track whether this application is memory, CPU, or slot intensive, and when the application is laboring the memory and CPU most.

imageFigure 5 - Resource Usage

IBM Spectrum Conductor with Spark uses ElasticSearch to monitor and maintain all application data. All the application data is preserved for a specified period of time before being deleted. The default time configured in IBM Spectrum Conductor with Spark 2.2 for data preservation is 14 days, which means that after 14 days all of the data related to the application is deleted.

Debugging Spark Applications

There are two types of logs that you can retrieve for debugging purposes; stdout and stderr. Stdout contains the standard output from the application, which includes the runtime result of the application. Stderr contains the standard errors from the application, as well as other debugging information.

Download Driver and Executor Logs

If there are errors or debugging information available for an application, you will find a driver hyperlink and executor hyperlinks in the Drivers and Executors tab of the application page as shown in Figure 6 and Figure 7. When you click either of these links, the Driver or Executor dialog appears. The logs can be downloaded on this dialog by clicking on the download icon to the right of the log file.

imageFigure 6 - Driver Information

imageFigure 7 - Executor Information

Retrieving the Spark master log

When you utilize a Spark instance group to run multiple workloads, the Spark batch master might encounter failures and get stuck in the ERROR state. To get your Spark instance group back up and running, you can also retrieve the Spark batch master log from the Services tab when you view a Spark instance group as seen in Figure 8.

imageFigure 8 - Sparkms-batch service

In the Instances tab of the sparkms-batch service, select the instance of interest and click View Logs to download the log to check for errors. Refer to Figures 9 and 10.

imageFigure 9 - Sparkms-batch instances

imageFigure 10 - Spark batch master log

Retrieving the Dockercontroller logs

When you run Spark instance groups and Spark workload inside of Docker containers, you can use the Dockercontroller logs to find and debug Docker related errors. The Dockercontroller logs are generated on all hosts that ran or attempted to run Spark instance groups or Spark workload inside of a Docker container. You can retrieve the Dockercontroller logs by navigating to the Resources > Hosts > All Hosts page and opening the Host Name hyperlinks as shown in Figure 11. You can also access the host dialog from the Drivers and Executors tab when you view a Spark application; select a driver or executor and click the Host hyperlink.

imageFigure 11 – View Hosts

Once the host dialog is open, navigate to the Host Logs tab as shown in Figure 12. For Docker cases, select the dockercontroller.log and click Retrieve Log List. Finally, download the dockercontroller.log populated in the Logs section.

imageFigure 12 - Dockercontroller log retrieval

Put Your Knowledge to Use

Batch application failures can happen occasionally, and it is very useful to know how to debug and determine what went wrong with the application. One of the best ways to do so is by reading the log files.

The following example demonstrates the process of debugging a failed application. Figure 13 shows a batch command before it is submitted. The command is about to run a Spark Pi application with the spark-examples_2.11-2.jar file. This file name is the result of a typo from the sample command where the correct file name is supposed to be spark-examples_2.11-2.0.1.jar. As a result, spark-examples_2.11-2.jar does not exist, therefore, the Spark application will receive an error.

imageFigure 13 - Bad application

After the application is submitted, the application attempts to run the Spark Pi application, and soon after, it falls into the ERROR state as seen in Figure 14.

imageFigure 14 - Bad application went to error

To solve this issue, we need to retrieve the driver or executor runtime logs. As you can see in figure 15, due to the error there is an “Uncaught exception” message inside the stderr log.

imageFigure 15 - Bad application driver log

The driver stderr log in Figure 16 shows that there is a “NoSuchFileException” as the driver attempts to find the /var/platformconductor/spark-2.0.1-hadoop-2.7/examples/jars/spark-examples_2.11-2.jar file. For exploration purposes, you can check the Spark UI, which will show the same error.

imageFigure 16 - Driver NoSuchFileException

Summary

Application monitoring and debugging is very useful knowledge to have when dealing with Spark applications. IBM Spectrum Conductor with Spark 2.2 offers these monitoring features to help manage the excessive amount of Spark application data. For more information about monitoring Spark applications, see the IBM Spectrum Conductor with Spark Knowledge Center.

If you would like to try out IBM Spectrum Conductor with Spark 2.2, you can download an evaluation version here. If you have any questions or require other notebook samples, post in our forum!


#SpectrumComputingGroup
1 comment
7 views

Permalink

Comments

Wed February 16, 2022 11:14 PM

Not sure if it is okay to paste the external link here, but I found the following article also explains Spark Web UI in detail.

https://sparkbyexamples.com/spark/spark-web-ui-understanding/