1. Introduction
In my opinion, one of the greatest things about Java is the ability to get meaningful thread dumps on a running production environment without having to enable DEBUG mode. The thread dump is a snapshot of exactly what's executing at a moment in time. While the thread dump format and content may vary between the different Java vendors, at the bare minimum it provides you a list of the stack traces for all Java threads in the Java Virtual Machine. Using this information, you can either analyze the problem yourself, or work with those who wrote the running code to analyze the problem.
2. What is a stack trace?
I mentioned earlier that the thread dump is just a list of all threads and the full stack trace of code running on each thread. If you are a J2EE Application Server administrator and you've never done development before, the concept of a stack trace may be foreign to you. A stack trace is a dump of the current execution stack that shows the method calls running on that thread from the bottom up. If you're unfamiliar with what a method is, please see:
http://en.wikipedia.org/wiki/Subroutine.
Here is an example stack trace for a thread running in WebLogic:
"ExecuteThread: '2' for queue: 'weblogic.socket.Muxer'" daemon prio=1 tid=0x0938ac90 nid=0x2f53 waiting for monitor entry [0x80c77000..0x80c78040]
at weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:95)
- waiting to lock <0x8d3f6df0> (a weblogic.socket.PosixSocketMuxer$1)
at weblogic.socket.SocketReaderRequest.run(SocketReaderRequest.java:29)
at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:42)
at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:145)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117)
You don't need to understand what it's doing at the moment, but the key point to remember is that it is bottom up. This means that it started with weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117), that method called the "execute" method above it which called the one above it and so on.
The key here is knowing that what is currently running (or waiting) is always the top method. This will give you insight as to what the threads are stuck on. I usually glance through the whole stack trace because developers usually provide clues in method names as to what is actually going on. In the above thread, you'll notice that the method currently running (or in this case, waiting on a java lock) is weblogic.socket.PosixSocketMuxer.processSockets(). Socket Muxer is a term used by BEA and others to refer to managing data on network sockets (data sent across the network interface)
3. Thread pools
Most application servers use thread pools to manage execution tasks of a certain type. A thread pool is merely a collection of threads set aside for a specific task. In the example thread above, I've shown you a thread from the WebLogic thread pool (or queue) named "weblogic.socket.Muxer". A pool of these "Muxer threads" is set aside by WebLogic to manage reading and writing data for network connections coming into WebLogic.
In most cases, you won't be looking at the Muxer threads. When someone reports that the application server isn't responding, it usually means that the application code deployed to the application server isn't working right - and you'll need to figure out why. So you'll need to identify the thread pool that your application code runs on and find those threads in the Java thread dump to see what's going on.
Unless you've customized the execute queue/thread pool that your application gets deployed too, you'll want to look for the given application servers "Default" execute queue. In WebLogic, you'll look for the threads marked as 'weblogic.kernel.Default' to see whats running.
For WebLogic 10, here is an example of just such a thread:
"[ACTIVE] ExecuteThread: '12' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=1 tid=0x091962f8 nid=0x2f95 in Object.wait() [0x7cd75000..0x7cd75ec0]
at java.lang.Object.wait(Native Method)
- waiting on <0x8ed19d28> (a weblogic.work.ExecuteThread)
at java.lang.Object.wait(Object.java:474)
at weblogic.work.ExecuteThread.waitForRequest(ExecuteThread.java:156)
- locked <0x8ed19d28> (a weblogic.work.ExecuteThread)
at weblogic.work.ExecuteThread.run(ExecuteThread.java:177)
In WebLogic, when you see the keyword "waitForRequest" in a particular thread's stack trace, it means that the thread is idle and could potentially do work if a new request came in. Having idle threads is usually a good thing, but it doesn't always guarantee that your application is healthy or responsive. It just means that you have the potential for more throughput.
Once you know which execute queue is used for the unresponsive application code, you know exactly where to look to figure out what's going on.
4. Getting a thread dump
There are quite a few ways to obtain a thread dump from a running virtual machine. I've found that the easiest way to get a thread dump is to send the HUP or BREAK signal to the java process. In Unix, this is done by figuring out the PID (process ID) of the Java virtual machine and doing a "kill -3 ".
In Windows, you can use the SendSignal.exe program to send a "break" signal to the process and obtain a thread dump.
Where you go to find the thread dump usually depends on Java implementation. For most vendors, you'll go to the "standard out" log file. In WebLogic, it's often referred to as "weblogic.out", "nohup.out" or something you've created yourself by redirecting standard output to a file. This is why it's critical to redirect standard output & standard error to a log file so it is captured when you need a thread dump. See
http://www.javasanity.org/basicsetup for information on how to do this.
For the IBM Java, the thread dump gets printed to a file with a prefix of "javacore" in it's name. The file gets written to the Java process "working directory". You may have to do some digging to find it.
5. Full Thread dump
Take a look at this example thread dump taken from WebLogic 10 while running some sample requests. You can look at this to get familiar with the format of a WebLogic thread dump running in the Sun JVM. Take a moment to search for the default execute queue and figure out what the code was doing when this thread dump was taken Don't read the next section until you're sure you want the answer.
6. Summary
Being able to obtain and analyze Java thread dumps is critical for understanding what's really going in inside your J2EE application server. Without thread dumps, you're blind when it comes to trying to get to root cause for an application server "hang" condition. I'd suggest always getting a thread dump before restarting a hung or misbehaving application server. It might not always be useful, but it doesn't hurt to get one in case it's needed later.
If you took the time to analyze the full thread dump in section 5, I hope you took the time to find the threads marked "weblogic.kernel.Default" to see what's going on. In this thread dump, there are only two types of requests running on the Default queue. There are idle threads (which are not the issue) and there are threads running my "insert_test.jsp" java server page.
If you found that they all were doing something within "Oracle" routines, kudos to you. If you understand network programming, you might even have realized that when the top function is doing a "socketRead", it's waiting on data to come across the network. Seeing most of your executing threads waiting on data from the network is usually a pointer to the next place you need to look to identify the cause of your problem.
What I provided here was a simulation of database issue. In this case, all running "Default" threads were waiting on a response from the database. This response was never going to come, because I locked the table in exclusive mode and the jsp was trying to do an insert. I solved this by killing the lock - this allowed the threads to complete. A poorly tuned database (missing or corrupt indexes, SGA, disk I/O, high CPU on the DB, etc) can cause a similar "socketRead" bottleneck within the jdbc drivers for your application. I've also found that poorly designed SQL code or tables that were too large and never purged can cause this as well.
A special thanks to author of original post.
Regards
!Kuldeep Sharma