Sunday, July 26, 2009

Logging, Debugging, Instrumentation and Profiling

It started with this tweet from Cary Millsap [company | book | blog | twitter ]



That took me to this article on his blog, Multitasking: Productivity Killer.

I then began to look through his blog history and one in particular caught my attention, On the Usefulness of Software Instrumentation, from February of this year. I left a comment:
I'm having a conceptual crisis.

Profiling. Instrumenting. Debugging. Logging.

What's the difference? What's the same?

I have one AOL friend that says adding lines to code is debug code (meaning bad), while others say it is instrumentation (meaning good). Now I've got profiling to worry about? Come on!

Can you help a poor guy out? Maybe do a post on it? If you already have, point me to it? Anything? Help? :)
To which he responded:
Hmm, good one. I don't have enough time left today to work on it before close of business, but let's chat about it via email. That ought to help you out, and then I'll be glad to keep thinking about it and write something up. Write me via http://method-r.com/contact-us. —Cary
What follows is the conversation between Cary and I, reprinted with here with his permission.

Me: July 15, 2009, 10:09 PM
As directed...;)

I'm a big believer in Instrumentation, until recently, when I got all confused (shocker).

I was having a conversation with an old colleague...I asked about Instrumentation, and he indicated that Instrumentation created the ability to essentially monitor your system so that operational folks could adjust (more hardware, bandwidth, etc). My clumsy example, logins typically take 1ms, now they are taking 10ms. He agreed, sort of. The best I could figure out, from his definition, was that Instrumentation of software was like SixSigma is for process. (I only have a vague understanding of SixSigma).

Then I stumbled on your article. BTW, I think this is the Tom Kyte article you were referring to.

**Logging, I get. I think. Logging allows you to capture exceptions or other "bad" things and then, optionally, notify someone if it is severe enough.
**Debugging. It's a superset of Logging, I think. You are technically logging, but you may or may not have the notifications or ability to take immediate action.
**Instrumentation. ?
**Profiling. ?

I used to think I knew what Instrumentation was, but now I'm not so sure.

I think I'm fairly good at designing, then writing code for Oracle database applications. I consider security in my design and code stages. I consider future enhancements in design and code. I try to think of it all. I've even used debug statements liberally applied to all of my code. I want to do it the right way though. So my company brings you in to take a look, an audit, a pre-emptive strike if you will, and you say, "Man, this is some pretty awesome code here! Where's my check?."

I would love to see you write something up, 5 different articles, one on each topic then put them all together in a final post. Just kidding of course. Whatever you do write though, I'll be there asking questions!
Cary: July 15, 2009 11:36 PM
Here are some thoughts. I'd like you to press back with your questions and comments before I consider putting this up as a blogpost.

I think a good definition for debug code is, it's the code you put into your application when you'd rather have your code tell you what it's doing than use a proper debugger (with breakpoints and watches and all that). It's the printf("Here I am\n") statements that you throw into your code to help you figure out why it's not doing what you thought it was going to do. I use both debuggers and debug code. I use debug code anywhere I feel like a program I'm shipping to someone may need to report some information to me after it's deployed. Actually, I take that back...

When debug code gets fancy and becomes a permanent fixture in your code, like "if ($debug =~ /[gex]/i) { sprintf("%.6f entering %s\n", gettimeofday(), caller()); }", that's when I think it's really logging code. I think of logging as a permanent part of your code that adds some kind of operational feature to your code. It's more for the person managing how your code runs than it is for the person who runs it as a business function. Logs answer questions like, "Show me when the different phases of last night's process ran." The Oracle alert file is a log. Things that log4j prints out are logs.

Instrumentation, I'd say, is logging with a time-oriented purpose. I think people use the word pretty loosely to mean anything they want it to mean, so maybe instrumentation includes everything that's not a core function of the application. But, I like the word instrumentation. It puts a picture in my mind of a Starrett micrometer measuring the thickness of some Swiss watch part to within 0.0001 of an inch. It seems to me like the kind of a word that should mean something more careful than just "all code that's not core function code." When I say instrumentation, I mean logging with a focus on where time has gone.

Now, profile code is a different matter entirely; it is a very carefully restricted subset of instrumentation. A profile is a spanning, non-overlapping account of response time. Spanning means that the sum of the elements has to be greater than or equal to the response time the profile describes. Non-overlapping means that the sum of the elements has to be less than or equal to the response time the profile describes. In other words, a profile has to describe exactly a response time event; not less time than that, nor more. (For tasks with parallelized subtasks in them, the overall task has a profile whose interesting part requires drilling down into the individual serial [but executed in parallel] subtasks.) So, profile code is a type of instrumentation that is logging code that explains time in such a manner that you can create a profile from it.

The 10046 trace code path inside the Oracle Database barely qualifies as profile code, because a profile can be constructed from its output data, but I say "barely," because it's very, very difficult. It may be easy to get it 80% right, like examples in Jeff's and my Optimizing Oracle Performance book explain, but the last 20% is a huge chore. That last 20% is most of why the Method R Profiler (which was called the Hotsos Profiler when I wrote about it in OOP) has so many man-years of labor in it. Good profiling code is difficult to write (especially because one of your constraints needs to be to minimize the measurement intrusion effect of the profiling code itself), but it's like any other functional requirement. First, you have to know what you want, which requires trial-and-error before you get the design right. Then it becomes a software development project for a specific function, which you build just like you'd build anything else.

Helpful?
Me: July 16, 2009 12:11 PM
OK, Debugging and Logging are much more clear now.

You said:

It seems to me like the kind of a word that should mean something more careful than just "all code that's not core function code."

Would you define core-function code? Would you want to Instrument everything? Or is that the piece I am missing?

Also:

Instrumentation, I'd say, is logging with a time-oriented purpose.

and then

When I say instrumentation, I mean logging with a focus on where time has gone.

I'm not sure I'm following you with the time piece (pun intended). If I have Debug code which has now become Logging code, the table (or file, etc) would also capture the current date and time (or timestamp) right? You could use that to get metrics and such. Or am I missing something again?

On Profiling:

(especially because one of your constraints needs to be to minimize the measurement intrusion effect of the profiling code itself)

Kind of like introducing a TV crew or video camera? Where people act different because the camera is there? Isn't that an anthropology concept as well?

Profiling code lives outside of the code? Is that the concept? You have to be at a pretty low level to capture that stuff don't you?

I've got a ways to go.

If I didn't say it, I truly appreciate you taking the time for this.
Cary: July 16, 2009 11:11 AM
Inline below...

Cary

On Thu, Jul 16, 2009 at 11:11 AM, chet justice wrote:
I'm going to take a gander first, without help.

OK, Debugging and Logging are much more clear now.

You said:

It seems to me like the kind of a word that should mean something more careful than just "all code that's not core function code."

Would you define core-function code? Would you want to Instrument everything? Or is that the piece I am missing?


I would define "core-function code" loosely as the stuff that non-technical people talk about when they describe what your code does.

There are engineering trade-offs at every turn. Oracle example: We have 10046 tracing, which turns on instrumentation for every dbcall and oscall executed by the Oracle kernel. Some people shy away from using this instrumentation level because they're worried about the measurement intrusion effect. I believe that the intrusion is almost always well worth the cost. Oracle also provides 10200 tracing, which turns on instrumentation for every consistent read executed by the Oracle kernel. Now there is some measurement intrusion, because it makes the kernel emit a long line of trace data every time it executes the CR code path. Oracle provides both (and a couple hundred more, too), because sometimes you need both. But, thankfully, they gave us the ability to turn them on/off independently, so the run-time operational manager gets to make the trade-off decision about how much measurement intrusion he's willing to live with.

So, in response to your "instrument everything?" question, it's an artistic answer with an economic constraint: You need to instrument everything that will be in your business's best interest to have instrumented when the code goes out. The economic constraint is that you don't want to instrument something you'll never need to activate in the field: that's a waste of development time that could have been better allocated to writing some other function. I say artistic because you can't always know what parts of your code are going to need the instrumentation. You have to feel your way through that, and sometimes events over time will very much surprise you.

Also:

Instrumentation, I'd say, is logging with a time-oriented purpose.

and then

When I say instrumentation, I mean logging with a focus on where time has gone.

I'm not sure I'm following you with the time piece (pun intended). If I have Debug code which has now become Logging code, the table (or file, etc) would also capture the current date and time (or timestamp) right? You could use that to get metrics and such. Or am I missing something again?


No, I think you're fine there. What I really need to say, I guess, is that the word instrumentation might have so many different loose definitions that it might be a good word to stay away from, except as a general category marker.

On Profiling:

(especially because one of your constraints needs to be to minimize the measurement intrusion effect of the profiling code itself)

Kind of like introducing a TV crew or video camera? Where people act different because the camera is there? Isn't that an anthropology concept as well?


Yes. Except with machines, the intrusion effect is much easier to measure and predict. There's the occasional race condition problem that the presence of instrumentation can make more difficult to reproduce, so it's still an interesting, potentially nonlinear effect. (I'm talking about the times when running without tracing takes 10 seconds. Then you turn trace on, and your thing never takes more than 2 seconds. That happens sometimes.)

Profiling code lives outside of the code? Is that the concept? You have to be at a pretty low level to capture that stuff don't you?

There are two parts to profiling: (1) the emission of detailed event timings, and (2) the aggregation of those timings into something usable for an analyst. Part #1 is what happens when you run a.out after doing a "cc -pg hello.c". The execution writes out detailed timing data to a file called gmon.out. Then part #2 happens when you run gprof upon gmon.out to produce a readable, analyzable report. With Oracle, part #1 is turning on trace; part #2 is running something like our Profiler. Part #1 happens inside your code. Part #2 happens outside your code.

I've got a ways to go.

If I didn't say it, I truly appreciate you taking the time for this.


You're welcome. I learn by explaining; I'm glad you consider me a good person to learn with this week.
Me: July 16, 2009 2:30 PM
Ah ha! I think I have it now.

Profiling is using Logging (possibly Debugging) information that has been turned on (possible performance impact, which is at least measurable), reading that Data and then turning it into Information (report)?

When you say artistic, do you mean something like processing transactions (not database transactions but financial)? That would be a good candidate wouldn't it?

Cary: July 16, 2009 3:10 PM
What I mean by artistic is that there aren't any hard-and-fast rules that a programmer can use to determine what code paths to instrument, and which not to. And there are no hard-and-fast rules describing how to segment the instrumentation; e.g., 10046 and 10200 should be separately controlled? or should both instrumentation types be combined into one switch? I use the "art" word to denote tasks for which there's no deterministic list of steps that an inexperienced, untalented person can follow to do just as good a job as an experienced, talented one.

You might like to read the paper that I've attached.
I'm still looking for the public link to this paper. I will update this post when I find it.


Me: July 21, 2009 12:47 AM
Interesting read. I haven't quite wrapped my head around, but I've only read it once.

So to sum it up (sort of), I, as a PL/SQL Developer can't really do that much in regards to profiling? When you talk about profiling, you're talking at the kernel level right? If not...then I'm lost! :)

Cary: July 21, 2009 9:03 AM
Well, as a PL/SQL programmer, you have 10046 data, which can be assembled into a profile. And you have DBMS_PROFILER, which gives you a response time profile partitioned by PL/SQL line number. As Jeff and I suggested in our book, the analysis process there is:

1. Create a response time profile with 10046 data.
2. If the majority of your time is spent in an EXEC call upon a PL/SQL block (as opposed to a PARSE, EXEC, or FETCH of a SQL statement within the block), then use DBMS_PROFILER to figure out which line(s) of PL/SQL are causing your performance problem.
Me: July 21, 2009 11:36 AM
duh...dbms_profiler. I am a bit slow on occasion.

Very cool stuff. Part of my lack of understanding is that I've never really used a debugger or profiler. I think my first exposure to it (looking back now) was just a couple of months ago and JRockit, the ability to "look" and see what was running in the JVM. I've done Java programming before, but I never advanced past the barely intermediate stage. PL/SQL doesn't (or hasn't to my knowledge until now) made things easy...from a command line anyway.

So I profiled a simple session calling DBMS_LOCK.SLEEP. Pretty cool stuff.
SYS@TESTING>BEGIN
2 dbms_profiler.start_profiler
3 ( run_comment => 'JUST A TEST' );
4
5 dbms_lock.sleep(60);
6
7 dbms_profiler.stop_profiler;
8 END;
9 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.01
SYS@TESTING>SELECT * FROM plsql_profiler_runs;

RUNID RELATED_RUN RUN_OWNER RUN_DATE RUN_COMMENT
---------- ----------- ------------ --------- ------------
0 SYS 21-JUL-09 JUST A TEST

1 row selected.

SYS@TESTING>SELECT * FROM plsql_profiler_units;

RUNID UNIT_NUMBER UNIT_TYPE UNIT_OWNER UNIT_NAME
---------- ----------- ----------------- ------------- -----------
1 1 PACKAGE BODY SYS DBMS_PROFILER
1 2 ANONYMOUS BLOCK
1 3 PACKAGE SPEC SYS DBMS_LOCK
1 4 PACKAGE BODY SYS DBMS_LOCK

4 rows selected.

SYS@TESTING>SELECT * FROM plsql_profiler_data;

RUNID UNIT_NUMBER LINE# TOTAL_OCCUR TOTAL_TIME MIN_TIME MAX_TIME SPARE1 SPARE2
---------- ----------- ---------- ----------- ---------- ---------- ---------- ---------- ----------
1 1 1 0 0 0 0
1 1 8 0 0 0 0
1 1 16 0 1069 1069 1069
1 1 32 0 654 654 654
1 1 41 0 0 0 0
1 1 49 0 0 0 0
1 1 57 0 0 0 0
You can find more information on DBMS_PROFILER here.

I have not yet completely wrapped my mind around this yet. I think it might be in part due to the fact that I have never really programmed in another language like C or Java (well, a little) where you run the debugger and profiler against your code.

I don't know if this conversation makes me look like a complete dolt either...I don't really care if it does. Cary graciously made himself available to answer my questions and tolerated my...inexperience in this area. It's no wonder he is held in such high regard in the Oracle community. Not only does he know his stuff, he's more than willing to share it.

Hopefully some of you will get something out of this conversation. I'm still hoping that Cary will do his own posts in the future, but I am grateful he let me share this with you.

1 comment:

Duke Ganote said...

I think we're in agreement here, but my 2 cents anyways:

* INSTRUMENTATION. I use it in two senses:
a. DBMS_APPLICATION_INFO to instrument "what is happening now"
b. logging to show the "time trail" of what has happened.
* DEBUGGING. Action to correct an erroneous implementation, although instrumentation may be of use.
* LOGGING. the "time trail" flavor of instrumentation (as mentioned above)
* PROFILING. Examining the "time trail" or code paths to see how time was spent.

"Debugging" reminded me of the joke about the Performance Consultant brought on a task. The Consultant's first question: "are the results correct now?". Answer: "well, not really." Consultant's statement: "if you don't care about correctness, I can really make this fly!"