The "Database is a Bucket" Mentality
Front and center again...I just woke up from a nap, I'm grumpy, so I must write. Besides, I haven't had a good rant in quite some time. Friend of mine asked me last week for some advice, specifically asking if there was a tool to convert Oracle SQL Syntax to the ANSI SQL syntax. (A quick search turned up this (it was the first result), if you're interested). I had to ask why. Client is switching to an open source database, i.e. "free." Oracle licensing is way too pricey. I'm sure Oracle costs a lot of money, it's pretty darn good software. Quite possibly the best in the world especially in the database realm. I've written about the incredibly feature rich goodness that is the Oracle database here here...actually, just trust me. It's in my name. Why is there even a comparison? Could it be that everyone out there believes that the sole purpose of a database is to store data? That it can't do anything else? The storage and retrieval of data...that's all it does of course. It's like saying the Democrats and Republicans are the same...at face value, perhaps, but the devil is in the details. This, this "Bit Bucket" mentality is what is so incredibly frustrating. I am no position to argue the differences between the various flavors of database, I lack the experience. But if I were using SQL Server, I would leverage the shit out of it's capabilities. If I were using MySQL, I would leverage the shit out of it's capabilities. If I were using Firebird, I would leverage the shit out of it's capabilities. Same goes for every single flavor out there. Get my point here? The database is NOT a bit bucket!Do I need to use more 4-letter words? I know that Oracle is feature rich and that 99% percent of your code can live in the database...think APEX and PL/SQL. You could probably put ALL of your code inside the database if you wanted to put the javascript in BLOBs as well. Please, please please quit telling me they are the same... they are not. Follow up rant by Mr. O'Neill can be found on this following post Everything is a Bit BucketLabels: database, design, development, oracle, rant
Database Application Security - A Visualization
Security, in regards to the database, is pretty broad. There are a multitude of ways to secure your database. From physical or logical access to SQL Injection. This is a (not so) pretty picture of how I visualize security from an application point of view.  The base of any application are the tables. Access to those tables should be limited to database views and PL/SQL packages (API) that are owned by the application (schema). SELECT privileges on those views should be given to only users who need it (users also includes other applications). Views present the data in the format that is required by the application. VPD (column or row) can be used to further restrict access to sensitive data. EXECUTE privileges on the APIs (PL/SQL) should only be given to those applications that need it. Under no circumstances, should direct INSERT/UPDATE/DELETE privileges be given to another user. Much of my thoughts on this matter are directly traceable to reading AskTom for so many years. I can't point to any one specific article, I think it's the accumulation/internalization of reading for so long. It makes sense though. Who better to decide the functions (INSERT/UPDATE/DELETE) that can be performed against a given set of tables than the person(s) who created them? If said person has left, make someone else responsible for them. I've fought hard over the years to implement this "pattern" and have been fairly unsuccessful. It is difficult to go against years of doing it one way (full access to the underlying tables) to forcing someone or something to use the API. The idea to visualize it came to me recently and I need to put it down. What do you think? Am I full of it? Is this reasonable? Anything I left out? Labels: database, design, oltp, security
PARALLEL Rant?
Let's say you have DEGREE set at the table and index level. I ask you if this is appropriate, instead offering up using whatever tool accessing the data to provide the PARALLEL hint. The DOP is set pretty low, given our current system. But it's still set and can't be easily turned off. I'm all for PARALLEL, but it's been beaten into my head that it should be used, specifically for batch operations. In other words, transforming or loading data. With the DOP set at the table or index level, it is not (necessarily, see resource limits below) controllable. If you have 1 or 2 users issuing SELECT statements against the table, it's not a big deal. Let's say the DOP is set to 8. 8 processes are spun off for each user. That's 16 processes now running that SELECT statement. Now let's say you have 1000 users. You probably won't make it to 8000 processes...your machine will probably keel over and die...or worse, just sit there forever. But we have to set DOP at the table/index level...our users don't know how to write SQL. Fair enough...teach them how to write it. That takes too much time. How do you ever expect them to learn? It might be a good short term solution, but is it really a good long term solution? Teaching your users how to write better SQL would be in everyone's best interest. - Initially, you'll be swamped with "How do I?" type questions.
- Then the questions will only trickle in.
- You'll have much more savy business users who can now probably articulate their needs much better which will lead to
- Better design documents
- Better requirements
- You can finally begin to push off more of this "reporting" type functionality out to the business (where it should be in my opinion).
OK, that might be a bit of a fantasy. What about setting up resource profiles for the users? I've never used them, but I was reading up on them tonight for this post. What can you control with this feature? - Limiting the User Session Level
- Limiting Database Call Levels
- Limiting CPU Time
- Limiting Logical Reads
- Limiting Other Resources
- You can limit the number of concurrent sessions for each user
- You can limit the idle time for a session.
- You can limit the elapsed connect time for each session.
- You can limit the amount of private System Global Area (SGA) space (used for private SQL areas) for a session
What's the point of all this? I'm not too sure. It's a rant I guess. I just got done reading Dom Brooks recent post and reminded me of this conversation...which I have all too often. Labels: design, development, parallel, random
OBIEE and Source Control?
One very difficult aspect of using OBIEE (or APEX for that matter) is that it doesn't lend itself very well to source control. There is one RPD (metadata) file in use at any given time. Changes to this environment will affect anyone using or developing on the presentation server layer (Dashboards/Answers/etc). For the web catalog (Answers/Dashboards/Prompts/etc) you can make changes, but again, it will affect anyone who is also using the tool. If you want to tweak a report that has prompts or filters, you need to save everything off to your own folder in order to work on it or risk colliding with others or worse, messing up the report beyond repair (also known as FUBAR). Developers usually need to break things to fix them and giving them an environment where they can do this (also known as tinkering), without repercussions, should be high on the list of must-haves. Inspired by a meeting I attended last week and Jake's recent welcoming of VirtualBox into the Oracle fold, I decided to think (yikes) my way through a possible solution. After the meeting last week, I was convinced I could build an environment using subversion as the source control tool. Tie that in with Jira, Fisheye and ultimately Bamboo and you'd have a pretty sweet environment to work in. How to do it though? 1. Set up source control. SVN is free and runs on Linux. Free. 2. Convince a multitude of developers to install and configure OBIEE on their own workstations. Yeah...not so sure about that. I accidentally said in that meeting, "I don't know a single, good developer who doesn't have a local install of Oracle (the database)" May have been just a tad hyperbolic...I like to tinker and appreciate those environments which allow me to do so. Having a local sandbox has been indispensable for me. 3. Not sure where or what 3 is. That's where I got hung up...until reading Jake's post. How about this then? 1. No subversion (for the time being). 2. Virtualize the development environment. 3. Hand out the VDIs to the developers, and let them run with it. When they make changes, they can promote them through the usual channels. Once those changes are accepted/merged with the development environment, a new snapshot is taken and distributed. Having the snapshot of dev is the key I think. Those who don't like to tinker, who just like to get their job done, won't have to worry about configuring their environment. They'll just fire up the VM and do their work. For those that do like to tinker, they can fiddle with the VM as much as they want without fear of breaking things for everyone else. If they need a fresh start, just get the original VDI and go crazy again. This hasn't been completely thought at (if you couldn't tell). I haven't considered passwords or other such sensitive data. It sounds good in my head though. What's the worst that could happen? Labels: design, obiee, source control
How Do You Normalize a Tweet?
Second post by Mr. Myers, you can read his first one, How To Kill a Code Review here. I have always liked design topics, I don't think they are covered enough on the web, which is why I liked this one. I have often wondered what trade-offs designers make for these types of applications (Twitter, Facebook, etc). Are they even "designed" by a data modeler? Or are they created by application developers? Not really sure it matters to those companies as they are successful (in a strange, no business model kind of way) and, I don't believe, represent many of the realities that we as Oracle professionals are likely to deal with.Firstly, I don't tweet. Alex (Gorbachev) mentioned it at a Sydney meetup. I had a look, but didn't get entrenched and I assume there will be others out there who aren't tweeters. Suffice to say a 'tweet' is a message broadcast by a twitter user to the twitter consumers. They are up to 140 characters long. So what's to normalise ? Isn't it just a value ? Even Oracle 6 could cope with VARCHAR2(140). But actually, a tweet isn't just a simple value. A search for "beer" would turn up all messages that included #beer. Similarly, the @ sign followed by a username allows users to send messages directly to each other. A message with @example would be directed at the user [example] although it can still be read by anyone. Source: WikipediaFirst normal form states There's no left-to-right ordering to the columns.
Every row-and-column intersection contains exactly one value from the applicable domain (and nothing else).
All columns are regular [i.e. rows have no hidden components such as row IDs, object IDs, or hidden timestamps]. Source: WikipediaThe problem is that the tweet " @tom Come for a #beer or #burger. Don't let @harry come" definitely has hidden components, but there is a sequencing in the message that is just as important. In a practical implementation, we would probably have the following tables: TWITTER_USERS : username (eg @tom), date_joined, email.... TAGS : tag (eg #beer) TWEETS : tweet_id (surrogate key), created_by (referencing twitter_user), created_timestamp, tweet_text... TWEET_TAGS : tweet_id, tag (eg #beer) TWEET_DESTINATIONS : tweet_id, username (eg @tom) Our message would have the two child tag records (#beer and #burger) and two child destination records (@tom and @fred). At the logical level, we are not properly normalized because we have the tweet_text duplicating information from the child entities and the potential for inconsistencies between them. We can say that the tweet just seems to contain duplicate information but it is really different. Is that just being picky ? I am not suggesting the relational model is wrong, broken, incomplete or inadequate. Quite the reverse, in fact. In this case the value of the model is that it tells us the problems that will arise when we denormalise data. For example, if @harry deletes his twitter account (because he was never invited for beers), do we delete the tweet_dest that referred to him or do we keep it and not enforce that referential integrity constraint ? If we delete the tweet_dest, we have an inconsistency between the tweet_text attribute and the tweet_dest child entities. Or maybe we delete the tweet entity itself and all its children. Those are really choices for the business (possibly with some legal implications though). I don't have a solution to the logical model representation, and would be interested in feedback. But not by twitter please :) Labels: design, gmyers
Database Cleanup: Metrics
Before my current refactor/redesign goes to production, I would like to capture some metrics. I'm fairly limited in what I can actually do (i.e. I can't use DBMS_PROFILER in production). So far, this is what I have come up with: 1. Lines of Code (LOC) - I don't believe this is necessarily a reflection of good or bad code. For instance, I can take that 2 line INSERT statement and turn it into 20 lines. WasINSERT INTO my_table(id, col1, col2, col3, col4, col5, col6, col7, col8, col9 ) VALUES ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ); IsINSERT INTO my_table ( id, col1, col2, col3, col4, col5, col6, col7, col8, col9 ) VALUES ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ); That's a pretty sane example. The 2 line version isn't all that bad, but it does run off the page. The point I am trying to make is that "cleaning" up can actually add more lines to your code. In my opinion, when more than one person is going to support the code, readability is a nice thing. Whether or not you like my style, it is (more) readable. So LOC is not necessarily a great metric, but it can give you an idea which way you are going (after it has been properly formatted anyway). 2. COMMITs - Many argue that there should (almost) never be commits ( an exception is logging with the AUTONOMOUS_TRANSACTION pragma) in the database. The calling application should perform the commits. Unfortunately that general rule is not always followed. I've added it to my list of metrics because it is pertinent to our environment. Of course I have gone to great pains to make sure that the removal of one commit will not impact the entire system...that possibility does exist when you have commits everywhere. 3. Text - This was a real stretch. What is the size of the individual procedure, package or function? I wouldn't have considered it (I never have until now), but I was desperate to define something...anything. How do you determine that? SELECT name, type, SUM( LENGTH( text ) ) t FROM dba_source WHERE owner = 'MY_OWNER' AND name = 'MY_NAME'; 4. Dependencies - Also known, to me, as modular code. Why have 200 INSERT statements into a single table when you could wrap that up into a single procedure and call that? If you add a column, you'll still have to go through and fix all those occurences (if it's not defaulted to something). But if you remove a column from that table, it can easily be hidden from the calling code, thus you only have to change it in one place. Of course you wouldn't want to leave it there forever, but it can be done piece-meal, bit by bit as you work on affected parts. Have you ever thought about this before? What kind of metrics would you suggest? I know mine are a bit of a stretch...so please share. Labels: design, development
The Database Cleanup
I found a recent discussion on Ask Tom about " Unused Objects" via David Aldridge's post, Metacode Gone Wrong. The original poster's question: We are in a process of removing unused objects (tables/functions/procedures/packages) from the database. Is there any script(suggestions) or short-cut method to find these unused objects (tables/functions/procedures/packages not used in ddl/dml/select statements for more than 3 months).
There are more than 500 objects(tables/functions/procedures/packages) in our database.
At least PLEASE help me in finding unused TABLES.For other objects I'm thinking to check manually in the application code(using find and grep commands)
Please Help me. To which Tom replied: You'll have to enable auditing and then come back in 3 months to see.
We don't track this information by default -- also, even with auditing, it may be very possible to have an object that is INDIRECTLY accessed (eg: via a foreign key for example) that won't show up.
You can try USER_DEPENDENCIES but that won't tell you about objects referenced by code in client apps or via dynamic sql
I'm always perplexed by this. How does one get into a production environment where by they don't know what the objects are used by/for? No documentation or anything? I've been in several environments where production was not documented very well (if at all). I guess that's fortunate for me, as there is always work to be done. At my last gig, I went about an effort to clean up the database. We had close to 600 tables in a single schema. The one good thing (for me anyway), what that those tables were not accessible outside of the database, they were called through PL/SQL. Finding dependencies (DBA_DEPENDENCIES) was fairly easy...but I also ran across the caveat that he mentions, Dynamic SQL. Nothing strikes fear in you quicker than the realization that all of your work might be nullified because you didn't consider the use [Dynamic SQL] up front. I would complain during code review/architectural sessions about the use of Dynamic SQL...not sure if it was listened, but I got my opinion in. Documentation of a database, in my recent experience anyway, is the last thing on anyone's mind. It's seen as time-consuming and un-important. I like what David says in the Ask Tom comments: I think that if documenting code makes people sad then they ought to be in their bedroom writing card games in VB. The sad thing is that it doesn't have to be a huge overhead, it just has to be well thoughtout and #actually done#. How To Clean Up The DatabaseSince I've had so much experience at this of late, I'll list the steps I have taken in the hopes that you can find something useful yourself. 1. DBA_DEPENDENCIES - It's a great place to start, but it's not a panacea. You can get 90% of everything you need here. It's that last 10% that is the hardest. For example, I'll focus in on one table, query DBA_DEPENDENCIES, and then put that list into a spreadsheet where I can then track my progress. Usually I'll add "fixed", "fixed date" and "comments" columns so I'll know that I have addressed it. I'll typically have a worksheet for each table. 2. Privileges - Specifically in relation to tables. Do other database users have DML access? SELECT is one thing (still important) but INSERT/UPDATE/DELETE is entirely different. If other users do have access, are they service accounts (used by outside applications) or are they solely database users (another application)? 3. Auditing - I had never thought to use auditing for this purpose, but it might be helpful in the future. 4. Logging - If you suspect a piece of code is no longer used (naturally there is no documentation), but are not sure, you can add a bit of logging code to it. It's not the best method in the world, but it works. With all things, it's not a 100% guarantee either, the code may be called once a year, there is really no way to tell. 5. Thorough and Meticulous Analysis - This isn't really a method but it's going in here anyway. Document everything you can which includes everything you've done. At the very least, you'll have some documentation to show for it. At the very least, you'll have a much better understanding of your application and it's inner-workings. Update 09/17/2009 12:30 PMDom Brooks reminded me of DBA_SOURCE in the comments so I'm adding that in.6. DBA_SOURCE - A case-insensitive search of DBA_SOURCE is a must have as well. Allows you to find all the references to a table/procedure/etc. Some may have just be in comments, but some may also be contained in Dynamic SQL. Labels: design, documentation
Database Tutoring
Last week a friend of mine sent over a craigslist posting, someone looking for a tutor. Here's the ad: Looking for an experienced SQL Database Systems analyst to help with homework assignments for a graduate level database course. Would like to meet 2 times per week (for 2 hours each session) over the next four months. Evening, weekends or Wednesdays preferred. The candidate must be able to explain the technical to the non-technical. Please reply with resume and availablility.
Course Topics Are: * Relational Model and Languages ( SQL) * Database Analysis and Design * Methodology (Conceptual and Logical Design) * Social, Legal, etc. Issues * Distributed DBMSs and Replication * Object DBMSs * The Web and DBMSs * Business Intelligence I replied immediately and heard back the next day. I sent my resume but I thought the blog would be more appropriate. Apparently it was enough. We spoke on Saturday for about an hour and I received all the materials necessary to start doing research including a sample database (in Access). I have to say I'm pretty excited about it. I thoroughly enjoy trying to explain database concepts so that others (non-techies) can understand. It's a Masters level class filled with students from Computer Science and from an Education Technology tract. Bet you can guess which side my "student" falls in. Seems a little odd that the Educational Technology folks are in the class, but I think it's a good thing. When they need an application in the future, they'll have a much better grasp of what to ask for and hopefully they'll be more involved in the process. I'll use this space both for reporting on progress and helping to explain things. Wish us luck! Labels: database, design, sql
Data vs. Information
Last week in The Case For Views on the very last line I said Records in a table typically constitute data. Tables, joined together, in a view, tend to turn that data into information. That elicited a very, very strong reaction from a good friend and mentor. In the comments he left this Turn data into information? That doesn't make a whole lot of sense to me-- All data is information. Can you clarify that statement a little? On the face of it, that's not a very strong reaction. He tends to be a lurker though, rarely leaving comments. Then there was twitter, where he sent me a few more links on the subject. I'm pretty sure he was fired up. Once a week or so, we'll get together over beers and have excellent conversations. Occasionally, I'll try to hold my ground from the database perspective. Last week we had a discussion about whether the database should be making web service calls. Security aside, I thought it was appropriate given the size and skills of the shop, but he and our other friend staunchly disagreed. Point is, we have some great conversations. It has never come down to "You are stupid!" or anything like that, it's a conversation with each side presenting their arguments. Since my friend has like 28 degrees in Engineering, I've learned to give him the benefit of the doubt, so I wanted to study up on it. I asked the oracle-l mailing list on Friday. My contention, or what I have heard and read, is that a database stores data, only through the use of SQL or some reporting tool, does that data get turned into information. I don't know where I heard or read that for the first time, but I've probably been saying it for years. Through my friends response and others on the mailing list, I probably need to rethink that particular statement. Here are some relevant links provided by my friend and others on the oracle-l mailing list: Principles of Communication Enginnering, By John M. Wozencraft, Irwin Mark Jacobs Information Theory and Reliable Communication, By Robert G. Gallager Nuno Suto, aka Noons suggested Fabian Pascal, which can be read here. He also suggested reading up on Chris Date and Ted Codd as well as Conceptual Schema and Relational Database Design: A Fact Oriented Approach, By G. M. Nijssen, T. A. Halpin Have you ever used the phrase, "data into information" or some derivation there of? I'd like to track down where I first came across it if possible. Thoughts on Data vs. Information as separate entities? Labels: database, design
The Case For Views
I recently had to "defend" my use of views. To me, they seem natural. Using them is almost always a good thing to do. I've met those that don't really care for them...I just never understood why. Then again, those same people are still not convinced of PL/SQL APIs. Maybe there is something to that mindset... Being forced to articulate one's views is a good thing, it's part of why I blog. I won't lie though, it gets frustrating to have do this, seemingly, all the time. I'm going to do it here, again. Complex JoinsNo, I'm not afraid of joins. I am afraid of others who are afraid of joins though. More specifically, I'm afraid of those who aren't proficient at writing SQL. Let me do it, once, and let everyone else have access to the view. Besides, I'm the subject matter expert (SME) on the given set of tables, so it follows that I should create the interface to those tables. Yes, I said interface. It's exactly what a view is and interface to the underlying data. EncapsulationWrite it once and let it propogate everywhere. When I had to "defend" my use of views, I mistakenly used the example of adding columns. Oops. That would (possibly) require changes throughout the system. I meant to say remove columns, in which case you could keep the placeholder in the view using NULL without having to change all of the code. This does not mean that proper analysis does not need to be performed, it does, but you could possibly get away with not having to change everything that references the view. My second example was a derived value. This makes more sense to some people thankfully. I've seen the same calculation done on a specific field done 10s, even 100s of times throughout the code. Why not do it one time? Perfect use for views. SecurityFollowing the least privileges necessary to perform a given action, views allow you to give access to the data without direct access to the tables. Views can also be used to hide or mask data that certain individuals should not have access to. In conjunction with VPD or Application Contexts, it's a powerful way to prevent unauthorized access. MaintenanceMaintenance has been alluded to above, but not explicitly stated. For derived values: If you have a derived or calculated value and that calculation is performed all over the place, what happens when it changes? You have to update it everywhere. If you had used a view, change it once and it propogates everywhere. What was once a project is now a "simple" code change. This affects IT in how they choose and assign resources as well as the Business. For complex joins: What if one table is no longer used or needed? What if that table is littered throughout the code base? You have a project on your hands. If that table were part of a view, you could "simply" remove it, keep the columns in the view and you're done. There might be places where code needs to be adjusted, but overall, you have a much smaller impact. That's a good thing. OtherI tried putting the following statement in a category up above, but couldn't make it fit. Records in a table typically constitute data. Tables, joined together, in a view, tend to turn that data into information. Labels: design, development, rant, views
PL/SQL: Coding Practices
I've struggled this week to write. Lately I've had lots of technical content and not much philosophical content. I have lots of philosophical content now, but the blog isn't the place for it...at least not now. The big project I'm working on now is refactoring our payment processing system. We interface with multiple gateways for redundancy purposes. The code is comprised of 3 stand-alone procedures. One of those procedures has (had) 17 private procedures/functions in it. On one hand, it made sense, since everything was a stand-alone procedure or function, you didn't want all these dependencies on other objects. On the other hand, testing was virtually impossible. If you needed to make a change to private procedure #13, you had to run through an almost infinite number of test cases to ensure that you covered that particular case. I've now moved those 3 procedures into a package. Theoretically, it's hot deployable now as long as we don't change the package signature. That's a big win in my book. I've also moved those 17 private functions into their own individual procedures and functions that can be exposed via the package specification for unit testing purposes (in production they will be private as nothing else needs access). The hardest part of that effort was there were no parameters being passed to the procedure, it just relied on the declared variables. So for each and every one of those I had to figure out what it relied on to work and what variables it set. No small task. Here is the table: CREATE TABLE t ( id NUMBER PRIMARY KEY, col_1 NUMBER(1) DEFAULT 0 NOT NULL, col_2 NUMBER(1) DEFAULT 0 NOT NULL, col_3 NUMBER(1) DEFAULT 0 NOT NULL, start_date DATE DEFAULT SYSDATE NOT NULL, end_date DATE ); and a procedure (which does nothing obviously) CREATE OR REPLACE PROCEDURE update_t ( p_id IN NUMBER, p_col_1 IN INTEGER DEFAULT 0, p_col_2 IN INTEGER DEFAULT 0, p_col_3 IN INTEGER DEFAULT 0 ) IS BEGIN NULL; END update_t; What's the best way to integrate the procedure into your code? I've seen this: CREATE OR REPLACE PROCEDURE some_other_procedure ( p_id NUMBER, p_variable VARCHAR2(1) ) IS BEGIN IF p_variable = 'A' THEN update_t ( p_id => p_id, p_col_3 => 1 ); ELSIF p_variable = 'B' THEN update_t ( p_id => p_id, p_col_2 => 1, p_col_3 => 1 ); ELSIF p_variable = 'C' THEN update_t ( p_id => p_id, p_col_1 => 1 ); ELSIF p_variable = 'D' THEN update_t ( p_id => p_id, p_col_1 => 1, p_col_3 => 1 ); END IF; END some_other_procedures; Since I default the input parameters to 0, I didn't have to specify each individual parameter every time I called it. I like that. I don't much like having 4 separate calls to UPATE_T though. 1. It makes it difficult (without further logging), to determine where exactly it's being called in the control statement. 2. Seems like a waste of space. P_ID is always going to be the same, why set it 4 times? I decided to make just one call to UPDATE_T. I create local variables, then set them in the control statement, and then make the call to UPDATE_T. CREATE OR REPLACE PROCEDURE some_other_procedure ( p_id NUMBER, p_variable VARCHAR2(1) ) IS l_col_1 INTEGER := 0; l_col_2 INTEGER := 0; l_col_3 INTEGER := 0; BEGIN IF p_variable = 'A' THEN l_col_3 := 1; ELSIF p_variable = 'B' THEN l_col_2 := 1; l_col_3 := 1; ELSIF p_variable = 'C' THEN l_col_1 := 1; ELSIF p_variable = 'D' THEN l_col_1 := 1; l_col_3 := 1; END IF;
update_t ( p_id => p_id, p_col_1 => l_col_1, p_col_2 => l_col_2, p_col_3 => l_col_3 ); END some_other_procedures; Not much savings in space (and sometimes you'll actually have more), but for me, this is much easier to read. If I have to debug this, it feels a lot easier to concentrate on the control statement without the calls to UPDATE_T. What do you do in these kinds of situations? Same as me? Different? Think I'm off my rocker (yeah, I know some of you do)? Labels: design, development, plsql
Baseball Data Modeling
Anyone out there like baseball? Ever had a desire to model a baseball game? I do and I have tried a few times in the past. It get's pretty hairy down at the game game/inning/player level. If I remember correctly, substitutions tripped me up a bit. There there's the whole datawarehouse side, I'd like that to be part of the project as well. I started a project on Google Code here, the name is pretty vanilla, baseball-database. If you join you can suggest a better name. I'd like to try and talk Oracle into giving a few licenses to the recently released production version of SQL Developer Data Modeler. I've hit up @krisrice on Twitter, but he has no control over licenses, just development. I've also hit up Justin Kestelyn ( @oracletechnet) who said he would look into it. I've had no time lately to bother him; perhaps with a few more people... If you are interested, just drop me a line chet at oraclenerd or message me through twitter. Labels: baseball, datawarehouse, design, oltp
Large-Scale Solutions for Small Enterprises: a Brief How (and Why) To
This is the first (technically, second I guess) in what will (hopefully) be a series of guest posts, from Ted and others.
One thing I really liked was the small shop that Ted talks about. So often you see IT become the cost center typically due to a lack of planning. 5 people supporting 3000? That's pure awesome. Speaks volumes for Ted and the choices he has made.Wouldn't it be great if you could effectively run a major ERP suite on a tight budget with only a handful of staff? That is one of the things I do, so Chet asked me to write a bit about it. Specifically, he asked me "how have you integrated Oracle into a small shop? How much work? Hard? Easy?" Hopefully this post will cover those questions. The WhatThe enterprise I manage runs a large and scalable ERP application suite: Oracle PeopleSoft Financials, HCM, Campus Solutions, and Enterprise Portal (all on application version 9.0 on PeopleTools 8.49). We have those applications running on an essentially Microsoft technology stack (Windows servers, SQL Server, Active Directory, ILM, and so forth). Our environment is heavily virtualized using VMware. Our mission has always been to maximize functionality and minimize cost. We do a fair job at that, as we deliver this ERP suite to a user base of ~3000 with an IT staff of 5 and a very small budget. The HowOur modus operandi is basically to run our applications very near vanilla and to keep them as current as possible. This allows us to take full advantage of vendor and peer support and leverage the latest delivered functionality. Staying vanilla keeps operating costs down (no development costs, quick patch cycles, etc.) and allows rapid upgrades. We also do not diversify the technology within our enterprise architecture unless absolutely necessary. That allows our staff to use one skill set across multiple applications. That, in turn, allows us to hire ambitious generalists who are comfortable moving across applications. We minimize training costs by maximizing peer collaboration (very common in the Education and Research industry) and staying very active in user groups (our industry has an exceptional user group, -1 for my bias). We also rarely use consultants and never use implementation/upgrade partners anymore. The WhyA colleague from a large university asked me recently why, for an organization our size, we did not use a much smaller solution (I think he suggested QuickBooks, +1 for snark). It is a fair question with a simple answer. While our organization is smaller than a larger university, the complexity of our business requirements is comparable. For example, our payroll contains all of the variations of a larger university: full time and part time staff, faculty contracts of every imaginable period of time, student employees, contingent workers, and so on. To use a smaller solution to handle that complexity would require a larger and far more specialized payroll staff, at least some custom application development, and a different IT support structure. All of those things are costly. Instead, we let Oracle worry about providing the functionality to meet those requirements, we let our payroll staff adjust their business processes around that functionality (we are mean that way), provide general IT support from our small pool of staff, and leverage our user group for strategic direction and answers to tough questions. So, to answer my colleague's question, we use a large solution like Oracle PeopleSoft to accomplish our mission: to maximize functionality and minimize cost. The ImportanceI am no analyst, but it does seem to me that business requirements, regulations, and compliance issues are getting more complex. It is probably safe to assume that they will not get simpler any time soon. We are also in a massive recession and resources are scarce. It is probably safe to assume this won't change anytime very soon, either. One way to meet these challenges is to take advantage of the benefits of larger enterprise solutions (to handle increasing complexity) and operate them as efficiently as possible (to handle decreasing resources). The major enterprise technology players (Oracle, Microsoft, SAP, IBM) are spending time and money focusing on small and medium enterprises ( SME) recently. I think this is mostly for sales (what isn't?) but I also get the sense that they want to know how SMEs actually run these applications. I wrote a white paper last year ( Effective ERP Practices for the Small Institution) that got a good bit of attention from Oracle. I also spoke at Open World last year about this stuff [http://www.slideshare.net/badgerworks/higher-education-acheives-oracle-peoplesoft-roi-presentation] (and went for free, in true oraclenerd style). I hope that answered some of Chet's questions and maybe raised some new ones. If you have questions or comments drop them here. You might also want to hit my blog for the top 5 FAQs that I get when I speak, write, or consult on this topic. Thanks for reading! AboutTed [ linkedin | twitter ] is Vice President for Communications and Membership at the Higher Education User Group, MBA and MSIS student at the Johns Hopkins University, Director of Administrative Systems at MICA, and blogger at badgerworks. Labels: design, tsimpson
Design: Persist or Derive? Part II
Part I is here. I realized near the end of that post that I had completely screwed it up. I think some of the intent was conveyed, but not really what I wanted. I'm going to try it again. New sample tables: ONLINE_STORES and ONLINE_STORE_STATUS. ONLINE_STORES has 4 columns: - ONLINESTOREID - sequence generated surrogate key
- NAME - String, can be anything really
- START_DATE - When this store went online
- END_DATE - When this store went offline
ONLINE_STORE_STATUS - The purpose of this table is to capture the history of a store's status, either up (available) or down (unavailable). A record is created when the store goes down, it can be either a pre-determined amount of time where END_DATE is set to a time in the future (say 30 minutes) or the END_DATE can be NULL and must be manually re-activated. It has the following columns: - ONLINESTORESTATUSID - sequence generated surrogate key
- ONLINESTOREID - FK referencing ONLINE_STORES PK
- REASON - Why was the store de-activated or brought down. Typically I would supply a list of known reasons, but it's unnecessary for my purposes.
- START_DATE - Time the online store was de-activated.
- END_DATE - Time the online store was re-activated.
What about Persist or Derive?Similar to the example from Part I, you could do this (find an online store's status) another way be storing the status inline in ONLINE_STORES. Add a column ( Persist) STATUS with a check constraint that limits the values to either UP or DOWN, along with a NOT NULL constraint of course. In a pure OLTP environment that is probably the most efficient solution. Reporting on down times, or better, how long has an online store been UP, is sometimes an afterthought. This can be handled by a shadow/history/audit/logging table. Those have always felt clunky to me. Many systems are a hybrid of OLTP and reporting (DW/DSS/etc.). My approach has been to tie the two tables together using a View ( Derive) to get the answer to whether an online store is UP or DOWN. There might even be a name for that in the modeling books...I should read up. Here are the scripts necessary for my demo: CREATE TABLE online_stores ( onlinestoreid NUMBER(10) CONSTRAINT pk_onlinestoreid PRIMARY KEY, name VARCHAR2(100) CONSTRAINT nn_name_onlinestores NOT NULL, start_date DATE DEFAULT SYSDATE CONSTRAINT nn_startdate_onlinestores NOT NULL, end_date DATE );
CREATE TABLE online_store_status ( onlinestorestatusid NUMBER(10) CONSTRAINT pk_onlinestorestatusid PRIMARY KEY, onlinestoreid CONSTRAINT fk_onlinestoreid_oss REFERENCES online_stores( onlinestoreid ) CONSTRAINT nn_onlinestoreid_oss NOT NULL, reason VARCHAR2(100) CONSTRAINT nn_reason_oss NOT NULL, start_date DATE DEFAULT SYSDATE CONSTRAINT nn_startdate_oss NOT NULL, end_date DATE );
INSERT INTO online_stores ( onlinestoreid, name ) VALUES ( 1, 'online store #1' );
INSERT INTO online_stores ( onlinestoreid, name ) VALUES ( 2, 'nerds r us' );
INSERT INTO online_stores ( onlinestoreid, name ) VALUES ( 3, 'geeks rule' );
CJUSTICE@TESTING>SELECT * FROM online_stores;
ONLINESTOREID NAME START_DAT END_DATE ------------- --------------- --------- --------- 1 online store #1 30-JUL-09 2 nerds r us 30-JUL-09 3 geeks rule 30-JUL-09
3 rows selected. Now I just create a simple View on top of these 2 tables: CREATE OR REPLACE VIEW vw_active_stores AS SELECT os.onlinestoreid, os.name, os.start_date, os.end_date FROM online_stores os WHERE NOT EXISTS ( SELECT NULL FROM online_store_status WHERE onlinestoreid = os.onlinestoreid AND ( end_date IS NULL OR SYSDATE BETWEEN start_date AND end_date ) ); Create a record in ONLINE_STORE_STATUS: INSERT INTO online_store_status ( onlinestorestatusid, onlinestoreid, reason ) VALUES ( 1, 1, 'maintenance' ); Select from the View: CJUSTICE@TESTING>SELECT * FROM vw_active_stores;
ONLINESTOREID NAME START_DAT END_DATE ------------- --------------- --------- --------- 3 geeks rule 30-JUL-09 2 nerds r us 30-JUL-09 Voila! I Derived! As a brief sanity check, I created a record that had a pre-determined re-activation date (1 hour forward). INSERT INTO online_store_status ( onlinestorestatusid, onlinestoreid, reason, start_date, end_date ) VALUES ( 2, 2, 'the geek sliced us...', SYSDATE - ( 1 / 24 ), SYSDATE + ( 1 / 24 ) );
CJUSTICE@TESTING>SELECT * FROM vw_active_stores;
ONLINESTOREID NAME START_DAT END_DATE ------------- --------------- --------- --------- 3 geeks rule 30-JUL-09
1 row selected. I'm really not sure which way is better/worse, as with anything I guess "It depends." The semaphore (flag) in ONLINE_STORES is a perfectly viable solution. It is the easiest solution, admittedly. Part of my thinking as well, and this relates back to the question I posed once before, UPDATEs in OLTP: A Design Flaw?. If I UPDATE the record in ONLINE_STORES, it has meaning. Typically it would either be to change the name or set the END_DATE. The UPDATE in ONLINE_STORE_STATUS means something else, it's just telling me the stop time of the DOWN time. Or I am just overthinking this? Is this too much idealism? Is it idealism at all?I've talked about it so much lately I can't remember which way is up. What do you do? Labels: design
Design: Persist or Derive?
The easy answer is, "It depends." Derive: to derive (third-person singular simple present derives, present participle deriving, simple past and past participle derived)
1. (transitive) To obtain or receive (something) from something else. 2. (transitive, logic) To deduce (a conclusion) by reasoning. 3. (transitive, linguistics) To find the derivation of (a word or phrase). 4. (transitive, chemistry) To create (a compound) from another by means of a reaction. 5. (intransitive) To originate or stem (from). Persist(ence): To persist data in the database world is to store/write it to a table. When you begin to model a database application, you have to make decisions on whether or not to persist the data (store it in a table) or derive the data on the fly (using SQL, Views, etc.) Age is a good candidate. Do you store the person's actual age or do you calculate it on the fly? If you store it, you'll need to create a process (daily, weekly, etc.) to ensure that it is accurate right? My birthday is on November 12th. Prior to that, I will be 37. After that, I will be 38. How do you know when to update it? On top of that, do you want to know how old I am in months? Days? Minutes? Or seconds? Do you store all of those? Assuming you've captured the my date of birth (DOB), a simple query (better, a view) will calculate all those on the fly. WITH dob AS ( SELECT TO_DATE( '11/12/1971', 'MM/DD/YYYY' ) m_dob FROM dual ) SELECT m_dob dob, TRUNC( ( SYSDATE - m_dob ) / 364.25 ) age_in_years, TRUNC( ( SYSDATE - m_dob ) / ( 364.25 / 12 ) ) age_in_months, TRUNC( SYSDATE - m_dob ) age_in_days, TRUNC( ( SYSDATE - m_dob ) * ( 24 ) ) age_in_hours, TRUNC( ( SYSDATE - m_dob ) * ( 24 * 60 ) ) age_in_mins, TRUNC( ( SYSDATE - m_dob ) * ( 24 * 60 * 60 ) ) age_in_secs FROM dob;
DOB: 11/12/1971 AGE_IN_YEARS: 37 AGE_IN_MONTHS: 453 AGE_IN_DAYS: 13,775 AGE_IN_HOURS: 330,617 AGE_IN_MINS: 19,837,026 AGE_IN_SECS: 1,190,221,597 So that was a pretty simplistic example. Here's a harder one and the inspiration behind this post. Let's say you have 2 tables, PEOPLE and PEOPLE_NAMES.  A simpler design would be to store the names inline. This has a couple of benefits: 1. It's all in one table. No need to "complicate" things by joining to another table (yes, I know, that's what Oracle does) 2. Fairly straight-forward code to maintain that table. CREATE/UPDATE_PERSON would handle everything just fine. OK, So #2 wasn't a great one. Hopefully you'll have a better explanation of benefits.Heres' what it looks like:  So back to the first example, PEOPLE and PEOPLE_NAMES. 1. you can only have one name at a time (nick names is a different story). 2. you want to see the history and search on history (i.e. Maiden name). Let me back up a little bit. I approach design from the reporting angle, not the functional angle. I believe the two go hand in hand though (naturally, you can't have reporting on a non-functional system). I think this comes from my background as an end-user. Without information, what good is the data? Background over. With Rule #1, you'll have only one record "active/current" at any given time. This will be maintained by the application layer. I've seen some interesting ways to maintain that rule with the use of triggers and/or materialized views, but I'm not going there. Of course I just realized that it's missing the start_date and end_date columns. Just assume they are there. So how do you mimic the SELECT from the second example (i.e. names inline)? A View silly. CREATE OR REPLACE VIEW vw_people AS SELECT p.peopleid, p.dob, p.gender, pn.title, pn.first_name, pn.middle_name, pn.last_name, pn.suffix, pn.printed_name, pn.start_date, pn.end_date FROM people p, people_names pn WHERE p.peopleid = pn.peopleid AND pn.end_date IS NULL; In your reports (user interfaces), you use the view as opposed to SELECTing directly from the table. VW_PEOPLE now resembles PEOPLE from the second example. What does this have to do with Persist or Derive? Well, I forgot...wait, did I just completely screw up this example? I'm not really deriving the name, I'm storing it, just in a different table. This post will definitely have a Part II. Apologies for wasting your time. Update Part II can be found here. Hopefully it makes up for this one.Labels: design
IDs (NUMBER) vs. Codes (VARCHAR2)
A slightly different way of putting that is Surrogate vs. Natural keys. Not that I really want to get into the entire discussion...but I will...a little. On Friday having a discussion with a developer friend after I showed him my Party model. It's all about boozing it up. OK, wishful thinking on my part, it's just a database design based on the Party (or Entity) Model. I've pondered it here and here. I'm not planning on debating the merits of the Party Model. I do like it though, it seems to me a natural end point. It does not take into account testing of the design, nor does it take into account places where you might denormalize...and many choose to denormalize by keeping addresses in line with people or organizations. On to my point. To maintain data integrity without using Check constraints, I'll create a small lookup table. - ADDRESS_CODES
- PHONE_CODES
- EMAIL_CODES
- STATE_CODES
Perhaps a few others. ADDRESS_CODES, as the name suggests, would use a code as the key, a VARCHAR2. HOME, WORK, OTHER, etc. Same goes for the others. I posted my query on twitter Friday morning and received a pretty solid response from 4 people.  The response was swift:  Which was exactly what my conversation with my friend had led to. In this case, why would Natural keys, VARCHAR2s, be bad. 1. If a user enters them in, they could misspell something, CELL would be CLEL or something. Fair enough. 2. If #1 happens, you just fix the name field instead of the key. So I started to come around to it, until I got to STATES. States have a 2 letter abbreviation. Why not use that? It's not like they'll change (hah!) right? Perhaps a distinction is in order then. If it's user entered content, then use a surrogate key. If not, use the supplied code. I don't know. Is this just my personal preference? Is there a standard? What do you do? Labels: constraints, design
Classic: Application Developers vs. Database Developers II
The original (with all the fun comments) can be found here. Originally posted on December 9, 2008. This is the "followup" to yesterday's post.You can read the first article here. My application developer friend, Mr. M, emailed me and another fine gentleman this little blurb recently: Mr. M: OH YEAH BABY!!! TEN TIMES FASTER!!!! YEAH!!!!!!!!
Hey seriously, what a tub of shit Oracle is. Where does this myth come from that it's such a great platform? Their client tools suck balls and it's generally just a pain in the ass to work with from a developer's point of view. But devs for some reason are under this impression that from thew server perspective it's rock solid and performant. Well, it may be solid, but it's a fucking turd. Our dba here - definitely an Oracle guy - has been tasked with looking into moving us onto an oss db. He basically refuses to even look at MySQL, stating that it's a mickey mouse worthless pile of shit (can't really argue with him there lol), so that basically leaves Postgres. So it's been a few weeks now, and he will basically admit now that Postgres completely waxes Oracle as far as performance goes. LOL We run RAC in production too. He's looking at failover, replication, blah blah blah now, we'll see what the verdict on that is. Oh, and Oracle AQ? That's a worthless pile of shit too. Why do they tack a fucking message queue that doesn't play nice with jms onto the fucking database, which is probably already overworked? Oh wait, that's right, they're in the business of selling per cpu licenses! Cocksuckers. This was prompted by a recent Oracle email blast about the Exadata storage system/Warehouse. As I did before, I'll just put the email here. Me: Agreed, their client tools aren't all the great. Which ones are you using?
I use SQL*Plus (naturally), SQL Developer and JDeveloper. The latter 2 tie in very nicely with Subversion. With JDeveloper, when I want to try out Java, it's got a pretty good suite of tools.
Oracle starting out catering to businesses, Microsoft started with the consumer. Oracle has made pretty good strides in making their software more usable while Microsoft has made theirs more scalable.
I haven't used AQ a whole lot and definitely not with Java. I do know that it's billed as JMS compliant.
Postgres has it's place and so does Oracle. It is a great platform if you know how to leverage it. PL/SQL is a 3GL (or is it 4) and you can write entire applications using it...and if you do that right, an incredibly robust and scalable application at that. Mr. M: "It is a great platform if you know how to leverage it. PL/SQL is a 3GL (or is it 4) and you can write entire applications using it...and if you do that right, an incredibly robust and scalable application at that."
NO!!! NO!!! NOOOOO!!!
I want to beat people like you who say this with a ball pean hammer. There are only a select few individuals on this earth who can write and read application logic written in SQL. AVOID THIS ANTI-PATTERN AT ALL COSTS! What is it with you f_cking database guys??? Why do you always want to stuff as much crap into the db as possible?
DUDE! SQL IS A PILE OF SHIT FOR EXPRESSING APPLICATION LOGIC!!!
It's fine for querying and manipulating sets of data, in a relational database. But it is a worthless sack of shit for expressing application logic!
I'm having to dig through this f_cking abortion right now because some Oracle f_ckhead thought "you can write entire applications using it!" Blog that, mofo!
This was followed by a package he had been working on. I wouldn't say it was the greatest, but it wasn't all bad either. Me: goodness gracious.
"DUDE! SQL IS A PILE OF SHIT FOR EXPRESSING APPLICATION LOGIC!!!"
disagree (naturally). It's incredibly easy to do, you just don't know how yet...and it seems even the Oracle professionals out there don't either.
I'll tell you this, the crazier the SQL or PL/SQL needed to derive and manipulate data the poorer the design. Start with a good design and it all becomes very simple. Of course note the use of "naturally" in my lexicon. Thanks Jake. Mr. M: well dude, we are back to our old discussion - you arguing that procedural sql code is perfectly fine for building apps, and by extension, that the last 20 years of computer science have basically been a misguided lost journey down the meandering, fruitless trail of oop. um.....no. select this from that. otherwise keep that sql crap caged up where it belongs.
btw, do the db guys here suck? seriously. i'm not competent enough to judge. (to be fair, apparently that crap i sent you is fairly old stuff, but still....) Me: I would say, based on limited knowledge of software development, that the OOP movement was started because the database (specifically Oracle) was not mature enough to do what was needed. Plus, I seem to recall that the OOP movement was supposed to have solved all the world's problems by now.
It's further propogated due to the needs you specified in our discussion that day at WellCare (i.e. performance). I still believe that if application developers better understood how a DB works (Oracle, MySQL, SQLServer, etc) they would end up writing less code. Database constraints alone force you to write less (and better) code simultaneously ensuring good data.
The code that I did look at (first 1000 lines or so) isn't great. 1. With all those IF THEN ELSE statements it's telling me that there's probably a better way to store the data. Looks like they're missing an attribute that should be contained with a track. 2. using Object (PL/SQL) types to store data in the way they seem to be doing it is not the best way. Again, probably a design issue. 3. When you do something like this: UPDATE pb_album_metadata SET primary_digital_flag = 0 WHERE album_id IN (SELECT b.album_id FROM (SELECT a.album_id AS album_id, MAX(a.album_id) OVER (PARTITION BY a.standard_upc) AS latest_album_id FROM pb_album_metadata a WHERE a.standard_upc = g_album_tab(1).standard_upc ) b WHERE b.album_id <> b.latest_album_id ) AND primary_digital_flag <> 0;
They should probably have considered end_date as an attribute of the album metadata. While analytic functions are pretty cool, they're more for analyzing (OLAP) and not OLTP environments.
That's for starters and without table definitions... Me (again): oh yeah...and PL/SQL is/was built on top of ADA, FYI. Mr. M: "I still believe that if application developers better understood how a DB works (Oracle, MySQL, SQLServer, etc) they would end up writing less code. Database constraints alone force you to write less (and better) code simultaneously ensuring good data."
Huh? What are we not understanding? What would be an example of a constraint that would force us to write less and better code? Me: CHECK, NOT NULL (same as CHECK) and FOREIGN KEY constraints all fit into that category.
Instead of having to check if a value is NULL when submitting it to the database, just submit and let the database throw an error, which you'll conveniently capture in your Exception block. Same goes for CHECK constraints, columnA should be either one of three values (test, test1, test3), you won't have to check that in your application code, just catch the exception. FKs insure good data (proper relationships).
A different perspective. If you are going to pay for it, use that damn thing. If you don't care, don't do it. But over the course of an "enterprisey" application, you'll end up writing more code and make maintenance that much more difficult (did I do it there?). Just those constraints will force you and the teams of application developers to put good data into the database.
You can still do it in the application of course (form validation is a good place)... Mr. M: Ahh, jeez dude, I wasn't sure if you were referring to the literal "constraint" or not.
Dude, even f_cksticks like redacted I think have a decent enough understanding of when and how to use db constraints. It's when you get into things like cursors or cost plans of subselects and anonymous tables (i think that's the name for it - where I say select something from (select some other crap). Then we defer to db gurus like yourself.
But dude....."you won't have to check that in your application code, just catch the exception".......uh, don't ever repeat that off our little email group. And definitely don't go posting that on your blog. F_ck me man, it's a damn good thing we keep you db folks caged up in that rdbms box.... Me: So we've had this discussion at work...for a high transaction system, do Java/C/etc handle exceptions well or what?
Why is it bad to deal with exceptions rather than coding to avoid them?
I highly doubt even redacted understood database constraints...him and his cohorts believed all database software would be commoditized and MySQL would be king in short order. Mr. M: "for a high transaction system"
Or for any system really....
To use your example of check constraints (is this value one of....) or not null constraints, checking these rules in the Java code and in the database code would seem to violate DRY. But we do that alot, and it is acceptable in certain cases. For instance, we also probably violate DRY if we're checking this same rule say in Javascript on the front end. But we accept this tiny violation of DRY because it tends to make for a better user experience and as a performance gain too, for we avoid the round trip to the server. Now, what your advocating here is close to the same thing. You're basically saying, don't check for the not null constraint in Java code, just go ahead and hit the database, let the database throw back an exception to the Java code, presumably correct the problem, and then make another roundtrip to the database again. Dude, what are you thinking?!? This to say nothing of the fact that this also could be considered a violation of Fail Fast, and a violation of Defensive Programming - what happens if the dba forgot to add the not null constraint in production?
Dude, listen to this guy. For a "high transaction system" basically you want to treat the database, as much as you can, as just a dumb data holder. A f_cking dumpster, that you just throw sh_t into and pull shit back out, using no vendor-specific features if at all possible.
Again, for we've had this discussion, but even in everyday apps, not just on Wall Street, the database is the bottleneck. And the database is the hardest layer to scale. So given those facts, you should design your app to do as little work in the database as possible. I was laughing at this point because the link above points to one of our consulting architects (I'm not really sure what his role is at this point). Me: i agree in any application that you want to minimize the number of round trips...
shocker...he's one of our architects. he's spot on in many instances, but...
database is the bottleneck because people don't know how to write SQL. I'll certainly concede the wall street applications (for the time being anyway), but the rest of us with what we do on a daily basis...Oracle will scale far beyond the demands they place. When that bottleneck shows up, 9 times out of 10 some dumb-ass c#/java guy thought he could write a better query than i. besides, what's the idiot doing anything but a simple join anyway? probably poor data model to start with...and we're right back where we started (sort of). Mr. M: "database is the bottleneck because people don't know how to write SQL.....some dumb-ass c#/java guy thought he could write a better query than i."
Dude, I'll grant you, people don't know how to write SQL, myself included. But that's not always why the database is the bottleneck. I think it's safe to say that's not even the majority of the reason. Yes, there are apps written by people who were just idiots, and they needlessly pummel the database, but that's definitely not the majority of scenarios. At my work the database is the bottleneck, and we run RAC in production. It's my understanding that even with RAC, there is a limit to how much you can scale that out. But any layer up from the database we are basically unlimited in how much we can scale that out. So it's very easy to stick another Apache box in front, or fire up another Weblogic server. But I can't do that with the database. We have 24 Weblogic servers for redacted. The database is the bottleneck. And we don't have shitty sql code in the app. In fact, we have very few hand-written queries anywhere in the app. Persisting something to a database is really a low-level concern that as an application developer I shouldn't even have to be bothered with, except for the rare corner cases where the persistence abstraction I'm using is just a little bit too generic to handle things effectively. And we don't use these ORMs because we don't know how to write sql. To effectively use an ORM tool requires a deeper understanding of sql and databases than simply being able to write solid SQL code. (Not saying Java devs who use ORMs know SQL better than a dba, just that it requires better sql skills than is required of a Java dev to simply write JDBC/SQL.) Now, before you try to tell me that my ORM library isn't able to write as efficient of sql code as a dba of your caliber, keep in mind that ORM tools are pretty advanced. They're able to intelligently do things like batch sql updates, and let you apply transactional semantics much easier than with raw jdbc/sql. But the overwhelming reason developers have so thoroughly adopted ORM is because Structured Query Language is such a nasty piece of shit for expressing application logic. SQL is a declarative, procedural language. It's totally unsuited for writing application logic! This, more than anything else, is why organizations and dev teams should seek to restrict what is handled within a relational database as much as possible - because the programming interface to it is a fucking ancient backward dying dinosaur. Mr. V (note, not Mr. M): My 2 canadian cents: The polyglot approach "... use different languages for different domain ..." Database was developed to manipulate data and should remain there. General purpose language was developed to encapsulate logic and should remain in that domain. You should not use DB to encapsulate business logic (in my opinion) no more than you would use HTML to create complex logic.
While Java, C#, Python, etc are described as General Purpose languages, they, too, are really domain-constrained. Their expressiveness are confined (and should be) to express low-level, tersed, explicit, verbose, and repetive logic (if that makes any sense). Languages such as these are more suitable for low-level abstraction on top of which, richer, more expressive languages can be built. We are seeing this now with the emergence of languages on the JVM (and arguably on the .Net's CLR).
I think SQL as we know will take a back seat and a smart developer somewhere will create a new domain-specific language that lives on the VM and will push the SQL expression out of the RDBMS and closer to the code that needs it. We are not there yet, but Microsfot is trying (see LINQ and all ORM papers). This is not to say that there won't be isntances where tuning in the SQL-native language won't be necessary. However, 80 to 90% of simple CRUD cases will be handled closer to the logic code that uses the data code.
Again, that's my 2 canadian cents... I could go on. But I have a meeting with redacted. Mr. V, I believe, is a little bit more sensible. Mr. M on the other hand is just trying to rile (sp?) me up. Me: Someone will probably create something like that, but it still gets at the heart of one of my arguments, many developers don't know how to use a database thus will go to any means to circumvent it. Embrace it I say.
Ultimately for me, it comes down to simplicity. I can write an application using PL/SQL that will scale very well for 90% of the solutions. Of course that doesn't include the "fancy" javascript/Ajax stuff needed for the UI. In my ever so humble opinion, this is a much better solution for a business in the long run. 1. You're more likely to change the middle tier than the database tier. Java, asp, Ruby, what's next? 2. Fewer moving parts means lower cost of ownership, even with the CPU costs. One person, one, was able to build and maintain a 350 page website. Be hardpressed to do that with the more expressive languages.
I think all of us are purists and very passionate about what we do. It's probably the main reason we get along. I thoroughly enjoy these conversations because it does force me to think...and that's always a good thing. Mr. V: Haaa chet. You sorta made my point than diverged away. Building an app in all PL/SQL is dangerous. It's no safer than me building an app in all Java. I can build very limited app in all Java. The moment I need to interact with other domain (UI, data, low-level native code, etc), I have to switch to something that is closer to what I am trying to do. If I need to create UI, I will pick a ui-centric environment, when I need to talk to DB, I will pass connect to a db and send in my SQL, and so forth. I will use Java as the glue to bring it all togher.
In the end, we may be saying the same thing, but using different accent. O well. And that's where it ended. I must say it's always fun. Mr. M and Mr. V are both very smart individuals and I highly respect what they do. We have different perspectives...but I think they listen, if only a little, as I listen to them. Their voices creep up on me especially now...which is definitely a good thing. Labels: design, development, funny, humility, ideas, java, plsql
Classic: Application Developers vs. Database Developers
One of my favorite "articles" of all time. I love these types of conversations, in the DB or the Middle Tier? For the vast majority of us, the Database will do just fine. As I've learned more about the data grids and the like, there are trade-offs, which aren't often discussed. One way or another you lose data (say you decide to only UPDATE one time instead of 60). Originally posted on February 20, 2008. Enjoy.It started innocently enough with this article. I sent it out to about 20 colleagues. The best line from the article: "Jerry: "Yeah, databases cause lots of headaches. They crash all the time, corrupt data, etc. Using text files is better." One of my more recently arrived colleagues (I'll call him Mr. M) replied to everyone with this statement: "Kind of funny actually, databases are less and less important at the large investment banks, where they basically load everything up into a data grid across a several hundred node cluster. Writing to the db is way too slow." This started a day long exchange of emails. What follows is the entire thread (up until my last post tonight). Me: "I would just argue that they don’t necessarily know how to write to databases. I would however love to see benchmarking done on both methods. Would be an interesting test..." Mr. M: "Well, my understanding is they just can’t scale out the db enough. Even something like Oracle RAC won’t work. And outside of the military, these are probably the top 1% of programmers in the world building this stuff." Me: "A benchmark would be the only way I would believe it. If you said the top 1% of database developers tried it and failed, I would be more likely to agree. My experience is that application developers != database developers. Different type of thinking involved." Mr. M: "'A benchmark would be the only way I would believe it.' Do you need a benchmark before you would believe in-memory retrieval is faster than disk retrieval? Essentially, this is what we’re talking about. 'If you said the top 1% of database developers tried it and failed, I would be more likely to agree. My experience is that application developers != database developers. Different type of thinking involved.' Why? It’s an issue to do with application performance not simply database performance. Database concerns are a subset of application concerns, essentially a specialization, requiring less encompassing knowledge. ;) From the article you linked to (http://www.watersonline.com/public/showPage.html?page=432587) "Better data management is the answer, says Lewis Foti, manager of high-performance computing and grid at The Royal Bank of Scotland (RBS) global banking and markets. "For very large compute arrays, the key issue is data starvation and saturation. This problem requires data grids with high bandwidth and scalable, parallel access, ... Banks are learning that data management in a distributed grid environment is very different from online transaction processing. "With so many data sources, distribution channels, demands for aggregation and analytics, surges in data volumes and complex dynamics between the flows, we need to manage 'data in motion' and give up the notion that data is somehow stored. It's dynamic, not static," says Michael Di Stefano, vice president and architect for financial services at GemStone Systems ... There is even some debate over how small a unit of work can be put on today's grids. Di Stefano at GemStone, for example, says, "One client has gone from 200 trades per second in a program trading application to more than 6,000 trades per second. This shows what the technology can do." Yep, the writing is on the wall. Oracle knows it too. http://www.google.com/search?hl=en&q=oracle+buys+tangosol&btnG=Google+Search" Me: "Good points. If it is in-memory it would be faster. I have not had the pleasure to work on such a system. I do disagree with the database concerns being a subset of application concerns. The data drives the app. We’re probably getting religious at this point (or am I)." Mr. M: "‘The data drives the app.” Exactly, but who’s to say where the data comes from or in what format? My application data may reside completely in xml files, or maybe I get it from some third party web services a la the en vogue “mashup.” Heck, I may not even need to worry about a database anymore…. http://www.amazon.com/gp/browse.html?node=16427261 The database is only one particular concern of the overall application. And it’s the application that matters. Data is useless if it just sits on a disk somewhere. It’s the ways in which the application lets the users view and manipulate the data that adds value to the business. Yep, definitely a different type of thinking between application developers and database developers." Me: "Definitely religious now. Applications come and go, data stays the same. Think Green Screens, EJBs, Ruby…what’s next?" Mr. M: "'Applications come and go' Exactly. Businesses are not static, nor are the markets they compete in. Changing applications are a function of changing business processes and changing markets. 'data stays the same.' Nonsense. Otherwise UPDATE would not be an SQL reserved word. If you mean database technology stays the same, well, I’m more inclined to agree with that. 'Think Green Screens, EJBs, Ruby...what’s next?' Whatever comes along to let the business more effectively respond to current market realities. Application platforms have evolved much faster than database platforms have. They’ve had to, their sphere of operation is much broader than that of databases, this is only natural, they deal with much broader concerns than do databases. Databases in the internet era function in essentially the same role they did in the era of dumb terminals. Clearly application platforms have evolved orders of magnitude more. Hence the statement, database concerns are a subset of application concerns. Here’s a simple test….if I take some business application and I’m forced to throw away one or the other, either the database or the appl- wait a second, it doesn’t even make sense to finish it, does it? The business can live without the database. I could do all kinds of things with the data, I could stick it anywhere. The business can’t live without the application though. Another way to look at is, what do the business users look at, test, approve, and use? The database? Of course not, they look at the application. They could care less whether the data sits on disk in an RDBMS, xml, or flat files." Me: "We obviously violently disagree. Without the database (and I use database and data interchangebly), the business could no longer function. The app is meaningless. How would you contact your customer? You couldn’t find it. 'Exactly. Businesses are not static, nor are the markets they compete in. Changing applications are a function of changing business processes and changing markets.' Poorly designed applications…that is all." A Feisty Colleague: "Using data and database interchangeably is incorrect. A database is a mechanism for data storage. XML data sets and flat files are mechanisms for data storage, too. So is a file cabinet, because, the data doesn’t have to be electronic, it could be … gasp! … on paper, and the application to use that data would be hands for holding the paper and a pencil to update and add data to the page." Me: "No it isn’t. I take into account xml files, flat files, web services (but not paper, unless it’s scanned) and all that. It would be consumed by the database and then accessed by the application via SQL. (that’s for Mr. M and the feisty one)" At which point someone forwarded the home page for Oracle's TimesTen In-Memory Database. Me: "A database on/in the mid-tier...Perfect!" Mr. M: "Implicit acknowledgment that disk IO operations that come with traditional database access simply can’t match the performance of in-memory data access (a point which you previously were unconvinced of but now seem perfectly accepting of the idea once you see it’s got Oracle’s imprimatur on it). Of course, why any application developer would want to program against an SQL interface if they weren’t forced to is beyond me. It is orthogonal to the programming model of most application platform languages. Surely Oracle recognize this fact too or they wouldn’t be buying Tangosol and other data grid technologies. Of course, most of those products are far more technically advanced than TimesTen or anything Oracle has in that space. Incidentally, it’s illustrative to note that Coherence and other products like it were for the most part designed and built by application programmers. The development of all these products is pretty much driven by the needs of the large investment banks on Wall Street. These trading applications simply had too many concurrent transactions to use an RDBMS (a problem quite a number of public domains now share, most famously google.com, nope, no RDBMS there, yet miraculously there is still data). The database just simply would not scale to such a degree. So the application developers, by necessity, came up with an alternate solution that did work, a fully transactional cache of data replicated across a cluster with node numbers in the thousands, and no relational model whatsoever to speak of. A perfect example of how database concerns are only one, sometimes small, concern amongst many that application developers must be aware of and ready to solve." Me: "Like you said initially, the top 1%. Many of us will never touch a system like this. I will certainly concede that it is faster (still would love to see benchmarking though), but that still leaves 99% of the applications out there that do not require that kind of performance." Me (again): "And don’t forget, I use data and database interchangeably. Applications are nothing without the data right? As to the object/relational impedance mismatch...well, more people that don’t know how to work in sets. Looping is what they understand. I understand the application side more than you seem to give me credit for. I’m not saying applications aren’t important, they are. Data (databases) and applications go hand in hand. If the application went away though, they could still access their data via SELECT statements (yes, via an application client tool), however painful that may be. Applications make retrieving data that much easier for our users. If anyone wants to unsubscribe from this mailing list, just let us know. This is fun for me (I’m guessing Mr. M too)." Needless to say it was a fun day. It didn't get [too] personal. More than anything I'm happy to have an equally passionate colleague. Besides, he claims he was just fracking around with me. ;) Labels: coherence, database, design, oracle
Data Grids for Database Developers
Back at COLLABORATE, I met one Patrick Peralta, member of the Coherence team (reason #109 to go to conferences). I spent about 2 hours with Mr. Peralta that day and had quite a bit of fun. I learned a couple of new tricks ( JRockit for one) and got a slightly better understanding of Coherence. I emphasized "slightly" because I still don't quite know how to use it. Caching data, I get that part. I guess I do understand it from a high-level perspective...it's the where part that I haven't quite grasped yet. One thing I asked Mr. Peralta to do for me was to right it up, blog it. Show me (us) how and when to use it. Through twitter, he's shared with me some links. Specifically this one on how to define a Data Grid. Finally (he did not mention that he likes to procrastinate), he's written something up, An Introduction to Data Grids for Database Developers. Smarter Caching
An obvious (or maybe not so obvious depending on who you ask) first step in scaling a database application is to cache as much as you can. This is fairly easy to do if you have a single app server hitting a database. It becomes more interesting however as you add more app servers to the mix. For instance:
* Is it OK if the caches on your app servers are out of sync? * What happens if one of the app servers wants to update an item in the cache? * How do you minimize the number of database hits to refresh the cache? * What if you don’t have enough memory on the app server to cache everything? It's a very interesting technology. I would highly encourage you to click through and give it a go. Leave comments here or abroad. Labels: coherence, database, design, oracle
UPDATEs in OLTP: A Design Flaw?
This one has been on my mind for the past year or so... Should you do UPDATEs in an OLTP environment? The answer is maybe, or it depends, of course. When I ask this question I ask it in relation to INSERTs. There is a difference between creating a record and updating a record. I think it comes down to a design decision. For instance, I once used a table to track sessions (web sessions, using Application Context in the database to "reconnect"), it was defined something like the following: An Acceptable Time to use UPDATEA session defaulted was created with a 30 minute window. Each page the user would hit would update the END_TIME to SYSDATE + 30 Minutes. If they had no activity for 30 minutes, the END_TIME would be less then the current time (SYSDATE) and they would be logged out. This table did exactly what I needed, UPDATEs worked perfectly here. What if I wanted to track page hits though? Could I do that with the current table? I could possibly add PAGE_ID or something I suppose, but then I would have to write this "complicated" query to find the first START_TIME and then compare that with the last END_TIME. Kind of changes the meaning of the table right? An Unacceptable Time to use UPDATEYou have a table of scheduled transactions or some sort of recurring billing. You use this table by having a job that looks at the DATE_TO_POST, takes those and attempts to post them. If there is a problem, you update the DATE_TO_POST column to the next date/time (based on rules somewhere) and you increment the TIMES_POSTED counter. That doesn't sound so bad right? If you don't have any processing or billing failures, it's not. But if you do, you lose some valuable data, in my opinion. First, get rid of the TIMES_POSTED column. You don't need it. Create a record for every single transaction you have posted. This obviously changes the meaning of the table and will force you to change your code. Here's why it's good though (to remove the UPDATE): You now have a single record for every attempt. Your reporting off of this becomes much easier and is performed with SQL. With the UPDATE, you have to maintain some PL/SQL code ( TIMES_POSTED + 1). With the INSERT, you don't perform a calculation at all. A slightly different example and one that may be more familiar to everyone: This kind of structure is more suited for a data warehouse. The columns have been pivoted to make reporting easier. In an ideal environment, this wouldn't exist on the OLTP server. However we don't all work in ideal environments and this type of structure is often employed. I pretty much hate it (in OLTP, not DW). The volume and complexity of code needed to maintain this type of table is large. There are all sorts of computations that must be performed and then someone, like me, has to come in and maintain it. I always vote for scrapping it in favor of the following type structure(s): Now you have INSERT statements as opposed to one or more UPDATE statements. Reporting tools can then handle the pivoting or whatever else you want to do to the data. UPDATEs in an OLTP are not a flaw, but when about to write one, think of the implications. Is that what you really want to do? Wouldn't you rather just do an INSERT and be done with it? Labels: database, design, rant
Indexing High Transaction Tables
I haven't had many opportunities for the past couple of months to write code, something I sorely miss. Since I changed jobs, I've been doing mostly analysis (aka learning all about the data). I think the last time I wrote any PL/SQL was March...wow. One thing I have been tasked with doing is re-designing how we process payments. Many of the core tables have been bastardized in that entities are mixed up with attributes...or at least I think that's how you describe it. I'll show you an example and maybe someone can tell me what I am trying to explain. Let's take a lookup table like payment types. You'd have Check, ACH and Credit Card. I'm sure there are others, I'm just trying to get the point across. Fairly simple right? Well, what has happened is that the business has expanded and in order to handle it, new payment types were created. Now we have Company 1 Check, Company 1 ACH and Company 1 Credit Card. Those lookup records have now been imbued with a different meaning. What's the term for that? Oh, you were interested in the title? I'm getting there. We store a payments in a "transaction" table with a few different lookup tables (like payment types) hanging off of it. We're not talking about Visa volume or anything, but a modest multiple of 10 thousand transactions a day. We hope to be getting a lot more soon. When considering the design of these types of tables, I have tended towards denormalization. You want to get in and out real quick without the need to lookup (I typically use codes for the lookup keys as opposed to the number, "ACH" or "CREDIT_CARD" for example) or write data to a bunch of different tables. One INSERT and done. Hold on, I'm almost there. As I was analyzing the data, I found that many of the columns I was grouping on or trying to filter on were not indexed. I suppose someone could make the case that leaving indexes off of this type of table would be good. My intuition is that they should be indexed, that the "overhead" of the index is nominal, but of course I have no proof one way or another. I'd like to know what the DBAs out there think. A small aside, I consider myself a "dba" (note the lowercase) in that I have some crossover skills, but they tend to be from the application standpoint. "DBA" are people that keep the database running all the time. The ones that can perform a recovery (I can do backups, but I have yet to perform a recovery). The ones that love to dig into the internals of Oracle to see what's going on. What do the DBAs think? Index a high volume table or leave them off? Either way, please explain yourself so I might learn something. ;) Labels: database, design, indexes
Mentoring
At my previous job, I was very fortunate to be surrounding by some incredibly smart people. Start-ups tend to attract those types on both the business and IT side. Now those of you who work for Oracle or who work for some of the larger companies out there may be surrounded by these types. I've typically worked at small companies and I think that makes it harder. My first boss I believe would fall into the "incredibly smart" category. It should go without saying that the other people are not smart as well, I'm talking the best of the best. People you could truly learn from. There's always a catch though...what if they don't like to teach or mentor? My first boss hired me as a reports developer and allowed me a fair amount of room to grow. I felt like it was too slow so I moved on. Looking back on it, I think I made the right decision but I probably didn't go about it the right way. One of the big reasons I chose to go to Revolution Money was the opportunity to do and learn cutting edge stuff. The goal (of the company) was to match Visa in transaction per second (if I remember correctly that was set around 5 or 6K). They had people who had built high-transaction/high-volume applications all over the place. "All over the place" may be an exaggeration, IT was only 20 or 25 strong at the time. "Relatively speaking," how about that? There was a high percentage of these people. John DaviesPreviously global head of architecture at BNP Paribas and JP Morgan Chase, CTO and co-founder of C24 recently sold to IONA Technologies (Nasdaq: IONA). Author of several Java books published by Wrox, veteran speaker at technical and banking conferences world-wide, expert in high performance/low latency enterprise and global architectures Edward KatzinHe was the CTO for Madison Tyler LLC a proprietary trading company that is making the most of the capital markets shift towards electronic trading platforms in the United States and abroad.
Edward was Vice President – Technology Strategy with Visa before joining Madison Tyler and was responsible for leading the development of Visa’s technology strategy to address ever evolving challenges specific to ensuring the reliability, security, and cost effective operation of Visa’s systems, network, and application infrastructure.
As a consultant for DiamondCulster (now Diamond Management & Technology Consultants Inc.), Edward was a Principal and Senior Technical Architect responsible for designing and deploying large scale information systems solutions that align the deployment of technology solutions with business strategy and maximize the return on investments in technology. Miladin Modrakovic [ blog ] Again, nothing good in his LinkedIn profile so I get to make it all up. I've talked about Miladin before here and here. He's the kind of DBA who you actually believe practices the dark arts. I'm sure I could do what he does given an almost infinite amount of time. He was never afraid to show me his work, i.e. how he arrived at an answer. That's a great trait to have. Bob KernerBob doesn't have a nice summary of his skills on LinkedIn so I'm going to have to make stuff up. As far as street cred goes, he was the Chief Architect at AOL from 1999 to 2005. Despite what you may think of the company, it is still a force out there. He's also done stuff which I probably should not or can't mention else he'll have to kill me. I have a great amount of respect for this guy. Not only does he know what he's talking about he'll show you exactly how he came to the conclusion. He held a few lunch and learns from XML (schemas) to cryptography (how and why). It's not been often in my career that I've come across someone with so much experience that is as willing as he is to teach. My daily interactions were a constant opportunity to learn something new. Cyclomatic Complexity was from him. He also spoke of Information Theory and Marshalling. There's my ode to mentors. I try and do the same for people when the opportunity presents itself. For some reason I tend to focus on those on the business side who show a serious aptitude for data related activities. So if you have someone in your organization who shows a higher than normal interest, take them under your wing. If you are a newbie, throw yourself at their feet and ask them to help you. Labels: design, development, discipline
|