From: Subject: Intelligent Enterprise: Data Warehouse Designer: Asymmetric Aggregation (printable version) Date: Thu, 28 Mar 2002 09:40:36 +0100 MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_000_0000_01C1D63C.9D35B8D0"; type="text/html" X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 This is a multi-part message in MIME format. ------=_NextPart_000_0000_01C1D63C.9D35B8D0 Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Content-Location: http://www.iemagazine.com/cgi-bin/printable.cgi Intelligent Enterprise: Data Warehouse Designer: = Asymmetric Aggregation (printable version)

March 28, 2002

http://ww= w.iemagazine.com/020328/506warehouse1_1.shtml

Asymmetric Aggregation

What to do if your data mart isn't a shiny dimensional diamond =

By Gabriel Tanase

In last issue's column I tried, without sounding too = pessimistic, to=20 describe several real-world threads most data mart designers are = likely to=20 encounter that make the data mart more complex than the shiny = dimensional=20 "diamond" we all dream about. (See "Real-Life= =20 Data Mart Processing.") These threads include: three distinct = usage=20 styles (ad hoc querying, standard reporting, and using the data = mart as an=20 intermediate step in the delivery of data to a serious downstream=20 application); the existence of complex nonlinear aggregated = measures; and=20 the conflict between source-aggregated and data mart-aggregated = data that=20 should produce the same results but don't.

Now, as promised, I'll draw these threads together and show you = an=20 adaptable architecture you can use to tackle these issues if they = arise in=20 your environment. I'll suggest how to create a specific asymmetric = aggregate level of the data mart. And I'll recommend building a = custom=20 user interface for storing the underlying component factors of = your=20 nonlinear aggregations in recognizable data mart dimensions.

I have argued that:=20

  • In an environment where a data mart is used mostly via = standard=20 reports, its aggregated information output is likely to be = trusted,=20 required, and used as a source for downstream applications.=20
  • Some aggregate-level measure calculations are nonlinear and=20 dependent on business rules.=20
  • Sometimes data sourcing or measure calculations can take = place only=20 at a certain aggregate level: Either there are no source = transactions=20 available to be aggregated or no allocation and aggregation = makes good=20 business sense.

Telltale Requirements

There are a few telltale signs that you'll require an = asymmetric=20 aggregate layer in your data mart, which are relatively easy to = catch=20 during the requirements-gathering phase.=20

  • Some of the future users of the data mart periodically = reenter=20 values from existing reports or OLTP extracts into spreadsheets = or=20 similar tools, obtaining processed figures that they send to = other=20 coworkers or directly to recipients outside the company (the = last being=20 a dangerous situation). The desktop processing rules are = understood by=20 only a handful of people, if at all.=20
  • Users tell you during interviews that they need reports with = a lot=20 of figures that are only slightly aggregated; they say they = themselves=20 will take care of the higher-level aggregation and calculation=20 requirements.=20
  • They tell you that a major requirement for the reporting = tool is to=20 enable easy download into Excel.=20
  • During the reporting tool selection process, key users = wonder=20 whether it would be capable of performing "complicated = calculations."=20
  • An existing application takes extracts from the OLTP system = for=20 further processing, or an application needing summarized data is = scheduled to be developed in the near future.=20
  • Users describe data items they say they will absolutely need = in the=20 data mart, but the description makes it clear that the items are = not at=20 a transaction level.=20
  • Users say that they would need a process of "adjusting" or = "filling=20 in" data in the data mart after aggregation takes place, because = they=20 "will always have missing, estimated, or unavailable data = items." They=20 may also say that they want their reports produced by including = or=20 excluding the "adjusted" figures, for comparison and auditing = purposes.=20

Designing for IT

After establishing that business needs and limitations demand a = customized aggregate layer, your next area of concern is an = appropriate=20 design.

It should be obvious that the variety of possible circumstances = precludes anyone from offering simple recipes. Hence, I will try = to=20 suggest just some starting points by asking and answering two = questions=20 that I trust designers would find asking themselves in such a=20 situation.

The fundamental prerequisite is that the "asymmetric" aggregate = layer=20 must consist of real tables, not views or materialized views. You = cannot=20 create these aggregate measures at run time. You need to put those = special, aggregate-only feeds somewhere.

Question: Should I keep measures aggregated from = transactions=20 and externally sourced aggregate-only measures in separate = tables?

Answer: It depends on whether the values from the = external feed=20 must be added to measures already aggregated from transactions or = do not=20 interfere with each other.

If they must be added, a final aggregate measure M must = be=20 computed as M =3D I + E, where I is a value = aggregated from=20 transactions and E is an externally sourced value at the = same=20 summary-level grain.

If the external values and preexisting aggregates do not = interfere with=20 each other, there exists:=20

  • A first set of measures IM1, IM2, ..., IMp obtained = purely by=20 aggregation from transactions=20
  • A nonoverlapping second set of measures EM1, EM2, ..., = EMq=20 obtained from the external feed.

Again, both the IM1 and EMj measures must be = defined at=20 the same summary-level grain. Here the two sets of measures could = be=20 vertically partitioned in two tables. But, ultimately the two data = flows=20 must come together into only one set of measures. So you should = start with=20 them in separate tables and combine them in the final step.

If the external feed E is manually produced and subject = to=20 repeated amendments until declared ready, it is likely that those = manual=20 changes must be logged for approval and audit purposes. Figure 1 presents a = structure=20 enabling such a process.

For each aggregated table (AGG_Final) where the final measure = values=20 are stored, you need three other tables to store the same measure = set at=20 different steps in the amendment process:=20

  • System-aggregated values (AGG_Sys)=20
  • Manually entered "amendment" values (AGG_Manual)=20
  • Applied "amendment" values (AGG_Applicable).

This structure's design enables the application of constant = amendments=20 to aggregated measures over more than one time period. Supposing = that the=20 target aggregated table is at a week time grain; this structure = would=20 enable the entry of one set of amendment values applicable for = several=20 weeks. This could be necessary when estimates in lieu of missing = data must=20 to be kept constant over several weeks, until real data comes in. = In this=20 case users would not want to reenter the same values each = week.

The system loads data mart aggregates from available = transactions into=20 the AGG_ Sys table. This table has key columns (symbolized as = Key_S in the=20 figure) per its specific combination of dimensions and their = grains.

The summary-level data that is available only via manual entry = =97 that=20 is, the amendments (at the same grain as in AGG_Sys) =97 is = initially stored=20 in the AGG_Manual table via a custom user interface.

The set of key columns for the AGG_Manual table must include, = in=20 addition to the keys in Key_S (except the time key) the following: =

  • Date and time of entry=20
  • Amendment set version number=20
  • User ID (of the user who entered the data set)=20
  • First period for which the amendment will be applied=20
  • Last period for which the amendment will be applied=20
  • "Ready/approved" flag.

Once an amendment set is flagged as "Ready/approved," a process = will=20 clone it into the AGG_Applicable table in as many copies as time = periods=20 to which the set is to be applied, differentiated only by the time = key.

The AGG_Applicable table has exactly the same columns in Key_A = as=20 AGG_Sys and AGG_Final do in Key_S and Key_F, respectively.

At the end of each relevant time period, a subprocess of the=20 extraction, transformation, and loading batch would simply add up = the=20 manually entered measures held in AGG_Applicable to the = system-aggregated=20 measures held in AGG_Sys, storing the results into AGG_Final. If = there=20 were no data in AGG_Applicable for a certain time period, the = values in=20 AGG_Final would be identical with the system-aggregated values = from=20 AGG_Sys.

Question: If some aggregations are nonlinear complex=20 calculations, how should their parameters be stored? Should they = be fully=20 and exclusively included in the dimensional database design or = kept=20 completely separate in a fully normalized area of the data = mart?

Answer: No surprises here, the answer again being, "It=20 depends."

First, you generalize Figure=20 1 by allowing the results of the nonlinear aggregation = process to=20 enter your data mart design at the same place as the manual = aggregations=20 discussed in the previous section. But there must be additional = underlying=20 tables (not shown in the figure) designed to store parameters that = drive=20 the nonlinear calculations. Again, you probably should populate = these=20 tables via a custom user interface. These driver tables will have = to=20 support versioning, including timestamps and an audit of changes. = If there=20 are relationships that have to be enforced among calculation = parameter=20 entities, then a third-normal-form, history-aware model may be the = best=20 choice.

However, it is likely that some of these parameters that drive = the=20 nonlinear aggregations, especially non-numeric ones (such as flags = and=20 codes), will also happen to be attributes of some dimensions. Even = if this=20 were not the case, it would still be wise to collect them all into = a=20 "Calculation Parameters" dimension so that users can analyze the = results=20 while having access to underlying parameters in the familiar = dimensional=20 framework.

I have tried in these two articles to describe some typical = issues that=20 arise in complex data mart environments. The asymmetric aggregate=20 architecture illustrated in Figure=20 1 is flexible. You should be able to support all three = usage=20 styles listed at the beginning of this column, and at the same = time allow=20 linear aggregations and nonlinear aggregations to coexist in a way = that=20 the end user can understand.


Gabriel Tanase [gabriel@gabrieltanase.com] = is=20 a system designer based in Ireland. He has worked on several = business=20 intelligence projects for a leading European insurance=20 provider.

Return = to=20 Article

------=_NextPart_000_0000_01C1D63C.9D35B8D0 Content-Type: image/gif Content-Transfer-Encoding: base64 Content-Location: http://www.iemagazine.com/gifs2/2001_logo_2.gif R0lGODlhPAFPALMAAAEBAf///88AKnJycpycnMfHx+d+k9gwUuybq99UcPXK03wBGS4tLd3d3fK5 xPfm6SH5BAAAAAAALAAAAAA8AU8AAAT/MMhJq7046827/2AojmRpnmiqrmzrvnAsz3Rt33iu73zv /8CgcEgsGo/IpHLJbDqf0Kh0Sq1ar9isdjsxCL4CB5fjMJjPZvEEgUY/LF6wekxnxcHhOuaOFxgo B31gChV8X3N6iSMPgl+KhY1+gJGEE4yNj5kgCJSaAYZgfxOBjZUSoF9vnqt7nTqceTaokqOup7as GZeiXLCCqji+pjSzvAGkgsO+fcC5F5e0Y40Hr3jDM8WTpRWNCc66eMZb0F/UOcsC1zLZtdsU5ALm 3xbw4r1mCDoKyTfsEsh91AVgYyDfvAuC7B1cAe+LQBj+juFayAHgF4UUTzR02C+SMYuD/zJ2ABlN JIqNHGV51MbPZAaSJV0GeEDQgIJmGFCmIqGgJgKcFyLCfCiiJz6iHx4oQIP0w9KjIyPFzBIxDtBZ Uy1J3eroAoIEUr1lEIoLJsYACsB2s1BuA9YDBjF8EUshbaQERGFyjSWlqpw1e/lSCCzVArqwrRp9 LLvygtqtoqDJs/CYK6LBbSU80Atm8oTKhENO8QsmH+itcf+FxsSS8NXGqt3FTkgv9AEHeDxL2Lea Lua2OgV5Pr26aRLSnVena6ccT+vQOMnKlqi4QvC9ug8H1t28jzzixakg794Va2jmzSFVR2/teagK nLFX4N3ddwDyePLRx28cyXjyYphH2P8tUiHQE0yTSdfSbH0YI6B8v01jIFaI4OfcfgCKB9snXB1g AHgXzaTAiDfBZCCJKIqBIR72gWeKggG5F+IEW+Fll1STgZgadYJEeJcBnEngAIrg2YjiiJdB8R8z WkWiG48xZgDek3d1seFQMkaDlW8gyqNTkiseQiNiVkaSZERaLAlGdIVZgGVObRoWSZm0sScag+GM OSc3TgI2jVxrSUCmj3mqVycXak4VZlfwTUSBdt2pAWN7ds6o044D9Yknfnr26OaGBB66RaL2LCrA BW86ZmGhk94JZaGQnorQn4KuCkanfQQFKofrIbohmjPFmWVets7YakqbGissoZldF97/fXsa2iAc u2JB6qysNTqdtraKcuxyldIyy5OvZmZqc5UsG+q00hbqa6+8ioqrc59um2Vz3l7J2HrjYkCSOecq N++t1MK7rruj/lqtuuUiW69wbUQccSXfDmPWwfSyRaupNUks8cCMUgAssNYqbHCt2Q4rZbUdVJxl vtHyKRzIAgDlAcMjs2zFtRqnHG6SdHoqgsvhisKxnLRCe/LNMQeNMMYzvisqyQxfDGckmLas73RW K+2z1959xnAHOO/arwsPFNADzzILLfLYKDeS9Sj2xctusuDizYuOb2sarFQ2C5k3zbqejJujmpFA AACMB46NyfLGnSvSSWsmSqxf7Hhj/0xE691kY+ZNxlkClyGADE5lL73VNZwA3QEDjAMwwA5sMxuy 7WAkYAYp4RKWWucNlwQiYZM5yxWfCwAwebvvPTyNGY/1dwHsjBNAO+R3457YgJqtOgfwXUvenGeY B0ZB8owv33fkaCknvQUNMM4AD7UT7u95n3e3fvapBi+O8TiiDH58gz7lZWx/T3MaV96XgwIwgAEN 2F72qNa0th3vHfEJlAKjFjyL7QqAlfOTcnZUQPVtMCt4s5cQHvhACSKMgl/DoPkKRpgzbW1B/ntG BpNWALVpZocLiMcFSnhAKwUxexUowPA6AwIffqAAEeRADy3AgAHIzoXNg1pW4CZCxf84znQ1mtu3 EBG+CoBxGk4iAPViNzu0gIWIBWQjBR4YuzoCYAEVUKMdj0iLATxwdlaUHyli94U4ys+JEmCh9QIJ AOsFQI0trAAjD0mPSVZvAn70oxNo8qE0OM46ZfiQ7mzCBE6OMh8bMWQd54fJ2B3AjnaMIixXOYH4 zZJxFaBeFe04v8UxzgCqlOME2LjGSzKSAracJSJ9OUtWBmCKMmnCYeI4gB4y0pmWlF0BCDBJVloz dg3oITQLUMdqbrOOc7wlANRGTvnJjwDfFGYA1AkACRxzmLGDZwGs6MwATNKc14wmEYwDmgL2cyb5 tGc5LVBHJ97zHXWMIj6vmMg6OnL/AI5sZ0IhGjsf2tF6DWCADx+qUYlaQKMHfUBHBRoERtStLoJA pwUCycpJXkCjbfRn7Ga60jzutKKMyykFNErRJMaOlRa9AEl7egGZSlJ+HKDePK0IyHrOEwDzyyc5 q3jVg7ZSpAGIZFAdiNV9YhWeQb2p7NbpS502cp6we6o20xrIZ571qvCEHVivGMjZWTWsZTVUAoCi HXki86eMdCRDoapQXFbgqBgAJ1ABYNIJEDUDa5SAU+X612RS9gLJFKplf6oByvo1rY2bp2rDys5q viGrGBhANTVbz/g1wJFWlS07RepVuzIufv7kZjWz2oDKyjasIVVtPUXqQB86cJia/4mrbuv5VwbA s2cs2l0kCohICqyUkd2dgC5b6djRXnJ6582sWtMa24RCVqmkBaxhG/vZpqZWA1uFK1fvyl6wbtV6 7OXp/NoZQSviVqEQbKdihwrOeso2kGrDqmgDidYrOhit/nWmVfWo07hWF6vW6U4wb6m2h1KRsW79 qwSYSc/+otioAT6pHN/L0/JOFsQUyKY6w5tL5aYNo1XMqiZZ+8ycVnF2xZVkG/1YxQYMV7P2bKMD rftMkzowpOQMqXDXuVolS8CBx6VuOJ35XCiH1Y/CVa54K9s95YxYmSnGLIpNnOIWU3S86+UyBnB6 1aJy9sQNFa+dmdpevz6Tm9abn/+iC7xbvip3wY+8aHFnB7sDP7LS+0zuWy0LyNuCmAB5DWtcKQDq Lxt6cbPb5zCpvNpeApnKdNxnpVUVGkaGU5y4Fid95YzjOOdYq7nOdRTxLGPG8fjLM47xr21s3p6u 8dbB1jUObvuCbdZhiZJg5CfJq+Jczjm+vtVzVF/M4PNegJkAVja3ee1YOrNUmgmwSALiwmcOuFvQ vaazStWNAWLD2M9/7rNo152BxD5yo+/mwmYLDu505rvh6h13r//dbe/+dL4Ez8B7F55wLDBy4DWu OL6dyczuMrO3gEY5UUFucIG318Zsfu/HO65wGteS1fcGqjNDC2issvm4Ot+zHUXTe/KJgtzXB1/m RvdN7meCleZQ8KyEZSvTnMu3n+WkuiOLeWQ/SvbqQndnUGW7Rqzz29cnJ/tmpd51jkN9CQ8oJiwv 2vCRHxaW2KSnkZuO7N/qeOIuh2955X7UyoaUnpB+exOcXMzZchro55YtpB9wzQE0g/Jch7RwE993 LpuVjeF98LH3CfkAMP6onDd12R2v+NavQKPHdr3s6QD72dteEbW/ve7HkPvd+x4Lvf+98KcQ/OEb 3wnFP77yk5D85Tv/+dCPvvSnT/3qW//62M++9lkaAQA7 ------=_NextPart_000_0000_01C1D63C.9D35B8D0--