Real Time DSS: Rebuilding dimensions with huge fact tables

Hi,

I have a dimension which is used in two Measure Groups. It has many levels (let's say 10). One fact table handles level 1-8, the other 8-10.

The facts are huge, about 2 billion records for the first table (for level 1-8) and 100 million records for 8-10.

We discussed a fitting partitioning schema and this will not be the big issue.

Our problem is, that in the last levels (9-10) members can move from one parent to another. That means that the dimension structure changes, it has to be updated which also means a reprocession of all partitions related to this dimension.

This is really a problem because of the first fact table (2 billion records). I shouldn't be that big deal for the second table.

My question is: Since the structure of level 1-8 is quite fixed is there any way to get around reprocessing the partitions of the first fact table? Their aggregations will not change! It's only the aggregations of the lower level which are not stored in that fact table!

Any idea?

You have a classic case of slowly changing dimension. I dont remember which number 1, 2, or 3 given to this case. You can search for a term "slowly changing dimensions" you should get a lot of information for this design.

To make a story short: You can avoid re-processing your partitions by doing Update to dimension instead of Full Process. When issuing ProcessUpdate for your dimension, you can still get your members moving from one parent to another, but the data in the partitions will not be lost.

Edward.
--
This posting is provided "AS IS" with no warranties, and confers no rights

|||

Edward,

thanks for your reply. This is a SCD Type 1 scenario since I don't save any history...

Are you sure that an update of a dimension doen't affect the aggregations? That's what BOL says:

Process Update

Forces a re-read of data and an update of dimension attributes. Flexible aggregations and indexes on related partitions will be dropped. For example, this processing option can add new members to a dimension and force a complete re-read of the data to update object attributes. This processing option is supported for dimensions and mining models.

So the question might be what "flexible aggregations" are?! But I guess that all aggregations are flexible...

Regards,

|||

That correct. Updating of dimension will cause Analysis Server dropping flexible aggregations.

But that is way different from what your initial concern of having fully to re-process all the partitions in your cube.
Processing of aggregations will not require Analysis Server to read data again from relational database. Aggregations are built based on the data that already exists in your partitions.

If you think of it. It is logical that Analysis Server drops aggregations. Aggregations are being pre-calculated totals are no longer valid as soon as you move memeber from one parent to another.

Now about difference between flexible and rigid aggregations. It has to do with exactly the topic of this discussion. When defining relationship between attributes as flexible, you're telling the server that members of the child attribute could move from one parent to another and therefore any aggregaton build based on these attribures will be dropped in case of incremental update. You have an option to mark relationship as ridid. In this case if you try to move the member, you will get an error during processing.

Edward.
--
This posting is provided "AS IS" with no warranties, and confers no rights

|||

Edward,

thanks for your reply, things are getting clearer now...

So if you have a processed partition with aggregations defined and you do a "process update" of one or more dimensions, you'll still have the data in the partitions but then without aggregations and indices. I guess you can still query the cube but it will be slow. To speed it up again you have to do a "process index" to recreate aggregations (I still get a litte bit confused about idecies and aggregations, I understand that you always have indeces but you can add aggregations in a partition if you want to).

How big is the performance inpact on having flexible dimensions (if there is any)?

So it's right that aggregations on rigid relationships are not dropped in an update?! So having them i.e. on a time dimension keeps aggregations on time even when you "process update" the dimension?! What about other dimensions? If you "process update" your product dimension are aggregations on the region dimension also dropped? You only wrote that you get an error when you move a member in a rigid relationship. But there has to be an advantage as well ;-)

I'm really missing something like a "incremental" update of a dimension... This would mean that keys have to keep the same, only member names can be change. New members can be added but existing member couldn't move. That would be perfect for lots of cases, because you can add new members to a dimension without doing any processing of aggregations, just because any existing data will not be changed. But you can i.e. process the partition with the "actual" data and then you could take advantage of the new added members...

Thanks,

|||

<E.M> You got this one right.

How big is the performance inpact on having flexible dimensions (if there is any)?

<E.M> There is no performance difference per-se between flexible and rigid aggregations. It is just that flexible once are being dropped during update of the dimension.

So it's right that aggregations on rigid relationships are not dropped in an update?! So having them i.e. on a time dimension keeps aggregations on time even when you "process update" the dimension?!

<E.M> Right here.

What about other dimensions? If you "process update" your product dimension are aggregations on the region dimension also dropped? You only wrote that you get an error when you move a member in a rigid relationship. But there has to be an advantage as well ;-)

<E.M> The aggregations are not per dimension. A single aggregation could will be defined to have aggregates across several dimensions. For instance totals for Product category and Year could be one aggregation. For aggregation to be rigid, it should be based on attributes with Rigid relationships across all dimensions "participating" in it.

<E.M> You are correct here again. Many users find "incremental" update of the dimension very useful. But in many cases you still need to have ability to move memeber from one parent to another, delete a member. Sometimes you have a lot of historical data in many partitions and cannot afford to re-process.

Edward.
--
This posting is provided "AS IS" with no warranties, and confers no rights

Friday, March 23, 2012

Rebuilding dimensions with huge fact tables

No comments:

Post a Comment

Real Time DSS

Blog Archive

About Me