Pentaho

 View Only

 Mondrian 4: Performance and high cardinality dimensions

  • Data Management and Analytics
  • Data Management and Analytics
  • Pentaho
  • Pentaho
Werther Dryden's profile image
Werther Dryden posted 03-20-2021 15:20

Hello,

 

I have a Customer Dimension with about 900'000 entries and a fact table with about 1.5 million entries. Everything implemented in a star schema with a simple Mondrian 4 Schema structure. In the MDX Query below I am (actually it is automatically generated by Saiku) trying to display all the customer numbers with their ordered quantity for a specific date (10.12.2018).

 

 

 

WITHSET [~OrderDate_Month_Year] AS Exists({[OrderDate].[Month].[Jahr].Members}, [~OrderDate_Month_Month])SET [~OrderDate_Month_Month] AS {[OrderDate].[Month].[2018].[Q4].[Dezember]}SET [~OrderDate_Month_Day] AS Exists({[OrderDate].[Month].[2018].[Q4].[Dezember].[10.12.2018]}, [~OrderDate_Month_Month])SET [~ROWS_OrderDate_Month] AS Hierarchize({[~OrderDate_Month_Jahr], [~OrderDate_Month_Month], [~OrderDate_Month_Day]})SET [~ROWS_Customer_OriginalCustomer] AS {[Customer].[OriginalCustomer].[CustomerNumber].Members}SELECTNON EMPTY {[Measures].[quantity ordered]} ON COLUMNS,NON EMPTY NonEmptyCrossJoin([~ROWS_OrderDate_Month], [~ROWS_Customer_OriginalCustomer]) ON ROWSFROM [Orderpositions]

 

 

 

The result set has about 3000 rows. The sql-queries which Mondrian sends perform very fast (each a matter of seconds), but after Mondrian is done retrieving the data from the database it takes a very long time until the result is shown in the UI . Mostly the timeout which I´ve configured (5 Minutes) kicks in before that and mondrian stops all operations. In this case, If I re-run the query then no more database queries are performed (I guess because everything needed is already in the cache) but it still takes another 3 minutes until I see the result and the application starts eating a lot of memory.

I thought that when one uses a NonEmptyCrossJoin and has the following mondrian properties set:

mondrian.native.nonempty.enable=true

mondrian.native.crossjoin.enable=true

Mondrian creates a crossjoin in the background only for the elements which are needed in the result (?)

But I have the feeling that a crossjoin is made which includes all customers (900'000). When I delete all the customers from the customer dimension table which are not referred to by the fact table (in this case only 60'000 entries are left in the customer dimension) Mondrian performs very fast to get desired result set (with 3000 entries) and I have none of the above mentioned problems.

So the customer dimension table´s size is definitely the problem here.

What do I have to set or configure so that Mondrian makes a clean and fast non empty cross join when using high cardinality dimensions? Am I missing something?

 

(Postgresql 10 + Saiku + Mondrian 4)

 

 

 

 

 

 

 

 

 

 


#DataManagementandAnalytics
#Pentaho
Carlos Lopez's profile image
Carlos Lopez

Just a warning Mondrian 4 is not fully implemented or production ready. It would be interesting to see how this MDX query behaves using Mondrian as shipped by the CE edition.

Is that something you could check?

Werther Dryden's profile image
Werther Dryden

Thanks @Carlos Lopez​  for the feedback :) We will try it next week with Mondrian 3 and I will let you know.

Carlos Lopez's profile image
Carlos Lopez

Sounds good; make sure to attach the Mondrian SQL and MDX logs

Werther Dryden's profile image
Werther Dryden

mondrian.properties

Werther Dryden's profile image
Werther Dryden

mondrian3_schema.xml

Werther Dryden's profile image
Werther Dryden

mondrian.log

Werther Dryden's profile image
Werther Dryden

postgres.log

Werther Dryden's profile image
Werther Dryden

Hello Carlos,

 

 

I built a small Mondrian 3 Cube containing only the necessary Dimensions and Measures: mondrian3_schema.xml (see above)

Here my mondrian.propertis (see above) file which I´ve used for this test. I know it´s quite messy because I´ve tried a lot of things to solve my performance problem J

The Query which Saiku builds here is a bit different from the Query above from Mondrian 4, but the effect is the same and we experienced the same behaviour.

I didn´t just paste the mdx-query in the mdx editor but gradually selected my way to the desired result. Meaning, first I only displayed the ordered quantity per year (see line 3 in mondrian.log), selected only the year 2018, added the month to the query (line 412), filtered the month for april, added the day (line 999)…etc…(see saiku_ui.png below)

  

 

But the final query starts at 2011.

There seem to actually be two reasons why this query takes so long:

1.      There are many SQL-Queries which are sent. Several queries at several levels are being peformed (I guess for caching) which prolong the execution time. See In postgreql.log (114-131) or mondrian.log (2107-2411). The final query would normally take 3-4 seconds. With these additional queries the whole sql-part takes 25 seconds*. Is there a way to turn off these additional queries in the properties? 

2.      After the SQL-queries are done (see mondrian.log around 2411) it still takes another good 75 seconds until the result is shown in saiku (you see in the picture below (saiku_ui.png) that the total execution time was 99 sec). During this time the memory usage increases a lot (a couple of Gigs). Do you know what is happening here or what I can do that this phase doesn´t take this long?

 

 Cheers,

Werther

 

 

 

* In another environment (different Cube), additionally to these higher level queries, many queries are sent because segments of the result set are loaded. Is there are way of changing the properties (they use a different mondrian.properties than the one I´ve attached here) that only one query is sent instead of several similar queries which return a subset of the main query.

Werther Dryden's profile image
Werther Dryden

saiku_ui.png

Werther Dryden's profile image
Werther Dryden

@Carlos Lopez​ : Hi 😃 , just wanted to ask if there is any update on this topic?