Data Brewery

analytical data streams & online analytical processing Python frameworks

    by Stefan Urbanek
  • Page Redesign, Leaving Tumblr

    Data Brewery home page is being redesigned. I would like to thank Andrej Sykora who did a great job with the new look and migration of the old blog posts.

    Why?

    The main reason for redesign was providing more content for each project. Another one was to have it designed in a way that future projects can be easily added – by having one subdomain for each project.

    Important: Blog Moving

    The Data Brewery blog is moving away from Tumblr. New blog posts will be generated using Pelican to static pages. The base URL will stay the same: blog.databrewery.org.

    The old blog URLs are being redirected to the new URLs. There are still few blog posts that need to be migrated, but we hope to have these finished soon.

    If you are following the blog with a feeds reader, here is a link to the new feed.

    If you can see this sentence, then you are reading this post on Tumblr. Blog at databrewery.tumblr.com is not going to be updated, this is the last message. Here is the link to the new blog.

  • Tags: announcement by Stefan Urbanek
  • Cubes 0.10.2 Released – Even More Hierarchies, Formatters and Docs

    After few months and gloomy winter nights, here is a humble update of the Cubes light weight analytical framework. No major feature additions nor changes this time, except some usability tweaks and fixes.

    Documentation was updated to contain relational database patterns for SQL backend. See the schemas, models and illustrations in the official documentation.

    Also improvements in cross-referencing various documentation parts through see-also for having relevant information at-hand.

    Thanks and credits for support and patches goes to:

    • Jose Juan Montes (@jjmontesl)
    • Andrew Zeneski
    • Reinier Reisy Quevedo Batista (@rquevedo)

    Summary

    • many improvements in handling multiple hierarchies
    • more support of multiple hierarchies in the slicer server either as parameter or with syntax dimension@hierarchy:
      • dimension values: GET /dimension/date?hierarchy=dqmy
      • cut: get first quarter of 2012 ?cut=date@dqmy:2012,1
      • drill-down on hierarchy with week on implicit (next) level: ?drilldown=date@ywd
      • drill-down on hierarchy with week with exlpicitly specified week level: ?drilldown=date@ywd:week
    • order and order attribute can now be specified for a Level
    • optional safe column aliases (see docs for more info) for databases that have non-standard requirements for column labels even when quoted

    New Features

    • added order to Level object - can be asc, desc or None for unspecified order (will be ignored)
    • added order_attribute to Level object - specifies attribute to be used for ordering according to order. If not specified, then first attribute is going to be used.
    • added hierarchy argument to AggregationResult.table_rows()
    • str(cube) returns cube name, useful in functions that can accept both cube name and cube object
    • added cross table formatter and its HTML variant
    • GET /dimension accepts hierarchy parameter
    • added create_workspace_from_config() to simplify workspace creation directly from slicer.ini file (this method might be slightly changed in the future)
    • to_dict() method of model objects now has a flag create_label which provides label attribute derived from the object’s name, if label is missing
    • #95: Allow charset to be specified in Content-Type header

    SQL:

    • added option to SQL workspace/browser safe_labels to use safe column labels for databases that do not support characters like . in column names even when quoted (advanced feature, does not work with denormalization)
    • browser accepts include_cell_count and include_summary arguments to optionally disable/enable inclusion of respective results in the aggregation result object
    • added implicit ordering by levels to aggregate and dimension values methods (for list of facts it is not yet decided how this should work)
    • #97: partially implemented sort_key, available in aggregate() and values() methods

    Server:

    • added comma separator for order= parameter
    • reflected multiple search backend support in slicer server

    Other:

    • added vim syntax highlighting goodie

    Changes

    • AggregationResult.cross_table is depreciated, use cross table formatter instead
    • load_model() loads and applies translations
    • slicer server uses new localization methods (removed localization code from slicer)
    • workspace context provides proper list of locales and new key ‘translations’
    • added base class Workspace which backends should subclass; backends should use workspace.localized_model(locale)
    • create_model() accepts list of translations

    Fixes

    • browser.set_locale() now correctly changes browser’s locale
    • #97: Dimension values call cartesians when cutting by a different dimension
    • #99: Dimension “template” does not copy hierarchies

    Links

    Sources can be found on github. Read the documentation.

    Join the Google Group for discussion, problem solving and announcements.

    Submit issues and suggestions on github

    IRC channel #databrewery on irc.freenode.net

    If you have any questions, comments, requests, do not hesitate to ask.

  • Tags: cubes olap release by Stefan Urbanek
  • Cubes 0.10.1 Released – Multiple Hierarchies

    Quick Summary:

    • multiple hierarchies:
      • Python: cut = PointCut("date", [2010,15], hierarchy='ywd') (docs)
      • Server: GET /aggregate?cut=date@ywd:2010,15 (see docs - look for aggregate documentation)
      • Server drilldown: GET /aggregate?drilldown=date@ywd:week
    • added result formatters (experimental! API might change)
    • added pre-aggregations (experimental!)

    New Features

    • added support for multiple hierarchies
    • added dimension_schema option to star browser – use this when you have all dimensions grouped in a separate schema than fact table
    • added HierarchyError - used for example when drilling down deeper than possible within that hierarchy
    • added result formatters: simple_html_table, simple_data_table, text_table
    • added create_formatter(formatter_type, options …)
    • AggregationResult.levels is a new dictionary containing levels that the result was drilled down to. Keys are dimension names, values are levels.
    • AggregationResult.table_rows() output has a new variable is_base to denote whether the row is base or not in regard to table_rows dimension.
    • added create_server(config_path) to simplify wsgi script

    • added aggregates: avg, stddev and variance (works only in databases that support those aggregations, such as PostgreSQL)

    • added preliminary implemenation of pre-aggregation to sql worskspace:

      • create_conformed_rollup()
      • create_conformed_rollups()
      • create_cube_aggregate()

    Server:

    • multiple drilldowns can be specified in single argument: drilldown=date,product
    • there can be multiple cut arguments that will be appended into single cell
    • added requests: GET /cubes and GET /cube/NAME/dimensions

    Changes

    • Important: Changed string representation of a set cut: now using semicolon ‘;’ as a separator instead of a plus symbol ‘+’
    • aggregation browser subclasses should now fill result’s levels variable with coalesced_drilldown() output for requested drill-down levels.
    • Moved coalesce_drilldown() from star browser to cubes.browser module to be reusable by other browsers. Method might be renamed in the future.
    • if there is only one level (default) in a dimension, it will have same label as the owning dimension
    • hierarchy definition errors now raise ModelError instead of generic exception

    Fixes

    • order of joins is preserved
    • fixed ordering bug
    • fixed bug in generating conditions from range cuts
    • AggregationResult.table_rows now works when there is no point cut
    • get correct reference in table_rows – now works when simple denormalized table is used
    • raise model exception when a table is missing due to missing join
    • search in slicer updated for latest changes
    • fixed bug that prevented using cells with attributes in aliased joined tables

    Links

    Sources can be found on github. Read the documentation.

    Join the Google Group for discussion, problem solving and announcements.

    Submit issues and suggestions on github

    IRC channel #databrewery on irc.freenode.net

    If you have any questions, comments, requests, do not hesitate to ask.

  • Tags: cubes olap release by Stefan Urbanek
  • Cubes 0.10 Released

    After a while, here is an update to Cubes - Python Lightweight OLAP framework for multidimensional modeling. There are some changes included that were mentioned in the EruoPython talk such as table_rows and cross_table.

    I recommend to look at updated examples in the Github repository. The Flask example is now “real” example instead of “sandbox” example and you can see how to generate a simple table for dimension hierarchy browsing.

    There is also a more complex example with star-like schema dataset in the cubes-examples github repository. Follow the instructions in README files how to make it running.

    There are some backward incompatible changes in this release – until 1.0 the “point” releases will contain this kind of changes, as it is still evolving. You can find more information below.

    Quick Summary

    • Way how model is constructed has changed. Designated methods are create_model() or load_model()
    • Dimension defition can have a “template”. For example:
        {
          "name": "contract_date",
          "template": "date"
        }
    
    • added table_rows() and cross_table() to aggregation result for more convenient table creation. The table_rows takes care of providing appropriate dimension key and label for browsed level.
    • added simple_model(cube_name, dimension_names, measures)

    Incompatibilities: use create_model() instead of Model(**dict), if you were using just load_model(), you are fine.

    New Features

    • To address issue #8 create_model(dict) was added as replacement for Model(**dict). Model() from now on will expect correctly constructed model objects. create_model() will be able to handle various simplifications and defaults during the construction process.
    • added info attribute to all model objects. It can be used to store custom, application or front-end specific information
    • preliminary implementation of cross_table() (interface might be changed)
    • AggregationResult.table_rows() - new method that iterates through drill-down rows and returns a tuple with key, label, path, and rest of the fields.
    • dimension in model description can specify another template dimension – all properties from the template will be inherited in the new dimension. All dimension properties specified in the new dimension completely override the template specification
    • added point_cut_for_dimension
    • added simple_model(cube_name, dimensions, measures) – creates a single-cube model with flat dimensions from a list of dimension names and measures from a list of measure names. For example:
    model = simple_model("contracts", ["year","contractor", "type"], ["amount"])
    

    Slicer Server:

    • /cell – return cell details (replaces /details)

    Changes

    • creation of a model from dictionary through Model(dict) is depreciated, use create_model(dict) instead. All initialization code will be moved there. Depreciation warnings were added. Old functionality retained for the time being. (important)
    • Replaced Attribute.full_name() with Attribute.ref()
    • Removed Dimension.attribute_reference() as same can be achieved with dim(attr).ref()
    • AggregationResult.drilldown renamed to AggregationResults.cells (important)

    Planned Changes:

    • str(Attribute) will return ref() instead of attribute name as it is more useful

    Fixes

    • order of dimensions is now preserved in the Model

    Links

    Sources can be found on github. Read the documentation.

    Join the Google Group for discussion, problem solving and announcements.

    Submit issues and suggestions on github

    IRC channel #databrewery on irc.freenode.net

    If you have any questions, comments, requests, do not hesitate to ask.

  • Tags: cubes release olap by Stefan Urbanek
  • Data Types: From Storage to Analysis

    What is the data type of 10? Depends on who you are, what are you going to do with it. I would expect my software friends to say that it is an “integer”. Why this information might not be sufficient or not relevant? How analysts see the data?

    Storage Data Type

    If we say “data type”, engineers would name types they know from typed programming languages: small integer, double precision float, character. This data type comes from how the data are stored in memory. The type specifies what operations can be done with the data stored at that particuliar place and how much memory is taken. To add two integers on an Intel processor there is an instruction called ADD, to add two floats there is a different instruction called FADD (Dear kids: this used to be on a separate chip in PCs!). To add an integer with an float, there has to be a conversion done. Database people would say decimal, date or string. Same as with memory data types, each type has it’s allowed set of operations and size it takes in the database. They both are of one kinds of data types: storage data types.

    Storage data type, as the name suggests, is used by software (compiler, database system) to know how much memory it takes to store the value of that type and to select appropriate operations (or operation variants).

    Concrete vs. Generic

    The number of storage data types and their differentiation is exhausting. To name a few:

    • C language has more than 25 concrete numeric types and differentiates by floatness, size and sign flag
    • PostgreSQL has 9 numeric types, differentiates by size and floatness
    • NumPy differentiates not only by size and sign, but also by byte order

    Do I need all taht information about data type when working with data? In most cases I don’t, it is information for machine, not for me as data analyst/scientist. There are cases when knowing about data types might be handy, like optimisation (for memory consumption for example) or error prevention (of some kind) by type checking in typed languages.

    For simplification, some tools use generic data types and hide the concrete storage type: integer, float (or real), string, … No storage size, no byte order. Those are low level details.

    For reading the data, no input from user is required, as short int is integer and double is real. Problem with generic data types is that there might be multiple options how to store a generic integer.

    Analytical Data Types

    When doing data analysis I think about variable values and what I can do with them. In data analysis adding two integers or two floats is the same. It is just a + b. There is only one kind of addition: + (remember the ADD and FADD?). However, there might be numbers that adding them together will have no meaning, like adding two invoice numbers or years together.

    To specify how the values should be treated during data analysis, there is another kind of data type: analytical data type or also called variable types. They are:

    Set (or Nominal Variable)
    Values represent categories, like colors or contract. types. Fields of this type might be numbers which represent for example group numbers, but have no mathematical interpretation. For example addition of years 1989 and 2012 has no meaning.
    Ordered Set (or Ordinal Variable)
    Similar to set field type, but values can be ordered in a meaningful order.
    Flag (or Binary)
    Special case of set type where values can be one of two types, such as 1 or 0, ‘yes’ or ‘no’, ‘true’ or ‘false’.
    Discrete
    Set of integers - values can be ordered and one can perform arithmetic operations on them, such as: 1 apple + 2 apples = 3 apples.
    Range
    Numerical value, such as financial amount, temperature

    The analytical data types are disstinct from storage data types. Take for example just an integer: it can be from a set without any arithmetic operations (ID, year), can be a discrete number (count of something), a flag with binary values of 40 and 50. Integer as a set can be ordered as set of product sizes or unordered as kind of IDs or category numbers where categories are ordered by their names rather.

    In addition to the mentioned data types, it is sometimes useful to specify that the tool or algorithm should just ignore a field/column/variable. For that purpose typeless analytical data type might be used.

    Here is an example of storage and analytical data types:

    The idea behind analytical data types is described for example in nice introductory data mining book [1] or also in [2]. [1] differentiates measures as interval-scaled variables and ratio-scaled variables. Interesting that [2] describes the “set”, which they call “categorical variable” as “generalization of the binary in that it can take one more than two states”, not the other way around.

    [1] Max Bramer: Principles of Datamining, Springer Verlag London United 2007, p12.

    [2] Jaiwen Wan and Micheline Kamber: Data Mining - concepts and techniques, Elsevier 2006, p392.

    Keep the metadata with you

    As data are passed through algorithms, blocks of processing code, data types (along with other relevant metadata) should be passed with them. Data types can be in some cases guessed from data stream or explicitly expressed by a user, sometimes they can be reflected (like in a database). It is good to keep them, even if sometimes it is not possible to maintain accuracy or compatibility of data types between data sources and targets.

    If done right, even after couple of transformations, one can say to an analytical metadata accepting function/algorithm: “get averages of this dataset” and it will understand it as “get averages of amounts in this dataset”.

    Basic metadata that should be considered when creating data processing or data analysing interfaces are:

    • number of fields
    • field names (as analyst I rather refer by name than index, as field position might differ among source chunks sometimes)
    • field order (for tabular data it is implicit, for document based databases it should be specified)
    • storage data types (at least generic, concrete if available or possible)
    • analytical datatype

    The minimal metadata structure for a dataset relevant to both: analysts who use data and engineers who prepare data would therefore be a list of tuples: (name, storage type, analytical type).

    Conclusion

    Typeless programming languages allow programmers to focus on structuring the data and remove the necessity of fiddling with physical storage implementation. Hiding concrete storage types from data analysts allows them to focus on properties of their data relevant to analysis. Less burden on mind definitely helps our thinking process.

    Nevertheless, there are more kinds…

    Links

    Data Brewery documentation of metadata structures.

  • Tags: brewery data metadata by Stefan Urbanek
  • Using Pandas as Brewery Backend

    UPDATE: Added info about caching.

    First time I looked at Pandas (python data analysis framework) I thought: that would be great backend/computation engine for data Brewery.

    To recap core principle of Brewery: it is flow based data streaming framework with processing nodes connected by pipes. A typical node can have one or multiple inputs and has output. Source nodes have no inputs, target nodes have no outputs.

    Current brewery implementation uses one thread per node (was written in times when Python was new to me and I did not know about GIL and stuff). Can be considered just as prototype…

    Had this idea in mind for quite a some time, however coming from database world, the only potential implementation was through database tables with nodes performing SQL operations on them. I was not happy by requirement of some SQL DB server for data processing, not mentioning speed and function set (well, ok, pandas is missing the non-numeric stuff).

    Here is the draft of the idea, how to implement data transfer between nodes in Brewery using tables. The requirements are

    • follow data modeller’s workflow
    • do not rewrite data – I want to be able to see what was the result at each step
    • have some kind of provenance (where this field comes from?)

    See larger image on imgur.

    Table represents a pipe: each pipe field is mapped to a table column. If node performs only field operation, then table can be shared between nodes. If node affects rows, then new table should be considered. Every “pipe” can be cached and stream can be run from the cached point, if the computation takes longer time than desired during model development process.

    Pandas offers structure called DataFrame, which holds data in a tabular form consisting of series of Series (fancier array objects). Each of the series represents a collection of field’s values for analytical/computational step. Nodes that share same field structure and same records can share the series which can be grouped in a table/DataFrame.

    Node can:

    • create completely new field structure (source node, aggregation, join, …)
    • add a field (various derive/compute nodes)
    • remove a field (field filter, field replacement)

    Just adding or removing a field does not affect the series, therefore nodes can just point to series they “need”. Aggregation or join nodes generate not only new field structure, they affect number and representation of records as well, therefore the field series differ form their respective source series (compare: “year” in invoices and “year” in summary of invoices). For those kind of nodes new table/DataFrame should be created.

    Sampling nodes or selection nodes can generate additional Series with boolean values based on selection. Each node can have hidden input column representing the selection.

    There are couple of things I am missing so far: DataFrame that will be a “view” of another data frame – that is: DataFrame will not copy series, only reference them. Another feature is more custom metadata for a table column (DataFrame series), including “analytical datatype” (I will write about this later as it is not crucial in this case). They might be there, I just did not discovered them yet.

    I am not an expert in Pandas, I have just started exploring the framework. Looks very promising for this kind of problem.

  • Tags: brewery idea rfc by Stefan Urbanek
  • Cubes and Slicer are going to EuroPython 2012

    Cubes is going to EuroPython 2012.

    EDIT: Added "Need help?".

    There are going to be two sessions. First there will be talk introducing to light-weight OLAP with Cubes and Slicer, on Friday at 9:45 in room Tagliatelle (add to calendar). Afterwards there will be longer, more in-depth and hands-on training about Slicing and Dicing with Cubes on Friday at 14:30 in room Pizza Napoli (add to calendar)

    In the talk I will introduce the framework and explain reasons for it’s existence. Then I will dig into architecture, features and briefly show examples how to use it for slicing and dicing. Newbies are welcome.

    The training will go into more details and the participants will learn:

    • how to prepare data for aggregated browsing - star and snowflake schemas
    • how to create a logical model, define cubes, dimensions and hierarchies
    • how to browse aggregated data and how to slice and dice cubes from within Python
    • how to create a WSGI OLAP server (“in 15 minutes” style) for aggregated data browsing and how to use it in your web application for providing (browsable) data to end-user reports
    • how to provide localized reporting

    If the time permits, we can look at the anatomy of the framework and see how to implement a backend for another kind of data store.

    I will be focusing on the existing SQL (relational OLAP) backend.

    Customized examples

    You might use the training session (and not only the session) to solve your problem - just bring your own sample data, if you like.

    Do you have any data that you would like to slice and dice? Have a database schema and do not know how to create a logical model? You can send me a data sample or a schema, so I can prepare examples based on problem you are solving.

    Please, do not send any confidential data or schemas under NDA.

     

    Need help?

    If you have any questions or would like to help with your data: from data preparation, through data modeling to slicing and dicing. You can grab me during the whole event. If you can not find me, just tweet me: @Stiivi.

    Participation

    If anyone is interested in participating in the project, he is welcome. Here are some features that are either out of scope of my skills and I would like to cooperate with someone more professional, or I do not have available resources to do that:

    I am also very open to new feature suggestions and feature change requests. Just little note: Cubes is meant to be small and simple. At least for now. There are plenty of complex and feature-rich solutions out there. If we can make new, more complex features as non-critical, optional plug-ins, that would be great.

    Links and Calendar Events

    You can add the talks to your calendar by following the links:

  • Tags: cubes olap talk announcement by Stefan Urbanek
  • School of Data: The Pipeline

    "We are missing data literacy" was mentioned by Tim Berners-Lee at Open Government Data Camp 2010 in London. Here we are in 2012 in Berlin, together with OKFN, P2PU and their friends preparing content of School of Data from the very beginning.

    Based on our lively discussions lead by Rufus Pollock, reviews by Peter Murray-Rust and Friedrich Lindenberg, I’ve created the the pipeline based skill map that I will talk about here.

    The Pipeline

    Data have many forms, from ore-like solid (think of web and documents), crystalized solid (think of databases), through flowing liquid (think of those being processed) to vaporing gas (think of paying no attention). The best way of looking at the data is to look at them in all their stages as they go through a connected, but dismantle-able processing pipeline:

    The flow is divided into the following parts:

    • discovery and acquisition – covers data source understanding, ways of getting data from the web and knowing when we have gathered enough
    • extraction – when data has to be scraped from unstructured documents into structured tables, loaded from a tabular file into a database
    • cleansing, transformation and integration – majority of skills for data processing, from understanding data formats, through knowing how to merge multiple sources to process optimization
    • analytical modeling – changing data to be viewed from analytical point of view, finding various patterns
    • presentation, analysis and publishing – mostly non-technical or just very slightly technical skills for story tellers, investigators, publishers and decision makers

    There are two more side-pipes:

    • governance – making sure that everything goes well, that process is understandable and that content is according to expectations
    • tools and technologies – from SQL to Python (or other) frameworks

    Here is the full map:

    Download the PNG image or PDF.

    Modules or skills

    The pipeline skills, or rather modules, are based mostly on experience from projects in the open-data domain with inspiration of best practices from corporate environment.

    We tried to cover most of the necessary knowledge and concepts so potential data users would be able to dive-in to their problem and get some reasonable (sometimes even partial) result at any stage of the pipe. Some of the corporate best practices are too mature at this moment to be included, some of them were tuned either with different semantics, different grouping. It was done intentionally.

    Most of the modules will be based on hands-on problem-solving. They will provide source dataset (or bunch of unknown sources, for the purpose of discovery), sandbox or a playground environment, and few questions to be answered. Learner will try to solve the problem using guiding lecture notes. In ideal module, the dataset would be from existing open-data project, so the learner would be able to see the big picture as well.

    Next Steps

    Outline in a form of a pipeline is nice and fine … as a guideline. Content has to follow and content will follow. If you would like be involved, visit the School of Data website. Follow @SchoolofData on Twitter.

    Questions? Comments? Ideas?

    Links:

  • by Stefan Urbanek
  • Cubes 0.9.1: Ranges, denormalization and query cell

    The new minor release of Cubes – light-weight Python OLAP framework – brings range cuts, denormalization with the slicer tool and cells in /report query, together with fixes and important changes.

    See the second part of this post for the full list.

    Range Cuts

    Range cuts were implemented in the SQL Star Browser. They are used as follows:

    Python:

    cut = RangeCut("date", [2010], [2012,5,10])
    cut_hi = RangeCut("date", None, [2012,5,10])
    cut_low = RangeCut("date", [2010], None)
    

    To specify a range in slicer server where keys are sortable:

        http://localhost:5000/aggregate?cut=date:2004-2005
        http://localhost:5000/aggregate?cut=date:2004,2-2005,5,1
    

    Open ranges:

        http://localhost:5000/aggregate?cut=date:2010-
        http://localhost:5000/aggregate?cut=date:2004,1,1-
        http://localhost:5000/aggregate?cut=date:-2005,5,10
        http://localhost:5000/aggregate?cut=date:-2012,5
    

    Denormalization with slicer Tool

    Now it is possible to denormalize tour data with the slicer tool. You do not have to denormalize using python script. Data are denormalized in a way how denormalized browser would expect them to be. You can tune the process using command line switches, if you do not like the defaults.

    Denormalize all cubes in the model:

    $ slicer denormalize slicer.ini
    

    Denormalize only one cube::

    $ slicer denormalize -c contracts slicer.ini
    

    Create materialized denormalized view with indexes::

    $ slicer denormalize --materialize --index slicer.ini
    

    Example slicer.ini:

    [workspace]
    denormalized_view_prefix = mft_
    denormalized_view_schema = denorm_views
    
    # This switch is used by the browser:
    use_denormalization = yes
    

    For more information see Cubes slicer tool documentation

    Cells in Report

    Use cell to specify all cuts (type can be range, point or set):

    {
        "cell": [
            {
                "dimension": "date",
                "type": "range",
                "from": [2010,9],
                "to": [2011,9]
            }
        ],
        "queries": {
            "report": {
                "query": "aggregate",
                "drilldown": {"date":"year"}
            }
        }
    }
    

    For more information see the slicer server documentation.

    New Features

    • cut_from_string(): added parsing of range and set cuts from string; introduced requirement for key format: Keys should now have format “alphanumeric character or underscore” if they are going to be converted to strings (for example when using slicer HTTP server)
    • cut_from_dict(): create a cut (of appropriate class) from a dictionary description
    • Dimension.attribute(name): get attribute instance from name
    • added exceptions: CubesError, ModelInconsistencyError, NoSuchDimensionError, NoSuchAttributeError, ArgumentError, MappingError, WorkspaceError and BrowserError

    StarBrowser:

    • implemented RangeCut conditions

    Slicer Server:

    • /report JSON now accepts cell with full cell description as dictionary, overrides URL parameters

    Slicer tool:

    • denormalize option for (bulk) denormalization of cubes (see the the slicer documentation for more information)

    Changes

    • important: all /report JSON requests should now have queries wrapped in the key queries. This was originally intended way of use, but was not correctly implemented. A descriptive error message is returned from the server if the key queries is not present. Despite being rather a bug-fix, it is listed here as it requires your attention for possible change of your code.
    • warn when no backend is specified during slicer context creation

    Fixes

    • Better handling of missing optional packages, also fixes #57 (now works without slqalchemy and without werkzeug as expected)
    • see change above about /report and queries
    • push more errors as JSON responses to the requestor, instead of just failing with an exception

    Links

    Sources can be found on github. Read the documentation.

    Join the Google Group for discussion, problem solving and announcements.

    Submit issues and suggestions on github

    IRC channel #databrewery on irc.freenode.net

    If you have any questions, comments, requests, do not hesitate to ask.

  • Tags: announcement release cubes olap by Stefan Urbanek
  • Cubes 0.9 Released

    The new version of Cubes – light-weight Python OLAP framework – brings new StarBrowser, which we discussed in previous blog posts:

    The new SQL backend is written from scratch, it is much cleaner, transparent, configurable and open for future extensions. Also allows direct browsing of star/snowflake schema without denormalization, therefore you can use Cubes on top of a read-only database. See DenormalizedMapper and SnowflakeMapper for more information.

    Just to name a few new features: multiple aggregated computations (min, max,…), cell details, optional/configurable denormalization.

    Important Changes

    Summary of most important changes that might affect your code:

    Slicer: Change all your slicer.ini configuration files to have [workspace] section instead of old [db] or [backend]. Depreciation warning is issued, will work if not changed.

    Model: Change dimensions in model to be an array instead of a dictionary. Same with cubes. Old style: "dimensions" = { "date" = ... } new style: "dimensions" = [ { "name": "date", ... } ]. Will work if not changed, just be prepared.

    Python: Use Dimension.hierarchy() instead of Dimension.default_hierarchy.

    New Features

    • slicer_context() - new method that holds all relevant information from configuration. can be reused when creating tools that work in connected database environment
    • added Hierarchy.all_attributes() and .key_attributes()
    • Cell.rollup_dim() - rolls up single dimension to a specified level. this might later replace the Cell.rollup() method
    • Cell.drilldown() - drills down the cell
    • create_workspace(backend,model, **options) - new top-level method for creating a workspace by specifying backend name. Easier to create browsers (from possible browser pool) programmatically. The backend name might be full module name path or relative to the cubes.backends, for example sql.star for new or sql.browser for old SQL browser.
    • get_backend() - get backend by name

    • AggregationBrowser.cell_details(): New method returning values of attributes representing the cell. Preliminary implementation, return value might change.

    • AggregationBrowser.cut_details(): New method returning values of attributes representing a single cut. Preliminary implementation, return value might change.

    • Dimension.validate() now checks whether there are duplicate attributes

    • Cube.validate() now checks whether there are duplicate measures or details

    SQL backend:

    • new StarBrowser implemented:
      • StarBrowser supports snowflakes or denormalization (optional)
      • for snowflake browsing no write permission is required (does not have to be denormalized)
    • new DenormalizedMapper for mapping logical model to denormalized view
    • new SnowflakeMapper for mapping logical model to a snowflake schema
    • ddl_for_model() - get schema DDL as string for model
    • join finder and attribute mapper are now just Mapper - class responsible for finding appropriate joins and doing logical-to-physical mappings
    • coalesce_attribute() - new method for coalescing multiple ways of describing a physical attribute (just attribute or table+schema+attribute)
    • dimension argument was removed from all methods working with attributes (the dimension is now required attribute property)
    • added create_denormalized_view() with options: materialize, create_index, keys_only

    Slicer tool/server:

    • slicer ddl - generate schema DDL from model
    • slicer test - test configuration and model against database and report list of issues, if any
    • Backend options are now in [workspace], removed configurability of custom backend section. Warning are issued when old section names [db] and [backend] are used
    • server responds to /details which is a result of AggregationBrowser.cell_details()

    Examples:

    • added simple Flask based web example - dimension aggregation browser

    Changes

    • in Model: dimension and cube dictionary specification during model initialization is depreciated, list should be used (with explicitly mentioned attribute “name”) — important
    • important: Now all attribute references in the model (dimension attributes, measures, …) are required to be instances of Attribute() and the attribute knows it’s dimension
    • removed hierarchy argument from Dimension.all_attributes() and .key_attributes()
    • renamed builder to denormalizer
    • Dimension.default_hierarchy is now depreciated in favor of Dimension.hierarchy() which now accepts no arguments or argument None - returning default hierarchy in those two cases
    • metadata are now reused for each browser within one workspace - speed improvement.

    Fixes

    • Slicer version should be same version as Cubes: Original intention was to have separate server, therefore it had its own versioning. Now there is no reason for separate version, moreover it can introduce confusion.
    • Proper use of database schema in the Mapper

    Links

    Sources can be found on github. Read the documentation.

    Join the Google Group for discussion, problem solving and announcements.

    Submit issues and suggestions on github

    IRC channel #databrewery on irc.freenode.net

    If you have any questions, comments, requests, do not hesitate to ask.

  • Tags: cubes announcement release