What is a Data Set
A Data Set is a tabular result set that is stored in the Content Store. These can be created from a Framework Manager Package or a Data Module. These are saved as parquet files in the Content Store and are a binary columnar format that is compressed, more information on the parquet files here. Due to the compression the actual parquet file may be smaller than the raw input file. Although these files are stored in the Content Store, once requested by the query service (DQM) the files are stored locally on the server so that the query planning may leverage caching opportunities. With Cognos Analytics 11.1 DQM now communicates with a new SPARK service that runs in its own JVM to process these parquet files. A Data Set can be used to create a Report, Dashboard or Exploration.
How should you use a Data Set
There are a number of good reasons to leverage a Data Set and some instances where it may not be an ideal solution. Firstly, since the Data Set is stored in the Content Store you need to be conscience of the impact of the database sizing. To assist with that there are governor settings that limit the size of each file and the total size of all files per user. When performance is critical and the database is perhaps large or complex that queries off it are not as optimal as desired then consider a Data Set. When creating the Data Set be aware of the number of rows, number of columns and specifically the columns that are actually used. Even if 2000 columns are possible to be defined in a Data Set including columns that are not used is inefficient. A Data Set includes aggregations to improve overall performance but if unused aggregations are included that may have a negative impact. As a general rule 10’s of millions of rows is possible were 100s of millions may not provide the performance you need and you should consider aggregate tables in your database for those large data volumes. When creating the Data Set include sorting on frequently used columns as that improves performance with columnar organization of the data in the parquet and how it is accessed. Also, as with any data source design including frequently computed expressions in the data is more efficient than computing those at runtime. As mentioned previously the parquet files are retrieved from the Content Store but at query execution time these are stored on the file system in the ../data location, having that location on fast disk also improves performance. The parquet files are stored on disk instead of being passed directly to the query service to allow reuse but these files when not used for a period of time are automatically purged to conserve disk space. These files are encrypted by default but that can be changed in the system configuration. The parquet files are created by Cognos and include metadata, therefor Cognos cannot consume parquet files created externally.
To aid in creating the Data Set using CTRL ALT M (CTRL OPT M on Mac) will enable the Report Authoring toolbar.
Logging to assist with diagnosing Data Set queries is set in the JDBC event in the xqe.diagnosticlogging.xml file. Also, in the configuration under Diagnostic logging you can enable logging of the Flint Service which is the compute service that processes the parquet file. This will generate files (flint-console.log and dataset-service.log) in the ../log directory. Additionally, you can enable the Spark console in flint-app.properties.
What’s different between 11.0 and 11.1 Data Sets
In Cognos Analytics 11.0.x DQM processes the parquet files where in Cognos Analytics 11.1 a new compute service offloads the processing. The new service is using Spark which allows for faster parallelism which benchmarking has shown significant improvements both in the size of the files that are possible as well as performance. With 11.1 the file format of the parquet file has changed. All new Data Sets will use the new format and a Data Set refresh will update to the new format. There is also a utility <install>/bin64/parquetUpgrade that will scan the Content Store and update the files. When the compute service receives a request and the file is not in the new format the data is processed as is and you may not realize the improvements.
Documentation on Data Sets can be found here.