Fast manipulation of large tables
Jul 13, 2017
Mendix applications can accumulate large sets of data originating from various sources and created for various reasons. In order to work efficiently with large datasets extra attention to performance is needed. This post will provide recommendations for speed and performance when modelling manipulations of large datasets.
Use batches (limits and offsets) #
Batches are the execution of actions each on a set or “batch” of inputs, rather than a single input. In a microflow that manipulates large amounts of data (whether it’s changing, committing or deleting) the retrieve is not one single retrieve-action (input for the manipulation) but the processing is split in multiple retrieves each with a limit and an offset. Batch processing is an efficient way to manipulate large tables. Creating a processing microflow with batches is well explained in the Mendix documentation. Creating batches is the first and most important step to achieve fast manipulation of large datasets.
The batch process in Mendix manipulates the data within the batch and then sends its queries to the database. The objects used within the batch (between the merge and the split) can be cleaned up by the garbage collector after each batch. This prevents the Mendix application from running into a Java.lang.OutOfMemoryError: GC overhead limit exceeded error. However the transaction in the database will not end until the microflow has processed all the records. Although the performance of Mendix is improved, the load on database level remains high.
Use separate transactions #
The performance of batch processing microflows can be further improved by adding an “End transaction” to the microflow. This ends the current transaction in the database (commit) and prevents an overload on database level. The end transaction is a Java action in Community Commons. Alternatively you can use the Java Action RunMicroflowAsyncInQueue or other Java Actions that create their own database transaction. End transaction is not necessary after every batch, but for example every 100.000 objects (find a sweet spot depending on the number of app engines available).
In order to understand the implications of separate transactions, you need some basic understanding of different levels of isolation which handle transaction integrity. The Mendix platform uses transactions, which means that every microflow, commit, and delete will happen in a (database) transaction. The transaction is initialised as soon as the microflow executes its first write to the database (mx) Retrieve activities will never start a transaction.
In database systems the isolation property defines the level of isolation, i.e. how/when the changes made by one operation becomes visible to others (more…). The Mendix platform uses the isolation level Read Committed, which means, as the name suggests, that only committed objects are visible outside the transaction (mx)
Ending a database transaction before the microflow ends implicates:
- Data within that transaction will be available outside the context of the processing microflow
- The rollback can only undo changes that are not yet committed (in case of an error or rollback activity)
A note on Mendix 7 #
Mendix 7 has an issue with the garbage collector; objects created in microflows with end transaction are Garbage Collected only after the microflow ends (problem exists at least until the current version 7.13.1). The work around would be to use the old ExecuteMicroflowInBatches Java action from Community Commons. This issue should be fixed in version .15 or .16 (check release notes)