Migrating table from SQL Server to SingleStore RowStore large size increase

chrisgrasp · March 4, 2021, 2:22pm

Hello. I’m new to SingleStore and have been running experiments/PoC with Managed SingleStore S00 tier.

The table we’re experimenting with (definition for SingleStore below) when populated with approximately 100k records on SQL Server uses about ~89MB for row/data storage and 97MB for all indexes so overall the table is using ~186 MB of disk storage.

When I create the equivalent to this table in SingleStore rowstore per the definition below and use a pipeline to populate it with the same 100k records, the SingleStore table is always 2.4x - 3.2x larger in SingleStore memory.

Heres the summary of our tests for sql server scenario data + primary key only=89 MB:

singlestore=SPARSE table + primary key index only=293 MB (3.2x larger)
SPARSE table + primary key + converting all empty strings to null=242 MB (2.7x larger)

Then when testing for what will probably be the production equivalent scenario on SQL Server side: data + primary key + secondary indexes = 186MB

SPARSE + primary key + secondary indexes + converting all empty strings to null=463 MB (2.4x larger)

If this is “by design” that is fine but we’re trying to understand capacity planning and roughly estimating that any sql server tables we move into rowstore will be anywhere from 2.4x to 3.2x larger.

We realize this could be created as a column store but this table is updated several times a minute and we often select on many columns for many data exports.

CREATE ROWSTORE TABLE dataBooking (
`BookingID` INT NOT NULL,
`BookingSeqNumber` SMALLINT NULL DEFAULT 1 COMMENT 'Invoice Booking order number (Will be used for key if there is a duplicate Document ID',
`BookingTypeID` SMALLINT NULL COMMENT 'Category  (Air, Car, Hotel, Rail) From Trams',
`BookingSubTypeID` INT NULL COMMENT 'Travel Type ID from TRAMS',
`NetTransactionCount` SMALLINT NULL,
`DocumentTypeCode` VARCHAR(1) NULL COMMENT 'Arc, CommTrack, Supplier, Voucher',
`TransactionTypeCode` VARCHAR(2) NULL COMMENT 'S=Sale R=Refund X=Exchange M=Debit/Credit Memo T=TAAD V=Void O=Other',
`InvoiceNumber` INT NULL,
`AccountID` INT NULL COMMENT 'Link to Company File',
`VendorID` INT NULL COMMENT 'Vendor Table Link',
`BranchID` INT NULL,
`InsideAgentID` INT NULL COMMENT 'Inside Agent ID',
`OutsideAgentID` INT NULL COMMENT 'Outside Agent ID',
`TicketingAgentID` INT NULL COMMENT 'ID for Ticketing Agent',
`BookingAgentID` INT NULL COMMENT 'ID for Booking Agent',
`ReasonCode` VARCHAR(50) NULL COMMENT 'Reason Code',
`VendorCode` VARCHAR(3) NULL COMMENT '2 Letter Vendor Code (val al, car agency, hotel code)',
`OriginCityCode` VARCHAR(10) NULL COMMENT 'Origin City, Hotel, Car',
`DestinationCityCode` VARCHAR(10) NULL COMMENT 'First Destination City ',
`FOPCode` VARCHAR(2) NULL COMMENT 'Form of Payment CA, CH, CC',
`IntDomCode` VARCHAR(1) NULL COMMENT 'International, Domestic, Trans-border, Trans-pacific, Trans-atlantic',
`IssuedDate` DATETIME NULL COMMENT 'Ticketing Date',
`BookingDate` DATETIME NULL COMMENT 'First Date of Contact, PNR Creation Date',
`StartDate` DATETIME NULL COMMENT 'Departure, Pickup, In Date',
`EndDate` DATETIME NULL COMMENT 'Return Date, Drop Off, Out Date',
`NumberOfUnits` SMALLINT NULL COMMENT 'Number of Rooms, Cars',
`Duration` INT NULL COMMENT 'Number of Days, Nites, trip duration',
`Commission` DECIMAL(19,4) NULL,
`CommissionRate` DECIMAL(19,4) NULL,
`AgentInsideCommission` DECIMAL(19,4) NULL,
`AgentInsideCommissionRate` DECIMAL(3,2) NULL,
`AgentOutsideCommission` DECIMAL(19,4) NULL,
`AgentOutsideCommissionRate` DECIMAL(3,2) NULL,
`CoachFare` DECIMAL(19,4) NULL,
`LowFare` DECIMAL(19,4) NULL,
`ContractFare` DECIMAL(19,4) NULL,
`NegotiatedFare` DECIMAL(19,4) NULL,
`FaceValue` DECIMAL(19,4) NULL,
`BaseFare` DECIMAL(19,4) NULL,
`TotalTax` DECIMAL(19,4) NULL,
`TotalPaid` DECIMAL(19,4) NULL,
`BookingRate` DECIMAL(19,4) NULL,
`PNRLocator` VARCHAR(20) NULL,
`Sort1` VARCHAR(1024) NULL,
`Sort2` VARCHAR(1024) NULL,
`Sort3` VARCHAR(1024) NULL,
`Sort4` VARCHAR(1024) NULL,
`Sort5` VARCHAR(1024) NULL,
`TravelerName` VARCHAR(50) NULL,
`BookingClass` VARCHAR(25) NULL COMMENT 'Class of service, Room Type, Car Type, Etc',
`DocumentNumber` VARCHAR(255) NULL COMMENT 'Ticket Number, Confirmation, Voucher',
`OriginalDocumentNumber` VARCHAR(255) NULL,
`PaymentNumber` VARCHAR(200) NULL COMMENT 'Check,CC Number',
`Routing` VARCHAR(255) NULL COMMENT 'Routing',
`TourCode` VARCHAR(25) NULL,
`TicketDesignator` VARCHAR(25) NULL,
`BookingRemarks` VARCHAR(1024) NULL COMMENT 'Name Field Remarks',
`VendorRemarks` VARCHAR(1024) NULL,
`PaymentRemarks` VARCHAR(1024) NULL,
`VendorPaymentStatus` VARCHAR(1) NULL COMMENT 'O=Open, C=Closed, N=Not Applicable, V=Void',
`ClientPaymentStatus` VARCHAR(1) NULL COMMENT 'O=Open, C=Closed, N=Not Applicable, V=Void',
`EticketIndicator` TINYINT(1) NULL,
`ReIssued` TINYINT(1) NULL COMMENT 'Need to determine if this ticket number is referenced in another booking Orig Ticket Field',
`OriginalOriginalDocumentNumber` VARCHAR(50) NULL,
`Department` VARCHAR(50) NULL,
`BookingRateType` VARCHAR(10) NULL,
`PenaltyAmount` DECIMAL(19,4) NULL,
`InvoiceRemarks` VARCHAR(1024) NULL,
`GSTAmt` DECIMAL(19,4) NULL,
`AltVendorName` VARCHAR(80) NULL,
`AltVendorPhone` VARCHAR(25) NULL,
`AltVendorAddress` VARCHAR(80) NULL,
`AltVendorAddress2` VARCHAR(80) NULL,
`AltVendorCityStateZipCountry` VARCHAR(255) NULL,
`TAX1AMT` DECIMAL(19,4) NULL,
`TAX2AMT` DECIMAL(19,4) NULL,
`TAX3AMT` DECIMAL(19,4) NULL,
`TAX4AMT` DECIMAL(19,4) NULL,
`QSTAMT` DECIMAL(19,4) NULL,
`CurrencyCode` VARCHAR(3) NULL,
`DataSourceID` INT NULL,
`TIME_STAMP` DATETIME NULL,
`InvoiceGroup` VARCHAR(50) NULL,
`GSANumber` VARCHAR(50) NULL,
`PurchaseOrder` VARCHAR(50) NULL,
`UnUsedEticket` TINYINT(1) NULL,
`UnUsedEticketUsedDate` DATETIME NULL,
`UnUsedEticketExpirationDate` DATETIME NULL,
`UnUsedEticketGDS` VARCHAR(15) NULL,
`ARCNumber` VARCHAR(25) NULL,
`Sort6` VARCHAR(1024) NULL,
`Sort7` VARCHAR(1024) NULL,
`Sort8` VARCHAR(1024) NULL,
`Sort9` VARCHAR(1024) NULL,
`Sort10` VARCHAR(1024) NULL,
`CalcCarrierSeq` VARCHAR(100) NULL,
`CalcFareBasisSeq` VARCHAR(255) NULL,
`CalcClassSeq` VARCHAR(100) NULL,
`CalcTripMiles` INT NULL,
`CalcClassType` CHAR(1) NULL,
`InvoiceNumberText` VARCHAR(20) NULL,
`BookingSource` VARCHAR(100) NULL,
`BackofficeBookingID` VARCHAR(20) NULL,
`Division` VARCHAR(50) NULL,
`TravelArranger` VARCHAR(100) NULL,
`VendorPaymentDate` DATETIME NULL,
`ClientPaymentDate` DATETIME NULL,
`PaymentExpDate` VARCHAR(10) NULL,
`PenaltyCommission` DECIMAL(19,4) NULL,
`Commissionable` TINYINT(1) NULL,
`CorpRateCode` VARCHAR(20) NULL,
`VendorLoyaltyCode` VARCHAR(20) NULL,
`GovRateCode` VARCHAR(20) NULL,
`BookingStatus` VARCHAR(10) NULL,
`ExtraDayChargeAmount` DECIMAL(19,4) NULL,
`ExtraHourChargeAmount` DECIMAL(19,4) NULL,
`VoidedDate` DATETIME NULL,
`GDS` VARCHAR(20) NULL,
`PCC` VARCHAR(20) NULL,
`CostItemAmount` DECIMAL(19,4) NULL,
`TripDescription` VARCHAR(250) NULL,
`TripId` VARCHAR(25) NULL,
`SavingsComment` VARCHAR(50) NULL,
`ReasonCode2` VARCHAR(10) NULL,
`PaymentNumberEnd` VARCHAR(4) NULL,
`FlightPass` TINYINT(1) NULL,
`RoutingCode` VARCHAR(2) NULL,
`AccountingPeriod` DATETIME NULL,
`BookingIsLocked` TINYINT(1) NULL,
`TravelerEmail` VARCHAR(500) NULL,
`CalcServiceFees` DECIMAL(19,4) NULL,
`ContractLevel` INT NULL,
`TripTypeCode` VARCHAR(2) NULL,
`SourceFileName` VARCHAR(100) NULL,
`FILEGRP` VARCHAR(100) NULL,
`POSCountryCD` CHAR(2) NULL,
`PaymentNumberHash` VARCHAR(125) NULL,
`CalcClassSeqDesc` VARCHAR(255) NULL,
`CalcTripTime` INT NULL DEFAULT 0,
`PaymentNumberMask` VARCHAR(50) NULL,
`BookingLocator` VARCHAR(20) NULL,
`ValidationStatus` TINYINT UNSIGNED NOT NULL DEFAULT 0,
`SyncFileGroup` VARCHAR(100) NULL,
`CalcClassDesc` VARCHAR(100) NULL,
`CalcCityNameItinerary` VARCHAR(300) NULL,
`CalcCountryNameItinerary` VARCHAR(300) NULL,
PRIMARY KEY (`BookingID`),
INDEX `IX_dataBooking_VendorID` (`VendorID` ASC, `BookingTypeID` ASC),
INDEX `IX_dataBooking_TransactionTypeCode` (`TransactionTypeCode` ASC),
INDEX `IX_dataBooking_BookingTypeID` (`BookingTypeID` ASC, `BookingID` ASC),
INDEX `IX_dataBooking_IssuedDate` (`IssuedDate` ASC),
INDEX `IX_dataBooking_VendorCode` (`VendorCode` ASC),
INDEX `IX_dataBooking_IntDomCode` (`IntDomCode` ASC),
INDEX `IX_dataBooking_OriginalDocumentNumber` (`OriginalDocumentNumber` ASC, `BookingID` ASC),
INDEX `IX_BackOfficeBookingID` (`BackofficeBookingID` ASC, `BookingID` ASC),
INDEX `IX_dataBooking_UnusedEticket` (`UnUsedEticket` ASC),
INDEX `IX_dataBooking_AccountID` (`AccountID` ASC),
INDEX `IX_dataBooking_StartDate` (`StartDate` ASC),
INDEX `IX_dataBooking_EndDate` (`EndDate` ASC),
INDEX `IX_dataBooking_TravelerName` (`TravelerName` ASC),
INDEX `PMI_dataBooking_PaymentNumberEnd` (`PaymentNumberEnd` ASC, `TransactionTypeCode` ASC, `InvoiceNumber` ASC),
INDEX `PMI_dataBooking_InvPnrTraveler` (`PNRLocator` ASC, `InvoiceNumber` ASC, `TravelerName` ASC),
INDEX `IX_InvoiceNumber_AccountID_BranchID` (`InvoiceNumber` ASC, `AccountID` ASC, `BranchID` ASC, `BookingTypeID` ASC, `IssuedDate` ASC, `DataSourceID` ASC, `TransactionTypeCode` ASC, `DocumentNumber` ASC, `BookingID` ASC),
INDEX `IX_BookingID_DocumentNumber` (`BookingID` ASC, `DocumentNumber` ASC, `OriginalDocumentNumber` ASC, `ReIssued` ASC, `OriginalOriginalDocumentNumber` ASC, `BookingTypeID` ASC)
)
COMPRESSION=SPARSE;

rick · March 5, 2021, 5:58am

I can’t speak to exactly why we use more storage space than SQL Server in this scenario. It depends on how you define the table in SQL Server. I would suggest you try using the column store, or what we are now calling Universal Storage. We have given it that different name as it is more than just a regular column store. The intent is that Universal Storage table has the analytics power of a column store and the operational characteristics of a row store all in one. It handles inserts, updates, deletes and point queries much faster than your average column store. (This is done through a couple of interesting mechanisms I can explain in more detail if you want). It won’t be as fast as using a row store but you may find it is fast enough for your scenario and the benefits of column compression and the overall TCO of using less memory would be worth the tradeoff.

adam · March 6, 2021, 3:47am

The break down for memory use by datatype can be found here:

Each secondary key adds 40 bytes of overhead per row (doesn’t matter which columns the key is on - the overhead is fixed). Your table has a number of second so likely that explains most of the overhead your seeing.

How were you determining the size of your table? Via information_schema.table_statistics?

-Adam

chrisgrasp · March 8, 2021, 2:35pm

@adam One of my experiments was with the primary index only, no secondary indexes which came out to 293 MB as I mentioned above. The table definition SQL I posted was the full scenario which came out to 463 MB. Yes I was pulling the usage from information_schema.table_statistics as described in the documentation. I wasn’t sure if maybe the size increase is “normal”/expected or I missing something. I definitely understand they are very different systems so not expecting apples to apples but the size increase was much higher than I expected.

@rick The SQL Server table definition is nearly identical. Since i posted above, I did read thru the columnstore change from 7.0 announcement which sounds awesome. In what use cases now, given column store is the new go-to, would the in-memory row store make sense? I’m still of the classic thinking of column store can’t support rapid updates and our intention is to use pipelines and hook this table up to a Kafka topic which will be changing quite frequently.

I did recreate this table as columnstore and loaded with the same 100k rows and the table size dropped to 27 MB, which is amazing.

Thanks guys.

rick · March 8, 2021, 10:26pm

@chrisgrasp In our managed service the default table type is the Universal Storage type. (Note: in our software offering it will be made the default table type in the next release coming out later this year). Note, I am calling it Universal Storage, and not “column store”. That is deliberate as calling it just a column store is underselling its value and leads to the “classic thinking” you mention. While the technology we started with is a columnstore we have it evolved it beyond that. For example, we keep a small row store alongside the columnstore (hidden from the user). Small Inserts go into that row store first and are moved over to the column based table later. Queries are a union over the row table and the main column table. This is how we get point inserts to be fast. Deletes are a metadata only operation (with clean up of the data happening asynchronously). We have changed the locking model and how we access the segments (the chunks of data) to limit the amount of data that has be touched on disk, making point queries and row updates fast. We now support multiple indices over the table. We have done all this deliberately to enable the query patterns typically associated with a rowstore table to perform well without compromising the TCO and ability to do fast large scale aggregation queries that users expect of a column table. It won’t be as fast as the in-memory rowstore table but it will be fast enough for most use cases. We truly believe that you will only need this one table type for the vast majority of use cases going forward instead of having to make hard tradeoffs choosing between row and column oriented tables.

I am curious to see how well it works for you and your use case.

chrisgrasp · March 9, 2021, 12:27am

Thanks @rick that was very informative.

The “small row store alongside the column store” you mentioned reminds me of SQL Server’s Delta store thats created when you set a clustered column store index on a table, albeit on disk and B-Tree based. Is that a similar concept to what you described?

Chris

hanson · March 19, 2021, 1:02am

Yes, our writeable in-memory row store segment in our columnstore tables is similar to the delta store in SQL Server columnstores, though as you said, our writeable segment is entirely in memory, not a disk-based B-tree.

chrisgrasp · April 15, 2021, 4:17pm

@hanson The documentation above mentions writing data >=16 MB triggers an “optimized insert”, does this mean it bypasses the in-memory row store and writes directly to disk and if so, does this imply a performance penalty when writing batches >= 16MB?

We’re trying to understand what batch size we should choose when writing data to singlestore for historical data loads. For example, we were thinking we need to ensure we’re writing batches < 16 MB to ensure they write to the in-memory row store so write performance stays fast or if the performance difference is negligible.

in the vein of this thread, we know doing inserts and updates to column stores within SQL Server come with performance penalties.

hanson · April 15, 2021, 6:01pm

Presumably you are referring to this documentation page:

The heuristics used to decide whether to insert to the in-memory rowstore segment, or do direct-to-disk columnstore insert, will cut over when the overall cost to do a direct-to-disk insert is most likely going to be less. So no, there should not be a performance penalty.

Most apps should not have to care about this. But it’s an inexact decision when to cutover, so if you are trying to squeeze out every millisecond, you may need to test it and decide on your batch size to make inserts always go to the in-memory rowstore segment, or direct to disk.

The 16 MB number for the threshold is the default value of this expression involving global variables

columnstore_flush_bytes * columnstore_disk_insert_threshold

See a discussion of that here:

chrisgrasp · April 16, 2021, 8:37pm

@hanson thanks, will give it a read.