Rethinking Lambda Architecture for Real-Time Analytics
Case Studies

Rethinking Lambda Architecture for Real-Time Analytics

Big data, as a concept and practice, has been around for quite some time now. Most companies have responded to the influx of data by adapting their data management strategy. However, managing data in real time still poses a challenge for many enterprises. Some have successfully incorporated streaming or processing tools that provide instant access to real-time data, but most traditional enterprises are still exploring options. Complicating the matter further, most enterprises need access to both historical and real-time data, which require distinct considerations and solutions. Of the many approaches to managing real-time and historical data concurrently, the Lambda Architecture is by far the most talked about today. Like the physical aspect of the Greek letter it is named for, the Lambda architecture forks into two paths: one is a streaming (real-time) path, the other a batch path. Thus, it accommodates real-time high-speed data service along with an immutable data lake. Oftentimes a serving layer sits on top of the streaming path to power applications or dashboards. A Fork in the Road Many Internet-scale companies, like Pinterest, Zynga, Akamai, and Comcast have chosen SingleStore to deliver the high-speed data component of the Lambda architecture. Some customers have chosen to fork the input stream in order to push data into SingleStore and a data lake, like HDFS, in parallel. Here is an example of the Comcast Lambda Architecture:
Read Post
Customer Guest Post: Learning the SingleStore JSON Column Type
Case Studies

Customer Guest Post: Learning the SingleStore JSON Column Type

This post originally appeared on the Novus Tech BlogMore and more, companies are migrating from NoSQL databases back to relational, ACID-compliant databases that also offer high availability and horizontal scalability – aka “NewSQL.” Novus Partners is no exception, having recently completed our migration from MongoDB to SingleStore.Of course, NoSQL developers often come to depend on rapid development cycle and “schema-less” data models, which can make the transition back to relational – including the prospect of devising a relational model for your schemaless data – daunting. It turns out, however, that SingleStore offers a feature that enales you to keep many of those benefits and ease the NoSQL-to-NewSQL transition: the JSON column type.In this article, we’ll get up and running quickly with SingleStore, and then immediately learn about its native JSON support and what it can do.Getting Started with SingleStoreAlthough SingleStore is a proprietary technology (binary download only), they recently released a free Community Edition that is perfect for developer-testing. All you need are a 64-bit Linux environment and the MySQL client (SingleStore made a strategic decision to implement the MySQL client protocol – “bug-for-bug”).After downloading and installing the Community Edition, and the MySQL Client, you will want to alias memsql to the MySQL client command as below (I have this alias in my ~/.local.bash:$ alias memsqlalias memsql='mysql -u root -h 127.0.0.1 -P 3306 --prompt="memsql> "'$ memsqlWelcome to the MySQL monitor. Commands end with ; or \g.Your MySQL connection id is 15Server version: 5.5.8 SingleStore source distribution (compatible; MySQL Enterprise & MySQL Commercial)Copyright (c) 2000, 2014, Oracle and/or its affiliates. All rights reserved.Oracle is a registered trademark of Oracle Corporation and/or itsaffiliates. Other names may be trademarks of their respectiveowners.Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.If you already know MySQL, you already know most of SingleStore commands and syntax. SingleStore adds some proprietary commands and syntax, mostly related to its replication and sharding functionality, which we won’t discuss here. If you want to learn more, you can find all their documentation online.Let’s start out by inspecting the default databases, and then define our own “test” database where we can explore SingleStore’s JSON support:memsql? show databases;+--------------------+| Database |+--------------------+| information_schema || memsql || sharding |+--------------------+3 rows in set (0.00 sec)memsql> create database test;Query OK, 1 row affected (2.21 sec)memsql> use test;Database changedPrototyping a SchemaPerhaps in our existing NoSQL database, we had a document collection named tasks that contained, among other things, unstructured data related to our distributed compute application. Our developers added and removed fields from their Task objects as needed for different compute jobs, which was great from a development perspective because it meant rapid development cycle without the need for frequent database schema changes and migrations.Fortunately, we don’t have to leave that agility behind when we transition from NoSQL back to SingleStore’s relational model.The simplest version of our new tasks table has two columns: a bigint primary key and the column for JSON data, which has the type… JSONmemsql> create table tasks ( task_id bigint not null primary key auto_increment, task json not null );Query OK, 0 rows affected (15.53 sec)memsql> describe tasks;+---------+------------+------+------+---------+----------------+| Field | Type | Null | Key | Default | Extra |+---------+------------+------+------+---------+----------------+| task_id | bigint(20) | NO | PRI | NULL | auto_increment || task | JSON | NO | | NULL | |+---------+------------+------+------+---------+----------------+2 rows in set (0.00 sec)Now, let’s insert some data into this table. The task documents themselves are just JSON literals inside single-quotes, like any other data type:memsql> insert into tasks (task) values ('{"user" : "nzucker", "uid" :{"clientId" : 1, "which" : "P", "id" : 205} } ');Query OK, 1 row affected (0.00 sec)memsql> insert into tasks (task) values ('{"user" : "nzucker", "uid" :{"clientId" : 7, "which" : "P", "id" : 1009} } ');Query OK, 1 row affected (0.00 sec)memsql> insert into tasks (task) values ('{"user" : "bqunibi", "uid" :{"clientId" : 9, "which" : "P", "id" : 327} } ');Query OK, 1 row affected (0.00 sec)memsql> select * from tasks;+---------+---------------------------------------------------------------+| task_id | task |+---------+---------------------------------------------------------------+| 2 |{"uid":{"clientId":1,"id":205,"which":"P"} ,"user":"nzucker"} || 4 |{"uid":{"clientId":9,"id":327,"which":"P"} ,"user":"bqunibi"} || 3 |{"uid":{"clientId":7,"id":1009,"which":"P"} ,"user":"nzucker"} |+---------+---------------------------------------------------------------+3 rows in set (0.00 sec)An important note: SingleStore will not let you insert invalid JSON documents. Attempting to do so yields an error:memsql> insert into tasks (task) values ('{"user" : ');ERROR 1844 (HY000): Leaf Error (10.20.79.111:3307): Invalid JSON value for column 'task'JSON Field PredicatesSuppose we want to select rows based on a specific JSON document field. Even though the SingleStore JSON column type is something like a LONGTEXT column in terms of data format and limitations, the database query engine understands the JSON specification and how to navigate the document tree.For example, we can select all the tasks that I previously inserted having user of ‘nzucker’memsql> select * from tasks t where t.task::$user = 'nzucker';+---------+---------------------------------------------------------------+| task_id | task |+---------+---------------------------------------------------------------+| 2 |{"uid":{"clientId":1,"id":205,"which":"P"} ,"user":"nzucker"} || 3 |{"uid":{"clientId":7,"id":1009,"which":"P"} ,"user":"nzucker"} |+---------+---------------------------------------------------------------+2 rows in set (2.71 sec)The clause t.task::$user navigates to the user field of the JSON document, with the $ prefix ensuring that the value is evaluated as a STRING data type (it is actually short-hand for the JSON_EXTRACT_STRING function).Naturally, SingleStore supports predicates that use nested JSON fields. For example, we can also query by the uid.clientId field:memsql> select * from tasks t where t.task::uid::%clientId = 7;+---------+---------------------------------------------------------------+| task_id | task |+---------+---------------------------------------------------------------+| 3 |{"uid":{"clientId":7,"id":1009,"which":"P"} ,"user":"nzucker"} |+---------+---------------------------------------------------------------+1 row in set (0.00 sec)The % prefix on ::%clientId ensures that the field is interpreted as a double (the only numeric data type supported by the JSON standard), which is important if you are doing numeric comparisons in your queries.memsql> select * from tasks t where t.task::uid::%clientId < 9;+---------+---------------------------------------------------------------+| task_id | task |+---------+---------------------------------------------------------------+| 3 |{"uid":{"clientId":7,"id":1009,"which":"P"} ,"user":"nzucker"} || 2 |{"uid":{"clientId":1,"id":205,"which":"P"} ,"user":"nzucker"} |+---------+---------------------------------------------------------------+2 rows in set (0.00 sec)You can also select “sub-documents” from JSON columns using a similar syntax:memsql> select task::uid from tasks;+--------------------------------------+| task::uid |+--------------------------------------+|{"clientId":7,"id":1009,"which":"P"} ||{"clientId":1,"id":205,"which":"P"} ||{"clientId":9,"id":327,"which":"P"} |+--------------------------------------+3 rows in set (2.73 sec)Because we’re extracting JSON documents from our JSON column no \$ or % prefix is required in the task::uid clause.Persisted ColumnsSingleStore also supports persisted columns (also known as “computed columns”) extracted from JSON documents fields. This feature is very convenient if you find yourself repeatedly querying the same deeply-nested JSON document fields.memsql> alter table tasks add column client_id as task::uid::clientId persisted bigint;Query OK, 0 rows affected (8.90 sec)Records: 0 Duplicates: 0 Warnings: 0memsql> select * from tasks;+---------+---------------------------------------------------------------+-----------+| task_id | task | client_id |+---------+---------------------------------------------------------------+-----------+| 3 |{"uid":{"clientId":7,"id":1009,"which":"P"} ,"user":"nzucker"} | 7 || 2 |{"uid":{"clientId":1,"id":205,"which":"P"} ,"user":"nzucker"} | 1 || 4 |{"uid":{"clientId":9,"id":327,"which":"P"} ,"user":"bqunibi"} | 9 |+---------+---------------------------------------------------------------+-----------+3 rows in set (2.75 sec)Here we persisted the client_id field, extracted from the task object. Note that if our task documents have a suitable primary key field (say, a field named _id) we could extract that field to populate task_id field.Now let’s select using this persisted column:memsql> select count(*) from tasks where client_id = 9;+----------+| count(*) |+----------+| 1 |+----------+1 row in set (2.27 sec)If you plan your persisted columns in advance, you will make life easier for developers or analysts who may not be familiar with JSON structure. Also, you can create indexes using these JSON-derived persisted columns, which obviously has a huge benefit.Updating JSON Document Fields “In-Place”If you have used JSON persistence from an object-oriented language, you might have written code like the following:1. Fetch the entire JSON document out of the store.2. Deserialize the JSON document into an object.3. Update a single field on the object.4. Serialize the entire object back to the store as JSON.That’s quite a bit of data transfer just to manipulate a single field. Well, since we can select database rows by JSON fields, why not update individual fields of JSON documents as well? This too is possible:memsql> update tasks set task::uid::$which = 'F' where task_id = 3;Query OK, 1 row affected (0.00 sec)memsql> select * from tasks where task_id = 3;+---------+---------------------------------------------------------------+-----------+| task_id | task | client_id |+---------+---------------------------------------------------------------+-----------+| 3 |{"uid":{"clientId":7,"id":1009,"which":"F"} ,"user":"nzucker"} | 7 |+---------+---------------------------------------------------------------+-----------+1 row in set (0.00 sec)It’s worth mentioning that SingleStore doesn’t support JSON field-level validation, so “in-place” updates such as this run the risk generating data that violates your domain model. For example, if the task::uid::which field is required, but you set task::uid::\$which = NULL, your application may encounter errors due to the missing field. So, use this feature with caution (and perhaps a robust set of integration tests).Manipulating JSON ArraysAnother great feature of SingleStore’s JSON support is the ability to manipulate JSON arrays in-place, using the JSON_ARRAY_PUSH_ expression.Let’s continue with our “task” documents example by defining their domain model in Scala as follows:case class UID(clientId: Int, which: String, id: Int)case class Task(user: String, uid: UID, history: List[String])As with any other JSON field, SingleStore array manipulation functions enable us to add entries to a JSON array “in-place.” Using our example data set, we first update one of our documents to have a field called “history,” initialized with an empty JSON array:memsql> update tasks set task::history = '[]' where task_id = 2;Query OK, 1 row affected (2.96 sec)Then can then we insert into the array and observe the results:memsql> update tasks set task::history = JSON_ARRAY_PUSH_STRING(task::history, "New") where task_id = 2;Query OK, 1 row affected (0.00 sec)memsql> select * from tasks;+---------+--------------------------------------------------------------------------------+-----------+| task_id | task | client_id |+---------+--------------------------------------------------------------------------------+-----------+| 3 |{"uid":{"clientId":7,"id":1009,"which":"F"} ,"user":"nzucker"} | 7 || 2 |{"history":["New"],"uid":{"clientId":1,"id":205,"which":"P"} ,"user":"nzucker"} | 1 || 4 |{"uid":{"clientId":9,"id":327,"which":"P"} ,"user":"bqunibi"} | 9 |+---------+--------------------------------------------------------------------------------+-----------+3 rows in set (0.00 sec)memsql> update tasks set task::history = JSON_ARRAY_PUSH_STRING(task::history, "InProgress") where task_id = 2;Query OK, 1 row affected (0.00 sec)memsql> select * from tasks where task_id = 2;+---------+---------------------------------------------------------------------------------------------+-----------+| task_id | task | client_id |+---------+---------------------------------------------------------------------------------------------+-----------+| 2 |{"history":["New","InProgress"],"uid":{"clientId":1,"id":205,"which":"P"} ,"user":"nzucker"} | 1 |+---------+---------------------------------------------------------------------------------------------+-----------+1 row in set (0.00 sec)Finding the Last Element of the ArrayPerhaps we want to find the most recent update in our task document’s history array. One (inefficient) approach is to extract the entire array, and serialize it into an array object, and find the last element:memsql> select task::history from tasks where task_id = 2;+----------------------+| task::history |+----------------------+| ["New","InProgress"] |+----------------------+1 row in set (2.18 sec)This certainly works in a pinch, but a better solution is to obtain the last element – representing the most recent status – directly in the SQL. Thank you to Justin Fu at SingleStore Support, who provided this solution:memsql> select task_id, JSON_EXTRACT_JSON(task::history, JSON_LENGTH(task::history) - 1) as latest_status -> from tasks -> where task_id = 2;+---------+---------------+| task_id | latest_status |+---------+---------------+| 2 | "InProgress" |+---------+---------------+1 row in set (1.24 sec)Performance ConsiderationsA quick word about performance: JSON is a plain-text-based data format and therefore will always be slower than an optimized, binary format (see Thrift and ProtocolBuffers). At Novus, our migration path was to continue using Salat for Scala case class-to-JSON serialization, which in turn uses json4s for JSON AST operations. For most cases, this was painless and Just Worked.However, in some cases, we encountered human-perceptible performance degradation after moving from MongoDB to SingleStore. Typically, this occurred when attempting to deserialize thousands of JSON documents into Scala objects while processing a SELECT result. Although many developers know Mongo as “just JavaScript,” it’s actually not. The storage and wire format are the carefully planned and optimized BSON format. Given our application was now transferring and parsing JSON text rather than BSON, this slow-down was completely understandable.So, the trade off was to sacrifice human-readability for performance, reverting to a binary format for the performance-sensitive functionality. We successfully used Twitter Chill for this purpose, storing the objects in a BLOB database column. Another option is to bite the bullet and devise a relational model for your object model, particularly if the design is stable.Either way, be sure to focus on functionality and domain modeling first, before turning to performance optimizations.Thank you to Carlos Bueno (@archivd) and the rest of the SingleStore team for feedback on earlier drafts of this article.
Read Post
Forrester
SingleStore Recognized In

The Forrester WaveTM

Translytical Data
Platforms Q4 2022

Tapjoy is Powering its Mobile Ad Platform with SingleStore
Case Studies

Tapjoy is Powering its Mobile Ad Platform with SingleStore

Over the past several months, we worked closely with the Tapjoy data science and engineering team to implement SingleStore as the database to power their Mobile Marketing Automation and Monetization Platform. In order to deliver optimized ads to over 500 million global users and support over one million transactions per minute, Tapjoy needed a database that could enable HTAP, a Gartner term we refer to frequently at SingleStore, which stands for Hybrid Transactional and Analytical Processing. Two Use Cases Real-Time Ad Optimization\ What does ad optimization look like to a user? Essentially, the Tapjoy Mobile Marketing Platform serves relevant ads from active ad campaigns based on how the ad has performed recently and how well the user fits the target profile of the ad campaign. Users are able to engage with ads and earn reward points, and advertisers generate revenue. For ad optimization to occur in real-time, data needs to be made usable very quickly. The combination of high performance and low latency makes SingleStore a good fit for this need. Across the cluster, Tapjoy is able to process 60,000 queries per second at a response time of less than ten milliseconds. Overlap Analysis\ The second use case for Tapjoy and SingleStore is overlap analysis, or ad targeting based on a user falling into multiple market segments. For example, it is easier to target a user if you can categorize them in several ways, like a sports fan from Texas with kids. SingleStore provides a real-time stream of data, and familiar SQL interface, enabling Tapjoy to analyze user interest data quickly and efficiently to serve the right ad to the right person at the right time. In short, if you know SQL, you can do big data. We detailed more of this use case in a previous blog post: Boost Conversions with Overlap Ad Targeting. Realizing HTAP SingleStore provides several important benefits when it comes to enabling HTAP, beyond the SQL interface. Unlike HBase, Tapjoy’s prior solution, SingleStore allows for updates and deletes. In addition, SingleStore can be easily integrated with standard tools and APIs. Interoperability with hardware is something that we see as a boon to our customers – they can build their own white box solution, leverage commodity hardware and deploy in public or private clouds. Finally, SingleStore embodies HTAP by combining transactions and analytics in one system, which achieves high throughput and low latency. With a slew of competitors in the digital advertising space, having instantaneous access to data up to the last click and having the capability to concurrently analyze that data is a key differentiator for Tapjoy. For more information on Tapjoy and SingleStore at the In-Memory Computing Conference, visit: singlestore.com/events. Read the official announcement here.
Read Post
Essential Resources for Apache Spark
Case Studies

Essential Resources for Apache Spark

There’s no doubt about it. Apache Spark is well on its way to becoming a ubiquitous technology. Over the past year, we’ve created resources to help our users understand the real-world use cases for Spark as well as showcase how our technologies compliment one another. Now, we’ve organized and consolidated those materials into this very post. Videos Pinterest Measures Real-Time User Engagement with Spark\ Demo of real-time data pipeline processing and analyzing re-pins across the United States.
Read Post
How Pinterest Measures Real-Time User Engagement with Spark
Case Studies

How Pinterest Measures Real-Time User Engagement with Spark

Setting the Stage for Spark With Spark on track to replace MapReduce, enterprises are flocking to the open source framework in effort to take advantage of its superior distributed data processing power. IT leads that manage infrastructure and data pipelines of high-traffic websites are running Spark–in particular, Spark Streaming which is ideal for structuring real-time data on the fly–to reliably capture and process event data, and write it in a format that can immediately be queried by analysts. As the world’s premiere visual bookmarking tool, Pinterest is one of the innovative organizations taking advantage of Spark. Pinterest found a natural fit in SingleStore’s in-memory database and Spark Streaming, and is using these tools to find patterns in high-value user engagement data. Pinterest’s Spark Streaming Setup Here’s how it works: Pinterest pushes event data, such as pins and repins, to Apache Kafka.Spark Streaming ingests event data from Apache Kafka, then filters by event type and enriches each event with full pin and geo-location data.Using the SingleStore Spark Connector, data is then written to SingleStore with each event type flowing into a separate table. SingleStore handles record deduplication (Kafka’s “at least once” semantics guarantee fault tolerance but not uniqueness).As data is streaming in, Pinterest is able to run queries in SingleStore to generate engagement metrics and report on various event data like pins, repins, comments and logins. Visualizing the Data We built a demo with Pinterest to showcase the locations of repins as they happen. When an image is repinned, circles on the globe expand, providing a visual representation of the concentration of repins by location.
Read Post
ESG Lab Validates SingleStore’s Real-Time Big Data Analytics Solution
Case Studies

ESG Lab Validates SingleStore’s Real-Time Big Data Analytics Solution

Enterprise Strategy Group (ESG) recently performed a benchmark that established SingleStore’s distributed in-memory database as the standard for real-time, Big Data analytics. ESG validated that SingleStore delivers outstanding performance and linear scalability, superior reliability and durability, and the ability to support rapid growth. Tony Palmer, a senior ESG Lab analyst who performed the benchmark said, “Organizations today increasingly need a fast and easily scalable database to query Big Data in order to meet the real-time demands of their business. Throughout our tests, SingleStore’s distributed in-memory database demonstrated outstanding ingest and query performance, with ingest and queries occurring concurrently. As a result, we would recommend businesses take a serious look at SingleStore as a solution that provides highly available, high-performance, real-time, Big Data analytics at scale.” SingleStore was designed for today’s Big Data analytical needs, and strives to solve the problems that plague traditional databases that simply cannot compete in the high velocity, high volume data environments. SingleStore delivers performance, complex analytical capabilities, and a highly reliable and durable real-time database for Big Data analytics.
Read Post