Five Sessions to Attend at AWS re:Invent 2017
Trending

Five Sessions to Attend at AWS re:Invent 2017

Amazon will finish an exciting year by bringing together thousands of people to connect, collaborate, and learn at AWS re:Invent from November 27 – December 1 in Las Vegas. Whether you are a cloud beginner or an experienced user, you will learn something new at AWS re:Invent. This event is designed to educate attendees about the AWS platform, and help develop the skills to design, deploy, and operate infrastructure and applications. SingleStore is exhibiting in the Venetian Sands Expo Hall, so stop by our booth #1200 to view a demo and talk to our subject matter experts. This year, AWS re:Invent will offer even more breakout sessions led by AWS subject matter experts and top customers. This informative mixture of lectures, demonstrations, and guest speakers is geared towards keeping attendees informed on technical content, customer stories, and new product announcements. Here are our suggested top breakout sessions for you to attend at the event. Big Data Architectural Patterns and Best Practices on AWS Siva Raghupathy – Sr. Manager, Solutions Architecture, Amazon In this session, we simplify big data processing as a data bus comprising various stages: collect, store, process, analyze, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architectures, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost. Business and Life-Altering Solutions Through AI and Image Recognition Bob Rogers – Chief Data Scientist, Intel Julie Cordua – CEO, Thorn Nikita Shamgunov – CEO, SingleStore Artificial intelligence is going to be part of every software workload in the not-too-distant future. Partnering with AWS, Intel is dedicated to bringing the best full-stack solutions to help solve business and societal problems by helping turn massive datasets into information. Thorn is a non-profit organization, co-founded by Ashton Kutcher, focused on using technology innovation to combat child sexual exploitation. It is using SingleStore to provide a new approach to machine learning and real-time image recognition by making use of the high-performance Intel SIMD vector dot product functionality. This session covers machine learning on Intel Xeon processor based platforms and features speakers from Intel, Thorn, and SingleStore. Five Ways Artificial Intelligence Will Reshape How Developers Think Noelle LaCharite – Sr. Technical Program Manager, Amazon Jenn Jinhong – Developer Advocate, Amazon Thinking in terms of AI and conversation, changes the way you approach building web services and customer experiences. In this session, we discuss five trends that we’re seeing right now in artificial intelligence and conversational UI as we work with people building new experiences with Alexa. Self-Service Analytics with AWS Big Data and Tableau Anna Terp – BI Solutions Architect, Expedia Tad Buhman – Sr. Business Analyst II, Expedia As one of the thought leaders in Expedia’s cloud migration, the Expedia Global Payments Business Intelligence group architected, designed and built a complete cloud data mart solution from the ground up using AWS and Tableau online. In this session, we will discuss our business challenge, the journey to the solution, high-level technical architecture (using S3, EMR, data pipelines, Redshift, Tableau Online) and lessons learned along the way, including best practices and optimization methods, etc. Leveraging a Cloud Policy Framework – From Zero to Well Governed Vikram Pillai – Chief Architect, Director of Engineering, CloudHealth Technologies Governing cloud infrastructure at scale requires software that enables you to capture and drive management from internal policies, best practices, and reference architectures. A policy-driven management and governance strategy is critical to successfully operate in cloud and hybrid environments. As infrastructure grows, you might leverage knowledge that extends beyond the organization. An open-source “cloud policy framework” enables users to leverage a community that can help define and tune best practice policies, and help SaaS vendors and ISVs capture the best way to manage an application and share it with customers. A well-defined management and governance strategy enables you to put automation in place that keeps your cloud running securely and efficiently without having to take it on as a full-time job. This session discusses the development of a “cloud policy framework” that enables users to leverage open source rule definition organizations can use to govern their cloud. Learn best practice policies for managing all aspects of services, applications, and infrastructure across cost, availability, performance, security and usage.
Read Post
Five Sessions to Attend at Strata Data Conference New York
Trending

Five Sessions to Attend at Strata Data Conference New York

Strata Data Conference in New York brings together thousands of companies across the globe that build their businesses with data. The developers responsible for this revolution need a place to share their experiences on this journey. This year, Strata Data Conference will offer even more breakout sessions led by data engineers and scientists. This informative mixture of lectures, demonstrations, and guest speakers is geared towards keeping attendees informed on technical content, customer stories, and new launch announcements. SingleStore is exhibiting in the Sponsor Expo, so stop by our kiosk #2 to view a demo and speak with our subject matter experts. Here are our top session picks for you to attend at the event. Geospatial big data analysis at Uber Zhenxiao Luo (Uber), Wei Yan (Uber) 11:20am–12:00pm, Wednesday, September 27, 2017 Location: 1A 23/24 Uber’s geospatial data is increasing exponentially as the company grows. As a result, its big data systems must also grow in scalability, reliability, and performance to support business decisions, user recommendations, and experiments for geospatial data. Zhenxiao and Wei will start with an overview of Uber’s big data infrastructure before explaining how Uber models geospatial data and outlining its data ingestion pipeline. They then will discuss geospatial query performance improvement techniques and experiences, focusing on geospatial data processing in big data systems, including Hadoop and Presto. Zhenxiao and Wei will conclude by sharing Uber’s use cases and roadmap. Building advanced analytics and deep learning on Apache Spark with BigDL Yuhao Yang (Intel), Zhichao Li (Intel) 1:15pm–1:55pm, Wednesday, September 27, 2017 Location: 1A 12/14 The rapid development of deep learning in recent years has greatly changed the landscape of data analytics and machine learning and helped empower the success of many applications for artificial intelligence. BigDL, a new distributed deep learning framework on Apache Spark, provides easy and seamlessly integrated big data and deep learning capabilities for users. Yuhao Yang and Zhichao Li will share real-world examples of end-to-end analytics and deep learning applications, such as speech recognition (e.g., Deep Speech 2), object detection (e.g., Single Shot Multibox Detector), and recommendations, on top of BigDL and Spark, with a particular focus on how the users leveraged the BigDL models, feature transformers, and Spark ML to build complete analytics pipelines. Yuhao and Zhichao will also explore recent developments in BigDL, including full support for Python APIs (built on top of PySpark), notebook and TensorBoard support, TensorFlow model R/W support, better recurrent and recursive net support, and 3D image convolutions. When models go rogue: Hard-earned lessons about using machine learning in production David Talby (Pacific AI) 5:25pm–6:05pm, Wednesday, September 27, 2017 Location: 1A 06/07 Much progress has been made over the past decade on process and tooling for managing large-scale, multi-tier, multicloud apps and APIs, but there is far less common knowledge on best practices for managing machine-learned models (classifiers, forecasters, etc.), especially beyond the modeling, optimization, and deployment process once these models are in production. Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries. Topics include: Concept drift: Identifying and correcting for changes in the distribution of data in production, causing pretrained models to decline in accuracyA/B testing challenges: Recognizing common pitfalls like the primacy and novelty effects and best practices for avoiding them (like A/A testing)Offline versus online measurement: Why both are often needed, and best practices for getting them right (refreshing labeled datasets, judgement guidelines, etc.) Data futures: Exploring the everyday implications of increasing access to our personal data Daniel Goddemeyer (OFFC NYC), Dominikus Baur (Freelance) 11:20am–12:00pm, Thursday, September 28, 2017 Location: 1E 15/16 Increasing access to our personal data raises profound questions around ownership, ethics, and the resulting sociocultural changes in our everyday lives. Recent legislation that allows the reselling of our personal browser histories without our explicit consent proves the increased need to explore and investigate the consequences that these developments may bring about. Data Futures, an MFA class in which students observed each other through their own data, explores the social impacts that this informational omnipresence of our personal data may have on our future interactions. In the course, students are guided through a succession of exercises in which they observe each other through their personal data trails to derive assumptions about one another and their class. The changing social dynamics that are exposed by these intimate data exercises showcase how the social behavior of a whole group is affected once personal information becomes accessible. Inspired by this experiential understanding of their data, students then speculate around the future impacts of this knowledge ubiquity by telling visual interaction stories, exemplifying the implications of increasing access to our data. Daniel Goddemeyer and Dominikus Baur share the findings from Data Futures and demonstrate the results with a live experiment with the audience that showcases some of the effects when personal data becomes accessible. Executive Briefing: Machine learning—Why you need it, why it’s hard, and what to do about it Mike Olson (Cloudera) 1:15pm–1:55pm, Thursday, September 28, 2017 Location: 1E 12/13 Companies have been capturing and analyzing data with sophisticated tools for a long time. In recent years, though, two forces have combined to change what’s possible: we can collect and store vastly more data than ever before, and we have powerful new capabilities, like machine learning, to analyze it. Companies that do this well benefit in many ways. Mike Olson will share examples of real-world machine learning applications, explaining how they matter to business. Mike will also explore a variety of challenges in putting these capabilities into production—the speed with which technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and will outline proven ways to meet them.
Read Post
Forrester
SingleStore Recognized In

The Forrester WaveTM

Translytical Data
Platforms Q4 2022

Five Sessions to Attend at Join Conference in San Francisco
Trending

Five Sessions to Attend at Join Conference in San Francisco

Join Conference in San Francisco brings together thousands of companies across the globe that build their businesses with Looker. The developers responsible for this revolution need a place to share their experiences on this journey. This year, Join will offer even more breakout sessions led by Looker subject matter experts and top customers. This informative mixture of lectures, demonstrations, and guest speakers is geared towards keeping attendees informed on technical content, customer stories, and new launch announcements. SingleStore is exhibiting in the Sponsor Expo, so stop by our kiosk #2 to view a demo and speak with our subject matter experts. Here are our top session picks for you to attend at the event. Driving Adoption — Tips, Tricks & Everything in Between Thursday, September 14, 11:15am – 12:00pm Lucas Thelosen, VP, Professional Services, Looker Looker Professional Services will present an overview of challenges faced when releasing Looker to a team of end-users, both technical and non-technical, and how they have seen solutions working out. Gain insight into how we teach everything from from the simplest concepts (what is your data) to advanced explores for the highly curious. Turn Looker Into Your To Do List Thursday, September 14, 1:00pm – 1:45pm Jamie Davidson, VP Platform, Looker Step up your game by using Looker as a platform for operations. Learn from the experts on Looker integrations that will allow you to turn your data insights into action items. Scaling to 800 Looker Users at BuzzFeed Thursday, September 14, 1:25pm – 1:45pm Nick Hardy, Senior Business Analyst, Data Science, BuzzFeed BuzzFeed has one of the largest internal Looker user bases. From data pipelines and permissions, to models and PDTs, discover what they’ve found works (and doesn’t!) on their path to 800 users. Analyzing & Optimizing Redshift Performance Friday, September 15, 2:00pm – 2:20pm Fabio Beltramini, Customer Success Analyst, Looker Learn how you can leverage our Redshift Performance blocks to analyze your redshift usage and decrease query latency. How Blue Apron Managed Rapid Growth with Looker and Google BigQuery Friday, September 15, 11:00am – 11:20am Sam Chase, Tech Lead Data Operations, Blue Apron Andrew Rabinowitz, Data Engineer, Blue Apron In the past five years, Blue Apron has grown from three founders hand-packing boxes to a public company serving millions. But as they grew so did their data, and suddenly performance was a huge problem. Hear how Blue Apron uses Kafka, Google BigQuery and Looker to store, stream and analyze data.
Read Post
Five Sessions to Attend at Kafka Summit San Francisco
Trending

Five Sessions to Attend at Kafka Summit San Francisco

Kafka Summit San Francisco brings together thousands of companies across the globe that build their businesses on top of Apache Kafka. The developers responsible for this revolution need a place to share their experiences on this journey. This year, Kafka Summit San Francisco will offer even more breakout sessions led by Kafka subject matter experts and top Kafka customers. This informative mixture of lectures, demonstrations, and guest speakers is geared towards keeping attendees informed on technical content, customer stories, and new launch announcements. SingleStore is exhibiting in the Sponsor Expo, so stop by our kiosk #109 to view a demo and speak with our subject matter experts. Here are our top recommended sessions for you to attend at the event. Efficient Schemas in Motion with Kafka and Schema Registry 10:30 am – 11:10 am, Pipelines Track Pat Patterson, Community Champion, StreamSets Inc. Apache Avro allows data to be self-describing, but carries an overhead when used with message queues, such as Apache Kafka. Confluent’s open source Schema Registry integrates with Kafka to allow Avro schemas to be passed ‘by reference’, minimizing overhead, and can be used with any application that uses Avro. Learn about Schema Registry, using it with Kafka, and leveraging it in your application. Kafka Stream Processing for Everyone 12:10 pm – 12:50 pm, Streams Track Nick Dearden, Director of Engineering, Confluent The rapidly expanding world of stream processing can be confusing and daunting, with new concepts to learn (various types of time semantics, windowed aggregate changelogs, and so on) but also new frameworks and programming models. Multiply this by the operational complexities of multiple distributed systems and the learning curve is steep indeed. Come hear how to simplify your streaming life. From Scaling Nightmare to Stream Dream: Real-time Stream Processing at Scale 1:50 pm – 2:30 pm, Pipelines Track Amy Boyle, Software Engineer, New Relic On the events pipeline team at New Relic, Kafka is the thread that stitches our micro-service architecture together. We receive billions of monitoring events an hour, which customers rely on us to alert on in real-time. Facing a ten fold+ growth in the system, learn how we avoided a costly scaling nightmare by switching to a streaming system, based on Kafka. We follow a DevOps philosophy at New Relic. Thus, I have a personal stake in how well our systems perform. If evaluation deadlines are missed, I loose sleep and customers loose trust. Without necessarily setting out to from the start, we’ve gone all in, using Kafka as the backbone of an event-driven pipeline, as a datastore, and for streaming updates to the system. Hear about what worked for us, what challenges we faced, and how we continue to scale our applications. Kafka Connect Best Practices – Advice from the Field 2:40 pm – 3:20 pm, Pipelines Track Randall Hauch, Engineer, Confluent This talk will review the Kafka Connect Framework and discuss building data pipelines using the library of available Connectors. We’ll deploy several data integration pipelines and demonstrate: best practices for configuring, managing, and tuning the connectorstools to monitor data flow through the pipelineusing Kafka Streams applications to transform or enhance the data in flight. “One Could, But Should One?”: Streaming Data Applications on Docker 5:20 pm – 6:00 pm, Use Case Track Nikki Thean Staff Engineer, Etsy Should you containerize your Kafka Streams or Kafka Connect apps? I’ll answer this popular question by describing the evolution of streaming platforms at Etsy, which we’ve run on both Docker and bare metal, and what we learned on the way. Attendees will learn about the benefits and drawbacks of each approach, plus some tips and best practices for running your Kafka apps in production.
Read Post
5 Sessions to See at AWS Summit Chicago
Trending

5 Sessions to See at AWS Summit Chicago

The Amazon Web Services Summit will bring members of the cloud computing community together to connect, collaborate, and learn about AWS from July 26-27, 2017 in Chicago. Whether you are a cloud beginner or an experienced user, you will learn something new at the AWS Summit. This free event is designed to educate attendees about the AWS platform, and help develop the skills to design, deploy, and operate infrastructure and applications. SingleStore is exhibiting in the HUB, Partner & Solutions Expo, so stop by our booth #405 to view a demo and speak with our subject matter experts. This year, AWS Summit Chicago will offer even more breakout sessions led by AWS subject matter experts and top AWS customers. This informative mixture of lectures, demonstrations, and guest speakers is geared towards keeping attendees informed on technical content, customer stories, and new launch announcements. Here are our top breakout session picks. DEM314 – Real-time Dashboards with Kinesis Analytics & Amazon Rekognition In this demo, we’ll show you how to easily aggregate and display streaming data from multiple mobile devices on a real-time dashboard. First, we’ll use Amazon Rekognition to find and analyze faces in photos. Then we will stream the metadata to Amazon Kinesis Streams and aggregate it with Amazon Kinesis Analytics. Allan MacInnis – Amazon Ryan Nienhuis – Amazon Wednesday, July 26, 3:00 PM – 3:30 PM– Day 1 Theater, The HUB SEC309 – Secure Your Cloud Investment: Mastering AWS Identity Access Management (IAM) The landscape of IT and data security has changed vastly since the advent of the cloud. Savvy technology leaders know that they must have visibility and control over their environment to fully leverage their cloud investments. Tools like IAM offer teams indispensable tools to proactively manage and protect their cloud environment. Join CloudCheckr CEO Aaron Newman to learn tips for effective and secure cloud deployments that you can implement today, including: How to address requirements of the AWS Shared Responsibility ModelWhy anticipating internal and external threats are crucial for mitigating security risks in the cloudIAM overview and how it helps ensure secure and compliant deploymentsFeatures and policies, as well as how to apply them to users and groupsAdvice for leveraging IAM roles to mitigate potential security risksBest practices for using IAM to configure user permissions, and other important considerations Aaron Newman – CEO, CloudCheckr Wednesday, July 26, 11:30 AM – 12:30 PM– E351 DEM201 – Transforming Your Business with Analytics on the Cloud The cloud and analytics capabilities have matured to the point that two out of every three enterprises have identified business use case(s), such as increasing ROI, agility and pace of innovation, and have started the journey to cloud. Analytics on the cloud offers a faster path to insight-driven decision-making and business outcomes. This session is intended to increase participants’ knowledge of the cloud analytics value proposition, business use cases, common pitfalls and challenges, and best practices to migrating to cloud. At the end of the session, attendees will be equipped with: A better understanding of what AWS offers in the world of data analytics, machine learning and AIEffective strategies that avoid common mistakes in a cloud journeyA self-assessment (readiness) approach and high level understanding of total cost of ownership structure of cloud analytics Prasanna Rajagopalan – Analytics and Information Management, Trianz Wednesday, July 26, 1:30 PM – 2:00 PM– Kumo Theater, The HUB DEM200 – Cloudreach: Building an Effective Cloud Operating Model on AWS In this session, you will learn how to build a cloud operating model on AWS. When moving to the cloud, it is imperative that you have a plan which encompasses where you are today (your current state) and where you want to get to (your future state). The Cloud Operating Model is designed to help you identify what your target future looks like and the steps you should take along the way to get there. We will cover how to define your cloud strategy/vision, the design principles you need to capture to make it a reality, how to transform your organization to meet the new reality of getting things done, and finally how to ensure you have an operational plan in place to manage your environment. Jeff Armstrong – Cloud Architect, Cloudreach Wednesday, July 26, 12:45 PM – 1:15 PM– Kumo Theater, The HUB DEM316 – Real-Time Streaming Analysis & Visualization using Apache Flink on Amazon EMR and Kibana See how a real-time data stream can be processed, analyzed, and visualized in real time using a combination of open source technologies and managed services. In this demo, we will use Apache Flink on Amazon EMR to process NYC taxi traffic data in real-time. Then we will use the Amazon Elasticsearch service and Kibana, to analyze and visualize the data. Keith Steward – Specialist (EMR) Solutions Architect & AI SME, Amazon Web Services, Amazon Thursday, July 27, 10:45 AM – 11:15 AM– Kumo Theater, The HUB
Read Post
The Machine Learning Track at Spark Summit
Trending

The Machine Learning Track at Spark Summit

Spark Summit 2017 kicks off in less than two weeks with a program that includes more than 175 talks led by top experts in the Apache Spark ecosystem. From developer tutorials and research demos to real-world case studies and data science applications, these 5 sessions will take your machine learning skills to the next level. 5 Machine Learning talks to check out at Spark Summit 2017: Apache Spark MLlib’s Past Trajectory and New Directions\ (Joseph Bradley, Databricks) – This talk discusses the trajectory of MLlib, the Machine Learning (ML) library for Apache Spark. Review the history of the project, including major trends and efforts leading up to today. Embracing a Taxonomy of Types to Simplify Machine Learning\ (Leah McGuire, Salesforce) – Salesforce has created a machine learning framework on top of Spark ML that builds personalized models for businesses across a range of applications. Extending Spark Machine Learning: Adding Your Own Algorithms and Tools\ (Holden Karau & Seth Hendrickson, IBM) – Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. This talk introduces Spark’s ML pipelines, and looks at how to extend them with your own custom algorithms. How Apache Spark and AI Powers UberEATS\ (Chen Jin & Xian Xing Zhang, Uber) – The overall relevance and health of the UberEATS marketplace is critical in order to make and maintain it as an everyday product for Uber’s users. Chen and Xian Xian explain their implementation of Apache Spark and AI to increase user retention. Real-Time Image Recognition with Apache Spark\ (Nikita Shamgunov, SingleStore) – Nikita’s session will examine the image recognition techniques available with Apache Spark, and how to put those techniques into production Don’t miss the chance to see a live demo at the SingleStore kiosk! Visit us at kiosk #7 in the Moscone West expo hall or book a demo here.
Read Post
Five Talks for Solving Big Data Challenges at Strata+Hadoop World
Trending

Five Talks for Solving Big Data Challenges at Strata+Hadoop World

Strata+Hadoop World in San Jose kicks-off next week on March 14, offering data engineers and business intelligence professionals a place to gather and learn about the most challenging problems, engaging use cases, and enticing opportunities in data today. SingleStore will be showcasing real-time data as a vehicle for operationalizing machine-learning, exploring advanced tools including TensorFlow, Apache Spark, and Apache Kafka. We will also be demonstrating the power of machine learning to effect positive change in our keynote: Machines and the Magic of Fast Learning. See everything we have planned for Strata+Hadoop World including customer talks, events, demos, and giveaways at: singlestore.com/events Hand-Picked Talks at Strata+Hadoop World With over 250 speaking sessions to choose from, deciding the right session to check out at Strata can be a bit overwhelming. To help you out, we have mapped out our top five recommended sessions for the conference: Tuesday, March 14 Developing a modern enterprise data strategy Edd Wilder-James (Silicon Valley Data Science), Scott Kurth (Silicon Valley Data Science) 1:30pm–5:00pm Tuesday, March 14, 2017 – Location: 210 B/F A data strategy should guide your organization in two key areas—what actions your business should take to get started with data and where to start to realize the most value. Edd Wilder-James and Scott Kurth explain how to create a modern data strategy to power data-driven business.
Read Post
O’Reilly Radar Podcast: The 2017 Machine Learning Outlook
Data Intensity

O’Reilly Radar Podcast: The 2017 Machine Learning Outlook

O’Reilly Media Editor, Jon Bruner, recently sat down with SingleStore VP of Engineering, Drew Paroski, and SingleStore CMO, Gary Orenstein, to discuss the rapid growth and impact that machine learning will have in 2017. In this podcast, Paroski and Orenstein share examples from companies using real-time technologies to power machine learning applications. They also identify key trends driving the adoption of machine learning and predictive analytics. Listen Here
Read Post
Five Talks for Building Faster Dashboards at Tableau Conference
Trending

Five Talks for Building Faster Dashboards at Tableau Conference

Tableau Conference 2016 kicks-off in Austin, Texas on November 7-11, offering data engineers and business intelligence pros a place to gather and learn how to utilize data to tell a story through analytics and visualizations. SingleStore will be exhibiting the native connectivity and high performance partnership with Tableau using the Tableau SingleStore connector at TC16. Additionally, SingleStore will present a new showcase application, SingleStore Springs: Real-Time Resort Demographic Analysis. This demonstration showcases live customer behavior by demographic across resort properties, visualized with a Tableau dashboard. Attendees will learn how to natively connect Tableau to SingleStore for enhanced dashboard performance and scalability by visiting the SingleStore booth.
Read Post
Using SingleStore and Spark for Machine Learning
Data Intensity

Using SingleStore and Spark for Machine Learning

At Spark Summit in San Francisco, we highlighted our PowerStream showcase application, which processes and analyzes data from over 2 million sensors on 200,000 wind turbines installed around the world. We sat down with one of our PowerStream engineers, John Bowler, to discuss his work on our integrated SingleStore and Apache Spark solutions. What is the relationship between SingleStore and Spark? At its core, SingleStore is a database engine, and Spark is a powerful option for writing code to transform data. Spark is a way of running arbitrary computation on data either before or after it lands in SingleStore. The first component to SingleStore and Spark integration is the SingleStore Spark Connector, an open-source library. Using the connector, we are able to use Spark as the language for writing distributed computations, and SingleStore as a distributed processing and storage engine. For those familiar with Spark, here is how the SingleStore Spark Connector allows tight integration between SingleStore and Spark: Using `SingleStoreContext.sql("SELECT * FROM t")`, you can create a DataFrame in Spark that is backed by a SingleStore table. When you string together a bunch of SparkSQL operations and call collect() on the result, these DataFrame operations will actually run in the SingleStore database engine as direct SQL queries. This can give a major performance boost due to the SQL-optimized nature of SingleStore. Using `df.saveToSingleStore()`, you can take a DataFrame and persist it to SingleStore easily The second component to SingleStore and Spark integration is Streamliner, which is built on top of the Spark Connector. Streamliner enables you to use Spark as a high-level language to create Extract, Transform, Load (ETL) pipelines that run on new data in real time. We built Streamliner around a ubiquitous need to ingest data as fast as possible and query the information instantly. With Streamliner, you can write the logic of your real-time data analytics pipeline such as parsing documents, scoring a machine-learning model, or whatever else your business requires, and instantly apply it to your SingleStore cluster. As soon as you have raw analytics data available for consumption, you can process it, see the results in a SQL table, and act on it. What type of customer would benefit from the SingleStore Streamliner product? A customer who is already using Kafka to collect real-time information streaming from different sources can use Streamliner out-of-the-box. Without writing any code, you can take all the data in a Kafka topic and append it to a SingleStore table instantly. SingleStore will automatically place this in a JSON format by default so no additional work is required. However, if you want to take semi-structured or unstructured “messages” and turn them into “rows” for SingleStore, you can write arbitrary code in the Streamliner “Transform” step. Streamliner also allows you to do this inside the web browser console. Consider this example – suppose you want to make a dashboard that will monitor data from your entire company and produce real-time visualizations or alerts. Your production application is inserting into a production database, emitting events, or outputting logs. You can optimize this dashboard application by taking all of this data and routing it to a distributed message queue such as Kafka, or writing it directly to a SingleStore table. You can then write your data-transformation or anomaly-detection code in Spark. The output of this is data readily available in SingleStore for any SQL-compatible Business Intelligence tool, your own front-end application, or users in your company running ad-hoc queries. What is PowerStream? PowerStream is a showcase application that we built on top of Streamliner. It’s an end-to-end pipeline for high-throughput analytics and machine learning. We have simulation of 20,000 wind farms (200,000 individual turbines) around the world in various states of disrepair. We use this simulation to generate sample sensor data, at a rate of 1 to 2 million data points per second. Using a co-located Kafka-Spark-SingleStore cluster, we take these raw sensor values and run them through a set of regression models to determine 1) how close each turbine is to failing, and 2) which part is wearing down. In your opinion, what is the most interesting part of the PowerStream showcase application? I am personally interested in the data science use case. PowerStream demonstrates how we can deploy a machine learning model to a cluster of nodes, and “run” the model on incoming data, writing the result to SingleStore in real time. Data science is a big field and running machine learning models in production is an important part, but of course not the whole picture. Data exploration, data cleaning, feature extraction, model validation – both interactively (offline) and in production (online) – are all parts of a complete data science workflow. Watch the SingleStore PowerStream session at Spark Summit with CTO and Co-founder, Nikita Shamgunov
Read Post