Polyglot Databases Session @BOSS-VLDB

POLYGLOT DATA MANAGEMENT SESSION – September 9^th, 2016

Organizers: Marta Patiño (TU Madrid-UPM) & Patrick Valduriez (INRIA)

The traditional tags for databases have been rendered obsolete since new data management technologies emerge with a combination of traditional and new capabilities that are transforming the database world. There are many open research challenges in which many different data stores coexist and even ecosystems and architectures based on multiple data stores, such as the lambda architecture. The new technologies have brought new data models and query languages more appropriate for certain kinds of problems such as key-value data stores, document data stores, and graph databases. In this context developers must deal with this data diversity in order to get insights and knowledge from the different data stores. Moreover, guaranteeing consistency across data stores has become a major issue due to the polyglot persistence trend. Also the frontier between OLTP and OLAP databases is blurring up resulting in near-real time and real-time analytical databases.

This session will present different technologies in this evolving arena of new data management technologies combining multiple capabilities.

This session is part of the BOSS Workshop (http://boss.dima.tu-berlin.de). Following the great success of the first Workshop on Big Data Open Source Systems (BOSS'15), the second Workshop on Big Data Open Source Systems (BOSS'16) will again give a deep-dive introduction into several active, publicly available, open-source systems. The systems will be presented in tutorials by experts in the presented systems. The tutorials will give details on installation and non-trivial examples usage of the presented system.

The BOSS workshop is held in conjunction with the

42nd International Conference on

VERY LARGE DATA BASES

New Delhi, India - September 5 - 9, 2016

Session Program (9^th September):

12:00 - 12:30	Big Data processing using Polybase. Karthik Ramachandra (Microsoft Gray Systems Lab)
12:30 - 14:00	Lunch Break
14:00 - 14:30	Multistore Systems: Retrospection on CloudMdsQL. Jose Pereira (Univ. do Minho & INESC)
14:30 - 15:00	Exploiting the data center in contemporary commodity boxes: The scaling-in approach. Jignesh Patel (Univ. of Wisconsin-Madison)
15:00 - 15:30	LeanBigData: Blending OLTP and OLAP to Deliver Real-Time Analytical Queries. Ricardo Jimenez-Peris (LeanXcale)

Talks Details:

Title: Big Data processing using Polybase

Speaker: Karthik Ramachandra

Abstract: To make good decisions, business decision makers need to analyze both relational data and other data that is not structured into tables – notably, data stored in Hadoop and other similar Big Data systems. This is difficult to do unless there exists an efficient way to process queries that access data across these different types of data stores. PolyBase bridges this gap by operating on data that is external to Microsoft SQL Server. PolyBase is a technology that accesses and combines both non-relational and relational data, all from within SQL Server. It allows queries on external data in Hadoop or Azure blob storage. The queries are optimized to push computation to Hadoop when beneficial. The talk will give an overview of Polybase and describe the architecture of Polybase in SQL Server 2016. Some of the key technical challenges and design approaches will also be discussed.

Bio: Karthik Ramachandra is a Senior Scientist at the Microsoft Gray Systems Lab at Madison, WI. His areas of research include query processing and optimization in large scale data management systems.

He completed his Ph.D. in Computer Science from IIT Bombay, where his work focused on improving performance of database applications using techniques that lie in the intersection of databases and compilers/programming languages. His work has received an honorable mention for the 2015 ACM SIGMOD Jim Gray Doctoral Dissertation award. Prior to his Ph.D., Karthik has spent 5 years at ThoughtWorks Inc., where he led teams designing and developing enterprise software systems.

Title: Multistore Systems: Retrospection on CloudMdsQL

Speaker: Jose Pereira (Univ. do Minho, INESC)

Co-authors: B. Kolev, P. Valduriez, C. Bondiombouy, R. Jimenez, R. Pau

Abstract: The blooming of different cloud data management infrastructures has turned multistore systems to a major topic in the nowadays cloud landscape. In this paper, we give an overview of the Cloud Multidatastore Query Language (CloudMdsQL), and the implementation of its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational, NoSQL, HDFS) within a single query that can contain embedded invocations to each data store’s native query interface. The major innovation is that a CloudMdsQL query can exploit the full power of local data stores, by simply allowing some local data store native queries (e.g. a breadth-first search query against a graph database) to be called as functions, and at the same time be optimized, e.g. by pushing down select predicates, using bind join, performing join ordering, or planning intermediate data shipping.

Bio: José Pereira is a Professor in Computer Science at the University of Minho and Senior Researcher at INESC TEC in Portugal. His research is centered on reliable distributed systems, in particular, on communication protocols for large scale systems and on distributed data management. Recently he has been focusing on efficient distributed processing for databases combining SQL and NoSQL. He is also the CTO of LeanXcale, a startup devoted to provide an ultra-scalable database managing OLTP and OLAP workloads and with polyglot capabilities.

Title: Exploiting the data center in contemporary commodity boxes: The scaling-in approach
Speaker: Jignesh Patel (Univ. of Wisconsin-Madison)

Abstract: Modern servers pack enough storage and computing power that just a decade ago was spread across a modest-sized cluster. In addition, we are on a technological roadmap in which the storage and compute densities of individual server nodes will continue to increase at a faster rate that the networks that connect the nodes. Thus, we must complement methods that focus on "scaling-out" by also developing methods to "scale-in" to fully exploit the hardware capabilities that is packed in each server node. This is especially true for an important class of real-time in-memory analytic data applications. The recent Apache-incubated Quickstep project focuses on this scaling-in aspect. Quickstep uses novel methods for organizing data (including columnar and hybrid storage organization), template metaprogramming for vectorized query execution, and a query execution paradigm that separates control-flow from data-flow. Collectively, these methods produce more than an order-of-magnitude performance improvement over many existing open-source platforms.

Bio: Jignesh Patel is a Professor in Computer Sciences at the University of Wisconsin-Madison. His papers have been selected as the "best papers in the conference" at VLDB (2012), SIGMOD (2011) and ICDE (2010, 2011). He has a strong interest in seeing research ideas transition to actual products. His Ph.D. thesis work was acquired by NCR/Teradata in 1997. In 2007 he founded Locomatix, which became part of Twitter in 2013, and seeded the technology that became Heron. Heron now powers all real-time services at Twitter. His last company, Quickstep Tech. was acquired by Pivotal in 2015. He founded the NEST entrepreneurship contest at the U. Wisconsin in 2009. This contest has contributed to the creation of a number of startups that collectively have created over a 100 jobs in the city of Madison. Jignesh was named as one of the top technology entrepreneurs in Madison in 2013. He also enjoys teaching and is the recipient of the Wisconsin “COW” Teaching Award, and the U. Michigan College of Engineering Education Excellence Award. He is an ACM Fellow, and serves on the board of Lands’ End and a number of technology startups. He blogs at http://bigfastdata.blogspot.com

Title: LeanBigData: Blending OLTP and OLAP to Deliver Real-Time Analytical Queries

Speaker: Ricardo Jimenez-Peris

Abstract: Traditionally, OLTP and OLAP workloads have been served by different kinds of databases systems, transactional databases and data warehouses. This separation has resulted in having to organize a process to copy the data from the operational database into the data warehouse known as extract-transform-load (ETL). This process is estimated to cost 80% of the budget of doing business analytics. LeanXcale is a NewSQL database that scales transactional processing in a linear manner to 100s of nodes. With this it provides an ultra-scalable OLTP database. Thanks to its ability to scale the OLTP engine as much as needed an OLAP engine has been built that works over the operational data delivering in this way real-time analytical queries.

Bio: Ricardo Jiménez Peris is CEO & Co-Founder of LeanXcale. LeanXcale is a startup devoted to provide an ultra-scalable database managing OLTP and OLAP workloads and with polyglot capabilities. Formerly, he was director of the Distributed Systems Lab at Technical University of Madrid (UPM). He has spent his whole scientific career aiming at scaling transactional processing and data management till he found the perfect solution and abandoned the university to incorporate LeanXcale. He is now fully engaged in this new mission to solve the major performance and scalability issues of data management. He is recipient of two awards one for LeanXcale as one of the best startup ideas in Idea Challenge 2014, the European startup competition organized by the European startup accelerator EIT Digital, and for LeanBigData as the best European project coordinated by a Madrid organization.

Personal tools

Polyglot Databases Session @BOSS-VLDB

POLYGLOT DATA MANAGEMENT SESSION – September 9th, 2016

Organizers: Marta Patiño (TU Madrid-UPM) & Patrick Valduriez (INRIA)

POLYGLOT DATA MANAGEMENT SESSION – September 9^th, 2016