The workshop BOSS'20 will be held in conjunction with the
46th International Conference on
Very Large Data Bases
Held online due to the COVID-19 pandemic, August 31 - September 4, 2020
Due to the current situation with COVID-19, the VLDB Conference 2020 will not be taking place in Tokyo. Following VLDB, BOSS 2020 will take place virtually on September 4.
Following the great success of the previous BOSS workshops collocated with VLDB since 2015, the sixth Workshop on Big Data Open Source Systems (BOSS'20) will again give a deep-dive introduction into several active, publicly available and open-source systems.
The Zoom and Slack links for the tutorials are available >>here<< .
UTC time | CEST time | Activity description | ||
08:00 - 08:15 | 10:00 - 10:15 | Welcome and introduction | ||
08:15 - 09:45 | 10:15 - 11:45 |
Tutorial: Managing the end-to-end machine learning lifecycle with MLflow
Building and deploying a machine-learning model can be difficult to accomplish. Enabling data scientists to reproduce ML pipelines is equally challenging. Moreover, doing so can impact data science teams’ productivity, leading to a significant waste of time and resources. Getting models up to speed in the first place is significant enough that it can be easy to overlook long-term management. What does this involve in practice? In essence, we have to compare the results of different versions of ML models to track what’s running where, and to redeploy and rollback updated models as needed. Each of these requires its own specific tools, and it’s these changes that make the ML lifecycle so challenging compared to traditional software development lifecycle (SDLC) management. Presenters: Paulo Gutierrez, Databricks Denny Lee, Databricks Links: Full description | Slides | Material | Video |
||
10:00 - 12:00 | 12:00 - 14:00 | Tutorial: Gradoop Gradoop is a distributed, open source framework for complex graph analytics built on top of Apache Flink. It provides a graph data model which extends the widespread property graph model by the concept of graph collections and a set of operators. Operators, e.g., for pattern matching, graph transformations and structural grouping, as well as graph algorithms can be applied on a logical graph and graph collections. The flexible combination of these operators allows a declarative definition of graph analytical workflows that can be easily executed on a shared nothing Apache Flink cluster. In the Gradoop tutorial you will learn about the key ideas in distributed graph analytics. For that purpose, we prepared several chapters where we shed light on how to get started, how to use our analytical operators and how to execute analytical pipelines with Gradoop. Each chapter contains two parts. First, an introduction and demonstration of the current topic and second, time for practice where you can apply the given knowledge to solve example analytical questions. During hands-on practice time, we are tuned to help and answer your questions. At the end of the tutorial session we will give a deeper look into Gradoop's future development goals and our ongoing research. Presenters: Kevin Gomez, University of Leipzig Christopher Rost, University of Leipzig Links: Full description | Slides | Material | Video |
||
15:00 - 15:55 | 17:00 - 17:55 | Keynote: Stateful functions on Apache Flink
Abstract Orchestration frameworks like Kubernetes have made dealing with stateless applications very easy. But for stateful applications, we are still clinging to the ancient wisdom that state shall be someone else's problem: just put it in a database! Because of that, we are still struggling with the same issues of data consistency and complex failure semantics as decades ago. Developing stateful applications in a scalable and resilient way is still hard, especially when they span multiple (mirco)services. Stream Processors, like Apache Flink, have solved similar problems in the area of event-processing. By rethinking the relationship between state, messaging, and computation, stream processing applications are out-of-the-box scalable and consistent. Is it possible to bring some of these ideas to the space of general-purpose applications and (micro) services? The Apache Flink project has recently added a new subproject called "Stateful Functions" (https://statefun.io/) that tries to achieve exactly that. In Stateful Functions, the Flink effectively becomes an event-driven database that works together with containerized event-driven functions to form a new building block for scalable and consistent applications. In this talk, we present the Stateful Functions project. We show how its small change in responsibilities between database and applications goes surprisingly far in solving the problem of consistency and failure semantics for applications, and additionally makes it blend very well with current serverless technologies, like AWS Lambda, knative, etc. Speaker: Stephan Ewen, Ververica Stephan Ewen is one of the original creators and PMC Chair of Apache Flink, and CTO and co-founder of Ververica (founded as "data Artisans", acquired by Alibaba Group). He leads the efforts on systems and architectures for data analytics and distributed applications, powered by stream processing technology. Before working on Apache Flink, Stephan worked on in-memory databases, query optimization, and distributed systems. He holds a Ph.D. from the Berlin University of Technology. Links: Slides | Video |
||
16:00 - 17:30 | 18:00 - 19:30 | Tutorial: Delta Lake
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs. In this 1.5 hour session you will learn about: The data engineering pipeline architecture Data engineering pipeline scenarios Data engineering pipeline best practices How Delta Lake enhances data engineering pipelines The ease of adopting Delta Lake for building your data engineering pipelines Presenters: Kate Sullivan, Databricks Emma Freeman, Databricks Links: Full description | Slides | Material | Video |
||
17:30 - 19:30 | 19:30 - 21:30 | Tutorial: JSON Analytics with Apache AsterixDB
Apache AsterixDB is a Big Data Management System (BDMS) with a feature set chosen to target use cases such as web data warehousing and social media data analysis. Its notable features include:
This tutorial will explain how Apache AsterixDB's SQL++ language can be used to analyze large bodies of semistructured (JSON) data. The focus will be on the analytical features of SQL++, which include SQL++ CTEs and functions, aggregation and grouping in SQL++, advanced grouping (Grouping Sets, Rollup, Cube), and Window functions. The attendees will have an opportunity to use AsterixDB and SQL++ hands-on during the tutorial. Presenters: Michael Carey Ian Maxon Till Westmann Dmitry Lychagin Phanwadee (Gift) Sinthong Glenn Galviso Links: Full description | Slides | Material | Video |
Proposals for tutorials are accepted until June 5, 2020 (Extended!)
Accepted presenters will be notified by July 3, 2020
⇒ BOSS'15 on September 4, 2015, in conjunction with VLDB 2015
⇒ BOSS'16 on September 9, 2016, in conjunction with VLDB 2016
⇒ BOSS'17 on September 1, 2017, in conjunction with VLDB 2017
⇒ BOSS'18 on August 27, 2018, in conjunction with VLDB 2018
⇒ BOSS'19 on August 26, 2019, in conjunction with VLDB 2019