About BOSS

The workshop BOSS'20 will be held in conjunction with the

46th International Conference on
Very Large Data Bases
Held online due to the COVID-19 pandemic, August 31 - September 4, 2020

Message on Covid-19 (SARS-CoV-2) and BOSS 2020

Due to the current situation with COVID-19, the VLDB Conference 2020 will not be taking place in Tokyo. Following VLDB, BOSS 2020 will take place virtually on September 4.

Workshop Date
  September 4th, 2020

Following the great success of the previous BOSS workshops collocated with VLDB since 2015, the sixth Workshop on Big Data Open Source Systems (BOSS'20) will again give a deep-dive introduction into several active, publicly available and open-source systems.

  • The systems will be presented in tutorials by experts in the presented systems.
  • The tutorials will give details on installation and non-trivial usage and examples of the presented system.

Workshop Program

The Zoom and Slack links for the tutorials are available >>here<< .


UTC time CEST time Activity description
08:00 - 08:15 10:00 - 10:15 Welcome and introduction
08:15 - 09:45 10:15 - 11:45 Tutorial: Managing the end-to-end machine learning lifecycle with MLflow

Building and deploying a machine-learning model can be difficult to accomplish. Enabling data scientists to reproduce ML pipelines is equally challenging. Moreover, doing so can impact data science teams’ productivity, leading to a significant waste of time and resources.

Getting models up to speed in the first place is significant enough that it can be easy to overlook long-term management. What does this involve in practice? In essence, we have to compare the results of different versions of ML models to track what’s running where, and to redeploy and rollback updated models as needed. Each of these requires its own specific tools, and it’s these changes that make the ML lifecycle so challenging compared to traditional software development lifecycle (SDLC) management.

Presenters:
Paulo Gutierrez, Databricks
Denny Lee, Databricks

Links: Full description | Slides | Material | Video
10:00 - 12:00 12:00 - 14:00 Tutorial: Gradoop

Gradoop is a distributed, open source framework for complex graph analytics built on top of Apache Flink. It provides a graph data model which extends the widespread property graph model by the concept of graph collections and a set of operators. Operators, e.g., for pattern matching, graph transformations and structural grouping, as well as graph algorithms can be applied on a logical graph and graph collections. The flexible combination of these operators allows a declarative definition of graph analytical workflows that can be easily executed on a shared nothing Apache Flink cluster.

In the Gradoop tutorial you will learn about the key ideas in distributed graph analytics. For that purpose, we prepared several chapters where we shed light on how to get started, how to use our analytical operators and how to execute analytical pipelines with Gradoop. Each chapter contains two parts. First, an introduction and demonstration of the current topic and second, time for practice where you can apply the given knowledge to solve example analytical questions. During hands-on practice time, we are tuned to help and answer your questions. At the end of the tutorial session we will give a deeper look into Gradoop's future development goals and our ongoing research.

Presenters:
Kevin Gomez, University of Leipzig
Christopher Rost, University of Leipzig

Links: Full description | Slides | Material | Video
15:00 - 15:55 17:00 - 17:55 Keynote: Stateful functions on Apache Flink

Abstract
Orchestration frameworks like Kubernetes have made dealing with stateless applications very easy. But for stateful applications, we are still clinging to the ancient wisdom that state shall be someone else's problem: just put it in a database! Because of that, we are still struggling with the same issues of data consistency and complex failure semantics as decades ago. Developing stateful applications in a scalable and resilient way is still hard, especially when they span multiple (mirco)services. Stream Processors, like Apache Flink, have solved similar problems in the area of event-processing. By rethinking the relationship between state, messaging, and computation, stream processing applications are out-of-the-box scalable and consistent. Is it possible to bring some of these ideas to the space of general-purpose applications and (micro) services? The Apache Flink project has recently added a new subproject called "Stateful Functions" (https://statefun.io/) that tries to achieve exactly that. In Stateful Functions, the Flink effectively becomes an event-driven database that works together with containerized event-driven functions to form a new building block for scalable and consistent applications. In this talk, we present the Stateful Functions project. We show how its small change in responsibilities between database and applications goes surprisingly far in solving the problem of consistency and failure semantics for applications, and additionally makes it blend very well with current serverless technologies, like AWS Lambda, knative, etc.

Speaker: Stephan Ewen, Ververica
Stephan Ewen is one of the original creators and PMC Chair of Apache Flink, and CTO and co-founder of Ververica (founded as "data Artisans", acquired by Alibaba Group). He leads the efforts on systems and architectures for data analytics and distributed applications, powered by stream processing technology. Before working on Apache Flink, Stephan worked on in-memory databases, query optimization, and distributed systems. He holds a Ph.D. from the Berlin University of Technology.

Links: Slides | Video
16:00 - 17:30 18:00 - 19:30 Tutorial: Delta Lake

Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

In this 1.5 hour session you will learn about: The data engineering pipeline architecture Data engineering pipeline scenarios Data engineering pipeline best practices How Delta Lake enhances data engineering pipelines The ease of adopting Delta Lake for building your data engineering pipelines

Presenters:
Kate Sullivan, Databricks
Emma Freeman, Databricks

Links: Full description | Slides | Material | Video
17:30 - 19:30 19:30 - 21:30 Tutorial: JSON Analytics with Apache AsterixDB

Apache AsterixDB is a Big Data Management System (BDMS) with a feature set chosen to target use cases such as web data warehousing and social media data analysis. Its notable features include:
  • A NoSQL-style data model based on extending JSON with object database concepts;
  • A declarative query language, SQL++, that supports a broad range of queries against multiple semi-structured datasets;
  • A query optimizer for parallel queries and an efficient dataflow execution engine for partitioned-parallel query execution;
  • Partitioned and LSM-based native storage and indexing for large datasets;
  • Support for querying of external data (e.g., data on AWS S3) as well as natively stored data;
  • Rich data type support, including numeric, textual, temporal, and simple spatial data;
  • Secondary indexing through B+ trees, R-trees, and inverted keyword indexes;
  • Basic NoSQL-like transactional capabilities.


This tutorial will explain how Apache AsterixDB's SQL++ language can be used to analyze large bodies of semistructured (JSON) data. The focus will be on the analytical features of SQL++, which include SQL++ CTEs and functions, aggregation and grouping in SQL++, advanced grouping (Grouping Sets, Rollup, Cube), and Window functions. The attendees will have an opportunity to use AsterixDB and SQL++ hands-on during the tutorial.

Presenters:
Michael Carey
Ian Maxon
Till Westmann
Dmitry Lychagin
Phanwadee (Gift) Sinthong
Glenn Galviso

Links: Full description | Slides | Material | Video

Workshop Organization

Workshop Chairs:
  • Eleni Tzirita Zacharatou, TU Berlin, eleni.tziritazacharatou@tu-berlin.de
  • Pedro Silva, HPI, pedro.silva@hpi.de

Advisory Committee:
  • Tilmann Rabl, HPI
  • Michael Carey, UC Irvine
  • Volker Markl, TU Berlin


Call for tutorials
  • Important Dates:

    Proposals for tutorials are accepted until June 5, 2020 (Extended!)

    Accepted presenters will be notified by July 3, 2020

    In order to propose a tutorial, please email
    • a short abstract with a brief description of the system,
    • an outline of the planned tutorial,
    • the technology used for the hands on tutorial,
    • a list of presenters involved,
    • and a link to the website of your system

    to vldb.boss.workshop@gmail.com

    Note that the standard tutorial duration is 1.5 to 2 hours.
Selection Process for Tutorials
The proposals will be evaluated by the chairs and the advisory committee for the system readiness, relevance, timeliness, and perceived interests from the conference participants.

Previous Editions

BOSS'15 on September 4, 2015, in conjunction with VLDB 2015

BOSS'16 on September 9, 2016, in conjunction with VLDB 2016

BOSS'17 on September 1, 2017, in conjunction with VLDB 2017

BOSS'18 on August 27, 2018, in conjunction with VLDB 2018

BOSS'19 on August 26, 2019, in conjunction with VLDB 2019