watsonx.data

 View Only

What is Presto and How Does It Work?

By DIVYA LOKESH posted Fri April 12, 2024 08:21 AM

  

What is Presto and how does It work?

How does PrestoDB work? PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With Presto, how it works is you can query data where it lives across many different data sources such as HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third-party data sources or infrastructure components. 

Is Presto a database?

No, PrestoDB is not a database. You can’t store data in Presto and it would not replace a general-purpose relational database like MySQL, Oracle, or PostgreSQL.

What is the difference between Presto and other forks?

PrestoDB originated from Facebook and was built specifically for Facebook. PrestoDB is backed by Linux Foundation’s Presto Foundation and is the original Facebook open source project. PrestoDB between other versions or compared to other versions are forks of the project and are not backed by the Linux Foundation’s PrestoDB Foundation.

Is Presto In-Memory? 

Presto usually works in the context of the JVMs itself. Depending on query sizes and complexity of tasks, you can allocate more or less memory to the JVMs. PrestoDB itself, however, doesn’t use this memory to cache any data. 

How does Presto cache and store data?

Presto cache – stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

What is the Presto query execution model?

The query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When PrestoDB executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Does Presto Use MapReduce?

Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, PrestoDB will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.

What Is Presto In Big Data?

Big data encompasses many different things, including:

  • Capturing data
  • Storing data
  • Analysis
  • Search
  • Sharing
  • Transfer
  • Visualization
  • Querying
  • Updating

Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. 

Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem.

Presto data sources are sources that connect to PrestoDB and that you can query. There are a ton in the PrestoDB ecosystem including AWS S3, Redshift, MongoDB, and many more.

What Is Presto Hive?

Presto Hive typically refers to using PrestoDB with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto MySQL via the Hive connector is able to access both these components. One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto Use Spark?

PrestoDB and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.

Does Presto Use YARN?

PrestoDB is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.

At IBM, we’ve built a managed service for Presto on IBM Cloud. Presto is very powerful, but it’s also very complex and can be complicated to manage. We abstract those complexities and take care of configuring, tuning, and managing Presto under the hood so you can focus on driving data analytics for your organization. We have a trial experience for watsonx.data if you want to check it out, get started with the watsonx.data lite plan.


#watsonx.data
#PrestoEngine

1 comment
26 views

Permalink

Comments

Sun April 21, 2024 03:56 PM

Thank you for sharing, Divya!