Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, PrestoDB will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.
What Is Presto In Big Data?
Big data encompasses many different things, including:
- Capturing data
- Storing data
- Analysis
- Search
- Sharing
- Transfer
- Visualization
- Querying
- Updating
Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software.
Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem.
Presto data sources are sources that connect to PrestoDB and that you can query. There are a ton in the PrestoDB ecosystem including AWS S3, Redshift, MongoDB, and many more.
What Is Presto Hive?
Presto Hive typically refers to using PrestoDB with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto MySQL via the Hive connector is able to access both these components. One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI.
Does Presto Use Spark?
PrestoDB and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.
Does Presto Use YARN?
PrestoDB is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.