watsonx.data

 View Only

The top 5 reasons a data engineer should use Presto

By DIVYA LOKESH posted Fri April 12, 2024 10:20 AM

  

Open source software has become very widely used over the last decade or so. It’s not just developers and engineers adopting open source – companies ranging from the Fortune 500 to 10-person start ups are relying on open source software for many different parts of their business. In fact, as Ted Dunning recently put it, “every company in the world that uses software uses open source software”. 

In this blog, we'll share more about the open source SQL engine Presto and why thousands of engineers and companies use Presto for ad hoc and interactive analytics. In fact, we'll give you five reasons why you should use Presto in your data platform.

At a high level, the Presto engine was built to handle data processing in memory, eliminating the need for a disc input-output transaction. Because Presto doesn’t typically care what storage you use, you can quickly join or aggregate datasets across a range of data sources for a unified view of your data to query against. 

Why does this matter? If you can read data more swiftly, the performance of your queries improves correspondingly—always a good thing when you have business analysts, executives, and customer reports that need to be made available regularly.

Let’s look at 5 reasons why you should consider Presto for your data platform.

1. Query federation

Screen Shot 2023 04 04 at 12.21.36 PM

This architecture depicts a federated query across multiple data sources with Presto

Presto provides a single unified SQL dialect that abstracts all supported data sources. This is a powerful feature which eliminates the need for users to understand connections and SQL dialects of underlying systems.

With Presto, we can write queries that join multiple disparate data sources without moving the data – it’s all queried by unified ANSI SQL.

2. Fast, reliable, and efficient

Data infrastructure costs can explode, especially with proprietary systems like data warehouses ( i.e Snowflake, Redshift), as the data size and users’ workloads grow.

Presto is battle-tested at Meta and Uber and can scale to meet growing data sizes and workloads. It’s faster and more efficient than other engines because it’s optimised for large numbers of small queries, so you can query data at better price-performance compared to proprietary systems.

3. Presto is not tied to storage

Because Presto’s design separates the query engine layer from your storage layer (HDFS, MySQL, S3, etc.), you can independently scale either layer depending on what your workload needs are. Storage and compute are not tied to one another. Presto gives data engineers and architects tremendous flexibility. 

4. Unified SQL interface

diagram 2

SQL is the oldest and the most widely-used language for data analysis. Analysts, data engineers and data scientists use SQL for exploring data, building dashboards, and testing hypotheses with notebooks like Jupyter and Zeppelin, or with BI tools like Tableau, PowerBI, and Looker.

Presto has the ability to query data not just from distributed file systems, but also from other sources such as NoSQL stores like Cassandra, Elasticsearch, and RDBMS and even message queues like Kafka (and all of these data sources are queryable by standard ANSI SQL).

5. Open Source

Presto is open source. It’s part of the Linux Foundation, a gold standard for open source projects. The project is neutrally governed, meaning no one company or individual can dictate or control the roadmap. In fact, the Presto Foundation is made up of a consortium of industry leaders (Meta, Uber, Alibaba, Twitter, Intel, HPE and many more) who work together to advance the project forward. For many, an important in the decision-making process of using open source is ensuring that a project is not solely corporate-backed and is truly open to the broader community.

Presto: The Open Source SQL Query engine for better insights

We’ve talked about how Presto enables self-service ad-hoc analytics for its users handling large amounts of data, how you can query data where it lives, its ability to scale independently based on storage and compute and its intuitive SQL interface. These are the reasons a data engineer should consider using Presto for their data architecture.

At IBM, we’ve built a managed service for Presto on IBM Cloud. Presto is very powerful, but it’s also very complex and can be complicated to manage. We abstract those complexities and take care of configuring, tuning, and managing Presto under the hood so you can focus on driving data analytics for your organization. We have a trial experience for watsonx.data if you want to check it out, get started with the watsonx.data lite plan.

NOTE: This blog was originally published by Arpan Roy on Ahana site.


#watsonx.data
#PrestoEngine

0 comments
17 views

Permalink