Global AI and Data Science

 View Only

Should I use Spark for data science?

By Jacques Roy posted Fri December 14, 2018 11:31 AM

  

I was talking to a friend of mine, Robert, about the different tools available for when it comes to manipulating data and creating models. Our discussion quickly got to:

Should we used Spark or not?

For those who don't remember, Spark made a big splash in the big data world back in around 2010 as a general purpose cluster computing system. It describes itself as a unified analytics engine for large-scale data processing.

The following image got a lot of coverage and is still used today on the spark page.
Spark vs. Hadoop
To be more precise, the comparison is against the MapReduce programming that is included in the Hadoop distribution. The thinking was that we are in the era of big data and being able to scale to a large cluster of commodity hardware is essential to the success of these types of projects.

The other big benefit of Spark is that it comes with a set of classes that anticipated a lot of the processing that needs to be done. It also now includes a full SQL processing engine.

All that is really nice but do I need it? Of course, the answer is: "it depends".
You may have a lot of corporate data. You may augment it with tons of social media data and other external sources such as weather information or government data.

The thing is that a lot of people don't have a huge amount of data to process at once. It may be in the 100's of megabytes. Considering the fact that you can easily have gigabytes of memory on a laptop computer, a large server could easily have a few terabytes of memory.

So I see two reasons to use Spark:

  • As an insurance that if you need to go to a cluster you can do it with little effort
  • As a standard analytics engine
Standardizing on Spark could reduce the number of technologies your team needs to learn. It could also make it easier to move one person from one project to another.

Of course, Spark is not the only technology needed but it can provide a good base.

Take a look at the latest byte-size data science video (on Spark) at: bit.ly/byte-size-data-science
#GlobalAIandDataScience
#GlobalDataScience
0 comments
13 views

Permalink