All Resources

Dask

Released: 
October 28, 2018
Documentation
Github
Open website

517

Github issues

6600

Github stars

1

Days since last commit

2569

Stackoverflow questions

517

Github issues

6600

Github stars

1

Last commit days ago

2569

Stackoverflow questions

Dask in one line: Dask works with data that's too big for memory – and still uses Pandas syntax.

What is Dask?

Dask is a Python library for data wrangling and parallelization. It helps with scaling and parallelizing programs. It’s used to create programs that deal with big data or have long computation times, making it easier to distribute these workloads across multiple computer cores, either on a single machine or across a compute cluster.

Dask aims to be both simple and powerful. It implements the same APIs as Pandas, NumPy and scikit-learn, which many programmers already know well. This means Dask code is likely faster to write and easier to read than other frameworks, such as Spark or Hadoop, which are also often used to scale software across compute clusters.

What problems does Dask solve?

When choosing a framework for a software development or data analysis task, you usually have to make tradeoffs. Frameworks like Spark often bring you more power and scalability, but at the cost of added complexity. You could call this “making the easy things hard and the hard things easy.” You wouldn’t want to use Spark to read a CSV file of 1000 rows: it would create a lot of overhead and complexity, and it wouldn’t offer any benefits. You’d use Pandas instead. But you wouldn’t want to use Pandas to read and analyze 1 million CSV files per hour. It wasn’t designed to be used on that scale.

Dask aims to find a middle way in this tradeoff: the simplicity and familiarity of Pandas, NumPy, and scikit-learn, with Spark-like power to parallelize across cores.

Used by:
Oxylabs, Data Science, Gitential