From 2ad6f4d374aad331271fb32c66158fc5dd88d15e Mon Sep 17 00:00:00 2001 From: Chathura Widanage <7312649+chathurawidanage@users.noreply.github.com> Date: Sun, 9 May 2021 12:48:23 -0400 Subject: [PATCH 1/7] Adding Cylon under out of core --- doc/source/ecosystem.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/doc/source/ecosystem.rst b/doc/source/ecosystem.rst index d53d0556dca04..03cec831169e2 100644 --- a/doc/source/ecosystem.rst +++ b/doc/source/ecosystem.rst @@ -405,6 +405,11 @@ Blaze provides a standard API for doing computations with various in-memory and on-disk backends: NumPy, pandas, SQLAlchemy, MongoDB, PyTables, PySpark. +`Cylon `__ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Cylon is a fast, scalable, distributed memory parallel runtime with a Pandas like Python DataFrame API. ”Core Cylon” is implemented with C++ using Apache Arrow format to represent the data in-memory. Cylon DataFrame API implements most of the core operators of Pandas such as merge, filter, join, concat, group-by, drop_duplicates, etc. These operators are designed to work across thousands of cores to scale applications. It can interoperate with Pandas DataFrame by reading data from Pandas or convert data to Pandas so users can selectively scale parts of their Pandas DataFrame applications. + `Dask `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From df2ed43ae3823fcf2610666167bc8ab6b620727b Mon Sep 17 00:00:00 2001 From: Chathura Widanage <7312649+chathurawidanage@users.noreply.github.com> Date: Sun, 9 May 2021 13:17:17 -0400 Subject: [PATCH 2/7] DOC: Fixed a type in cylon decscription --- doc/source/ecosystem.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/ecosystem.rst b/doc/source/ecosystem.rst index 03cec831169e2..19676a5177a60 100644 --- a/doc/source/ecosystem.rst +++ b/doc/source/ecosystem.rst @@ -408,7 +408,7 @@ PySpark. `Cylon `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Cylon is a fast, scalable, distributed memory parallel runtime with a Pandas like Python DataFrame API. ”Core Cylon” is implemented with C++ using Apache Arrow format to represent the data in-memory. Cylon DataFrame API implements most of the core operators of Pandas such as merge, filter, join, concat, group-by, drop_duplicates, etc. These operators are designed to work across thousands of cores to scale applications. It can interoperate with Pandas DataFrame by reading data from Pandas or convert data to Pandas so users can selectively scale parts of their Pandas DataFrame applications. +Cylon is a fast, scalable, distributed memory parallel runtime with a Pandas like Python DataFrame API. ”Core Cylon” is implemented with C++ using Apache Arrow format to represent the data in-memory. Cylon DataFrame API implements most of the core operators of Pandas such as merge, filter, join, concat, group-by, drop_duplicates, etc. These operators are designed to work across thousands of cores to scale applications. It can interoperate with Pandas DataFrame by reading data from Pandas or converting data to Pandas so users can selectively scale parts of their Pandas DataFrame applications. `Dask `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From d75e3b09169ba988e611cf0665d884dc0c2213eb Mon Sep 17 00:00:00 2001 From: Chathura Widanage <7312649+chathurawidanage@users.noreply.github.com> Date: Tue, 11 May 2021 20:02:14 -0400 Subject: [PATCH 3/7] DOC: Adding a Cylon example and Style fixes --- doc/source/ecosystem.rst | 29 +++++++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-) diff --git a/doc/source/ecosystem.rst b/doc/source/ecosystem.rst index 19676a5177a60..a2dc05d1dd660 100644 --- a/doc/source/ecosystem.rst +++ b/doc/source/ecosystem.rst @@ -406,9 +406,34 @@ in-memory and on-disk backends: NumPy, pandas, SQLAlchemy, MongoDB, PyTables, PySpark. `Cylon `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Cylon is a fast, scalable, distributed memory parallel runtime with a pandas +like Python DataFrame API. ”Core Cylon” is implemented with C++ using Apache +Arrow format to represent the data in-memory. Cylon DataFrame API implements +most of the core operators of pandas such as merge, filter, join, concat, +group-by, drop_duplicates, etc. These operators are designed to work across +thousands of cores to scale applications. It can interoperate with pandas +DataFrame by reading data from pandas or converting data to pandas so users +can selectively scale parts of their pandas DataFrame applications. + +.. code:: python + + from pycylon import read_csv, DataFrame, CylonEnv + from pycylon.net import MPIConfig + + # Initialize Cylon distributed environment + config: MPIConfig = MPIConfig() + env: CylonEnv = CylonEnv(config=config, distributed=True) + + df1: DataFrame = read_csv('/tmp/csv1.csv') + df2: DataFrame = read_csv('/tmp/csv2.csv') + + # Using 1000s of cores across the cluster to compute the join + df3: Table = df1.join(other=df2, on=[0], algorithm="hash", env=env) + + print(df3) -Cylon is a fast, scalable, distributed memory parallel runtime with a Pandas like Python DataFrame API. ”Core Cylon” is implemented with C++ using Apache Arrow format to represent the data in-memory. Cylon DataFrame API implements most of the core operators of Pandas such as merge, filter, join, concat, group-by, drop_duplicates, etc. These operators are designed to work across thousands of cores to scale applications. It can interoperate with Pandas DataFrame by reading data from Pandas or converting data to Pandas so users can selectively scale parts of their Pandas DataFrame applications. `Dask `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From ba959d5f9e38a06a7a467358676bc3abd4ea8d3c Mon Sep 17 00:00:00 2001 From: Chathura Widanage <7312649+chathurawidanage@users.noreply.github.com> Date: Tue, 11 May 2021 20:09:19 -0400 Subject: [PATCH 4/7] DOC: Removed extra line break --- doc/source/ecosystem.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/doc/source/ecosystem.rst b/doc/source/ecosystem.rst index a2dc05d1dd660..c439260f20602 100644 --- a/doc/source/ecosystem.rst +++ b/doc/source/ecosystem.rst @@ -434,7 +434,6 @@ can selectively scale parts of their pandas DataFrame applications. print(df3) - `Dask `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From de0e0c7010011cd50d3fd8a46e9084695cedccec Mon Sep 17 00:00:00 2001 From: Chathura Widanage <7312649+chathurawidanage@users.noreply.github.com> Date: Tue, 11 May 2021 20:37:19 -0400 Subject: [PATCH 5/7] DOC: Removed spaces in blank lines --- doc/source/ecosystem.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/source/ecosystem.rst b/doc/source/ecosystem.rst index c439260f20602..e33ecf7dd0236 100644 --- a/doc/source/ecosystem.rst +++ b/doc/source/ecosystem.rst @@ -428,10 +428,10 @@ can selectively scale parts of their pandas DataFrame applications. df1: DataFrame = read_csv('/tmp/csv1.csv') df2: DataFrame = read_csv('/tmp/csv2.csv') - + # Using 1000s of cores across the cluster to compute the join df3: Table = df1.join(other=df2, on=[0], algorithm="hash", env=env) - + print(df3) `Dask `__ From a6b3db79fdaf4aab0d40a21714d231dda52043e6 Mon Sep 17 00:00:00 2001 From: Chathura Widanage <7312649+chathurawidanage@users.noreply.github.com> Date: Tue, 11 May 2021 22:53:04 -0400 Subject: [PATCH 6/7] DOC: Removed a whitespace --- doc/source/ecosystem.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/ecosystem.rst b/doc/source/ecosystem.rst index e33ecf7dd0236..b5248935d7514 100644 --- a/doc/source/ecosystem.rst +++ b/doc/source/ecosystem.rst @@ -422,7 +422,7 @@ can selectively scale parts of their pandas DataFrame applications. from pycylon import read_csv, DataFrame, CylonEnv from pycylon.net import MPIConfig - # Initialize Cylon distributed environment + # Initialize Cylon distributed environment config: MPIConfig = MPIConfig() env: CylonEnv = CylonEnv(config=config, distributed=True) From b4c9ba377ee5164bc7ed2ecf39638fb4e554da14 Mon Sep 17 00:00:00 2001 From: Chathura Widanage Date: Wed, 12 May 2021 00:18:01 -0400 Subject: [PATCH 7/7] remove trailing spaces --- doc/source/ecosystem.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/doc/source/ecosystem.rst b/doc/source/ecosystem.rst index b5248935d7514..bc2325f15852c 100644 --- a/doc/source/ecosystem.rst +++ b/doc/source/ecosystem.rst @@ -408,13 +408,13 @@ PySpark. `Cylon `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Cylon is a fast, scalable, distributed memory parallel runtime with a pandas -like Python DataFrame API. ”Core Cylon” is implemented with C++ using Apache -Arrow format to represent the data in-memory. Cylon DataFrame API implements -most of the core operators of pandas such as merge, filter, join, concat, -group-by, drop_duplicates, etc. These operators are designed to work across -thousands of cores to scale applications. It can interoperate with pandas -DataFrame by reading data from pandas or converting data to pandas so users +Cylon is a fast, scalable, distributed memory parallel runtime with a pandas +like Python DataFrame API. ”Core Cylon” is implemented with C++ using Apache +Arrow format to represent the data in-memory. Cylon DataFrame API implements +most of the core operators of pandas such as merge, filter, join, concat, +group-by, drop_duplicates, etc. These operators are designed to work across +thousands of cores to scale applications. It can interoperate with pandas +DataFrame by reading data from pandas or converting data to pandas so users can selectively scale parts of their pandas DataFrame applications. .. code:: python