Analysing Large Avocado dataset using Dask on Low Memory

Abstract

The analysis of data on computer systems with low memory has been a challenge over years. Performing even one of the simplest machine learning tasks on these systems is a tough job. And this has largely affected the data scientists while performing tasks on computer systems with limited or restricted computational resources.

In order to resolve this issue, we are seeking a solution by using the approach of incremental learning and Dask parallel computing library in Python. We have also used some Boosting libraries like XGBoost and LightGBM to help with the incremental learning process in order to handle big data.

We will apply both of these methods on the same dataset and use these models to predict.

We have approached this issue by two methods and deduced that Dask has been able to perform the computations very efficiently as compared to other methods. Dask uses its parallel computing process and works with pandas and NumPy modules to handle such a big data smoothly and efficiently and overcomes the need of resources.

Introduction

The explosion of the Big Data in recent times has prompted the world to develop such systems which are capable of supporting ultra-low latency service and Real-time data analytics.

In modern world, the need of parallelization in Machine Learning and algorithms has been increased exponentially. [1] This has been mainly because of the exponential increase in data sizes and the model sizes. Python has been very well known for its Data munging and preparation, but not for its data analysis. Pandas filled this gap very well. Pandas is an open source python library which provides high performance, makes the use of data-structures easy and is a great tool for data analysis. It allows the data analyst to focus more on research rather than programming. It allows flexible reshaping and pivoting various Data-sets. It is being used in various fields throughout the world such as academic domains and commercial domains, including Finance, Statistics, Economics, Neuroscience, Web Analytics, Advertising and much more. [2]

Data Analysts have been using tools such as NumPy, Pandas and Sci-kit Learn and many more python libraries on their machine because these tools have been really effective, efficient, widely trusted and intuitive. But for analysis of large datasets, many python computational libraries such as Pandas, NumPy and Sci-kit Learn and various others were not designed to scale above than a single machine, forcing the data analyst to rewrite their code in a better scalable tool. This leads the data analyst to rewrite the code in some other language altogether which leads the discovery and computations to slow down. [3]

This is the situation for which Dask was developed and it has influenced the working of the python libraries largely. Dask provides a way to scale various python libraries like Pandas, NumPy and Sci-kit workflows with minimal rewriting of code. Dask uses the data structures of these libraries internally and it also copies most part of their API to function as it integrates with these tools very well. Dask even helps these libraries to evolve consistently. It also reduces the amount of workload on a machine which is dependent on resources by transitioning to multi-core workstation and distributed cluster. Dask contains three parallel collections known as Data-frames, Arrays and Bags which are able to store the data that is larger than the RAM. All these three collections are capable of using the data partitioned among the RAM and the Hard disk as well distributed across the multiple nodes of a cluster. Processing data in a multi-core workstation and distributed cluster reduces the time required for execution and waiting time, which ultimately allows more time for the analysis of data, and hence producing better results. Moreover, in the presence of the distributed environment, it is very much possible to combine the memories from various server nodes to such an extent that the combined memory is capable of keeping all the data for various large-scale applications. (For example- Facebook). [4]

Related Work

Pandas also allows the computation by breaking the data file into smaller chunks and supports optional iterating. Pandas internally processes the file into smaller chunks which results in low memory usage while parsing the big data file. While reading a file through pandas it usually reads the entire content present in the file into a single DataFrame, but we can use the low_memory parameter set to true in order to allow the iterator parameter or chunksize to return the data in chunks. The default value of this parameter is true. This method is very efficient when low memory is present in a machine. For example, if the working memory present is only 100 MB, then it will be really tough or impossible for the system with limited resources to analyse data and perform computations on a dataset which is too big, let’s say the dataset consists of 10,000 rows. The iterator parameter in pandas allows the dataset to break into smaller data chunks (like 500 rows in a chunk) and allows the machine to perform computations and analysis of data. Hence, making the work feasible even when working with low memory and huge Dataset. [5]

Background Study:

The technology used in the project is Dask. It is an adjustable parallel computing library for analytic computing. It backs dynamic task scheduling optimized for computation as well as big data collections. It helps to handle big data information in a sequentially-parallel manner. It can be used to perform data wrangling and model building for large dataset on small and less powerful machines with Pandas data frame and Numpy data structure. Dask use multicore CPUs in single machine to perform parallel computation. During computation in order to effectively use memory and to use less memory, it stores the complete data on disk and uses chunks the same data set. Intermediate values generated are immediate deleted in order to free space and reduce the memory usage.
Challenges with common Data Science Python Libraries (Numpy, Pandas, Sklearn):

Python is a programming language mostly used for data analysis. Numpy, Sklearn and Pandas are python libraries are easy to understand and used for preforming data science tasks.

While using these on huge data set or on big data they take a lot of processing and running time because of the high memory requirement. These are not designed to scale beyond a single machine.

Why Dask:

It stores the complete data on disk and uses chunks the same data set. Intermediate values generated are immediate deleted in order to free space and reduce the memory usage. It can be either used to run on a local computer or be scaled up to run on a cluster. With minimal code changes, code can run in parallel by taking advantage of the preprocessing power of the system. It’s execution time and waiting time is less and the analysis time is more. [6]

It has 3 parallel collections- Data frames, Bags and Arrays which can store data that is larger than RAM. These collection types are equipped to use data divided between hard disk and RAM and also data which is well distributed in a cluster across multiple nodes. It uses TLS/SSL certificates to support encryption and authentication. It is able to take the advantage of new nodes added dynamically because Dask is resilient, elastic and by handing the failure of worker nodes gracefully. It has real-time and responsive dashboard showing present progress.

Data Types in Dask:

Dask provides multiple data types which are distributed version of already existing data types like Data frame from Pandas, list from python and ndarray from numpy. In this project we have used 3 datatypes array, bag and data frame.

Array:

Fig 1. Multiple Numpy arrays in a grid as Dask Array

A Dask array is formed by dividing large Numpy arrays grouped together i.e distributed Numpy arrays. When an array is created, chunk size is defined defining the size of Numpy arrays. Dask array use whole core of the system. It helps in working on dataset which are occupy space greater then the memory available on the system.

Dask Data-frame:

Fig 2. Representation of Dask Dataframe

Dask data frame is made of multiple small panda data frames. These multiple smaller data frames are made by slitting the large data frame row wise. These data frames can be situated on single or multiple machines which in return help in storing data set larger than memory. It operates by parallelizing the existing data frames. [7]

Bag:

Dask bag work by parallelizing the computation on Python’s list like object containing elements of many data type. It is mostly used while dealing with semi-structures data such as JSON blobs or log files. Data is read line by line and output specified number of lines using take method. It can be used in python project where operations like map, filter and fold are required. It uses python iterators to work in parallel over small memory.

Methodology

Many reports suggest that data size is increasing exponentially and will continue to grow at the same or even more rate in the future. These are not just facts, this is the new beginning of a revolution that will affect each and every business and life all over the globe.

With this vast amount of data, we require high end systems to tackle them. But it is practically not feasible to always arrange high end systems for the task.

This is the situation where Incremental Learning or Dask provides relief.

1) Data Collection and modification

Installing all the necessary libraries for the task.

!pip -q install –user –upgrade –ignore-installed numpy pandas scipy sklearn
!pip -q install catboost
!pip -q install lightgbm
!pip -q install “dask[complete]”
!pip -q install “dask-ml[complete]”
!pip -q install graphviz
!pip -q install pydot

Importing all the necessary libraries as variables.

import pandas as pd
import numpy as np
from multiprocessing import Pool
from pandas.io.json import json_normalize
from sklearn.pipeline import make_pipeline, Pipeline
import matplotlib.pyplot as plt
from sklearn.preprocessing import RobustScaler
from threading import Thread as trd
import queue
import json
import gc
gc.enable()

Importing the dataset that we have and analysing the size and format of the dataset.

part = pd.read_csv(“avocado.csv”)
part.shape

column=part.columns
column
xtrain = pd.DataFrame(part, index=np.arange(len(part)), columns=part.columns)
xtrain
Storing the DataFrame in a variable xtrain:
“xtrain=pd.DataFrame(part, index = np. arange(len(part)), columns=part.columns)”
xtrain.head(2)

Creating a Dictionary of the second and third columns of the dataset to be used for pre-processing.

new_ = xtrain.iloc[:,1:3].to_dict(‘list’)
new_row = []
for i in range(len(xtrain)):
dic_ = {}
for key in new_.keys():
dic_[key] = new_[key][i]
new_row.append(str(dic_))
xtrain[‘Dict’] = new_row

xtrain.head(2)

Fig 3. Dictionary created

The dataset which we chose, avocado.csv was downloaded from kaggle.com, it contains 18249 rows, but we want to analyse working with big data hence we repeat the data 7 times to achieve a view of big data. Hence the new data has 2335872 rows. [8]
for i in range(7):

xtrain = pd.concat([xtrain, xtrain], axis=0, sort=False).reset_index(drop=True)

xtrain.shape
xtrain.reset_index().to_csv(“avocadotrain.csv”, index=False)

We store this data in a file named avocadotrain.csv.

2) Data Exploration

To have a glimpse of how the dataset looks like we analyse first few rows of the dataset.

part = pd.read_csv(“avocadotrain.csv”, nrows=10)
part.shape

We print the first two rows and check all the columns present and type of data.

columns = part.columns

part.head(2)

We make lists of different columns on the basis of data in them i.e. numerical or dictionary etc. This is done so that we can use same function on similar data.

id_columns = [“index”]
num_columns = [“AveragePrice”, “Total Volume”, “PUL-4046”, “PUL-4225”, “PUL-4770”, “Total Bags”, “Small Bags”, “Large Bags”, “XLarge Bags”,”year”]
obj_columns = []
dict_columns = [“Dict”]
complex_columns = [“Date”]

index_col = “index”
columns_visited = [“AveragePrice”, “Total Volume”]

Now we explore the data in the Dict column:

col = “Dict”
df = pd.read_csv(“avocadotrain.csv”, usecols = [12])

df[col] = df[col].map(lambda x: json.loads(x.replace(“‘”, ‘”‘)))

column_as_df = json_normalize(df[col])
column_as_df.head()

Fig 4. Parsed data table using Dictionary

Calculating the standard deviation to check if they are consistent:

column_as_df.std()

Creating the kernel density estimation graph to visualize the probability density function of the data:
column_as_df.plot.kde()

Fig 5. Average price and Total volume graph

Selecting the columns to be dealt with:

columns_selected = [“AveragePrice”, “Total Volume”, “PUL-4046”, “PUL-4225”, “PUL-4770”, “Total Bags”, “Small Bags”, “Large Bags”, “XLarge Bags”,”year”]

We keep a dictionary with all the column names as keys and methods applied to it in a list as a pipeline for that column.

preprocessing_pipeline = {col: [] for col in columns_selected}
data = pd.read_csv(“avocadotrain.csv”)
for col_ in columns_selected:
rsc = RobustScaler()
rsc.fit(data[col_].values.reshape(-1, 1))
preprocessing_pipeline[col_].append(rsc)
preprocessing_pipeline

Fig 6. Pipeline created

We pickle dump the dictionary for future use:

import pickle
with open(“pipeline.pickle”, “wb”) as fle:
pickle.dump(preprocessing_pipeline, fle)
If the data is large enough for the memory of system, we are working on we can incrementally open up the column and apply functions to it. Or we can use Dask for the use.

3) Pre-processing

We use the dictionary we created earlier which has columns as keys and methods to be applied on them as values.

This will be called for each group of data while using the incremental learning process and not Dask.

def preprocess(df):
df.reset_index(drop=True, inplace=True)
df = df.drop(columns_to_drop_completely, axis=1)

for col in dict_columns:
if col not in df.columns: pass
col_df = json_normalize(df[col])
col_df.columns = [f”{col}.{subcolumn}” for subcolumn in col_df.columns]
selected_columns = [c for c in columns if c in col_df.columns()]
to_drop = [c for c in col_df.columns if c not in selected_columns]
col_df = col_df.drop(to_drop, axis=1)

df = df.drop([col], axis=1).merge(col_df, right_index=True, left_index=True)

for col_ in columns_selected:
rsc = preprocessing_pipeline[col_][0]
df[col_] = rsc.transform(df[col_].values.reshape(-1, 1)).reshape(1, -1)[0]
return df

We transform the data by fitting methods at each incremental step. There might be some missing data so to be on the safe side we apply the fitting to complete data.

From this step we can use either incremental learning (using Pandas) or use Dask.

4a) Using Incremental Learning

We read the data from the file incrementally using “chunksize” that specifies the number of rows to read at a time.

import lightgbm as lgb
import xgboost as xgb
import catboost as cb
from sklearn.model_selection import train_test_split

incremental_dataframe = pd.read_csv(“avocadotrain.csv”, chunksize=100000)

Now, we train the data using LightGBM or XGBoose.

lgb_params = {
‘objective’: ‘regression’,
‘verbosity’: 0,
}

xgb_params = {
‘update’:’refresh’,
‘process_type’: ‘update’,
‘refresh_leaf’: True,
‘silent’: True,
}
After each step we save the estimator and pass it on for the next iteration.

lgb_estimator = None
xgb_estimator = None

for df in incremental_dataframe:
df = preprocess(df)

xtrain, xvalid, ytrain, yvalid = train_test_split(df.drop([‘Total Bags’], axis=1), df[‘Total Bags’] )

lgb_estimator = lgb.train(lgb_params,
init_model=lgb_estimator,
train_set=lgb.Dataset(xtrain, ytrain),
valid_sets=[lgb.Dataset(xvalid, yvalid)],
valid_names=[“Valid”],
early_stopping_rounds = 50,
keep_training_booster=True,
num_boost_round=70,
verbose_eval=50)

del df, xtrain, ytrain, xvalid, yvalid
gc.collect()

After this we use the functions above with chunksize 100000 and predict the train for the column “Total Bags”. Then we Scatter Plot the relation between the trained values and the real values.

test = pd.read_csv(“avocadotrain.csv”, nrows = 10000)

preds = lgb_estimator.predict(preprocess(test).drop([‘Total Bags’], axis=1))
preds = preprocessing_pipeline[‘Total Bags’][0].inverse_transform(preds.reshape(-1, 1)).reshape(1, -1)[0]
true = test[‘Total Bags’]
plt.scatter(preds, true)

Fig 7. Scatter Plot between trained and real values

4b) Using Dask

Importing Dask library to use it for handling big data resources in a way that is sequentially – parallel.

Define the parallel programming attributes like number of workers, threads per worker, and memory limit per each worker.
import dask

import dask.dataframe as dd
from dask.distributed import Client
client = Client(processes=False, threads_per_worker=4, n_workers=4, memory_limit=’8GB’)
client

As memory limit per each worker is 8GB, and there are 4 workers hence Cluster Memory is 32GB.

The above is the structure of the client that we have created for parallel processing task.

Now, we read the data and visualize it using GraphViz library.

df = dd.read_csv(“avocadotrain.csv”, blocksize=25e6)
df.npartitions
df.visualize(size=”7,5!”)

Fig 8. Visualizing the data
df.head()

Fig 9. Head of Avocado data

Now, we convert the Dict column into JSON format and define keys to it.

df[‘Dict’] = df[‘Dict’].apply(lambda x: json.loads(x.replace(“‘”, ‘”‘)), meta=(‘Dict’, ‘f8’))
dict_col_keys = {
‘Dict’: [‘AveragePrice’, ‘Total Volume’]
}
dict_col_keys
for dic_col in dict_col_keys:
for key in dict_col_keys[dic_col]:
df[f'{dic_col}.{key}’] = df[dic_col].to_bag().pluck(key).to_dataframe().iloc[:,0]
df.head()

Then we add two columns using the JSON file created.

Fig 10. Head after adding two new columns using JSON

Now, we again visualize the data.
df.visualize(size=”20,10!”)

Fig 11. Visualizing data

As we clearly increased the columns to visualize data, we do not need these columns, hence we drop them.

The new data frame looks like:

columns_to_drop = [‘index’, ‘Dict’, ‘Dict.AveragePrice’, ‘Dict.Total Volume’,’Date’]
df = df.drop(columns_to_drop, axis=1)
df.head()

Fig 12. Dataset after dropping irrelevant columns

Checking if there is a null data in the data frame.
df.isnull().sum().compute()

Converting the dataframe into array:

lengths = []
for part in df.partitions:
l = part.shape[0].compute()
lengths.append(l)
print(l, part.shape[1])

Splitting the data frame into 2 Dask arrays, one with all the columns except ‘Total Bags’, other one with only this column.
X, y = df.drop([‘Total Bags’], axis=1).to_dask_array(lengths=lengths) , df[‘Total Bags’].to_dask_array(lengths=lengths)

Two Dask arrays are:

We transform the data by fitting transforms at both the arrays:

Xo = dask.array.zeros((X.shape[0],1), chunks=(200000,1))
from dask_ml.preprocessing import RobustScaler
for i, col_ in enumerate(df.columns):
if col_ == “Total Bags”:
rsc = RobustScaler()
y = rsc.fit_transform(y.reshape(-1, 1)).reshape(1, -1)[0]
else:
rsc = RobustScaler()
temp = rsc.fit_transform(X[:,i-1].reshape(-1, 1))
Xo = dask.array.concatenate([Xo, temp], axis=1)

Now we make the arrays into two chunks of equal size:

Xo = Xo[:, 0:]
Xo[-5:].compute()
Xo = Xo.rechunk({1: Xo.shape[1]})
Xo = Xo.rechunk({0: 200000})
y = y.rechunk({0: 200000})

Fig 13. Xo after pre-processing the data

Now we train the model according to the arrays and apply Linear Regression Model to the data.

75% of the dataset is used to train the model.
tr_len = 0.75*Xo.shape[0]
xtrain, ytrain = Xo[:tr_len], y[:tr_len]
xvalid, yvalid = Xo[tr_len:], y[tr_len:]
xtrain.shape, ytrain.shape, xvalid.shape, yvalid.shape

from dask_ml.linear_model import LinearRegression
est = LinearRegression()
est.fit(xtrain, y=ytrain)
preds = est.predict(xvalid)

We now Scatter plot the relation between the predicted values and the trained values for the complete data.

preds[0:10].compute()
plt.scatter(preds.compute(), yvalid.compute())

Fig 14. Scatter Plot between trained and real values using Dask

Discussion:

We know that the data size is increasing day by day exponentially. So, in order to analyse it we require a technology a method that will be able to work on it in faster and efficient manner. For this purpose, we have used Dask. It converts huge data into chunks of data and work on them. It works in parallel manner on this data by using the multi cores available on the system. This project was done on four phases. First phase was data collection and modification. In this, we collected the database and prepared the technology used for processing it by installing appropriate and necessary libraries and packages. In the second phase data exploration was carried out. It provided the initial characteristics of the data set used. Next phase was Data pre-processing. Data was converted into dictionary format and data fitting was done to remove any missing data. Last phase was incremental leaning a Dask. First the huge data set was converted in chunks of data which was acted upon by Dask environment in parallel fashion. 75% data was used for training. The output clearly shows that Dask can handle very huge data even on low memory machines. So, we conclude that Big Data analysis can be done efficiently on low memory using Dask.

References

[1] https://towardsdatascience.com/how-to-learn-from-bigdata-files-on-low-memory-incremental-learning-d377282d38ff

[2] https://towardsdatascience.com/ speeding-up-your-algorithms-part-4-dask-7c6ed79994ef

[3] http://docs.dask.org/en/latest/why.html

[4] http://docs.dask.org/en/latest/dataframe -performance.html

[5] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

[6] https://towardsdatascience.com/why-every-data-scientist-should-use-dask-81b2b850e15b

[7] https://www.analyticsvidhya.com/blog/ 2018/08/dask-big-datasets-machine_learning-python/

[8] https://www.kaggle.com/neuromusic/ avocado-prices

2019-3-27-1553692553

Essay: Analysing Large Avocado dataset using Dask on Low Memory

Essay details and download:

Text preview of this essay:

Abstract

Introduction

Related Work

Background Study:

Methodology

Discussion:

References

About this essay:

Essay details and download:

Text preview of this essay:

Abstract

Introduction

Related Work

Background Study:

Methodology

Discussion:

References

About this essay:

Essay Categories: