Comma Separated Values File Format & Parquet

Image from https://www.pexels.com/@pixabay/
Image from https://www.pexels.com/@pixabay/

There are many discussions and articles about storing data in CSV (Comma-Separated Values) or Parquet file format. Discussions about file size, time to read and write access, etc. I feel that the most important thing is that Parquet stores the column data type information with it. When we use CSV, we lost all this important information.

Let's look at some statistics on the file size, and time to read and write access,

1. Experiment Setup

The file formats are CSV, Parquet, Parquet (compressed with gzip),  and  Parquet (compressed with snappy).

We want to measure the following

  1. The time to write data to the file format
  2. The time to read data from the file format
  3. The file size of the file format

The numbers of rows are 10000, 50000, and 100000.

We use Faker to generate dummy data and pandas (with pyarrow) to read and write data.

The parquet file is NOT multi-parts.

Tests are performed on my laptop (MacbookPro 2.6 GHz 6-Core Intel Core i7)

The source code is provided here.


2. Results

2.1 File Size (in MBytes)

File type10000 rows50000 rows100000 rows
CSV1.0125.06510.124
Parquet1.1215.21710.062
Parquet gzip0.4912.2384.183
Parquet snappy0.7203.3356.326

Observations

  1. CSV and Parquet take up almost the same space.
  2. gzip has a better compression ratio (compared with snappy).
  3. The chart shows that it is linear.


2.2 Time to write the file (in milliseconds)

File type10000 rows50000 rows100000 rows
CSV45.384229.439447.032
Parquet46.87950.24079.409
Parquet gzip74.706321.714597.501
Parquet snappy13.52958.60494.954

Observations

  1. Slow to write CSV and Parquet gzip (especially the latter)
  2. Parquet snappy is good - compressed and writes faster.
  3. The chart shows that it is linear.


2.3 Time to read the file (in milliseconds)

File type 10000 rows 50000 rows 100000 rows
CSV24.762110.409242.348
Parquet28.93054.046145.249
Parquet gzip15.21057.490124.406
Parquet snappy12.29946.559106.755

Observations

  1. Reading small Parquet files is not great. (see 28.930 vs 24.762 milliseconds - reading Parquet is slower than CSV)
  2. Parquet snappy is the winner again.
  3. The chart shows that it is linear.


3. Summary

Depending on your use cases, CSV may not be the best option. As the number of rows and columns increases, Parquet (which is a columnar storage format) is a good option.


Comments

Popular posts from this blog

OpenAI: Functions Feature in 2023-07-01-preview API version

Storing embedding in Azure Database for PostgreSQL

Happy New Year, 2024 from DALL-E