Comma Separated Values File Format & Parquet

Image from https://www.pexels.com/@pixabay/

There are many discussions and articles about storing data in CSV (Comma-Separated Values) or Parquet file format. Discussions about file size, time to read and write access, etc. I feel that the most important thing is that Parquet stores the column data type information with it. When we use CSV, we lost all this important information.

Let's look at some statistics on the file size, and time to read and write access,

1. Experiment Setup

The file formats are CSV, Parquet, Parquet (compressed with gzip), and Parquet (compressed with snappy).

We want to measure the following

The time to write data to the file format
The time to read data from the file format
The file size of the file format

The numbers of rows are 10000, 50000, and 100000.

We use Faker to generate dummy data and pandas (with pyarrow) to read and write data.

The parquet file is NOT multi-parts.

Tests are performed on my laptop (MacbookPro 2.6 GHz 6-Core Intel Core i7)

The source code is provided here.

2. Results

2.1 File Size (in MBytes)

File type	10000 rows	50000 rows	100000 rows
CSV	1.012	5.065	10.124
Parquet	1.121	5.217	10.062
Parquet gzip	0.491	2.238	4.183
Parquet snappy	0.720	3.335	6.326

Observations

CSV and Parquet take up almost the same space.
gzip has a better compression ratio (compared with snappy).
The chart shows that it is linear.

2.2 Time to write the file (in milliseconds)

File type	10000 rows	50000 rows	100000 rows
CSV	45.384	229.439	447.032
Parquet	46.879	50.240	79.409
Parquet gzip	74.706	321.714	597.501
Parquet snappy	13.529	58.604	94.954

Observations

Slow to write CSV and Parquet gzip (especially the latter)
Parquet snappy is good - compressed and writes faster.
The chart shows that it is linear.

2.3 Time to read the file (in milliseconds)

File type	10000 rows	50000 rows	100000 rows
CSV	24.762	110.409	242.348
Parquet	28.930	54.046	145.249
Parquet gzip	15.210	57.490	124.406
Parquet snappy	12.299	46.559	106.755

Observations

Reading small Parquet files is not great. (see 28.930 vs 24.762 milliseconds - reading Parquet is slower than CSV)
Parquet snappy is the winner again.
The chart shows that it is linear.

3. Summary

Depending on your use cases, CSV may not be the best option. As the number of rows and columns increases, Parquet (which is a columnar storage format) is a good option.

Dennis Seah

Search This Blog