Comma Separated Values File Format & Parquet
Image from https://www.pexels.com/@pixabay/ |
There are many discussions and articles about storing data in CSV (Comma-Separated Values) or Parquet file format. Discussions about file size, time to read and write access, etc. I feel that the most important thing is that Parquet stores the column data type information with it. When we use CSV, we lost all this important information.
Let's look at some statistics on the file size, and time to read and write access,
1. Experiment Setup
The file formats are CSV, Parquet, Parquet (compressed with gzip), and Parquet (compressed with snappy).
We want to measure the following
- The time to write data to the file format
- The time to read data from the file format
- The file size of the file format
The numbers of rows are 10000, 50000, and 100000.
We use Faker to generate dummy data and pandas (with pyarrow) to read and write data.
The parquet file is NOT multi-parts.
Tests are performed on my laptop (MacbookPro 2.6 GHz 6-Core Intel Core i7)
The source code is provided here.
2. Results
2.1 File Size (in MBytes)
File type | 10000 rows | 50000 rows | 100000 rows |
---|---|---|---|
CSV | 1.012 | 5.065 | 10.124 |
Parquet | 1.121 | 5.217 | 10.062 |
Parquet gzip | 0.491 | 2.238 | 4.183 |
Parquet snappy | 0.720 | 3.335 | 6.326 |
Observations
- CSV and Parquet take up almost the same space.
- gzip has a better compression ratio (compared with snappy).
- The chart shows that it is linear.
2.2 Time to write the file (in milliseconds)
File type | 10000 rows | 50000 rows | 100000 rows |
---|---|---|---|
CSV | 45.384 | 229.439 | 447.032 |
Parquet | 46.879 | 50.240 | 79.409 |
Parquet gzip | 74.706 | 321.714 | 597.501 |
Parquet snappy | 13.529 | 58.604 | 94.954 |
Observations
- Slow to write CSV and Parquet gzip (especially the latter)
- Parquet snappy is good - compressed and writes faster.
- The chart shows that it is linear.
2.3 Time to read the file (in milliseconds)
File type | 10000 rows | 50000 rows | 100000 rows |
---|---|---|---|
CSV | 24.762 | 110.409 | 242.348 |
Parquet | 28.930 | 54.046 | 145.249 |
Parquet gzip | 15.210 | 57.490 | 124.406 |
Parquet snappy | 12.299 | 46.559 | 106.755 |
Observations
- Reading small Parquet files is not great. (see 28.930 vs 24.762 milliseconds - reading Parquet is slower than CSV)
- Parquet snappy is the winner again.
- The chart shows that it is linear.
3. Summary
Depending on your use cases, CSV may not be the best option. As the number of rows and columns increases, Parquet (which is a columnar storage format) is a good option.
Comments
Post a Comment