Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is exactly my go-to-move as well.

pandas.read_hdf has beaten out ray.dataframe.read_csv in terms of speed on the few files I've just initially tested now. But I imagine the programmable flexibility csvs have over hdfs (I've never used a Unix command to edit a hdf for example) is why this new approach could get some traction.



Try parquet if your data is tabular, pyarrow and related tools are getting parquet up to a pretty comparable speed to hdf5, with arguably more flexibility and a better multithreading story.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: