In the two previous post we have seen how disk IO and network IO affects our ETLs. For both use cases we have seen several techniques that could be used to improve drastically performance and drive to an efficient resource usage:
- Avoid IO disk at all.
- Use buff/cache properly if IO disk couldn’t be avoided.
- Optimize data download by choosing the right file format, use the Keep-Alive properly and parallelize network operations.
In this post we are going to put together network and processing operations to see the improvement in a complete workflow.
Continue reading