Asdrubal Chirinos · @achirinos
0 followers · 2 posts · Server mastodon.cloud

Sabías que Apache Hadoop es un framework para el procesamiento distribuido de grandes conjuntos de datos en clústeres de computadoras. Es una tecnología clave en el mundo del big data. 🐘🌐
Suscríbete a Código ergo sum
achirinos.substack.com/

#apachehadoop #curiosidadestecnologicas

Last updated 1 year ago

Chris Wensel · @cwensel
150 followers · 1067 posts · Server fosstodon.org

For those a little familiar with Cascading, it was originally designed to run on , and then , but it also has a local planner.

This lets developers create non-clustered data applications, without the Hadoop/Tez etc dependencies or runtime.

I've been using the local planner in production for over 5 years now.

But Parquet requires Hadoop libraries, and this is ok, there is a shim between the libraries that allow Parquet and S3AFileSystem to be used locally.

#apachehadoop #apachetez

Last updated 1 year ago

Chris Wensel · @cwensel
150 followers · 1066 posts · Server fosstodon.org

A little more color on this announcement..
fosstodon.org/@cwensel/1105490

First, removed support, so I had to splice the original source into Cascading. But the ParquetScheme didn't honor type information fully. So there is a new TypedParquetScheme that has native support for JSON and Timestamps.

Second, Parquet requires the FileSystem, which means we get the wonderful S3A implementation. But we also get a 331MB jar dependency with the aws bundle.

#apacheparquet #cascading #apachehadoop

Last updated 1 year ago