FedSearch - Federated network search engine

Asdrubal Chirinos · @achirinos

0 followers · 2 posts · Server mastodon.cloud

Código ergo sum | Asdrúbal Chirinos | Substack

Sabías que Apache Hadoop es un framework para el procesamiento distribuido de grandes conjuntos de datos en clústeres de computadoras. Es una tecnología clave en el mundo del big data. 🐘🌐 #ApacheHadoop #CuriosidadesTecnológicas
Suscríbete a Código ergo sum
https://achirinos.substack.com/

#apachehadoop #curiosidadestecnologicas

Last updated 1 year ago

Original post

Chris Wensel · @cwensel

150 followers · 1067 posts · Server fosstodon.org

For those a little familiar with Cascading, it was originally designed to run on #ApacheHadoop, and then #ApacheTez, but it also has a local planner.

This lets developers create non-clustered data applications, without the Hadoop/Tez etc dependencies or runtime.

I've been using the local planner in production for over 5 years now.

But Parquet requires Hadoop libraries, and this is ok, there is a shim between the libraries that allow Parquet and S3AFileSystem to be used locally.

#apachehadoop #apachetez

Last updated 1 year ago

Original post

Chris Wensel · @cwensel

150 followers · 1066 posts · Server fosstodon.org

A little more color on this announcement..
https://fosstodon.org/@cwensel/110549001614086663

First, #ApacheParquet removed #Cascading support, so I had to splice the original source into Cascading. But the ParquetScheme didn't honor type information fully. So there is a new TypedParquetScheme that has native support for JSON and Timestamps.

Second, Parquet requires the #ApacheHadoop FileSystem, which means we get the wonderful S3A implementation. But we also get a 331MB jar dependency with the aws bundle.

#apacheparquet #cascading #apachehadoop

Last updated 1 year ago

Original post