Chris Wensel · @cwensel
176 followers · 1510 posts · Server fosstodon.org

@Cmastication depends on how you access it? If via a query, only partition on the most common predicates.

Repartitioning data for different access patterns is a key use case behind and tessellate. See bio for links.

Otherwise yeah, partition via hash to get equal sized bits. Reminds me to add a hash transform to tessellate.

#clusterless

Last updated 1 year ago

Chris Wensel · @cwensel
170 followers · 1395 posts · Server fosstodon.org

just sayin', if you find chaining sql statements into a data processing dag a bit of a drag, I suggest you spend some time with

github.com/ClusterlessHQ

declarative decentralized heterogeneous flows (in today)

#clusterless #data #aws

Last updated 1 year ago

Chris Wensel · @cwensel
171 followers · 1385 posts · Server fosstodon.org

gah, finally. now have a arc that will automatically update glue table partitions to include newly arrived partitions.

will clean up the aws s3 logs example to include this tomorrow.

the example will show how to provision a database, table with schema (generated from tessellate), and deploy arcs that convert csv to parquet and keep the table partitions up to date.

#clusterless #aws

Last updated 1 year ago

Chris Wensel · @cwensel
169 followers · 1347 posts · Server fosstodon.org

Asked to start either Gitter or Discord channel for the tessellate project by a team in the UK.

I'd prefer to use Github discussions to keep things co-located, more discoverable, and more async. But I've never actually used it for anything. Or seen it used much.

Gitter would be my second choice. but I see now Gitter is on Matrix. don't yet have an opinion on that.

The Wardley Mapping community just moved to Discord from Slack. Seems to work for for that community.

Thoughts?

#clusterless

Last updated 1 year ago

Chris Wensel · @cwensel
169 followers · 1347 posts · Server fosstodon.org

So uses the AWS CDK and not terraform. Though I did plan to use the version of the CDK that wrapped terraform to manage non AWS infrastructure.

Curious how that community is going to react.

#clusterless

Last updated 1 year ago

Chris Wensel · @cwensel
169 followers · 1347 posts · Server fosstodon.org

some time ago I re-implemented Twitter-Snowflake github.com/twitter-archive/sno

Was just about to reboot that effort for use in Tessellate for but a quick search found TSID Creator
github.com/f4b6a3/tsid-creator

This is great if you need a locally unique id that fits in 64bits.

#clusterless #java

Last updated 1 year ago

Chris Wensel · @cwensel
165 followers · 1301 posts · Server fosstodon.org

adding Glue database and table support to

wish me luck.

fwiw, outside of simply provisioning a database or table with a schema, an Arc will be provided to add partitions to the catalog as they arrive keeping table up to date (and blind to corrupt or partial data)

this will make it trivial to have a table around every stage/arc of a pipeline dataset for ad-hoc queries and quality checks.

#aws #clusterless

Last updated 1 year ago

Chris Wensel · @cwensel
163 followers · 1273 posts · Server fosstodon.org

In a dream last night I tried to explain .

It’s like assembling a train but running it on someone else’s track.

#clusterless

Last updated 1 year ago

Chris Wensel · @cwensel
161 followers · 1230 posts · Server fosstodon.org

I just pushed a new sample application that will continuously convert AWS S3 Access Logs into Apache Parquet. This is an end-to-end real-world application using Clusterless and Tessellate together..

github.com/ClusterlessHQ/aws-s

Check it out and provide feedback.

#clusterless

Last updated 1 year ago

Chris Wensel · @cwensel
161 followers · 1230 posts · Server fosstodon.org

@Hhildebrand this is great news, I really enjoyed using it on the MR simulation.

I think now i'd like to PoC it to simulate latencies/lag and cost-to-serve in data pipelines.

the hard/interesting part is properly instrumenting a clusterless deployment to get real data for the simulator to use.

getting cost estimates before a deployment would be pretty cool.

#clusterless #aws

Last updated 1 year ago

Chris Wensel · @cwensel
161 followers · 1197 posts · Server fosstodon.org

The stack could use some work, still unsure about the workflow manager.

But the scenario testing tools really help build confidence in the pipelines after a bit of refactoring.

esp when run as a github action:
github.com/ClusterlessHQ/clust

#clusterless

Last updated 1 year ago

Chris Wensel · @cwensel
159 followers · 1189 posts · Server fosstodon.org

a core design element behind is that all data artifacts must be inventoried and all work is driven by the inventories.

data (files, objects, etc) can arrive at any rate. but once a manifest of arrivals is complete, a pipeline of work can start. and each workload is responsible for providing an inventory of completed work artifacts.

this allows for any new workload to be injected into the system (event dag), and to be backfilled if needed.

#clusterless

Last updated 1 year ago

Chris Wensel · @cwensel
159 followers · 1189 posts · Server fosstodon.org

Last night I pushed a new build of Tessellate for download: github.com/ClusterlessHQ/tesse

This release includes MVEL template support in the JSON pipeline definition.

#clusterless #aws #dataengineering

Last updated 1 year ago

Chris Wensel · @cwensel
159 followers · 1189 posts · Server fosstodon.org

guess I'll head to the pool while I wait for all my new Cascading builds to complete. fixes and features for 4.5 and 4.6 coming.

Also hope to have a new tessellate build out this evening with direct support of MVEL expression templates pre-processed in the JSON pipeline file. includes some handy intrinsics for file naming and partitioning.

#clusterless

Last updated 1 year ago

Chris Wensel · @cwensel
161 followers · 1145 posts · Server fosstodon.org

So Tessellate inherits lots of support for various data formats from Cascading
github.com/cwensel/cascading

Even though dropped Cascading support, we were able to port it over.

Now that Parquet is native to Cascading, it should be easier to add support.

This would allow to convert data as it arrives into Iceberg continuously for use in Athena or other data front-ends.

Anyone interested in a challenge?

#apacheparquet #ApacheIceberg #clusterless #aws #java

Last updated 1 year ago

Chris Wensel · @cwensel
161 followers · 1145 posts · Server fosstodon.org

Still working through this concept, but Tessellate will have native built in support for common schema formats that can be referenced by name.

For example, the AWS S3 Access log format: github.com/ClusterlessHQ/tesse

The AWS Cloudfront logs are up next to support.

Any other log formats that would be nice to have native support for?

#clusterless #aws #dataengineering

Last updated 1 year ago

Chris Wensel · @cwensel
161 followers · 1145 posts · Server fosstodon.org

Pushed a new build of Tessellate at github.com/ClusterlessHQ/tesse

Now has support for some basic transforms like coerce, rename, discard, copy, and insert.

When back from camping, plan to have a working example to share and some reasonable documentation.

#clusterless

Last updated 1 year ago

Chris Wensel · @cwensel
154 followers · 1105 posts · Server fosstodon.org

Just added some simple console metrics to Tessellate. Tuples read/written and durations. Adding rates is a todo.

github.com/ClusterlessHQ/tesse

#clusterless

Last updated 1 year ago

Chris Wensel · @cwensel
148 followers · 1010 posts · Server fosstodon.org

I've pushed up the start of the documentation to docs.clusterless.io/

Thanks to the Antora project for providing the doc framework: antora.org

We also now have downloadable package/releases on Github: github.com/ClusterlessHQ/clust

Additional thanks to JReleaser for implementing the packaging functionality: jreleaser.org

#clusterless

Last updated 1 year ago

Chris Wensel · @cwensel
148 followers · 991 posts · Server fosstodon.org

Here is an arc where the workload failed.

doesn't care about errors as much as people do, but it does care about missing data more.

errors could be a reason for missing data, but there may be other reasons that aren't errors that need human attention.

#clusterless

Last updated 1 year ago