Zane Selvans · @ZaneSelvans
907 followers · 1444 posts · Server social.coop

Now that we're putting all our denormalized output tables and analyses into the DB, we've got a lot more to manage, and are trying to figure out how to best combine existing tools to do it.

GitHub Discussion: github.com/orgs/catalyst-coope

Currently we store column, table, and dataset level information in big JSON-ish data structures, which are converted into objects using @pydantic models based (loosely) on the tabular data package abstractions.

#pudl #metadata #python #frictionlessdata #datadon

Last updated 2 years ago

Zane Selvans · @ZaneSelvans
907 followers · 1439 posts · Server social.coop

The @dagster folks interviewed us and did a write-up of our migration of from a messy DIY ETL to using their orchestration framework, which has thus far been a very positive experience. Unlike most of their users we are producing outputs. Very curious to see if other non-profit / open-data users will adopt the platform:

dagster.io/blog/catalyst-coope

#pudl #python #opendata #dataengineering #datadon #energymastodon #opensource #energytransition

Last updated 2 years ago

Zane Selvans · @ZaneSelvans
890 followers · 1378 posts · Server social.coop

As moves toward distributing only data (and much more of it) rather than expecting everyone to run the software (with its 500+ dependencies...) we're going to deprecate our output management layer.

We see two possible deprecation paths. Should we go slow? Or rip the band-aid off now?

Discussion on GitHub here: github.com/orgs/catalyst-coope

#pudl #opendata #energytransition #opensource #energymastodon #datadon #pydata #energytwitter

Last updated 2 years ago

Zane Selvans · @ZaneSelvans
885 followers · 1349 posts · Server social.coop

I did not realize you can post up to 100GB of data to and they provide access to computational resources and notebooks.

We're thinking about automatically posting all our data there, and maybe running community competitions to help solve entity matching, anomaly detection, and imputation problems. Is there any downside to doing this?

kaggle.com/datasets/zaneselvan

#kaggle #jupyter #pudl #opendata #machinelearning #datascience #energytransition #energytwitter #energymastodon

Last updated 2 years ago

Zane Selvans · @ZaneSelvans
877 followers · 1308 posts · Server social.coop

A few announcements!

Our migration to @dagster is progressing rapidly. If you use PUDL and run the ETL yourself, and need help getting Dagster set up, feel free to sign up for office hours:

calendly.com/catalyst-cooperat

Or ask for help in our GitHub discussions:

github.com/orgs/catalyst-coope

github.com/orgs/catalyst-coope

#pudl #opensource #opendata #energytransition #energymastodon #datadon #dataengineering #energytwitter

Last updated 2 years ago

Pavel Beneš · @labradosti
11 followers · 172 posts · Server mastodonczech.cz
Zane Selvans · @ZaneSelvans
860 followers · 1214 posts · Server social.coop

We finally have the whole data pipeline running in @dagster and the visualizations make it very clear where we need to parallelize stuff. 🐌

Anybody else running particularly large or complex / DAGs with these tools? We'd love to compare notes.

It would also be cool if there were some way to expose all this information to our users in a read-only form, so they can see what's happening with the nightly builds too.

#pudl #datadon #opendata #opensource

Last updated 2 years ago

Zane Selvans · @ZaneSelvans
844 followers · 1088 posts · Server social.coop

This kind of analysis has been done for the US ISO/RTO regions before, but this is the first publicly available analysis of non-ISO/RTO regions like the Southeast and West. The data illuminates massive opportunities for reduced reliance on coal and increased customer savings over time. A lot of the ongoing plant capital expenses and non-fuel O&M costs come from FERC Form 1 data liberated by

#CatalystCoop #pudl

Last updated 2 years ago

Zane Selvans · @ZaneSelvans
803 followers · 953 posts · Server social.coop

The Open Grid Emissions Initiative, which uses data as one of its main inputs, is trying to do this for historical analyses. Even if it's never useful for dispatching demand, it'll be hugely valuable for modeling the feasibility of 24/7 renewables.

github.com/singularity-energy/

#CatalystCoop #pudl

Last updated 2 years ago

Zane Selvans · @ZaneSelvans
348 followers · 365 posts · Server social.coop

The 2021 Form 1 data has been more recalcitrant, since they've switched to using for reporting (after 27 years of Visual FoxPro...), but we're close!

It looks like there's enough structured information in the XBRL taxonomies that we can reproduce all the calculations and tag the data with the relevant FERC accounting categories.

#pudl #finance #utility #ferc1 #XBRL #ferc

Last updated 2 years ago