Now that we're putting all our denormalized output tables and analyses into the #PUDL DB, we've got a lot more #metadata to manage, and are trying to figure out how to best combine existing tools to do it.
GitHub Discussion: https://github.com/orgs/catalyst-cooperative/discussions/2546
Currently we store column, table, and dataset level information in big JSON-ish #python data structures, which are converted into objects using @pydantic models based (loosely) on the #FrictionlessData tabular data package abstractions.
#pudl #metadata #python #frictionlessdata #datadon
The @dagster folks interviewed us and did a write-up of our migration of #PUDL from a messy DIY #Python ETL to using their orchestration framework, which has thus far been a very positive experience. Unlike most of their users we are producing #OpenData outputs. Very curious to see if other non-profit / open-data users will adopt the platform:
https://dagster.io/blog/catalyst-cooperative-case-study
#DataEngineering #datadon #EnergyMastodon #OpenSource #EnergyTransition
#pudl #python #opendata #dataengineering #datadon #energymastodon #opensource #energytransition
As #PUDL moves toward distributing only data (and much more of it) rather than expecting everyone to run the software (with its 500+ dependencies...) we're going to deprecate our output management layer.
We see two possible deprecation paths. Should we go slow? Or rip the band-aid off now?
#OpenData #EnergyTransition #OpenSource #EnergyMastodon #datadon #pydata #EnergyTwitter
Discussion on GitHub here: https://github.com/orgs/catalyst-cooperative/discussions/2503
#pudl #opendata #energytransition #opensource #energymastodon #datadon #pydata #energytwitter
I did not realize you can post up to 100GB of data to #Kaggle and they provide access to computational resources and #Jupyter notebooks.
We're thinking about automatically posting all our #PUDL data there, and maybe running community competitions to help solve entity matching, anomaly detection, and imputation problems. Is there any downside to doing this?
#OpenData #MachineLearning #DataScience #EnergyTransition #EnergyTwitter #EnergyMastodon
https://www.kaggle.com/datasets/zaneselvans/catalyst-cooperative-pudl
#kaggle #jupyter #pudl #opendata #machinelearning #datascience #energytransition #energytwitter #energymastodon
A few #PUDL announcements!
Our migration to @dagster is progressing rapidly. If you use PUDL and run the ETL yourself, and need help getting Dagster set up, feel free to sign up for office hours:
https://calendly.com/catalyst-cooperative/pudl-office-hours
Or ask for help in our GitHub discussions:
https://github.com/orgs/catalyst-cooperative/discussions
#OpenSource #OpenData #EnergyTransition #EnergyMastodon #datadon #DataEngineering #EnergyTwitter
https://github.com/orgs/catalyst-cooperative/discussions/2475
#pudl #opensource #opendata #energytransition #energymastodon #datadon #dataengineering #energytwitter
#matylda #labrador #labradosti #joybrador #labradorretriever #labradors #brownlabrador #blacklabrador #labradorlove #lovelabrador #labdogs #doglovers #labrador_lovers #labradortime #labradorworld #labradorpuppies #labradores #blackbrador #pudlak #pudl
#matylda #labrador #labradosti #joybrador #labradorretriever #labradors #brownlabrador #blacklabrador #labradorlove #lovelabrador #labdogs #doglovers #labrador_lovers #labradortime #labradorworld #labradorpuppies #labradores #blackbrador #pudlak #pudl
We finally have the whole #PUDL data pipeline running in @dagster and the visualizations make it very clear where we need to parallelize stuff. 🐌 #datadon
Anybody else running particularly large or complex #OpenData / #OpenSource DAGs with these tools? We'd love to compare notes.
It would also be cool if there were some way to expose all this information to our users in a read-only form, so they can see what's happening with the nightly builds too.
#pudl #datadon #opendata #opensource
This kind of analysis has been done for the US ISO/RTO regions before, but this is the first publicly available analysis of non-ISO/RTO regions like the Southeast and West. The data illuminates massive opportunities for reduced reliance on coal and increased customer savings over time. A lot of the ongoing plant capital expenses and non-fuel O&M costs come from FERC Form 1 data liberated by #CatalystCoop #PUDL
The Open Grid Emissions Initiative, which uses #CatalystCoop #PUDL data as one of its main inputs, is trying to do this for historical analyses. Even if it's never useful for dispatching demand, it'll be hugely valuable for modeling the feasibility of 24/7 renewables.
The 2021 #FERC Form 1 data has been more recalcitrant, since they've switched to using #XBRL for reporting (after 27 years of Visual FoxPro...), but we're close!
It looks like there's enough structured information in the XBRL taxonomies that we can reproduce all the calculations and tag the data with the relevant FERC accounting categories.
#pudl #finance #utility #ferc1 #XBRL #ferc