Kedro · @kedro
63 followers · 59 posts · Server social.lfx.dev

A QuantumBlack team helped bring PySpark support to pandera 👏🏼 we are so proud of this open source contribution and hope to keep them coming!

kdnuggets.com/2023/08/data-val

#python #pydata #pandera #pyspark

Last updated 1 year ago

Ankur · @ankur
0 followers · 6 posts · Server masto.ai

Unlocking the power of PySpark can be a 🧩. github.com/ankur-gupta/pyspark is a 🗺️ to easier installation w/ instructions on installing the right versions of Python, Java, Scala, & dependencies. Or, use the starter code as a template.

#python #pyspark #bigdata

Last updated 1 year ago

Kedro · @kedro
62 followers · 56 posts · Server social.lfx.dev

New blog post: How to integrate Kedro and Databricks Connect 🔶

In this blog post, our colleague Diego Lira explains how to use Databricks Connect with Kedro for a development experience that works completely inside an IDE.

kedro.org/blog/how-to-integrat

Install it with

```
pip install databricks-connect
```

#kedro #python #pydata #datascience #databricks #dbx #spark #pyspark

Last updated 1 year ago

Joel · @j
74 followers · 316 posts · Server moo.nz

Just optimised a Spark job to run in 10 minutes instead of 30.

The problem was the defaults for it's new Adaptive Query Execution (on by default since 3.2.0).

I don't really understand having an adaptive planner if it isn't actually making good decisions. It just means a more complicated system that's harder to tune.

#pyspark

Last updated 1 year ago

Štěpán Rešl · @StepanResl
105 followers · 85 posts · Server techhub.social

🔥 Excited to share my latest article, "Lessons Learnt from PySpark Notebooks and Extracting APIs"! 🚀✨

📝 In this article, I explore my experience with PySpark Notebooks and the process of extracting APIs. 💡💻

🔎 Throughout my journey, I encountered challenges and gained valuable insights that I'd like to share with you. Whether you're a PySpark enthusiast or just getting started, there's something here for everyone! 📚🔬

🔗 Read the full article on my website Datameerkat and expand your PySpark skills today! Link: datameerkat.com/lessons-learnt

📖✨ I'd love to hear your thoughts, so feel free to leave a comment, and let's connect. 💬🤝

Looking forward to connecting with you all! 🌐📲 Let's dive into the world of PySpark together! 🚀💻

#datascience #pyspark #Notebooks #apis #datamanipulation #datavisualization #article #datameerkat #powerbi #microsoftfabric

Last updated 1 year ago

PyLadies Bot · @pyladies_bot
99 followers · 91 posts · Server botsin.space
SQLAllFather · @SQLAllFather
869 followers · 3343 posts · Server techhub.social

Hey Fediverse - does anyone know how to work in Python (PySpark) or Scala with files that do not have a file extension?

I am working with a large number of tab-delimited text files that are produced by a 3rd party and which do not have any file extension.

For example, a file that would logically be called "customerdata.tsv" is instead called simply "customerdata".


In my notebook this works, but only if I manually rename the source file:

df = spark.read.csv("customerdata.tsv", sep=r'\t')

This does not work:

df = spark.read.csv("customerdata", sep=r'\t')


I'm hoping to avoid needing to rename all of the ~200 source files to get this to work. My public searching has not produced anything useful - can anyone here point me in the right direction?

Thanks in advance!

!

#python #pyspark #scala #Notebooks #help

Last updated 2 years ago

Kedro · @kedro
48 followers · 32 posts · Server social.lfx.dev

kedro-datasets 1.4.0 is out! 🔶 With a new SparkStreamingDataSet!

kedro-datasets is a separate PyPI package where Kedro datasets live. ⚠️ Notice that `kedro.extras.datasets` is deprecated and will be removed in Kedro 0.19, so install the new package now!

```
pip install "kedro-datasets==1.4.0"
```

#kedro #datascience #python #pydata #spark #pyspark

Last updated 2 years ago

PyData Granada · @pydatagrx
32 followers · 32 posts · Server masto.ai

Esta tarde por fin! 🥳
---
RT @draxus
Esta tarde tenemos un pedazo de taller en la @ETSIIT_UGR de @databricks + de la mano de @nenetto. Promete ser mágico... 🧙‍♂️ meetup.com/pydatagrx/events/29

@pydatagrx @PyData @NumFOCUS @OSLUGR @python_es
twitter.com/draxus/status/1651

#pyspark

Last updated 2 years ago

rmoff 🏃🏻 🍺 🥓 · @rmoff
1126 followers · 745 posts · Server data-folks.masto.host

🤔 I spent an hour randomly jiggling things to unbreak this… who wants to tell me why it did what it did? (I still don't know; I'm just blogging the error to help others)

✍🏻 Blogged: Using Delta from pySpark - java.lang.ClassNotFoundException rmoff.net/2023/04/05/using-del

#pyspark #deltalake #datadon

Last updated 2 years ago

Cheatography · @cheatography
5 followers · 159 posts · Server botsin.space

Just released: PySpark Fingertip Commands Cheat Sheet by shivprasadgadekar

Download it free at cheatography.com/shivprasadgad

Here's their description of it: This PySpark cheat sheet is designed for those who want to learn and practice and is most useful for freshers.

@cheatsheets

#cheatsheet #cheatsheets #python #spark #pyspark

Last updated 2 years ago

Ale Segura · @alesegura
467 followers · 179 posts · Server masto.ai

The last week has been an intensive self-course of . I use it for data engineering (cleaning, sql like stuff, etc.). Any functions or tricks that I should know about?! Btw, if someone has a good resource to understand how it does the distributed processing and about and in general, I would appreciate leaving the references here! 😊

#pyspark #spark #deltalake

Last updated 2 years ago

Henry · @hl
82 followers · 267 posts · Server social.lol

@jake4480 @sysop408 I'm lucky that professionally I use and , which are so much clearer and expressive. But I've seen some 800 line SQL queries there which terrify me and I suspect were written by a mad genius, and I hope I never have to try and debug the output of. Even on these simple little queries, if I ever have to change anything it always seems easier to start from scratch.

#pyspark #pandas

Last updated 2 years ago

Abid · @1abidaliawan
12 followers · 32 posts · Server data-folks.masto.host

Check out my latest tutorial on PySpark for Data Science! Learn how to leverage the power of distributed computing and perform large-scale data analysis with ease. Let's dive into the world of big data!
kdnuggets.com/2023/02/pyspark-

#pyspark #datascience #Tutorial

Last updated 2 years ago

António Domingues · @keyboardpipette
155 followers · 861 posts · Server genomic.social

I sometimes feel like I'm either very smart or a monkey randomly typing things.

One of those occasions was yesterday modifying a function to add some fields to the output, a merge, and some string filtering. I don't know at all and only do on occasion. It was mostly copy paste, modify, test. Repeat. It worked.

Smarter people would have read the entire spark docs I just got to the job done and moved on. Felt smug and stupid at the same time.

#pyspark #spark #python

Last updated 2 years ago

Henry · @hl
37 followers · 81 posts · Server social.lol

Now I’ve been here a few weeks, found the fire exits, amenities and snacks, I should probably add an post.

Hello 👋 I’m Henry, and professionally I do and in the industry, mostly with and

Unprofessionally I spend my time two children along with my wife @sarajw and occasionally manage to find time to play 🎸 improve my use of and hack at too many half forgotten projects.

#introduction #datascience #dataengineering #Aviation #python #pyspark #parenting #guitar #emacs #maker

Last updated 2 years ago

Henry · @hl
34 followers · 58 posts · Server social.lol

#pyspark #bigdata

Last updated 2 years ago

Brett Flippin · @bflipp
85 followers · 567 posts · Server vmst.io

Productive knowledge transfer session today. Getting the team up to speed using and moving changes into our environments.

#git #pyspark #etl

Last updated 2 years ago

Jamie Thomson · @jamiet
7 followers · 2 posts · Server hachyderm.io

Working on a thing, hoping to reach first proper release sometime soon github.com/jamiekt/jstark

#python #pyspark

Last updated 2 years ago

Brett Flippin · @bflipp
85 followers · 399 posts · Server vmst.io

Woof, file compaction with 1.x is the only way to make it usable. 50-100x performance improvements depending on data and partition sizes. The default merge and write operations are incredibly inefficient. I understand its been greatly improved in 2.x. We're a couple months from upgrading the platform though.

#deltalake #datawarehouse #datalake #spark #pyspark #aws #awsglue

Last updated 2 years ago