FedSearch - Federated network search engine

Kedro · @kedro

63 followers · 59 posts · Server social.lfx.dev

A QuantumBlack team helped bring PySpark support to pandera 👏🏼 we are so proud of this open source contribution and hope to keep them coming!

https://www.kdnuggets.com/2023/08/data-validation-pyspark-applications-pandera.html

#python #pydata #pandera #pyspark

Last updated 2 years ago

Original post

Ankur · @ankur

0 followers · 6 posts · Server masto.ai

GitHub - GitHub - ankur-gupta/pyspark-starter: Starter pyspark code with a working combination of all versions

Unlocking the power of PySpark can be a 🧩. https://github.com/ankur-gupta/pyspark-starter is a 🗺️ to easier installation w/ instructions on installing the right versions of Python, Java, Scala, & dependencies. Or, use the starter code as a template. #Python #PySpark #BigData

#python #pyspark #bigdata

Last updated 2 years ago

Original post

Kedro · @kedro

62 followers · 56 posts · Server social.lfx.dev

New blog post: How to integrate Kedro and Databricks Connect 🔶

In this blog post, our colleague Diego Lira explains how to use Databricks Connect with Kedro for a development experience that works completely inside an IDE.

https://kedro.org/blog/how-to-integrate-kedro-and-databricks-connect

Install it with

```
pip install databricks-connect
```

#kedro #python #pydata #datascience #databricks #dbx #spark #pyspark

Last updated 2 years ago

Original post

Joel · @j

74 followers · 316 posts · Server moo.nz

Just optimised a Spark job to run in 10 minutes instead of 30.

The problem was the #pyspark defaults for it's new Adaptive Query Execution (on by default since 3.2.0).

I don't really understand having an adaptive planner if it isn't actually making good decisions. It just means a more complicated system that's harder to tune.

#pyspark

Last updated 2 years ago

Original post

Štěpán Rešl · @StepanResl

105 followers · 85 posts · Server techhub.social

Open media

🔥 Excited to share my latest article, "Lessons Learnt from PySpark Notebooks and Extracting APIs"! 🚀✨

📝 In this article, I explore my experience with PySpark Notebooks and the process of extracting APIs. 💡💻

🔎 Throughout my journey, I encountered challenges and gained valuable insights that I'd like to share with you. Whether you're a PySpark enthusiast or just getting started, there's something here for everyone! 📚🔬

🔗 Read the full article on my website Datameerkat and expand your PySpark skills today! Link: https://datameerkat.com/lessons-learnt-from-pyspark-notebooks-and-exctracting-apis

📖✨ I'd love to hear your thoughts, so feel free to leave a comment, and let's connect. 💬🤝

#DataScience #PySpark #Notebooks #APIs #DataManipulation #DataVisualization #Article #DataMeerkat #PowerBI #MicrosoftFabric

Looking forward to connecting with you all! 🌐📲 Let's dive into the world of PySpark together! 🚀💻

#datascience #pyspark #Notebooks #apis #datamanipulation #datavisualization #article #datameerkat #powerbi #microsoftfabric

Last updated 2 years ago

Original post

PyLadies Bot · @pyladies_bot

99 followers · 91 posts · Server botsin.space

DEV Community - Bulk load to Elastic Search with PySpark

📝 "Bulk load to Elastic Search with PySpark"

👤 Valery C. Briz (@valerybriz)

🔗 https://dev.to/valerybriz/bulk-load-to-elastic-search-with-pyspark-2ohj

#pyladies #python #elasticsearch #spark #pyspark #bigdata

Last updated 2 years ago

Original post

SQLAllFather · @SQLAllFather

869 followers · 3343 posts · Server techhub.social

Hey Fediverse - does anyone know how to work in Python (PySpark) or Scala with files that do not have a file extension?

I am working with a large number of tab-delimited text files that are produced by a 3rd party and which do not have any file extension.

For example, a file that would logically be called "customerdata.tsv" is instead called simply "customerdata".

In my notebook this works, but only if I manually rename the source file:

df = spark.read.csv("customerdata.tsv", sep=r'\t')

This does not work:

df = spark.read.csv("customerdata", sep=r'\t')

I'm hoping to avoid needing to rename all of the ~200 source files to get this to work. My public searching has not produced anything useful - can anyone here point me in the right direction?

Thanks in advance!

#Python #PySpark #Scala #Notebooks #Help!

#python #pyspark #scala #Notebooks #help

Last updated 2 years ago

Original post

Kedro · @kedro

48 followers · 32 posts · Server social.lfx.dev

kedro-datasets 1.4.0 is out! 🔶 With a new SparkStreamingDataSet!

kedro-datasets is a separate PyPI package where Kedro datasets live. ⚠️ Notice that `kedro.extras.datasets` is deprecated and will be removed in Kedro 0.19, so install the new package now!

```
pip install "kedro-datasets==1.4.0"
```

#kedro #datascience #python #pydata #spark #pyspark

Last updated 2 years ago

Original post

PyData Granada · @pydatagrx

32 followers · 32 posts · Server masto.ai

Esta tarde por fin! 🥳
---
RT @draxus
Esta tarde tenemos un pedazo de taller en la @ETSIIT_UGR de @databricks + #PySpark de la mano de @nenetto. Promete ser mágico... 🧙‍♂️ https://www.meetup.com/pydatagrx/events/292425237/

@pydatagrx @PyData @NumFOCUS @OSLUGR @python_es
https://twitter.com/draxus/status/1651487583904432131

#pyspark

Last updated 2 years ago

Original post

rmoff 🏃🏻 🍺 🥓 · @rmoff

1126 followers · 745 posts · Server data-folks.masto.host

Open media

🤔 I spent an hour randomly jiggling things to unbreak this… who wants to tell me why it did what it did? (I still don't know; I'm just blogging the error to help others)

✍🏻 Blogged: Using Delta from pySpark - java.lang.ClassNotFoundException https://rmoff.net/2023/04/05/using-delta-from-pyspark-java.lang.classnotfoundexception-delta.defaultsource/

#pySpark #DeltaLake #datadon

#pyspark #deltalake #datadon

Last updated 2 years ago

Original post

Cheatography · @cheatography

5 followers · 159 posts · Server botsin.space

Open media

Just released: PySpark Fingertip Commands Cheat Sheet by shivprasadgadekar

Download it free at http://www.cheatography.com/shivprasadgadekar/cheat-sheets/pyspark-fingertip-commands/?utm_source=mastodon

Here's their description of it: This PySpark cheat sheet is designed for those who want to learn and practice and is most useful for freshers.

@cheatsheets #CheatSheet #CheatSheets #python #spark #pyspark

#cheatsheet #cheatsheets #python #spark #pyspark

Last updated 2 years ago

Original post

Ale Segura · @alesegura

467 followers · 179 posts · Server masto.ai

The last week has been an intensive self-course of #PySpark. I use it for data engineering (cleaning, sql like stuff, etc.). Any functions or tricks that I should know about?! Btw, if someone has a good resource to understand how it does the distributed processing and about #Spark and #DeltaLake in general, I would appreciate leaving the references here! 😊

#pyspark #spark #deltalake

Last updated 2 years ago

Original post

Henry · @hl

82 followers · 267 posts · Server social.lol

@jake4480 @sysop408 I'm lucky that professionally I use #PySpark and #pandas, which are so much clearer and expressive. But I've seen some 800 line SQL queries there which terrify me and I suspect were written by a mad genius, and I hope I never have to try and debug the output of. Even on these simple little queries, if I ever have to change anything it always seems easier to start from scratch.

#pyspark #pandas

Last updated 2 years ago

Original post

Abid · @1abidaliawan

12 followers · 32 posts · Server data-folks.masto.host

Open media

Check out my latest tutorial on PySpark for Data Science! Learn how to leverage the power of distributed computing and perform large-scale data analysis with ease. Let's dive into the world of big data! #PySpark #DataScience #tutorial
https://www.kdnuggets.com/2023/02/pyspark-data-science.html

#pyspark #datascience #Tutorial

Last updated 3 years ago

Original post

António Domingues · @keyboardpipette

155 followers · 861 posts · Server genomic.social

I sometimes feel like I'm either very smart or a monkey randomly typing things.

One of those occasions was yesterday modifying a #PySpark function to add some fields to the output, a merge, and some string filtering. I don't know #spark at all and only do #python on occasion. It was mostly copy paste, modify, test. Repeat. It worked.

Smarter people would have read the entire spark docs I just got to the job done and moved on. Felt smug and stupid at the same time.

#pyspark #spark #python

Last updated 3 years ago

Original post

Henry · @hl

37 followers · 81 posts · Server social.lol

Now I’ve been here a few weeks, found the fire exits, amenities and snacks, I should probably add an #introduction post.

Hello 👋 I’m Henry, and professionally I do #DataScience and #DataEngineering in the #Aviation industry, mostly with #python and #pyspark

Unprofessionally I spend my time #parenting two children along with my wife @sarajw and occasionally manage to find time to play #guitar 🎸 improve my use of #emacs and hack at too many half forgotten #maker projects.

#introduction #datascience #dataengineering #Aviation #python #pyspark #parenting #guitar #emacs #maker

Last updated 3 years ago

Original post

Henry · @hl

34 followers · 58 posts · Server social.lol

And today I discovered pyspark.sql.functions.transform(), for all my array cleaning needs #pyspark #bigdata
https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.functions.transform.html#pyspark-sql-functions-transform

#pyspark #bigdata

Last updated 3 years ago

Original post

Brett Flippin · @bflipp

85 followers · 567 posts · Server vmst.io

Productive knowledge transfer session today. Getting the team up to speed using #git and moving #pyspark #ETL changes into our environments.

#git #pyspark #etl

Last updated 3 years ago

Original post

Jamie Thomson · @jamiet

7 followers · 2 posts · Server hachyderm.io

Working on a #python thing, hoping to reach first proper release sometime soon https://github.com/jamiekt/jstark #pyspark

#python #pyspark

Last updated 3 years ago

Original post

Brett Flippin · @bflipp

85 followers · 399 posts · Server vmst.io

Woof, file compaction with #DeltaLake 1.x is the only way to make it usable. 50-100x performance improvements depending on data and partition sizes. The default merge and write operations are incredibly inefficient. I understand its been greatly improved in 2.x. We're a couple months from upgrading the platform though. #datawarehouse #datalake #spark #pyspark #aws #awsglue

#deltalake #datawarehouse #datalake #spark #pyspark #aws #awsglue

Last updated 3 years ago

Original post