I am developing a research application that requires very fast analysis of very large tabular data from sequencing experiments. While I eventually settled in #rstats #datatable, someone kindly suggested I check out what #porn IT does.
The porno servers handle a massive amount of data in real time, executing complex queries in response to what users are watching. At least 3 orders of magnitude larger a problem than mine. Here is what the pros do:
https://news.ycombinator.com/item?id=3597891
https://davidwalsh.name/pornhub-interview
Going back to our new #DataNotebook 📓 #ProDataTools 🛠️ development, and updating our generic #DataTableRenderers 🈸 for #VSCode Notebooks 📚 this week.
File your new feature requests and enhancements in our VS Code #DataTable ⊞ repo for now:
#datatable #vscode #datatablerenderers #prodatatools #datanotebook
Imagine you have a bunch of data points and you want to know how many belong to different categories. This is where grouped counting comes in. We've got three fantastic methods for you to explore, each with its own flair: **`aggregate()`**, **`dplyr`**, and **`data.table`**.
Happy counting, fellow data explorer! 🎉🔍 #DataAnalysis #RProgramming #ExploreData #dplyr #aggregate #baser #r #rstats #datatable
Post: https://www.spsanderson.com/steveondata/posts/2023-08-10/
#datatable #RStats #r #baser #aggregate #dplyr #exploredata #rprogramming #dataanalysis
Group percentages in R with #baser #dplyr and #datatable
#R #RStats #opensource
#OpenSource #RStats #r #datatable #dplyr #baser
Spent 3 hours this evening trying to parse a deeply nested json file and convert to an R data.table. Thought I'd encountered enough json data to be able to handle anything thrown at me but had to admit temporary defeat - I'll try again tomorrow. Maybe I'm just rusty with parsing json, or the two beers I had after work addled my brain. Anyone know of any good resources for handling deeply nested Json? #rstats #json #datatable
Occasionally, I think about how to work effectively with #rstats. Currently, I am teaching my #bioinformatics courses with #RKWard again. I try to do most of it with packages from the base installation. #datatable is an exception. But otherwise, I like to use #within (very fast) instead of #mutate.
But there are more approaches, which are often simpler/faster/stable:
- https://github.com/matloff/TidyverseSkeptic/blob/master/RDesign.pdf
- https://davidhughjones.medium.com/dont-forget-non-tidyverse-solutions-979c870c7f3e
#rstats #bioinformatics #rkward #datatable #within #mutate
Running #SQL queries in our new #DataNotebook 📓 @code extension, rendering results with simple #DataTable, #DataSummary 🈷️ & #FlatDataGrid 中 from our #DataTableRenderers 🈸 + CSV #DataExport from VSCode Notebook cell output all in one go. We doubt #MalloyData is as flexible. 😎 #DataTools 🔬 ...
#datatools #MalloyData #dataexport #datatablerenderers #flatdatagrid #datasummary #datatable #datanotebook #sql
I had recently posted on benchmarking the reading in of a .csv file but received an email over the weekend pointing out the omission of something like csv.gz file(s).
Functions tested in the benchmark:
✅ read.table
✅ read.csv
✅ fread
✅ vroom with altrep=false
✅ vroom with altrep=true
✅ read_csv
Post: https://www.spsanderson.com/steveondata/posts/rtip-2023-03-27/
#data #help #softwaredevelopment #compression #gz #r #rstats #vroom #datatable #readr #tidyverse #baser #opensource #innovation #technology #software #benchmarking
#benchmarking #Software #Technology #innovation #OpenSource #baser #tidyverse #readr #datatable #vroom #RStats #r #gz #compression #softwaredevelopment #Help #Data
Today I wanted to share some out of the box benchmarking for reading in a square #matrix in #r the idea behind this was to see how fast the default settings where for reading in these various files.
Post: https://www.spsanderson.com/steveondata/posts/rtip-2023-03-24/
#r #rstats #vroom #fst #arrow #datatable #opensource #opensourcesoftware #software #softwareengineering #technology #innovation
#innovation #Technology #softwareengineering #Software #opensourcesoftware #OpenSource #datatable #arrow #fst #vroom #RStats #r #Matrix
The original data.table function I wrote was slower than the original solution of tidy_bernoulli(), but with the help of Reddit, LinkedIn, and Mastadon users, I got a few great improvements thanks to users from Reddit, LinkedIn, and Mastadon.
🙌 Reddit Help from: https://www.reddit.com/user/NewHere_Hi_everyone/
🙌 LinkedIn Help from: Chris Kypridemos
🙌 Mastadon Help from: @datamaps
Post: https://www.spsanderson.com/steveondata/posts/rtip-2023-03-09/
#innovation #opensource #opensourcesoftware #software #technology #datatable #benchmarking #RStats
#RStats #benchmarking #datatable #Technology #Software #opensourcesoftware #OpenSource #innovation
I was recently challenged by a LinkedIn connection to get on with data.table and it was something that was on my radar but now it's got my interest and attention, so onward with it! challenge accepted!
Post: https://www.spsanderson.com/steveondata/posts/rtip-2023-03-07/
#datatable #tidydensty #bernoulli #tibble #tidy #r #rstats #opensourcesoftware #opensource #software #softwareengineering #innovation #technology #distributions #improvement #engineering #data #bigdata #dataanalysis
#dataanalysis #bigdata #Data #engineering #improvement #distributions #Technology #innovation #softwareengineering #Software #OpenSource #opensourcesoftware #RStats #r #tidy #tibble #Bernoulli #tidydensty #datatable
Everybody knows (hopefully) that data.table is great. Today, I noticed that it comes with its own update mechanism.
data.table::update_dev_pkg()
That is really useful if there is a feature you see in the development version but prefer a tested package.
Well, that settles it then.
#RStats #dplyr #datatable #chatGPT
(joking aside, it's spooky how well it responds to all kinds of questions students would throw at it. )
#chatgpt #datatable #dplyr #RStats
Our #DataTableRenderers 🈸 for #VSCode Notebooks 📚 has over 30,000 installs. It's one of the most widely used #dataNotebook 📓 extensions in VS marketplace. Extension includes scrollable #dataTable, #flatDataGrid & #dataSummary output renderers. Try it!
📥 https://marketplace.visualstudio.com/items?itemName=RandomFractalsInc.vscode-data-table
#dataTools 🛠️ 💎💎💎...
#datatools #datasummary #flatdatagrid #datatable #datanotebook #vscode #datatablerenderers
As part of improving documentation and encouraging best practices, I will be updating angular-datatables in near future. There won't be major changes to the library source code.
Our goal is to shuffle the menu items and update GitHub Support templates to reflect these changes.
I'll share more details on my blog in a few weeks once I've made some progress. (hopefully!!!)
Happy new year (in advance) folks! 🎉
#newplans #opensource #datatable #angular
Unpopular opinion: {data.table} syntax is:
1. Confusing in comparison with the indexing notation for base R data.frames
2. Cryptic with all those [, x := fn(a), by=var]
3. Difficult for the uninitiated to understand (unlike SQL and dplyr)
I know {data.table} is a good package with a lot of very happy users, but for some reason these disadvantages are rarely mentioned. Alongside the lack of database backend, they're the main reason I don't use the package much.
Unpopular opinion: {data.table} syntax is:
1. Confusing in comparison with the indexing notation for base R data.frames
2. Cryptic with all those [, x := fn(a), by=var]
3. Difficult for the uninitiated to understand (unlike SQL and dplyr)
I know {data.table} is a good package with a lot of very happy users, but for some reason these disadvantages are rarely mentioned. Alongside the lack of database backend, they're the main reason I don't use the package much.
Every time I work with data.table I think what a great package. The speed alone is great. What I also like is the tibble-like behavior when displaying data.
There might be times when you may want to get some sort of #summary #statistic like a #quantile or #IRQ on your #distribution data.
With my #r #package {TidyDensity} this is possible given the data comes from a tidy_ distribution function. If you have a vector of data you can use tidy_empirical() as a cheat.
With this function you can get output as #sapply #lapply #tibble or a #tibble where #datatable is doing the work.
Post: https://www.spsanderson.com/steveondata/posts/weekly-rtip-tidydensity-2022-11-23/
See attached!
#datatable #tibble #lapply #sapply #package #r #distribution #irq #quantile #statistic #summary
@chrisadamsecon Definitely faster than #dplyr on my laptop. More vectorized functions. #tidyverse like syntax and you can mostly mix and match with dplyr as you require. Another package, #tidytable is quite interesting also. You can just type regular dplyr and it will convert to #datatable without worrying about any extra steps like dtplyr. #rstats
#dplyr #tidyverse #tidytable #datatable #rstats