PyData: A community for developers and users of open source data tools

pixelmonkey · on Nov 10, 2019

"PyData is an educational program of NumFOCUS, a 501(c)(3) nonprofit charity."

They put on the PyData conferences across the world -- for example, PyData NYC 2019 was last week[1].

Profit from the conferences goes toward supporting NumFOCUS programs, including many open source projects[2].

Little-known fact: despite the name, PyData is not just a community for Python. Though Python is probably the largest sub-community within the PyData umbrella, PyData actually covers the Julia and R communities, as well. This is similar to how Jupyter Notebooks, and the parent Jupyter project, though very commonly used by Python data analysts, are also used by R and Julia programmers. In fact, Jupyter is a kind of portmanteau for "(Ju)lia, (Pyt)hon, and (R)".

If your company is looking to hire people comfortable with an open source data analysis stack -- while also supporting the good cause of scientific and data-oriented open source projects -- then you should consider forwarding sponsorship information from PyData's website[3] to a relevant hiring or marketing manager.

[1]: https://pydata.org/nyc2019/schedule/

[2]: https://numfocus.org/sponsored-projects

[3]: https://pydata.org/sponsor-pydata/

kfk · on Nov 10, 2019

Just wanted to say few random things here. I have been moving my team of 8 analysts from Alteryx to Python/Jupyter for data analytics and preparation. While things have been a bit rough we are happy with how easy is to use those tools. A lot of this infrastructure makes sense and it’s b/s free (which is not the case of tools like Alteryx). We have been able to run parallel executions, automated triggered data flows, email notifications, ETL!, pdf reporting with Latex, and more. We can tell prospective students that want to work with us to check out our open source library on github and learn a bit of python before the interview. Students love this stuff because it’s free and there is abundant information on the web. I fear the day I am promoted or move to other company and have to go back to stuff like

What’s the catch? Alteryx and the like dedicate enormous budgets to marketing and sales. It’s hard to convince companies that open source data tools are actually a lot better than the expensive enterprise stuff. Indeed even in my company it’s still hard to get people on board with Python even though we are one of the best teams doing analytics.

mmsimanga · on Nov 10, 2019

I have worked in BI most of my life and concur with your sentiments. I have a slightly different take on the whole commercial vs Open Source. I think organisations should embrace both and make them both first class citizens. In BI space I would say go with Tableau/Power BI/Cognos for all your day to day dashboards and reporting but don't try use Tableau to create printed reports (still requirement in some industries). Tableau wasn't designed to create print PDF reports. Go with an Open Source solution like Jasper Reports[0] or BIRT[1] to create the printed reports. The essence of what I am saying is companies should use Open Source to fill in gaps not covered by the commercial software. Commercial software is good for end users. Commercial companies provide manuals and training is available from partner organisations. This means a lot for none technical users.

[0]https://community.jaspersoft.com/ [1]https://www.eclipse.org/birt/

kfk · on Nov 10, 2019

Yes for things that are easy to isolate like dashboards and pdf reports buying solutions is good. But for things that need orchestration of various tools open source tends to work better for me.

tomrod · on Nov 10, 2019

I think the key point here is isolation. Often presenting dashboards and reports lead to more questions than the isolating infrastructure can reasonably manage.

Vaslo · on Nov 10, 2019

Any reporting software like this based in Python?

mmsimanga · on Nov 10, 2019

I am not aware of tools similar to BIRT and Jasper Reports in Python.

My observation is the world has moved on from Crystal Reports type tools to Tableau like tools. BIRT and Jasper Reports were created at a time when Java was dominant and I don't see anyone creating similar tools going forward. That's my two cents.

ironchef · on Nov 10, 2019

There's a few like: * superset - by maxime (of airflow, etc....) ..word on the street is some big changes are going on with this * dash by plotly

starpilot · on Nov 10, 2019

Interesting. I heard at Boeing they're trying to convert all of their SAS models to Python, so that they can ditch their multimillion dollar SAS license. Did you considering Anaconda Enterprise or Dataiku by chance? They both have really good support for Python workflows, and smoothe out the deployment/governance aspects.

kfk · on Nov 10, 2019

We use anaconda just as a python distribution on Windows. I think if you want a real data analytics setup which works well on all individual platforms then it's best to just build a docker-compose file. I was given a Dataiku presentation in New York in September and I liked it. My concern with such tools is that companies buy them thinking they are a 1 shop solution for analytics. For instance, how does Dataiku work with AWS/S3 Athena? Or how does Dataiku work with github? Maybe it works very well, I don't know, but evaluating those tools can take months. A "raw" python setup is, I think, an initial good first step as it forces teams to deal with all the architecture from the very beginning and it works because it integrates with everything. When you buy from a vendor you put yourself in a contract situation that could make it very hard to iterate to a final solution that actually works for the company. Open source doesn't have those problems.

tomrod · on Nov 10, 2019

If your organization's SAS experience is like where I just came from:

- 90% of your SAS usage is ETL via SAS/ACCESS -- easy to write, but no lineage or real reporting without costly engineering and maintenance. Current ETL tooling must more mature than was SAS offers

- 5% is actually using SAS for what it is intended, canned Statistical Packages with indemnification if their calculations are incorrect (watch out for SAS/STAT 9.3 time series, there are a few PROCs that have incorrect results!)

- 5% of your users are insanely frustrated trying to build real things on top of SAS's broken model of macros, PROC IML, procedure creators, and similar when they really just need Python.

Anaconda Enterprise is a good product, but really the FLOSS underneath just works well. Watch out for dependency hell (be disciplined by using virtualenvs / docker contains) and you'll see dramatic improvement in workflows.

nabdab · on Nov 10, 2019

We’re working on a jupyter notebook setup as well, but for us it’s only analytical notebooks and dashboards (through Voila). How have you tied in triggered dataflows and ETL?

kfk · on Nov 10, 2019

Yes. Well the trick is using Dask with a @retry decorator. You can listen for changes on api's and sql tables easily with a while loop. Scheduling is easy, triggering, retrying and notifying are the harder parts.

BerislavLopac · on Nov 10, 2019

One potential approach is Apache Airflow.

tomrod · on Nov 11, 2019

The Airflow creators went on to make Prefect, I hear.

BerislavLopac · on Nov 12, 2019

I have worked with Airflow extensively, but this is the first time I heard about Prefect, and mind blown! Looking at the docs, it seems like they have resolved most of the things we had to work hard around in Airflow -- I definitely need to look more deeply into it. Thank you!

kfk · on Nov 11, 2019

Do you have a personal contact? You seem to also be interested in similar things as me. By the way if you are ok with a bit of custom code you can do a lot of what airflow/prefect do with dask.delayed

tomrod · on Nov 11, 2019

I'm a big fan of dask, though haven't used it in a deployment/scheduling contact.

I'd love to connect.

beefield · on Nov 10, 2019

To me Alteryx has similar issues than Excel. Both seemt be created to create you an illusion that you know how to use the tool with minimal training/understanding. Which results most people producing utter unmaintainable garbage with these tools, and them not even realizing how bad it is. Because, you know, they know how to use the tool!

xiaodai · on Nov 10, 2019

When u sale, you sale to ppl who don't use the stuff. Executives.

mrslave · on Nov 10, 2019

I've worked on many integration projects for bad fit software purchases. Your comment reminds me of this:

Why Enterprise Software Sucks https://news.ycombinator.com/item?id=21224209

xvilka · on Nov 10, 2019

Isn't NumFOCUS also the sponsor of the Julia language and infrastructure?

xapata · on Nov 10, 2019

They do a fair bit.