PyCon X

firenze

2-5 maggio 2019

Scaling your Data infrastructure

This talk aims to answer a few questions:

  • What do you do when you need to move your model from your laptop to production?
  • Is big data == I need to use JVM the right assumption?
  • How can I put my jupyter notebook in production?
  • How do you apply the best software engineering practices (testing and ci for example) inside your data science process?
  • How do you “decouple” your data scientists, developers and devops teams?
  • How do you guarantee the reproducibility of your models?
  • How do you scale your training process when does not fit in memory anymore?
  • How do you serve your models and provide an easy rollback system?

The Agenda:

  • The Data Science workflow
  • Scaling is not just a matter of the size of your Data
  • Scaling when the size of your Data matters
  • DDS, Dockerized Data Science
  • Cassiny

I’ll share my experience highlighting some of the challenges I faced and the solutions I came up to answer these questions.

During this presentation I will mention libraries like jupyter, atom, scikit-learn, dask, ray, parquet, arrow and many others.

The principles and best practices I will share are something that you can apply, more or less easily, if you are running or in the process to run a production system based on the Python stack.

This talk will focus on (my) best practices to run the Python Data stack together and I will also talk about Cassiny, an open source project I started, that aims to simplify your life if you want to use a completely Python based solution in your data science workflow.


Comments

  1. Gravatar
    Ciao Christian,
    credo che le slides non siano complete, puoi caricare la versione completa?
    — Martin De Luca,
  2. Gravatar
    Uploaded ;)
    — Christian Barra,

Nuovo commento