Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Types for Python HTTP APIs (instagram-engineering.com)
213 points by YoavShapira on Sept 13, 2019 | hide | past | favorite | 49 comments


In our startup we use marshmallow [1] to validate and make REST API type aware. So marshmallow models validates the data types based on validation rules for request/response.

In place where we need client to be able to define the validation the request/response will consist of data, validation json schema. In this case use jsonschema validator [2] [3].

So for every json request/request data, it's passed through a marshmallow model which validates the json. In some cases we use jsonschema validator when we have dynamic json data with schema definition for validations.

For response the function return value pass through a marshmallow model. We are moving towards making all of internal functions/methods with type annotations to generate documentation using Sphinx [4] plugin.

We are not Instagram but are very happy with it and can replace flask library with bottle or other wsgi compatible framework and it will still work.

[1] https://marshmallow.readthedocs.io/en/stable/

[2] https://python-jsonschema.readthedocs.io/

[3] https://pydantic-docs.helpmanual.io/

[4] https://pypi.org/project/sphinx-autodoc-typehints/#data


Marshmallow is good but it can be slow. We also use it over a Flask API for input/output serialization and in the worse cases it can take a significant amount of time if the objects are large enough (maybe a hundred milliseconds). We also use the Marshmallow models in conjunction with a project called `flask-apispec` by the same authors to generate Swagger docs.

I've wanted to explore using the Typing module to replace Marshmallow since it started making the rounds to see if it results in better performance, but haven't had a chance. I would have liked to see Instagram release a library to go with this blog post so I don't have to do as much legwork.


> Toasted Marshmallow implements a JIT for marshmallow that speeds up dumping objects 10-25X (depending on your schema).

https://github.com/lyft/toasted-marshmallow


We are exploring using Python data classes and typehints directly. This way can remove all dependencies and rely on standard library.

As first step we are exploring marshmallow with data classes like the way we are dealing with jsonschema using pydantic.

In our product we prefer to use as little 3rd party packages as possible and rely on standard library. When we want to use 3rd party package we look at the code and if my team can support and enhance it then only we use, except in some case where package is better than standard library like requests.


Been doing this before the typing module was released, using hug: http://www.hug.rest/

FastAPI has a lot of advantages at this point for someone looking to do the same: https://github.com/tiangolo/fastapi

hug has some catching up to do, some of it just because it got in so early (It needs to be updated to be compatible with mypy and the types defined typing.py!) in any case using typing on API endpoints - independent of how you feel about Dynamic vs Static typing in general, just makes a lot of sense IMHO.


Thanks for the mention @timothycrosley! I feel honored :)


Looks like FastAPI framework with handler type anotations


I agree. While reading, I keep thinking of pydantic (https://pydantic-docs.helpmanual.io/) which FastAPI utilizes.


Exactly what I was thinking.

Plus the fact it is built on top of Starlette and Uvicorn and is frequently among the top of benchmarks of python based framework performance.

Fairly complete starter projects.

https://github.com/tiangolo/full-stack-fastapi-postgresql/bl...


Yeah I feel FastAPI will become the new Flask in a few years. The growth this year has been amazing. I wish I had started my project with it, but I am still using Flask. I found this nice extension though called Flask-Rebar that does a similar task with Marshmallow.


Yeah I feel FastAPI will become the new Flask in a few years.

What else does it offers besides performance (switching to async is no free lunch)?


I ran across FastAPI earlier this year and did a tiny prototype to play with it. Selling points for me:

- Integrates nicely with some existing libraries (Starlette, Pydantic)

- Well documented

- Auto-validation of endpoints from data models

- Auto-generation of OpenAPI schemas from those models

- Auto-serves live API docs from that schema

- Easy definition of sync and async endpoints

Again, it was just a tiny proof-of-concept prototype, but I'm sold on using it on future projects.


But I found one drawback of using FastAPI - defining all inputs in handler function signature is looks bit complicated and verbose. I prefer one input with type that are structure of another types. Something like this:

  class CreatePostInput(InputClass):
       text: str
       user_id: str
       ...
  
  @handler.post('/post')
  def create_post(input: CreatePostInput):
      ...


> - Auto-generation of OpenAPI schemas from those models

Is there a package that can do a reverse? Can I give it an OpenAPI spec and get a code stump going?


OpenAPI-generator [0], written in Java, can produce Python stumps, but I never liked that approach.

Connexion [1] is built on top on Flask and does routing and validation based on an OpenAPI spec.

I've recently started developing Pyotr [2], which does the same only based on ASGI and Starlette. It also includes a client module.

[0] https://openapi-generator.tech/

[1] https://connexion.readthedocs.io/

[2] https://github.com/berislavlopac/pyotr


https://github.com/zalando/connexion/

Solves most of the problems for that.


Depends on where you are coming from and what you need. Developer productivity is really high (real world experience). It has better defaults / builtins compared to flask (much faster to get to production quality). Faster to learn than Django / Django rest framework. My team has been using it for a couple of months now and it’s been really easy and fast to get an API up and running with api docs, data validation, middleware for auth and metrics, persistence, mix of async and sync functions and background jobs.


Nice writeup - I've been doing this type of thing for 10 years before type annotations by passing the types to the decorator and then using kwargs to supply defaults.

  @annotate(foo=int, bar=str)
  def view_func(foo=0, bar='hi there'):
    ...
The types could also be arbitrary callables to parse things like datetime and what not. I'd parse the params from either the get params or post data (json, urlencoded, etc).

But now I'm using graphene and graphql to handle all this -- it's a better way to do all of it imho. Of course this all came along after instagram, so you didn't have that choice back then.


Rather than reimplementing this stuff, and assuming you are using Django, you could just use Django Rest Framework to get OpenAPI, typed serializers, etc.

In that framework your serializers are by default auto-generated from your model classes; this is convenient to get started, just like Django itself.


I'm always amazed that, in majority of cases, we use formal specifications -- like OpenAPI -- for documentation instead to guide our implementations. The common pattern is to build an API first, and then extract routing and validation information into an OpenAPI spec, which is then used to set up a test server for clients to develop on top on.

But if you have a relatively clear idea of what your API should look like, there are great benefits to be gained by providing the specification first. This way, you don't need for the API to be implemented to start developing the clients, even by completely unrelated teams. Second, your spec will already include your routing and validation rules, and there is no need to manually specify e.g. Pydantic models.

I recently wrote a PoC of a framework [0] that uses the OpenAPI spec to easily implement a REST(ful) API; in a nutshell, you need to implement endpoint functions that correspond to the spec's `operationId` names, and it will automatically route a request to the right endpoint. It is fully ASGI compliant and as has a bonus client module, which allows you to do the requests.

[0] https://github.com/berislavlopac/pyotr


I'd love to see some more details on the tooling they created to generate OpenAPI schemas by extracting types from their code.

Related: are there any good Python libs for doing request/response validation based on OpenAPI v3 schemas?


I have used Connexion (https://github.com/zalando/connexion) in the past. The only thing I didn't really like is that the OpenAPI file grows rather unmaintainable after some time if you have a big service.


No one is mentioning that Instagram app which doesn’t have that many features, and is quite poorly designed, somehow requires over 2000 API calls to function.

I would dare anyone come up with even 100 for the Instagram app.


endpoints


I love types and I’m stoked about them coming back into style. For awhile, it was a major trend to avoid static types. I think most people’s problem ended up not being with types themselves- but poor type systems. For myself, I am particularly impressed with what TypeScript has done with its types, really amazing how expressive/flexible/and inspective it is. Anyways, just happy with the trend of embracing types and hopes it continues.


This is a bit lower level as it's only concerned with serializing, but a package I wrote called json_syntax[1] will take @dataclass or @attr.s classes decorated with type annotations and build encoders and decoders for them.

It also handles Union types reasonably well and lets you put in hooks to handle ugly cases. It's used in production on a system with a big complicated payload, and I designed it to be easily extensible if the standard rules don't work for you.

[1]: https://pypi.org/project/json-syntax/


I’ve been working on a similar solution for Flask in K Health. Currently we have great serialisation / deserialisation from types to JSON, next step will be creating OpenAPI documents


You should check out Flask-Rebar, I am using it a lot in my projects and it's very nice. Does all of this automagically.


I've been using type hints with my bottle projects for some time now. Makes it a lot easier to write JSON based apis.

https://github.com/theSage21/bottle-tools


When I read the headline and saw the source, I assumed this would be about GraphQL. I know Instagram utilizes GraphQL, for example on the web client, so now I'm wondering how that fits in.


Flask Swagger already has a way to annotate it quite well.


I find it really impressive that such a large organization runs anything on Python. Isn't the speed optimization potential immense? Does the amount of Python involved just not matter compared to image data?

Secondly I find it really impressive that such a large company with so many smart people can produce an application so mediocre and make the experience extra terrible and me wonder what the absolute fuck is up with that company by trying to block desktop browsers from the perfectly useable on desktop web app (the one in which you can upload things).

Especially for a photo centric application (that has since begun to be used for original video production, of course hampered by the insane lack of any options, starting with the aspect ratio) one could expect a normal work flow to include transferring photos from a camera to a desktop computer. Making your browser pretend to be a mobile phone seems like a step that could maybe, if you really tried (to take out the arbitrary restriction that you must have explicitly added in the first place), be made unnecessary.

So that's my (condensed) rant about Instagram as a whole.


> I find it really impressive that such a large organization runs anything on Python. Isn't the speed optimization potential immense? Does the amount of Python involved just not matter compared to image data?

Usually, that's exactly the story: there are certain hot spots that account for the vast majority of your processing time.

I work on an optimization engine that evaluates financial plans.

We have a ton of business logic that is not performance sensitive, and we decompose those objects into flat, regular primitives so they can run in a tight loop that does the actual simulation.

We optimized that using numba and get quite acceptable performance.

But let me revisit what you asked:

> I find it really impressive that such a large organization

In a large organization, you need to find people with the right skillset, and if someone is a subject matter expert (mathematician, statistician, etc.) they often know Python. If they can read and understand your code, they can directly check it for correctness.

Or they can load modules into Jupyter and work with them. Have a lingua franca is itself very powerful.


Who cares if it's 10% slower running on Python if the whole thing is going to be distributed across 1000s of containers that are behind layers of caching. At that point developer productivity is more important than runtime performance.


What exactly are those developers doing all day then? Because every time I open an Instagram.com link I'm struck by how bad it is. The images are small, the comment list is the same height as the image, so wide but short images have short comment list, the comments are scrolled to the bottom so the poster's description isn't visible without scrolling up, videos doesn't have any kind of playback controls so you can't even see how long they are, and so on. It's a terrible web site and I don't understand where all the developer time and productivity is going.

And wouldn't a 10% hardware cost decrease be worth a lot at a company with such a high load, or is it all rendered so seldom because of the caching?


Really? You’re dogging the product because you don’t like how it looks on mobile? The article is about how they use python and types to improve their APIs.


I'm not sure I can say this with authority, but no one cares about the instagram website. It's primarily a mobile app, so they've been ignoring development on the website's bugs for as long as I can remember using it.

Besides most of those problems are UI-related, this article is about the backend and really python is perfectly fine for a backend that has a huge focus on data (for obvious reasons)


The whole Type thing in Python and all the workarounds to make Python type-safish looks and feels absolutely pathetic. I have no idea why a company like Facebook takes so much time to apologize for the inadequacy of Python as a programming language in the broader sense. Especially since they are utilizing a service based architecture they easily could make the switch to a language that actually supports types. Also the obvious performance gains...


Just to clarify, Python has always been strongly typed. Every object has to have an explicit concrete type. However this type is only known at runtime and variables can be bound to any type of object. Aka dynamic typing.

What the "whole Type thing" adds is type hints. That is to say they tell the programmer what types are expected. These hints can also be used in static analysis to try to restrict the types that can be bound to any particular variable. This can be useful to detect bugs in large code bases.


Especially since they are utilizing a service based architecture

They articles says it is mostly a monolith with several millions line of code.


Why this sounds pathetic (?) for a company like Facebook which essentially made their original backend PHP (!) type safe and compiled by developing their own sub language Hack and compilers HHVM around it. Python always supported dynamic types but on run time ;) and nowadays also static.


Python does support types though. It's type system is, imo more pleasant and powerful than Javas.


Can you elaborate on this? I’d love to actually use types in Python.



Static typing is exactly what python doesn't have. It will remain dynamically typed. It has type hints which can be used with external tools to enforce type correctness to some degree, but python code will run even if you assign:

    def f(x: int):
        x = "str"
    f(42)


This is true of java and c as well.

You can cast things or subvert the type system and the code will still run.

At least where I work, my build process prevents me from running Python code with invalid type annotations, so it's exactly the same as for Java is cpp or any other statically typed language.


In my opinion, the main difference between the two from a developer's perspective is this;

You can't grep for dynamic type errors. You can grep for a cast to see where you've made a mistake and find a way to do it without a cast. This of course goes for the named casts in C++. It is harder to grep for casts in C.


Right, but if you run mypy over your code, you're in the same situation as in java. It's a one liner to run mypy before you invoke your tests. You have to cast to subvert the type system.

In c you have void ptr everywhere anyway.


Terminology: static vs dynamic typing != typed vs untyped. You can't not use types in Python.

Examples of untyped languages would be B, assembly, Forth(?) - where the system doesn't distinguish what type of objects are in memory or storage locations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: