Kotlin Serialization


#1

We are working on a generic Kotlin Serialization facility for some future release of Kotlin. Goals of this effort are:

  • Work with any Kotlin backend (JVM, JS, Native)
  • Be compatible with any data serialization format or library (kryo, json, xml, protobuf, Android’s Parcel, anything) including use-cases of loading configuration files from disk and loading/saving objects from/to relational databases
  • Support both static serialization formats (type information is taken from compile-time types) and dynamic serialization formats (ones that use run-time types)
  • Support both graph serialization formats (object identity is preserved) and value serialization formats
  • Support built-in arrays, collections and maps
  • Generate code for user-defined classes at compile-time

It is out of scope of the project to design any-Kotlin specific wire format. Support for some popular formats (like JSON) may be provided by a standard library, though. No commitment to any release time-frame at this phase.

This facility is going to supersede and resolve JS serialization issues that are tracked here:

This topic is to gather community use-cases and feedback.


Syntax suggestion - drop the need for "= null" on nullable properties (that have no initial value)
The Case for Macro System in Kotlin
So. What's next in language design?
JS pain point: object serialization
#2
  • From my perspective it would be very helpful if this was extensible (users could specify their own serialization format / serializer). This would allow for user defined formats as well as handling of special cases.
  • Where possible, it would be interesting if it worked mainly compile-time instead of run-time. In embedded contexts class generation is preferable over reflection. Of course reflection / run-time detection where needed.
  • Avoid the problems with JAXB, it is too much of an all-or-nothing approach that doesn’t support embedded content very well, especially when it involves arbitrary xml with namespaces.
  • Make the system properly typesafe. The java xml api’s are horrible with sources and results that are not actually typesafe (as well as inconsistent across different parts of the api’s).
  • Serialization should support different methods of enablement: interface/annotation (for “just-works” serialization); external (either a simple map, or a resolver that resolves a serializer for a given object/type)
  • Deserialization should also support some for of in-class enablement as straightforward as possible (interfaces don’t work as deserialization should be static or a constructor to work with immutable objects) Annotations “work” somewhat but not perfect. The same external mapping for serialization and deserialization should work.
  • Further wish: it would be very interesting if serialization would work such that a class would support “generic” serialization and a user would be able to specify a format and it “just worked”

#3

Thank you for your feedback. The plan is to make it extensible, compile-time and type-safe, via “just-works” annotation or by providing an extenal serializer code. It will defintely work with immutable objects. And yes, the plan to to make it “generic” so that user is able to specify its own format.

Can you, please, elaborate on JAXB problems that shall be avoided.


#4

I have two main problems with JAXB:

  • The factory feature system is quite dissatisfactory, especially with lack of standardization of features.
  • The way that list properties are managed is limited.
  • Custom marshalling/unmarshalling is cumbersome and limited for non-trivial datatypes.
  • The case of handling arbritrary child data is broken (discussed further below).

Basically an example use case (my real one is even more complex) is the use of JAXB (or alternative approach) to deserialize SOAP envelopes. The body and headers can contain arbitrary children in a different namespace. In some cases it is necessary to extract the body of the envelope and store it in a non-dom format (say a string). There is namespace information there for namespace prefixes that were declared on (for example) the outer tag. With some JAXB providers it is possible to actually access the parser (in the unmarshaller) to get access to this information. Unfortunately when marshalling it is not possible to fully manage namespaces easily.

An added complexity here is that a subtree of the xml is filtered/transformed programmatically. This all in a streaming way. JAXB with arbitrary children only works with DOM nodes. The way I see a solution is to allow for delegate serializers/deserializers. Given a (tree) structure, these would be responsible for handling a subtree and would have sufficient information to do so (hopefully typesafe with the underlying type-specific handler (type being json/xml/custom user defined etc.) somehow.


#5

This addition would be very welcome as I’ve experienced some serialization challenges, particularly when it comes to Pairs and having to add default values to everything when I don’t want to. I’m aware of some libraries that seem to ease these problems for JSON, but not for runtime-typed YAML which is what I needed.

Question, how would the system work with standard java objects? Would there be some more typical constraints like them being Java Beans?


#6

Can you, please, provide more background on your chellenges with Pairs and YAML?

As for Java objects. All the standard Java collections will be definitely supported, since they heavily used in Kotlin stdlib and are mapped to Kotlin collection interfaces. The level of support for 3rd-party java object is TBD. The choices are:

  1. Support them only if user explicitly provides an implementation of serializer for them.
  2. Support them dynamically via reflection (similar to Gson and/or Kryo)
  3. Support them statically via javac compiler plugin for KSerializable annotation.
    Maybe all three choices or a subset of those.

#7

Sure. I actually made a post about it a little while ago, but it appears it never got any traction :stuck_out_tongue: Here is the link to it:

Regarding Java objects, that sounds pretty reasonable. I would be partial to having some “just works” support in the case where custom Java objects are Java Beans since that is a typical standard for vanilla serialization methods.


#8

I don’t see any discussion of versioning. Is that deliberate?


#9

I don’t see any discussion of versioning. Is that deliberate?

I guess that from my perspective it isn’t. Versioning should be considered. It has to be optional (you want to support the lack of versioning information in both serialization and deserialization).

A way in which I would approach versioning is to consider it as state of a serialization driver. When an individual object is serialized or deserialized the handling code needs to be able to access the currently active version. Serialization defined through annotations can have version attributes specifying version ranges for the property/annotation (one property could have different names in different versions, so multiple annotations). The hardest aspect to handle without programmatic handlers is when the (older) version does not handle the entire range of values of the object that is serialized/deserialized. In case of serialization, some values may not be validly serialized in the old version. In the case of deserialization, default values need to be provided.


#10

The plan is to defer versioning to the actual serialisation format implementation. Various formats have varying approaches to versioning:

  • Don’t support versioning at all (both sides must have the same schema) like Kryo or other very compact formats
  • Use property names for versioning like JSON, XML, etc
  • Use integer tags for versioning like Protobuf, FIX, etc
  • Use some registered identifier for the class version that defines its schema
  • Describe class scheme first time it appears in stream like Java Serializaiton

In order to make all those formats implementable, serialization implementation will be given access to serialized class descriptor with property names and all their annotations.


#11

This would be sensible. There is one caveat, versioning requires stateful serialization/deserialization. While stateful approaches have their limitations, a generic framework could support attaching state that then could be used later in the framework. Note that xml namespaces would also need support for state unless an underlying layer does the namespace management (like most parsers actually do, but for some specialised purposes access to this state is beneficial - for example if you need to refer to a namespace prefix in an attribute or text value (eg. an xpath query with namespace support)).


#12

Stateful it will be. Providers of serialization format implementations will be required to implement an interface and an instance of this interface’s implementation will be used during serialization/deserialization. This way, implementation is free to store there any state it needs.


#13

Will this be used to persist coroutine’s state?


#14

Serialization of coroutines is a larger problem that is not directly related to an actual serialization format. Current implementation of coroutines is serializable with both Java Serialization (they implement java.io.Serializable) as well as with 3rd party Java libraries (I’ve tested it with Kryo and it works). However, neither is appropriate for long-running coroutines, like business processes, since their serial format evolution is not currently defined, e.g. a simple innocent refactoring in coroutine implementation can make its serialization format incompatible with a saved representation.

In this vein, we can definitely support “Kotlin Serialization” of coroutines, too, but if you try to serialize coroutine into some human-readable format like JSON, then you’re going to get quite illegible garbage with all the internal names exposed, even though it will deserialize its state properly, as long as you did not touch any code between saving and loading your coroutine.

I’m not sure if this kind of support makes sense until a stable representation for coroutine states is implemented, since Kotlin Serialization aims to support both binary formats and human-readable ones nicely.


#15

Support for multiple custom serialization schemes of a class.

Sometimes you want to serialize your object one way in one occasion and another way for a different occasion. E.g. if the api you’re talking with or implementing changes and you want to support both the new and the old version of the api. If serialization schemes only support a single custom serializer for your class you need to muck around with wrappers that do the alternative serialization.


#16

We don’t currently have an elegant idea on how alternative representations might work. The best we can do in this respect is to support optional fields with defaults that will be used is a field is absent in the serialized representation (if the serialization representation is flexible in that respect like JSON). We also plan to give serailizers access to all the annotations that are defined on serial fields, so if you annotate your fields (properties? elements?) with something like @SinceVersion(3), the you can have your serializer implementation check what version is currently being read/written and skip fields that should not be present in this version.


#17

One feature that is really necessary for using serialisation in secure contexts is ensuring objects can only be resurrected via their constructor and/or public setters. Is the idea that the generated code works this way, or can be made to?


#18

Let me elaborate on security a little bit. There are two kinds of serialization.

  • In static serialization you invoke something like MyClass.load(someInput) and only classes explicitly and statically referenced to by MyClasse get loaded. There is no reflection or loading classes by name.
  • In dynamic serialization, like Java Serialization, you invoke something like someInput.readObject(). Any class name can appear on stream and it will get dynamically found on class path at run-time and get loaded.

Any dynamic serialization scheme is inherently insecure. There is no way to make it secure by limiting resurrection to constructors and/or public setters only, since in a big application there is always a chance of class somewhere on your classpath that does something weird and even if you limit loaded classes by whitelist, there are still issues. You can google about Java Serialization security issues.

However, dynamic serialization is extremely useful in closed-world settings. Every modern JVM-based big-data distributed-computing framework uses it.

We plan to support both static and dynamic serialization in Kotlin.


#19

Yes, I am familiar with serialisation security thanks. You obviously need to pair ‘dynamic deserialisation’ with a whitelist, some frameworks like Kryo support that already, the default Java framework is getting support for that in Java 9.

Sometimes you don’t know ahead of time exactly what classes might be deserialised, any time you have a plugin architecture where plugins can serialise data into a stream is an example of that. If you allow plugins to extend the whitelists and take other precautions to prevent invalid streams being deserialised, the security of the two approaches ends up similar - put another way dynamic deserialisation is writing the same code that static deserialisation would, but it’s generated just in time instead of ahead of time.


#20

We are a bit into an uncharted terminology here. I’ve labled as static serialization anything where you explictly know what type you are reading at every point. This is typical in how you usually deserialize JSON into type-safe form, for example. It can be implemented via runtime relection (with Jackson, for example), but, out-of-the-box, Jackson still does static deserialization as it is fully driven by the type definitions in your code.

Deserialization is dynamic if you don’t need to know your types in advance. Out-of-the box Kryo is fully dynamic, unless you explictly configure a whitelist. It is extremely convenient for closed-world applications. It makes Kryo a fine replacement for Java serialization in Spark, for example.

Whitlists do blur the line between two aproaches. Protobuf’s Any is also on a border-line, even though I’d consider it sitll a fully static deserialization approach, because Any type does not get deserialized by the protobuf framework, but is kept an an array of bytes for an application code to deserialize if needed.

The approach that I currently pursue with respect to sercurity is to default on the safe side, e.g. make the serialization fully static by default, but still support both pre-compiled deserialization code and run-time (reflection-based) deserialization for 3-rd party library classes that you statically reference.

Dynamic serialization will be supported with an opt-in and you could do either full-world, black-list and/or white-list approaches, so in Kotlin serailzation white-list will be considered a variant of dynamic serialization. Both classes with pre-compiled deserialization code and run-time (reflection-based) deserialization shall be supported.

Of course, relection will be supported only on the platforms that support reflection and reflection always have adverse performance effects, so the primary effort is going to be focused on producing pre-compiled serlialization/deserialization code for all serializable classes.