Sorted collection by compareTo, uniquely by hashcode

I think TreeSet and TreeMap only use compareTo to do 2 jobs:

  • sort items
  • make sure there’s no 2 equal items in the collection

I have class Event with 2 fields: id and time.
I want to sort collection of events by time, but make sure there’s no 2 items with the same id in the collection.
If I use TreeSet, it will sort by time for me because I only compare time in compareTo method.
But I must manually manage the unique of id when I add new items to the collection.

Is there a better way for this issue?

Example (simplified):

I have a list of events in the collection:
event1 = Event(id: 1, year: 2020)
event2 = Event(id: 2, year: 2019)
event3 = Event(id: 3, year: 2021)

The collection should be sorted in the following order:
event2 = Event(id: 2, year: 2019)
event1 = Event(id: 1, year: 2020)
event3 = Event(id: 3, year: 2021)

Now, I want to update event2 with new time: Event(id: 2, year: 2022).
The new collection should replace the old event2 and use the new one with new order:
event1 = Event(id: 1, year: 2020)
event3 = Event(id: 3, year: 2021)
event2 = Event(id: 2, year: 2022)

You can see the code, which does not ensure the uniqueness of id here:
https://pl.kotl.in/7KlbEeDQ3

class Event(val id: Int, val year: Int): Comparable<Event> {
    override fun equals(other: Any?): Boolean {
        return other is Event && id == other.id && year == other.year
    }

    override fun hashCode(): Int {
        return id
    }

    override fun toString(): String {
        return "Event(id: $id, year: $year)"
    }

    override fun compareTo(other: Event): Int {
        return year.compareTo(other.year)
    }
}


fun main() {
    val events = sortedSetOf(Event(1, 2020), Event(2, 2019), Event(1, 2021))

    println("before:")
	println(events.joinToString("\n"))

	events.addAll(listOf(Event(2, 2022), Event(2, 2022), Event(2, 2022)))

    println()
    println("after (2 items with id 2, not what I want):")
	println(events.joinToString("\n"))
}

Probably not. I don’t think there are any data structures out there that sort by one key and ensure uniqueness by another. So’ll probably you won’t find anything that is less work to use.

The only thing you could really improve on is performance. I’m sure there are ways to improve speed if you know how often you add new items, read the collection, etc, but that would require some good benchmarking and I’d suggest you start with a simple to implement solution and only improve on that if you notice that performance is a problem.

1 Like

You could maybe provide your own special addUnique method that only adds an item if every other element in the collection doesn’t have that id. You could even make this much much more performant if, on every add, you use a hashset to store the id alone and then you only add the elemeht to the treeset if the id was successfully added to the hashset

If you are more concerned about the object identities maybe you should use HashSet first and just create a sorted collection on demand?

1 Like

In my application, the list could be quite big, so I want to save some memory on user’s device. That’s why I don’t want to have 2 collections for a single task.
The list must always sorted, nowhere in the code should have the list in different ordering. In this app, reading will more often than writing. Therefore I want it sorted immediately when it’s added/updated/removed and multiple reads on the list will not need to sort the list again to save some CPU resource (and maybe avoid mistake when developer forget sorting).

My workaround for now is to check manually if there’s an existing id, I will remove it and add the new one.

But I think, in general, sorting and identifying are 2 different stuffs. Like when you design SQL database table, you want to make sure there are no duplicated records (normally with an id). But you allow sorting the query result whatever you want.
In many cases, we can do these 2 things with the same attributes. But other cases, we want to deal with them differently.

Should we introduce something better in the core API of the language to support this situation? It’s better to have it in Java, but before that happens, we can also do it in Kotlin first.
It would be nice if someone from Kotlin team says something here.

1 Like

You could potentially create then a custom collection which uses List as a base. Complexity-wise each operation would have a linear cost, but it would save some memory. You just need to check objects for their presense first before insertion and then apply sort on internal list when objects are actually added.

1 Like

In an SQL database, if you want it to be efficient, you need an index that is ordered by your sort key, AND you will need an index that is ordered or hashed by your ID. These indexes are separate data structures that are essentially equivalent to a TreeSet and a HashSet.

You should use a TreeSet<Event> to maintain your ordering, and a HashSet<Int> to maintain uniqueness. Each of these data structures is pretty much the minimum that you require to accomplish these tasks, and that is why TreeSet doesn’t let you sort and check uniqueness by different keys – it would require an internal HashSet that that would be unused in 99.9% of cases, and the interface would be complicated for no reason by mixing the two jobs together.

3 Likes

How do you know 99.9% will be the case of current TreeSet? When I work with list of records from database, I usually sort and identify records differently.

I can also manage 2 collections manually, but it’s just inconvenient.

Maybe we can introduce HashTreeSet, which inherit TreeSet. So sorting will be solved in TreeSet, uniqueness will be solved in HashTreeSet. So we don’t need to complicated the current TreeSet, and still keep the HashTreeSet simple because it only need to deal with uniqueness and reuse sorting from TreeSet.

I think there is an another way to accomplish this with TreeSet alone. The trick is that we need to write a custom comparator here:

object EventComparator : Comparator<Event> {
	override fun compare(event1: Event, event2: Event): Int {
		if (event1.id == event2.id) {
			return 0
		}
		return event1.year.compareTo(event2.year) // this also could be hacked if we want 2 different events with the same year
	}
}
2 Likes

Super simple and cool solution, I didn’t thing about it at all. Thanks!!!
But, I need to try if it works :slight_smile:

2 Likes

the idea works … but :neutral_face:, the tree will not allow replace the old element. Therefore, still need to remove then add item. If we must remove then add, there’s no benefit to have a custom comparator.

1 Like

Yes, but sorting of large lists should be done on the db side and not in your business logic since DB’s are optimized for exactly that and allow you to store the necessary internal data to do so in a fast way.

1 Like

Sometime, you don’t get data from db. For example, you must get it from 3rd party source and you want do it in memory.

Just wanted to respond to a few specific comments:

More collections does not necessarily mean worse for memory. The most basic example is that storing two lists of Ints isn’t noticeably worse in performance than a single list of Pair<Int, Int> with the same data.
Not the same as memory but I’ve found that in general performance, you really have to either work out the proper big-O notation or run benchmarks to know what is better or worse–assumptions with performance are more than often incorrect.

My ordering is to: Make the thing (don’t over pre-optimize), benchmark the thing, only optimize the top issues.

For minor things like two collections it may be easier to ignore that level of detail until you have a functioning project.

This isn’t something we really want to jump the gun with. Definitely not something to add to the core of the language (at least now). Adding some new APIs to the stdlib is probably more what you meant though.
Still, even adding it to the stdlib would be a big commitment. First the solution should be solved with a library and used extensively. After that, it might be considered important enough to add to the stdlib. Things like serialization, coroutines, HTML, are all excluded from the stdlib and provided as libraries–I suspect a library solution not in the stdlib would be good enough :slight_smile:

2 Likes

This is more of a Java question than a Kotlin one. I recommend reading the Javadoc for the classes in question. In particular, note the following from the documentation for TreeSet:

Note that the ordering maintained by a set (whether or not an explicit comparator is provided) must be consistent with equals if it is to correctly implement the Set interface. (See Comparable or Comparator for a precise definition of consistent with equals .) This is so because the Set interface is defined in terms of the equals operation, but a TreeSet instance performs all element comparisons using its compareTo (or compare ) method, so two elements that are deemed equal by this method are, from the standpoint of the set, equal. The behavior of a set is well-defined even if its ordering is inconsistent with equals; it just fails to obey the general contract of the Set interface.

The term consistent with equals is described in the documentation for both Comparator and Comparable:

The ordering imposed by a comparator c on a set of elements S is said to be consistent with equals if and only if c.compare(e1, e2)==0 has the same boolean value as e1.equals(e2) for every e1 and e2 in S .

This means that you have to be really careful if you use TreeSet.

As for how to solve your problem. First, like arocnies said, avoid premature optimizations. Don’t optimize before you know have a problem. Second, if you have a large amount of data that needs to be indexed in multiple ways, you really should start by considering if a database wouldn’t be more appropriate. This could be a classical server based database such as MySQL, PostgreSQL, Oracle etc., but it could also be a local file based or in-memory database such as H2.

If you are sure you actually need this, then I recommend using two datastructures; one for indexing by ID to ensure uniqueness, and one for indexing by year, to maintain ordering. Then you should encapsulate all of this into a single class. So something like this:

class EventCollection {
    private val byYear: TreeMap<Int, HashSet<Event>>
    private val byId: HashMap<String, Event>

    fun insert(event: Event) {
        val oldEvent = byId.put(event.id, event)
        if (oldEvent != null) {
            byYear[oldEvent.year]!!.remove(oldEvent)
        }
        byYear.getOrPut(event.year, { HashSet() }).add(event)
    }

    fun orderedByYear() = byYear.values.flatten()

   /* implement more methods as needed */
}
1 Like

I mostly agree with those points, but I wouldn’t call it a java question. Even though we use java classes here in the end there is still the question if this should be added to the std-lib or as a standalone library and in both cases this should be done with kotlin.

Also small nitpick on your code example. If memory size is an issue (imagine something like this running on a server for years, but sorted by year and month maybe) you might want to consider removing empty hash-sets from the byYear TreeMap.

Well, in Guava you could also use TreeMultimap to simplify this.

Agreed but your described example sounds like you are accumulating data over a longer period of time. Maybe I’m wrong but this is exactly what a db is for. Add the events into a db as you get them and then request the sorted list back when you need it.
That said maybe is misunderstand your usecase and a db is not possible in your case. I would suggest a solution like @Varia then, or just go with your initial idea.
Only important part here is that you encapsulate your access into a class you can properly test. Also you avoid accessing the data in a way that messes up your sort/identity requirements.

Never mind being consistent with equals; maxmax1028’s custom comparator isn’t even transitive, is it?

Say we have:

a = Event(1, 2019)
b = Event(2, 2020)
c = Event(1, 2021)

Then according to the comparator, a < b and b < c, but a = c — when transitivity would require a < c.