Keeping data in memory instead of using databases

Lately I have been developing server-side web apps where I keep all the data in memory instead of using databases. Sounds crazy, right? Of course, for apps with huge amounts of data this would not work. For apps with smaller amounts of data, this works pretty good. Where the threshold goes, I don’t know yet, but I imagine most of SAAS apps can be developed this way:

Saving data
Dump the data to a JSON file before the app instance gets restarted. When you want to restart the app, always run a task that makes sure the data has been stored before it gets restarted. Additionally you need backup in case something suddenly shut down your app. Dump therefor data to the JSON file at certain intervals, for example every 10th second. The data is stored into a specific file JSON file, but you should also have some timestamped backup files.

If data loss is not a big problem in your app, then save on more rare instances and skip the restart task.

Loading data
Load all the data from the JSON file on startup of the app. If loading fails try reloading on certain intervals. For apps where data persistence is not important, you can continue with empty data if the loading fails, but should then avoid saving data until data has been successfully loaded from the file.

Security
You don’t need to think about SQL-injections :smiley:

You can encrypt/decrypt the JSON before it’s stored on disk or in the cloud. Passwords shall, as in databases, be hashed. Login attempts are tested against this hash, as done with databases.

SAAS apps
Run one app instance, and have one JSON file for each customer. Make a specific sub domain or route for each customer, where their users log in.

So…
What do you think?
Any concerns?
How would this work for bigger SAAS apps?
Where goes the limit for data storage in memory?

The huge advantage with this style is that we don’t need the pain of moving data from a DBMS, with its own data types, to Kotlin, with different data types. Testing is easier, set-up is easier. I love this way of developing, but don’t know how scalable it is. Would love to hear your thoughts.

Did you try Redis, Memcached? Also, many classic embedded databases support in-memory, for example: Sqlite, H2, Berkeley DB.

I don’t really think size of the data is the main factor or reason to store data on disk. As a matter of fact, I believe most classic databases like MySQL or PostgreSQL will load all the data into memory and work from it if they can only fit it there.

The main reason to use the disk is persistence and reliability. Saving the data every 10 seconds is simply not enough for most production uses. Databases could at the same time store huge amounts of data, work very quick and guarantee the data consistency in cases like sudden power failure.

I don’t really see how it changes anything here. We still need a database layer, we still need data classes for storing the data, we still need to save and load at specific points in the code. We can’t simply get any domain objects, serialize and deserialize them, because there are circular references, there is data which we don’t want to store and there are objects that don’t at all make sense to serialize, e.g. some services, etc. And if you already have a DB layer where you can safely serialize/deserialize the data to/from JSON, then well, that means you can use any existing document database or even SQL database with the same effort as your own database.

I think: don’t reinvent the wheel. Existing databases will do everything your own solution can do, they will do even more and they will do everything 10x better, faster or more reliably.

If strictly for fun or learning purposes - sure, great idea, you can learn a lot and there will be always some room for improvement to have more fun. If for production and “real” uses - no, don’t do this.

2 Likes

While it’s an attractive idea, there are many concerns, such as:

  • How much data might you lose if the process or machine goes down suddenly, and how much does that matter? (In some cases, losing data might not be a problem — but in others, e.g. financial transactions, it could be disastrous.)
  • Can multiple threads access the data at the same time (e.g. if multiple requests are processed simultaneously)? If so, how do you ensure that they don’t interfere, and that both see a consistent view of the data (especially if there are internal relationships and invariants)?
  • What happens when the data grows too much for the available memory? (Memory is always limited, and one app rarely gets a whole server to itself.) Do you drop some of it, or write it to disk?
  • What happens if your app/server isn’t fast enough to service all the incoming requests? How do you scale if your app (and server) can’t cope? (Not all data splits neatly into independent units.)
  • Does it matter if the data isn’t available for a period of time? If not, how do you do maintenance, upgrades, HA, etc.?

These are just some of the problems that database servers have been solving for decades. There are also frameworks (such as Spring/Hibernate) which take most of the pain out of transferring data between and DB and Kotlin objects.

Now, maybe none of those things are a concern in your case, and so a simple in-memory system could be suitable.

But even if you have a ‘toy’ system now, you’ll often find that it needs to handle far more data and/or processing than you plan for. And by the time you’ve enhanced it to address some of the issues that will arise, you’ll have reinvented many wheels, when you could simply be using an existing solution.

1 Like

There is no need for a database layer. The loading of the data happens at the beginning, and the storing happens at the end, plus every 10th second as a backup in case storage on shutdown fails.

Example of a simple app where all data that needs to be stored is related to specific users.

val userSessions: ConcurrentHashMap<String, UserSession> = try {
    jacksonObjectMapper().readValue(
        File("src/main/resources/userSessions.json").readText(), //Or use cloud storage
        object : TypeReference<ConcurrentHashMap<String, UserSession>>() {}
    )
} catch (e: Exception){
    logError("Exception while loading user sessions: $e")
    ConcurrentHashMap() // Alternatively exitProcess(1)
}

For saving the data:

val json = jacksonObjectMapper().writeValueAsString(userSessions)
File("src/main/resources/userSessions.json").writeText(json)

Then, on each HTTP request we find the user session (I use Ktor):

get("/") {
  val userSession = getUserSession(call)
  userSession.doWhateverYouWantAndItWillAllGetStoredAutomatically()
}

I find the correct UserSession based on the Ktor session, or create a new Ktor session and add a corresponding empty user session to the userSessions map:

fun getUserSession(call: ApplicationCall): UserSession {
    call.sessions.get<String>()?.let { userId ->
        userSessions[userId]?.let {
            return it
        }
    }

    UUID.randomUUID().toString().let { userId ->
        call.sessions.set("userId", userId)
        UserSession().let {
            userSessions[userId] = it
            return it
        }
    }
}

So no database layer needed. You can edit userSessions as much as you want and everything will get stored to the JSON file.

You would lose the last few seconds before it went down. For financial transactions and such this would be a huge problem, yes.

I think this works fine when using ConcurrentHashMap, ConcurrentList etc for the objects that are accessed. It’s just the same as for other data in the app.

That’s the $100 question. I’m have no idea what’s the max memory this will work for. I will think that getting more memory is cheaper than paying people like us, as we earn extremely high salaries. Better that we can produce more software per hour we work, which is possible when we skip databases.

You should probably use databases if the app is likely to hit the limit for how much data you can store (whatever that limit is). On the other hand: If you are working for a startup, the chance of bankruptcy is pretty high, so maybe you just want to make a simpler solution first, and rewrite later in the unlikely event of it becoming a success. The overall development cost would then probably be a bit higher (in the case of success), but the likelihood of success is so small that it’s better to save those hours in the beginning, and get out a good MVP that increases the chance of success.

Great question. That’s one of the possible issues here. I think for SAAS apps it should work pretty good, as they can be split by customers, while in other cases this would be a problem.

Elaborate please. I don’t know if this would be different than for database backed apps.

It works perfect for now, but for I have very small amounts of data. I’m curious if this could work for bigger apps, for example SAAS apps with lots of CRUD operations.

I probably miss something here, but if we can do this:

val json = jacksonObjectMapper().writeValueAsString(userSessions)
File("src/main/resources/userSessions.json").writeText(json)

then we can do this as well:

persistInTheDb(sessionId, jacksonObjectMapper().writeValueAsString(session))

What do you mean exactly by: “the pain of moving data from a DBMS, with its own data types, to Kotlin, with different data types.”?

Yes, but you would need to map to database types in persistInTheDb()? What if you have something like:

data class Patient(id: String, address: Address,  surname: String)
data class Address(country: Country, zipCode: Int, streetName: String)
enum class Country {
   USA, MEXICO, SPAIN
}

In a database you would need to either have a separate addresses table or split the address object into several fields in the patients table?

DBMSes have different types than Kotlin: String becomes varchar, Boolean becomes sometimes boolean and sometimes tinyint, Int can become int or bigint.

Also the database don’t let you define custom types unless you make a separate table, as in the example above.

In a document db you could maybe store a JSON object indexed by session ID? If so, it could be done as you describe.

Yes. And as a matter of fact, this is one of the most common patterns when using NoSQL/document databases. JSON also became a standard/native format for many such databases, they support additional features like items indexing by parts of the JSON, etc.

For your example with patients, please note if you would have a Patient.doctor: Doctor many-to-one relation, then during the serialization of the list of patients you would serialize the same Doctor tens of times. After deserialing you would still keep all instances separate. equals() should still work correctly when comparing a doctor to themself, but you consume tens times more memory than before serialization. And this is still pretty simple example, add a few more relations and you can end up serializing/deserializing the same instance thousands of times. Add a circular reference and you have a much bigger problem. This is what I mean by saying we can’t pick a tree of domain objects and serialize it just like that.

1 Like

Great! Then that would be a nice option for storage. Would you then say it’s OK to do like I suggested: Loading all the data on start-up and then saving everything before shut-down (plus every tenth second in case of sudden failures)?

Yeah, this is something I have been struggling with, but it’s not worse than in a relational database. I currently store just the ID, as we would do it in a relational database, and then make a method to find the actual object, like this

class Patient(val doctorId: String){
  fun doctor(): Doctor = doctors.first { it.id == doctorId }
}

Can this be done easier in a document database? You would have the same duplication issues there if it’s stored as JSON, right?

Sounds like you’re trying to avoid the complexity and time cost of having to use a traditional database. Which is a great idea when those costs are high.

I suspect you’ll get at least two types of responses:

  1. The high cost is worth it in many use-cases
  2. The cost isn’t high when you use the right tools

What you’re creating is essentially a custom NoSQL database. Since your NoSQL database is your own, you’ll learn a ton, know how to use it well, and it’ll have few features (which isn’t bad if you’re wanting it to stay simple).

I’d argue for #2 in most cases. You can find existing databases and database libraries that provide what you’ve created. Some offer more features or are more complex. You might find it’s more flexible to use a library since their switch from in-memory/filesystem to external DB store can be as easy as a config line.

Either way, it’s fun to create your own DB. If your use case doesn’t require the reliability provided by existing tools and you’re willing to spend the development time, it’ll probably be rewarding.

1 Like

I meant it is a common pattern to store JSONs under some keys. Not that it is common to load everything into memory and flush to disk every 10 seconds.

Answering your question by providing another one: what are benefits of this comparing to the classic approach where we persist the data when needed? If using classic approach, we can perform operations in transacations, our changes are atomic, isolated, we can easily rollback and the databse guarantees our data is safe. If modifying objects directly in the memory, other threads could see our partially applied changes, so they observe inconsistent data, we can’t easily rollback in the case of failures and the data we persist may get lost due to power failure. So what are benefits of this other than the fact we don’t have to invoke persist() function, because this is automatic?

No, it is not worse. My point is: whatever approach we choose for storing the data, we have to design some kind of a db layer, often with separate classes, because graphs of objects stored in the memory are not that easy to serialize and deserialize. It could be a little easier to implement while using your approach, but we can’t entirely avoid this problem. Also, even if using relational databases, ORMs are quite good at solving this problem and they often allow to not have separate classes for DB. They have other cons though.

What you described here is a kind of document database and yes, document databases share the same problem. Often, depending on the specific case, correlated data is stored together in a single JSON document, but there are also “links” to other documents. However, “joins” in document databases are much slower than in relational databases, so frequent jumps between documents should be avoided.

1 Like

I see. If you want to read/write to the database at every data change, that’s more complicated than I outlined. I need to look more into document databases, but what I have seen so far seems pretty complicated compared to my approach. With databases you need a DTO or ORM or otherwise a database layer. I don’t have any database or database layer (maybe dependent on the definition). I simply modify the UserSession directly, and it’s pretty chill. I know it won’t work for every app, and the question is how scalable it is.

I guess this approach could be an alternative if my app outgrows the current pattern:

get("/") {
  val userSession = getUserSession(call)
  userSession.changeSomeOfTheData()
  persistInDb(userSession)
}

But this app is pretty simple, only the user itself can modify their UserSession. In the case of SAAS apps, where different users can modify the same data, there would need to be only one running instance per customer for this to work. Company A has one instance, company B another instance etc.

I might sound arrogant, but we don’t need a database layer, at least for apps with small amounts of data. I don’t have a database in my solution, and it works fine, at least for now.

I know that serialization and deserialization sometimes can be a problem. A function, for example, is not serialized, at least not with regular jackson. Ideally there should be some Java/Kotlin native way to dump an object (or maybe all the memory) to a file. Then we wouldn’t have the problem with some objects not being serialized, and we could get object references by ID instead of nested objects. What do you think about this?

I will learn more about document databases to be able to better assess this alternative.

Most of the time you provide an example of user sessions. This is a very specific case, it is much less complicated than usual needs for databases. If working with sessions, we modify only a single “row” of a single entity at a time, so we don’t have to keep the data consistent between multiple rows/entities. In many cases we don’t at all need persistence, we don’t care about corrupted data as we can pretty much trash everything and start from scratch. This is one of reasons why sessions are often stored in a separate DB like Redis or Memcached, not in the “main” DB of an application.

We don’t really have to discuss banking applications where data inconsistency/corruption could mean that we performed a money transfer partially, so we deducted money from account A, but didn’t add it to account B. Even if we talk about something trivial like a discussion forum, we still need guarantees about the change atomicity, data consistency, etc. For example, we first create a new topic and then add the first post to it (which should be done together), but between creating the topic and the post, other requests could see the topic without any posts, which is considered an error. If we write to a file at that time window, we actually stored a corrupted data state and if we restore from it later, it will be corrupted forever. If two users add a post to the same topic at the exactly same time, then depending on the implementation it could work correctly, but one of requests may crash or one of posts may be overwritten by the second one. We can fix such problems by synchronizing threads, but this would be very hard to do properly and DBs provide a ready to use solutions for such problems. Additionally, using mutexes would be probably much less efficient than what DBs do. Also, if you modify any data while it is being serialized to be written to a file, that could crash the process of serialization. So we would probably have to block all write operations every 10s for say 1s. And that also requires adding complexity to the code, because such global write lock won’t be easy to do if everyone have direct access to data structures and can modify them at any time.

So the question is: when you say this approach works for you so far, what size of the application do you mean? Is this a publicly available site with multiple users using it at the same time? Or does it work in your tests, but you didn’t use this is practice or you used it in sites that are rarely used by multiple users at once? Also, do you use this already for the main data of the application or only for user sessions?

2 Likes

Yeah, that’s right. We would need to require Topic to receive an opening post.

Wouldn’t it work fine with a ConcurrentList?

Interesting. Does that apply for jacksonObjectMapper().writeValueAsString(userSessions)? If the object is modified during the write process it crashes?

My apps are used only by a few users. All the data that needs persistence is stored in the userSessions map, and there is no interaction between users. Other data is constant, either stored in code or in CSV files.

Either I overcomplicate things or you oversimplify :wink: Web applications are not like a single list or map and that’s it. In web applications we have complicated graphs of correlated data and usually we don’t only add a new item somewhere and that’s it, but we do much more. Concurrent data structures are not magic bullets to solve all concurrency problems.

But well, maybe this approach will work for you :slight_smile:

I don’t know if writeValueAsString() could crash or not. If the documentation doesn’t say explicitly it can handle the case of the data being modified while being serialized, I would be definitely careful.

It works perfect for now, but let’s see if I’m able to get more users to my apps and how it goes then.

If concurrency is a problem, with different users editing at the same time, it would in many cases be possible to queue up the requests and handle only one request at the time. Example: A SaaS for smaller veterinary clinics. It should work completely fine to queue up other requests while the first one is handled. Maybe this sounds crazy, but if handling of a request takes 100 ms on average, it won’t happen often that two of the 10 veterinarians at a clinic will have a request delayed. And if it happens, the request takes maybe 200ms instead of 100ms. No big deal. So a possible solution is to queue up the request. If serialization of data is a problem in regards to concurrency, that task could also place the other requests in a queue. Of course this solution won’t work for apps where thousands of users are communicating, but for SaaS apps for smaller companies where each company can be completely sepereted (Clinic A doesn’t communicate with Clinic B), this could be a possible solution.

Just throwing it out there, if you’re really wanting to reduce the cost of implementing backend, a BaaS would likely be more effective than a custom solution if you are planning on putting your app into production.

Here’s a survey of a few.

Kotlin should be compatible due to the language interop with most. I believe Appwrite and firebase (maybe AWS Amplify as well?) have idiomatic Kotlin APIs.

I have a “backend”, if you with that mean server side code. I have very little javascript, almost all the logic is done on the server. I don’t use React or any other JS framework, but generate the HTML from the server (like we all successfully did until things got extremely complicated from 2013 and and onward).

Seems like firebase and other BaaS are more for JS apps.

what happened in 2013?

The pattern where you have an SPA on the front-end (React, Svelte, Vue), and a JSON API on the server (plus a lot of micro services if you want to make it even worse).