Re: ORM comparison

andreas Thu, 28 Nov 2019 14:59:02 -0800

Hi Kashyap,

I've a bit experience with ActiveRecord and some more with EntityFramework.


As said in the other responses, the big fundamental difference between
PicoLisp database architecture and ORMs is that in PicoLisp the
application layer and the database layer is the same layer, it is not
two very different conceptual systems (OOP vs. Relational) with a
"translation system" in between.

This results in several benefits when developing applications using
PicoLisp database:

  * higher productivity
      o in PicoLisp, the database "table definition" and the application
        class is the same thing. In other stacks you have to do the
        "object definition" multiple times: in the database schema, an
        application class, usually you do additional a "business model"
        object (or even multiple). Every time it is basically the same
        "grouping of some properties/columns" with only small (but
        substantial) deviations.
        Even when these frameworks generate you some or all of the extra
        code, it is in practice often necessary to look at this code and
        adjust it so it works as intended, especially when doing changes
        on an existing application.
  * less code, less space for bugs
      o It is easier to analyze and debug what is happening with the
        data between input/storage and storage/output without having to
        track the values being copied from one intermediate (layer
        transition) object to the next.
  * simpler mental model
      o as less screws and mechanics are present in the PicoLisp stack,
        it is easier to understand and imagine in your head what does
        happen, easier to contrive which potential effects you need to
        take care of.
      o in mainstream stacks, the conceptual and technical barrier
        between OOP application programming and relational SQL database
        programming is often very pronounced. In my experience,
        application programmers these days often have a very incomplete
        understanding of how a database works (e.g. they don't know
        about transactions), which results in the best case in
        insufficient usage of the DBMS (e.g. amateurish re-implementing
        of functionality in the application which would be available
        from the DBMS in a tested, optimized and reliable way), or
        widespread disregard of fundamental data integrity
        vulnerabilities in the worst case, which depending of the
        application might result in minor annoyances up to complete and
        utter failure of the whole thing (e.g. losing vital data, might
        bankrupt a company).
  * queries and application cache
      o As you saw, it is possible to write very involved and powerful
        queries on the application level in PicoLisp.
        These queries make use of the database data cached on the
        "application layer", meaning the database is only asked for
        records which were not already accessed in the same session.
      o Most (maybe all?) ORMs cannot take advantage of the records they
        already have been cached, because the ORM has only a very
        limited implementation of SQL, so the query (e.g. written with
        LINQ on the application level) gets translated to SQL and sent
        to the database to be executed there, and then the results are
        sent back to the application - even data is sent forth and back
        which would already be cached on the application layer.
          + e.g. Entity Framework only uses its own cache when directly
            fetching a record by primary key (id),
            all other queries are executed on the database as Entity
            Framework has no query engine.
          + this hurts performance even more when the database is
            located on a different server than the application, as it is
            often the case.

PicoLisp database is not relational:

  * there is no "primary key" in PicoLisp
      o while you might often have a (rel id (+Key +Number)) to have an
        simple index to find all records of the same entity, such a
        property is completely optional
      o a record only needs to be in at least one index or being linked
        to from another record, so it can be found and is not seen as
        garbage (as in garbage collecting = object to which nothing points)
  * every object has a global unique identifier
      o global per database
      o this is the name of the external symbol
          + which encodes the physical location in the database file
            where the record is stored
      o this makes it possible to directly link a record to another
        (also many to many)
          + in relational databases this requires a foreign key (which
            is an index, roughly comparable to +Ref)
          + in relational databases, many-to-many relations require an
            interim table
              # this interim table is usually hidden by the ORM
  * lazy loading is possible by having a global unique identifier for
    every record
      o in relational databases, the "link" is storing the value of the
        primary key of another record
      o retrieving the complete record in a relational database means:
          + 1) find the corresponding entry in the specific primary key
            index
          + 2) from this the database retrieves the pointer to the
            physical location in the database file
          + 3) read the record from the database file
      o retrieving the complete record in PicoLisp database means:
          + 1) read the record from the database file
      o scenario: you have a number of interlinked records (as natural
        in OOP), and you want to have them available on the application
        layer
          + ORM:
              # 1) get a list of records from the database (or a single
                record)
              # 2) create for each of them an application object,
                copying all properties/fields from the database record
                to the application object
              # 3) when the property/field is a foreign key, retrieve
                the corresponding record from the database, create
                another application object and link it (by memory
                pointer) to the other related application object(s)
              # 4) repeat 2) and 3) for every such indirectly retrieved
                record because it was referred to (by foreign key) from
                the previous processed record
              # this is eager loading
                  * loading everything reachable from the "starting
                    point" records returned by the query
              # hope you have no indirect circular relations - direct
                circular relations might be handled by the ORM, but
                indirect ones often aren't - so endless recursion until
                you get stack overflow within the ORM code
                  * I just recently managed to do this by accident with
                    an Entity Framework application
              # obviously it often leads to "over"-loading:
                  * retrieving more records from the database than
                    actually required/used for the current operation in
                    the application
                  * creating all those extra application objects and
                    doing all these copying of property/field values
              # ORMs do allow to customize this "automatic default loading"
                  * extra work, extra source of bugs, ...
                  * when looking at an application you haven't written
                    yourself, you have to dive into this code to know
                    what is loaded, when and how.
          + PicoLisp database:
              # 1) get a list of records from the database (or a single
                record)
              # 2) when the property/field is a +Link (or +Joint), we do
                nothing special with it - the value just happens to be
                the name of an external symbol (or multiple)
              # only when the related record is accessed in the
                application is it retrieved from the database
                  * e.g. accessed via (get This address street name)
              # this is lazy loading
                  * only load when we actually use it
              # circular relations are no problem, doesn't matter how
                indirect/nested
                  * on first access the record is loaded
                  * on further access the already cached object is re-used

PicoLisp database is a multi-paradigm database system:

  * direct support for OOP
      o application objects can easily be persisted, without copying
        from/to another structure
      o inheritance and polymorphism can be applied to "tables" (as
        Entities are real OOP classes)
  * each record can have it's own specific schema
      o same as "NoSQL" document databases (e.g. MongoDB)
      o maintenance nightmare when used in typical business
        applications, but extremely useful for e.g. archive systems (old
        records may keep their old schema)
  * records can be directly interlinked, and these links can be
    traversed easily and efficiently
      o same as in graph databases
  * no limits on field (property) size or record size
      o well there are (high) limits eventually, but if hit them, you
        did most likely something very badly wrong in your design
      o text/strings can be of arbitrary lengths
      o numbers can be of arbitrary lengths
      o so some rather typical issue of applications using SQL databases
        simply don't exist
          + e.g. my name/street/city/comment/whatever is 21 characters
            long but the database only allows 15
          + e.g. we have to change the primary key from an int to a
            long, because we have so many records we reached the size
            limit of the int....
              # yep, this happens.
          + this is really a even more advantageous feature than one
            might think at first, this is really very useful in practice
              # typical applications require various changes to the
                application and the database whenever such an limit has
                to be increased
              # and all the ORM-generated code has usually to be adapted too
      o relational databases have limits on the maximum possible size of
        an individual record
          + practically means: the number of columns in a table is
            limited, as the combined maximal size of each column must
            not become higher than the record size limit
          + no such limit in PicoLisp (as long the symbol fits into
            memory and you can still come up with names for all these
            properties...)
  * support for various indexing strategies:
      o unique index
          +  +Key
      o non-unique index
          + +Ref
      o various text index strategies
          + +Idx
          + +IdxFold
          + +Fold +Idx
          + +Sn (soundex, find by phonetic similarity)
      o index based on multiple properties (columns)
          + +Aux
      o spatial index
          + +UB
          + for geospatial or other data with multiple numeric dimensions
      o and more...
          + e.g. creating your own custom indexing mechanic is well
            possible in PicoLisp
              # not so much in usual relational database systems
  * +Swap for lazy loading fields/properties
      o in relational databases, you do this with a separate table and a
        1-to-1 relation
          + an ORM often doesn't know better than to do eager loading,
            so when using an ORM this method usually needs additional
            tweaking of the ORM
          + 1-to-1 relations are often vehemently discouraged in SQL
            teachings, but in practice often a meaningful method to
            improve performance by preventing "loading too much"
              # it violates the beauty of the relational model
              # fun fact: the founders of relational model (the theory
                applied in "relational" databases) argue that the
                current implementations do not adhere to the relational
                model
                  * so these "relational" databases are not even really
                    relational in pure theory
  * full ACID <https://en.wikipedia.org/wiki/ACID> support
      o NoSQL databases usually lack it, basically shoveling the
        responsibility for data integrity to the application layer
          + the application programmers are usually not aware of this
          + more than one company got bankrupt because of this
      o mainstream relational databases have full ACID support
          + often the default transaction guarantees are rather weak, as
            they are optimized for performance benchmarks
              # but strong guarantees can be easily activated, if you
                know the right magic words
                  * the right magic words differ between different
                    database management systems, and the words
                    themselves are often somewhat misleading
          + support for transactions in ORMs is usually rather weak
  * block size configuration can be finely granulated
      o see (dbs)
      o most database systems store data in evenly sized slots in the
        database file
          + filesystems (which are a kind of database system) work the
            same (e.g. sectors on a harddrive)
          + because this allows quickly to jump to a specific location
            within the database file, e.g. the Nth entry = N multiplied
            with the blocksize (same as pointer arithmetic)
          + in contrast to a text file in which the lines have varying
            lengths separated by a linebreak character
              # so to get to the Nth line, you have to go from the start
                through every character one by one, and count the linebreaks
      o so a database record typically occupies at minimum one such block
          + or two, or many, but it can't occupy only half a block
          + unused space in a block is overhead (e.g. many filesystems
            allow you to see the "real file size" vs. "size on disk",
            where "size on disk" is always equal or bigger = filesystem
            blocksize)
      o so you want the blocksize to fit the typical (median) size of
        your database records
          + of course this is specific to every application
      o in most database systems, you can have multiple database files
        (or "partitions"), each has a fixed blocksize, tables get
        assigned to those files/partitions
          + same in PicoLisp database
      o in most database systems, this blocksize is something to
        configure in the database software, by an expert for that
        database software, not by the typical programmer
          + so usually this is actually not adjusted to the use case,
            but the database software defaults are used
      o  most database systems start with blocksizes of 4kB or higher
          + afaik because this is usually the page-size of the
            filesystem and/or the OS caching of files
      o a table in a relational database is kinda one big array
      o in PicoLisp database the blocksize can be much smaller, as data
        is not stored in fixed arrays but in external symbols (random
        access)
      o in relational databases, a table usually has to be stored
        completely in one database file, as a consequence there is only
        one blocksize for the whole table
      o in PicoLisp, every record could be (created) in another database
        file
          + moving a record later to another database file is not
            feasible, because this would change the name of its external
            symbol, so all references to that record would need to be
            changed.
              # if a re-adjustment of the blocksize has to be done on a
                PicoLisp database, the usual way is to export and
                freshly import the data
      o so if the blocksize is properly defined in *Dbs for every
        Entity, index and +Swap, then:
          + the PicoLisp database might have less storage overhead due
            to padding in ill-fitting blocks (in the database file) than
            a comparable relational database
          + (automatic) caching by the OS / filesystem might be more
            efficient (as more records fit into the cache)
          + though this effect is highly depending on the size and usage
            patterns of the data (so all in all this point might be
            rather academic)
  * Support for Blobs (files, or other arbitrary binary data)
      o picolisp actually stores Blobs not in the database files, but as
        separate files, referenced in the database
      o this is usually much more efficient than pressuring such Blobs
        into the database blocksize
      o such files can easily (and usually efficiently) be accessed
      o but such files are not cached in the application/database cache
          + this can be an advantage or an disadvantage, depending on
            the use case
          + typically Blobs are used to store files which often are not
            much processed by the application, mostly just imported (or
            generated) and exported to the user
              # then this mechanic for storing Blobs is clearly an advantage
          + in other scenarios such Blobs might be Bit-Tables used in
            the application, which better should be cached
              # though in PicoLisp such Bit-Tables could be stored as
                +Number, no need to use +Blob for this

While PicoLisp database is well-suited for business applications, it is
less ideal for storing huge amounts of strictly uniformly structured
data (e.g. raw data for machine learning).
Here it has two disadvantages compared to relational databases:

  * PicoLisp stores the property names for every non-NIL property on a
    record
      o so this is a small overhead per record and property, which might
        accumulate to unfavorable levels when storing huge amounts of
        similar data (similar to JSON, XML, ...)
      o relational databases store data as arrays, so the field names
        are not stored on every record (similar to CSV)
      o this is why Alex uses very short property names (e.g. "nm"
        instead of "name")
          + readability of the code is a habit thing
  * External symbols are good for random access, not so good for bulk
    read/export of data
      o this point is also valid for bulk insert, to a degree
      o a relational database stores data similar to an array - in
        continuing blocks (per table)
      o picolisp database stores data in external symbols
          + external symbols stored next to each other might be
            completely unrelated (except same blocksize)
      o so when a database query happens to request the data exactly in
        the order as it is physically stored in the relational database
        (usually by primary key, Microsoft calls this the "clustered"
        key), then the relational database can very efficiently just
        dump the whole part of the database file to the client, without
        having to do any (or much) processing of that data.
          + though using an ORM probably ruins this advantage, as the
            ORM is processing each record individually and copying the
            data into application objects


I hope I could give  here a good insight how PicoLisp database compares
to mainstream database stacks. In the end you have to decide what fits
best your requirements and the nature of the application/data.
I've worked now several years in parallel on business applications
implemented in C#/MVC .NET/MSSQL and in PicoLisp, and observe much
higher productivity and maintainability (flexibility, ease of doing
changes on an existing application) and better performance (for the
application user) in the PicoLisp stack - but of course it also depends
on the actual software architecture and database design used.
And of course PicoLisp programming requires capable developers who are
interested to dig into the stuff, while the popularity and the "hand
holding"-features/restrictions  of those mainstream enterprise stacks
make a kind of minimum productivity possible even when no capable
developer is available, yadda-yadda...

Feedback and corrections are very welcome.

Kind regards,
beneroth


On 28.11.19 18:06, C K Kashyap wrote:
> Hi Alex,
> There is a plethora of ORM systems such as ActiveRecords (in
> Ruby/Rails) or Microsoft EntityFramework and similar solutions in
> other languages where Objects are mapped to SQL DB records.
>
> I'd love to know your thoughts about how PicoLisp's approach is
> similar/different from them.
>
> Regards,
> Kashyap

Re: ORM comparison

Reply via email to