Re: [phplib] default authentication, anyone using it? From: Kristian Koehntopp (kris <email protected>)
Date: 08/05/01

On Sun, Aug 05, 2001 at 11:44:27AM +0200, giancarlo pinerolo wrote:
> > [ "vertical" and "horizontal" queries on user data ]

> That is why I push for XML representation of user data. You
> keep the free layout of user data, but can run 'vertical'
> queries on it.

Now, that pressed a button on me. Sorry, but I have got to rant.

You cannot run queries on text efficiently. Or in other words,
you are confusing an abstract data structure (e.g. the idea of a
Users preferences record as a data structure with certain names,
types and values) and the physical representation of a data
structure (e.g. the XML representation of that User preferences
record as opposed to the CSV representation of that same record
as opposed to the internal PHP in memory representation of that
same record and so on).

See, with the invention of LDAP and XML, people are falling all
over themselves to throw away their hard earned relational
database knowhow and go back to hierarchical structures. They
are chasing buzzwords, and this is going to cost them.

What do LDAP and XML have in common, and what sets them as a
pair apart from SQL and the relational model?

Both, LDAP and XML, are essentially tree structures, in the case
of XML even without a preparsed data representation and without
predefined indices. Both, LDAP and XML have no JOIN operation,
so you have to anticipate later joins and store prejoined data,
and they have only weak integrity checking and weak data types.
Both, LDAP and XML, are a big step backwards from the relational
model in terms of flexibility _and_ strangely rigidity at the
same time, also in terms of efficiency and expressiveness of the
available structures and operators.

If you assume that to be true for the moment (I'll show you that
it actually is true later below), how come LDAP and XML are to
popular in their respecive domains, and what particular
advantages do they have over existing relational implementations
like MySQL or Oracle?

Or in other words, why aren't we seeing login authentication or
mail routing against Oracle servers, while we do see login
authentication and mail routing against LDAP servers? This
particular case is easily answered: Oracle connections are slow
and carry a large overhead, even more so for simple queries such
as "does the combination of user x and password y exist in the
password database?" or "to which set of local or remote
addresses does the delivery address x <email protected> resolve?".

Also, the oracle client libraries are propietary, and they are
large. LDAP queries of this type are faster, and the respective
protocol is better documented and implemented in a much smaller
footprint. There are even free implementations of LDAP
available, while there is no free Oracle client library (or at
least, none that I know of).

So LDAP wins for this particular application case despite being
critically inferior in features because the wire format is
standardized and interoperable implementations exist, among them
free ones. This is a potentially fatal development: Companies
are starting out to implement mission critical databases such as
an employee register with access rights and login data on top of
of LDAP servers, with no integrity checking and no
crossreferencing (i.e. employee records referencing
manager-employee records, and employee records referencing room
records, and equipment records).

This being fatal is not just an idea of mine. I have actually
seen more than one company which is using LDAP servers in
production as the _leading_ database (the "leading" database in
an heterogenous environment is the one that holds the
authoritative data, and all other databases are fed from or
synced against the leading database) for their employee data,
and their sysadmins are running nightly integrity checking
scripts to find for example equipment records that are no longer
being referenced by any employees, and other scripts that find
employees that have no manager-employee.

Even in MySQL such a script is done with a single LEFT JOIN and
a HAVING ISNULL() clause, as in

select
  emp.*, equip.*
from
  equip left join emp on equip.emp_id = emp.emp_id
having
  isnull(emp.emp_id)

but since LDAP does have no joins, this ends up in enumerating
all equipment and generating an employee query for each piece of
equipment. So the LDAP server may be faster in answering a
casual and simple "does the username/password combination match"
query, but for all nonlinear queries that do not exactly match
the particular denormalized tree structure that has been
implemented, it is _by_ _far_ less efficient.

With Oracle you would not even have to execute the above
queries, but define the matching foreign key constraint as part
of the data model, so that there cannot be any equip record that
does not belong to some emp record.

Compare LDAP and MySQL instead. MySQL has much less connection
overhead, and the implementation is freely available. There are
actually pam modules for MySQL authentication, and sendmail
extensions for mail routing based on MySQL tables. Using MySQL
here is much better that using LDAP, for several reasons: MySQL
does not only allow you to store employee and equipment data in
different unrelated tables, and relate these tables at will in
queries of your design (that is, MySQL allows you to work on
normalized data), but MySQL also has much better data access and
even a smaller memory footprint than all LDAP server
implementations I know of.

NetUSE itself, and one of the larger customers of NetUSE have in
fact migrated from flat file structures to LDAP and then to
MySQL. Migrating from flat files to LDAP was a benefit because
of being able to distribute user data centrally, but was a pain
since we weren't able to use LDAP as a leading database
efficiently. This was, because we could not properly validate
the integrity of our data. By switching from LDAP to MySQL, we
were able to centrally distribute (and as of lately, mirror and
replicate) our data, but we can also check the integrity of our
data structures with a few queries. Using Oracle, we could build
this integrity checking even into our database model, but at
(too high) a price. MySQL is as good as any LDAP server in
casual queries, but MySQL does non-linear queries nicely as
well - LDAP doesn't do non-linear queries at all.

Now, this relates to XML in a similar way. Just like LDAP is a
denormalized tree structure, so is each XML file just a
serialized representation of a DOM (document object model) tree,
which is in turn again a denormalized represenation of some
data. As an XML file, the data is just a BLOB and unqueryable,
because it is unparsed, unvalidated and unindexed. To get a
structure that can be queried, you need to unserialize the data,
by parsing the actual XML into a DOM tree, and then run your
queries against the DOM.

So nobody every runs queries on actual XML data, but on
unserialized and DOMified data. I stress this point because
serializing and unserializing (parsing) from and to XML is an
expensive operation in terms of CPU cycles and memory. It is not
done casually, and you want to save the result of this parsing
and indexing operation. Also, it is usually done in RAM, which
limits the amount of data that can be handled at once.

Compare this to data stored in MySQL or Oracle tables. This data
is preparsed, and stored in some internal format that suits the
purposes of the databases query engine best. It is stored on
disk, with tuneable buffers in RAM. Thus, while more RAM
essentially benefits the query engine, the amount of data that
can be processed at once is in no way limited by the amount of
RAM you actually have. Also, indices readily exist on disk in
SQL databases, while there are no such premade indices in XML
data or DOM trees - there exist limited methods for the
specification of indices within XSLT, but these specifiers may
or may not be used by your XSLT processor and the indices must
be regenerated each time the data is being parsed again.

So what, one may ask, is the benefit of XML over data stored in
SQL tables? Again, the aspect of interoperability comes to mind.
XML data is just text, and while unuseable but for the simplest
of purposes in this form, anything can parse XML even without
having to write a parser for it - XML parsers are premade
generic parsers, just as a schema description. Also, by using
XML-Schema or a DTD, the format is self-describing. SQL tables
usually aren't, with MySQL being a noteable and extremely useful
exception

        mysqldump is _such_ a useful tool, enabling a developer
        to make changes to a schema on the fly and later
        recreate a complete source file to recreate the changed
        schema. It is also able to extract data from a live
        database as well, and in a form that is extremely
        self-describing. Oracle is lacking such a tool in the
        default install, so you can put data into an Oracle
        database, but you cannot get data out of an Oracle base
        in a self-descriptive form, nor can you get a reuseable
        schema description from an Oracle.

So XML has its purpose and benefit as a data-interchange and
data-transport format, it is an encapsulation representation of
data. Such a representation need not be normalized, and need not
be readily accessible nor does it have to contain integrity
checking, as such representation is used only in transit and no
operations are ever performed on it except "export" and
"import".

But, does an application want to store its internal data in XML
representation?

Certainly not.

An application is a closed and consistent set of functions, and
these functions just know what data they work on and how the
data is being represented in memory or on disk. There is no need
within a single application for a format like XML that has been
designed with a maximum of portability and transparency in mind.
Using an internal format that is readily available for
processing operatings is much more attractive, saving all that
parse and serialize operations.

Does an application want to be able to talk XML or any other
self-describing format to other applications for that very same
data?

Sometimes yes.

If the data in question is being shared by many applications,
and these applications need to be able to work on this data in
some unsynchronized batch-type fashion, and these applications
do not need to know of each others existence, XML is the way to
go. The application we are talking about should be able to dump
the respective data into some XML format, and be able to
reimport that XML dump into its internal format, even if it has
been processed by some other application inbetween. This import
requires not only parsing, but integrity checking as well, as
the importing application cannot know if the foreign application
knew anything about the particular constraints and requirements
of the importing application. For all the importing application
knows, the foreign application may even be actively hostile and
generate bad or unuseable data.

Is XML a generalized mechanism for applications to share live
data?

No, not at all.

There is no automatic support for locking and concurrent access.
XML data is just files.

There is no notion of individual storage for single records -
XML data is just files, not records, so the minimal unit of
locking is the file, unless you want to mark byte ranges as
locked. But then, a single operation may need to lock many, many
byte ranges in an atomic operation, if the operation is vertical
such as "lock all age fields in all user records, then find the
average age, unlock all". This cannot be done efficiently
outside of SQL databases.

There is no automatic support for versioning of records, for
rollbacks, and for transactional modifications.

So if your set of applications is going to share a set of _live_
data, you should design an SQL data model, and build the
necessary integrity checking logic into your data model, not the
individual applications. This way, the data is defended by the
data-keeping application, the SQL server itself. The application
may be activle hostily, but the data model will simple not
accept any data that does not satisfy the constraints you
defined.

Your applications should then talk to the database, which
handles the concurrency and transactional aspects of your data.

XML sucks here, but databases are specialized into handling this
type of situation (and this is where Oracle really shines, and
MySQL sucks only slightly less than XML files).

So what are the advantages and disadvantages of XMLified User
data over the current "User" BLOB?

Advantage:

- documented format that is easily parseable by any application
  (as opposed to some PHP program that represents the data and
  which can only be parsed by a PHP parser).

Disadvantages:

- data is still a blob, and one that needs a parser different
  from the builtin PHP parser.

- data cannot be queried natively by any database. There are
  no inexpensive vertical queries on XML data BLOBs.

And what are the advantages and disadvantages of SQLified
User data over the XMLified BLOB?

Advantages:

- vertical queries are just as efficient as horizonal queries.
- data can be accessed partially in horizonal and vertical mode
  ("give me just the background color for user x, not all user x
  preferences" is actually possible, as well as "give me the
  average age for all users that have logged in within the
  last 3 weeks.")

Disadvantage:

- SQLified table structure is more rigid. In order to introduce
  a new field you need to change the table structure.

- (without mysqldump): data is not easily exportable in a
  nonpropietary format.

To sum it up:

Yes, XML is better and more expressive than CSV for an export
format.

No, XML sucks raw eggs as the native data representation for an
application. It is just text.

Yes, normalization is something you don't want to give up
easily, especially when designing the leading database for a
certain set of data. If you need trees, see
http://www.koehntopp.de/tree.phps and
http://www.koehntopp.de/kris/artikel/sql-self-references/, and
do not give up the relational model.

Yes, concurrent access and integrity checking are hard problems,
and databases have been designed to solve exactly these
problems. You do not want to lose these solutions by chosing
tools that don't make use of this.

Yes, MySQL has many applications where it is actually suited
much better than Oracle for the job and vice versa (Oracle sucks
at casual access, and MySQL sucks at concurrent access).

And finally, returning to your argument, I fail to see the
actual benefit of using XML over a rigid SQL table structure
given the above ideas and experiences of mine. I still stand by
design decision:

> > Im my personal opinion User should be replaced by a class that
> > is not derived from Session or any Session superclass. It should
> > not use the Session storage mechanism. There should be a User
> > class that uses a structured table for preferences, and that
> > allows horizontal and vertical queries,

If you see this as:

> I feel this too rigid

then there should be a mechanism in that class that allows you
to change the table structure with a single call. Also, you'd
need at least a command line tool if not a set of methods in the
class that allow you to export and import data from that User
table or tables to and from XML.

The second one does not belong into PHPLIB at all. Instead,
someone needs to marry mysqldump to expat and add XML export and
import to that tool (it already can to CSV and SQL). Also, you
need a complete oracledump (there exists a perl script that
emulates basic mysqldump behaviour for Oracle, but it is very
limited, or at least it was last time I looked).

> > and a User subclass for
> > nobody users, which transparently hides the fact that for
> > nobodies the User data is being kept in a reserved session
> > variable.
>
> I second this, and an 'auto-somebodify' (pardon me that) method when a
> nobody becomes a somebody.

Yes.

Kristian

-- 
Abbestellen mit Mail an:   phplib-unsubscribe <email protected>
Kommandoliste mit Mail an: phplib-help <email protected>