Re: Architecture Overview

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Wed, 27 Aug 2008 01:23:17 +1200

Alex Rousskov wrote:
> On Tue, 2008-08-26 at 12:15 +1200, Amos Jeffries wrote:
>>> On Tue, 2008-08-26 at 02:23 +1200, Amos Jeffries wrote:
>>>> Okay, got the pretty-picture drawn up.
>>>>
>>>> NP: this is drawn up as a high-level flow from my accumulated view of
>>>> all our work to date and where its heading. That includes Adrian's
>>>> Squid-2 work and where I see it most efficiently mapping into Squid-3.
>>>>
>>>> It should be very similar to how squid currently works. With a few major
>>>> differences that we have all spoken and planned things around already.
>>> Thank you for working on this picture. I am not quite sure I interpret
>>> it correctly, but I do see a few distinct objects there: Data Pump, HTTP
>>> Parser, and Store. This is more or less clear, at least at this high
>>> level.
>>>
>>> It is not clear to me whether the other blobs such as Protocol
>>> Processing and Protocol Handling are flows, objects, or something else.
>>> I am also not sure whether the arrows represent passing message data,
>>> passing processing responsibility, or something else. Do different
>>> colors and blob shapes mean something?
>>>
>>> If we want an architecture picture, I think it would be great if we can
>>> formulate it in terms of objects and flows among them. This should make
>>> roles and boundaries much more clear.
>> Okay. The clouds are where I'm uncertain of the distinct content not
>> knowing everything about Squid yet.
>>
>> - The protocol processing cloud is modules such as FTP, Gopher, HTTP,
>> HTTPS?. Each being separate, but performing a 1-1 relationship with the
>> request. A flow handled by a protocol 'manager' object.
>
> Is HTTP protocol processing module a single class implementing both
> client- and server-side processing?

Good question. I think its probably best not to at this point. Though
with the overall design it does not at this point matter if they are
separate but communicating modules.

>
>> Forwarding Logic looks at the request just enough to decide where to shove
>> it and passes it on to one of these.
>
> Does it stay in the loop to, say, try another forwarding path?

No. If another path is needed responsible module needs to explicitly
pass it back into forwarding logic, with whatever new state the FL might
need to deal with it properly (error page result being one such case).

Same goes for any module handing off responsibility to non-specific
destination.

> Does
> forwarding logic know about caching?

Only as the cache is one possible path to completion.

>
>> - The second cloud is the ACL handling, redirectors. We came up with some
>> ideas at AusMeet that make all that a single object flow manager.
>> Efficiency of that still needs to be checked.
>
> Will those ideas be documented/discussed? Or is the current plan to test
> performance first?

Yes, eventually. I'm looking for time to write up a new feature page.

>
>> - Arrows are callback/responsibility flow.
>
> Callbacks and "responsibility" can be rather different things and can
> "flow" in different directions. Perhaps the arrows can be removed (for
> now) to avoid false implications?

True, but in Squid responsibility for current code operations on state
flows down the callbacks at present.

>
>>>> 1) Data Pump side-band. This is a processing-free 'pump' for any
>>>> requests which do not actually need to go through Squid's twisted
>>>> logics. With all logics compiled out it should be equivalent to a NOP.
>>>> But can be used by things such as adaptation components as an IO
>>>> abstraction.
>>> What data does the Data Pump pumps? Message bodies? What are the valid
>>> ends of a pump? Can there be many Pumps per HTTP transaction? Does the
>>> Pump communicate any metadata to the other side?
>> Data pump moves bytes, from A to B. IO level provides all the hooks for it
>> to do so. A and B could be sockets, buffers, pipes, handles, whatever gets
>> micro-designed.
>
> A pipe moves something from A to B. Is Data Pump a pipe? Pipes connect
> two ends. Pumps have a single end that produces/generates/provides
> something. You can put something into a pipe and get it on the other
> end. You can only get something from a pump.
>
> If Data Pump is a pipe, please note that the current pipes are slaves
> (they are being told what to do). Are you proposing active pipes that
> use some kind of unified I/O APIs to suck data from one end and push it
> into the other?

Contrary to Adrians latest pump statements, I'm still envisaging a data
pump as a one-way. From source to sink, whatever those may be.
Yes my vision of it is a slave told where the source/sink/buffer is and
left at it to completion.

This lends itself to the which HTTP model, one pipe reads headers into a
buffer and passes that in, then when whatever logics handle the headers
asks pump to read the body from source to a given sink (cache object,
adaptation buffer or clients tcp socket for three likely examples).

>
> Does Data Pump/Pipe store/buffer the bytes to give the other end a
> chance to get ready for consumption?

I hope not, its sinks would ideally be straight into some type of
socket, but buffers need to be accommodated.

>
>> As for many pumps per transaction: Ideally 1 (zero-copy), realistically 2
>> (client-side, and server-side).
>
> I do not understand how a transaction can have one pump or even one pipe
> (unless the pipe is bi-directional). Is Data Pump a bidirectional pipe
> that can shovel bytes in both directions?

Purely stateless. Uni-directional, but ambiguous as to which that
uni-direction is.

The core function of pump would be to handle non-adapted tunnel traffic
or request bodies, which may be very large amounts of bytes needed for
socket A to socket B with nothing but size accounting or speed delays
between.

Most modules really should be acting on a buffer pre-filled by a pump
somewhere, and passed non-copy (excepting the adapters of course) as
part of the request state.

>
> I apologize for so many questions, but the picture does not really
> define these things and without knowing what the blobs are, how one can
> evaluate the Architecture or one's compliance with it?

No worries. Thats why its still only proposed.

>
>> Content-adaptation may need more to pump
>> bytes out to the ICAP helper and back etc.
>
>>> If adaptation components can use Data Pump as an I/O abstraction, should
>>> not all other high-level components processing the transaction do the
>>> same so that high-level I/O code could be reused among all the
>>> components?
>> Yes. The exception being quick forwarding logic which may handle accept()
>> before bootstrapping it into a protocol manager or a 'tunnel' pump.
>>
>>> The NOP equivalence mentioned above confuses me. Do you mean that the
>>> pump does not copy data if it does not have to?
>> Yes. As close to zero-copy as reasonably possible.
>>
>>>> 2) Client Facing IO is unwound from all processing logics. It's simply a
>>>> raw input layer to accept connections and interface to the clients.
>>> The "external" side of the Client Facing IO blob is socket API and such,
>>> right? What is the Squid-side interface of the Client Facing IO blob? A
>>> collection of portable socket-level routines? Some kind of a Transaction
>>> object?
>> Something. I'm not going into implementation details. I'm thinking the TCP
>> listening sockets themselves.
>
> I am not asking about implementation details. I am asking about
> high-level interfaces of the blobs on the picture. Without that
> knowledge, it is difficult to understand how the blobs are connected and
> what they send to each other.

Okay. The two yellow bars for IO, are whats left of the comm layer (and
SSL layer) after its been slim-lined down to simply handle the sockets
and setup initial state objects on accept(). Everything else, from byte
reads to byte writes lies between them in one place or another
(read/write as part of the pump).

>
>>> Is limiting the number of accepted connections a "processing logic" or
>>> "Client Facing IO" logic?
>> Limiting accepted connections? Why would we want to do that?
>
> Because we are running out of resources and do not want to accept more
> responsibility until we deal with what we already have? But this is not
> critical at this point, there are much bigger questions so let's ignore
> this one.

Understood. For that it would be part of the client facing IO. Or
possibly the queue (or later thread) processing priority code.

>
>> delay_pools moves to a governor feature slowing the data pump. ACLs stay
>> as forwarding logic assists on an if-needed basis.
>>
>>
>>>> 3) Server Facing IO is likewise unwound from processing logics AND from
>>>> client IO logics. Though in reality it may share lowest level socket
>>>> code.
>>>>
>>>> 4) Processing components are distinct. Each is fully optional and causes
>>>> no run-time delay if not enabled.
>>> What decides which processing components are enabled for a given
>>> transaction? Do processing components interact with each other or a
>>> central "authority"? What is their input and output? Can you give a few
>>> examples of components?
>> Forwarding Logic. Or possibly an ACL/helper flow manager. How its coded
>> defines whats done. Presently there is quite a chain of processing.
>
>> We talked of a registry object which was squid.conf given a list and order
>> for ACL, redirectors, etc. That would make the detailed state fiddling
>> sit behind the single manager API.
>
> Many processing decisions are not static so I doubt a registry object
> driven by squid.conf can handle this. In fact, I suspect no single
> object can handle this complexity so the responsibility to enable
> processing components would have to be spread around processing
> components (and forwarder), which makes things like a "single pipe with
> bumps" design difficult to implement.

I think one of us mis-understands. Adrian explained the current flow of
security processing in Squid was something like:
  cachable -> http_access ACLs -> FXFF ACLs -> http_access ACLs -> blah blah

Currently a fixed order of operation, most of which can be turned off in
squid.conf, but the code runs through it all anyway, on a quick path,
but still through it.
The aim here was to reduce those mega-blocks down to a minimal order of
calls as determined by the user. Giving them the added benefit of
knowing and (if they liked changing) the exact order of processing.
ie, is url_rewrite done before http_access check or after caching? is
caching done before auth? is icp_access done before or after
cache_peer_access?

>
>>> The text on the picture seems to imply that there can be only one
>>> Processing Component active for a given transaction, which worries me,
>>> but perhaps I just do not understand what kind of Components you are
>>> describing here.
>> The finer details may run in parallel within a module. But the high level
>> processing sequence for any single request needs to be linear (or at least
>> representable in a linear fashion) to be understandable.
>
> I am not sure I agree, but let's wait until there is a processing
> sequence on the picture.
>
>>>> 5) Stores are an optional extra, if the configuration calls for caching.
>>>> But not needed for basic operations.
>>> Is there a single global index of stored responses? If yes, is it
>>> enabled only when caching is enabled?
>> That would be an implementation details inside the Store module top left
>> of the picture.
>>
>> IMO there should be a global API for storage. Whether that API loops
>> through a single index or a set of per-Cache ones is a detail choice.
>
> Sure, but it is important to decide whether the global index (or
> equivalent store interface) exists when caching is disabled. Currently,
> you get an index whether Squid (or a given caching scheme) needs it or
> not. Do you propose that there is no such index?

With this architecture you could disable caching entirely to the point
of not being compiled in. It's irrelevant outside the store module. All
the other modules need to see is a buffer of data or its absence.

>
>>> Do you consider request merging a form of caching?
>> I consider it a flow design issue. If forwarding wants to take a request
>> and point it at an already filled buffer (from store, from live stream, or
>> from /dev/zero) thats it's business.
>
> How will it find a "filled buffer" to merge with if there is no index?

I think we are clashing. From what I've been told the current design of
squid depends on in-transit objects being in cache storage. This does
not hold true in my proposed architecture.
The ForwardingLogic my have an internal hash/cache/index of in-transit
URLs if it really need to, but its won't involve the store. It's in-flow
data.

(NP: this model also applies to broadcast streams if we want to go that
way eventually).

>
>>>> If we all agree and work towards this type of model and things are kept
>>>> modular isolated to the highest levels. I don't see the future
>>>> integration of either squid branch or CacheBoy as being a big task.
>>> I think we would need a more detailed or precise architecture
>>> description to be able to "work towards it" or, more precisely, to
>>> identify code that does not satisfy the architectural constraints.
>>> Otherwise, everybody will be claiming to conform to the Architecture
>>> principles but there will be no improvement as far as merging Squid2 or
>>> external code into Squid3.
>> Agreed. That detailing is what we are starting now.
>> The two orange clouds need to be fleshed out into named components, then
>> on to slightly finer details.
>
> All current blobs need better description/definition, IMO. A few more
> blobs may need to be added. The next step would be to define the flows
> (i.e., which blob talks to which and what do the send to each other).

Yes.

>
>>> BTW, the text descriptions you gave above appear much more useful than
>>> the picture itself. Perhaps we can define the main objects and flows
>>> better and then redraw the picture to match the descriptions? Should
>>> this go into a wiki?
>
>> If you can't find any flaws with that highest level flow design. We can
>> wiki the progress so far and start iterating down to API definitions and
>> TODO lists.
>
> It is too early to find flaws. I do not understand the current picture
> yet. What I am saying is that it may be easier to ignore the picture for
> now, define a few blobs, and then try to draw it again.
>
> It would also be nice to agree on how distant is the future that the
> picture should reflect. Are we drawing Squid 3.3? Squid 10? The
> Architecture picture would be quite different for those two examples...

Really? A good architecture, (which I am aiming for here) would look the
same for both. With possibly different names or larger numbers of blobs,
maybe finer detail the older it gets.

Theres firstly a lot of work to get to anything like this end-product.
Though we could achieve it by 3.2 if we all agreed and set out to do
just that.

Afterwards, a vastly larger array of possible 'pluggable' bits that can
be integrated. As individual implementations of the blobs.

Amos

-- 
Please use Squid 2.7.STABLE4 or 3.0.STABLE8
Received on Tue Aug 26 2008 - 13:23:21 MDT

This archive was generated by hypermail 2.2.0 : Tue Aug 26 2008 - 12:00:07 MDT