How Does The Object Storage Scale?
The Pod Is The Unit
Gunkan defines the pod as an autonomous set of services that can run an object storage. It is composed of stateless services responsible for the logic (load balancing, lookups, listing) but also data-bound services responsible for the persistence of the data and the metadata.
A pod has a limited scalability because of its index. That limit depends on the expected performance of the index (latency and throughput), the kind of workload and its access pattern, and obviously it also depends on the implementation used for the index.
Scale With A Federation of Pods
The idea to break the scalability limit with a multiplication of pods instead of a increassing complexity to manage huge scales.
A Pod bounds the failure domain. Setting such a boundary helps identifying datasets, users, applications impacted by the failure of an element.
A Pod Stores and Organizes Data
A pod's first job is to persist raw data, to keep track of them and to provide easy operations around the data management.
A pod's second job is to be responsible for the object view on the data of the Pod. It includes the versioning of the objects, the association of application properties, the enforcement of lifecycle rules, etc.
Consul: Discovery, Health Checks
A consul agent is deployed on each host. It is used for its catalog of services, its health check mechanisms, and in a reduced manner its DNS load balancing capabilities.
From within a pod, you have the plain knowledge of the whole platform. The enforcement of location constraints inherited from OpenIO SDS allows the optimization of data placement.
What Is The Architecture Of The Object Storage?
Data & Metadata Management in a Pod
Our experience with OpenIO SDS teaches that any metadata can be expressed as a set of <Key,Value> pairs. What we need it the set of PUT / GET / DELETE operations plus a LIST returning sorted keys.
The data management has very similar requirements, PUT / GET / DELETE are as mandatory as sufficient to serve the data. An additional LIST operations is welcome to crawl the chunks with no direct access to the volume, but it requires no sorting.
So are designed the services that make Gunkan. "Index" services are in charge of the metadata, "Blob / Part" services are in charge of the data.
Architecture Big Picture
What Are The Core Components Of The Object Storage?
A blob-store ... stores BLOB :). The set of blob-stores make a share-nothing sub-system. Each service serves a ReST-like API: PUT, GET, DELETE requests plus a paginated iterator over the (unsorted) stored elements.
One implementation is currently available, directly inspired from the RAWX of OpenIO SDS. It still uses a local filesystem and the major difference lies in the naming of chunks. The final names are decided in the service upon each upload and then replied to the client. It allows much more compact naming conventions, a more efficient management of directory entries, and it is SMR-ready.
The Data Gateway serves a ReST-like API that is very similar to the Blob Store. It applies a securing algorithm on the data (erasure coding, replication).
Upon an upload, the Data Gateway enforces a placement policy to load balance the requests within the pod. Upon a download, the content is located based on the indexes of the Pod.
The Index Store is data-bound and serves a gRPC API. Its API proposes the PUT / GET / DELETE calls on <key,value> pairs. Both the key and the value are expected to be small. But the most important is that the Index Gateway propose a LIST operation over the lexicographically sorted sequence of keys.
One implementation is currently available, serving a local RocksDB database.
The Index Gateway is stateless and serves exactly the same gRPC API than the Index Store. It manages the sharding and the replication of the data over all the index stores. The Index Gateway is not mandatory but it is a great help in migration scenario.
Connectors of the Object Storage? Accessibility Layer?
Simple and Open, I like it. However, the adoption is not really deep.
If I have time.
It is my favourite joke. No way, we won't support CDMI, ever. Too complicated, bloated, in other words the typical deign by committee.