> > We are storing raw MIDAS files to S3 Object Storage, but MIDAS file are not
> > optimised for readout from such kind of storage. There is any work around on
> > evolution of midas raw output or, beyond simulated posix fs, to develop midas
> > python library optimised to stream data from S3 (is not really clear to me if this
> > is possible).
>
> We have plans for adding S3 object storage support to lazylogger, but have not gotten
> around to it yet.
>
> We do not plan to add this in mlogger. mlogger works well for writing data to locally-
> attached storage (local ext4, XFS, ZFS) but always runs into problems with timeouts and
> delays when writing to anything network-attached (even writing to NFS).
>
> I envision that each midas raw data file (mid.gz or mid.lz4 or mid.bz2) will
> be stored as an S3 object and there will be some kind of directory object
> to map object ids to run and subrun numbers.
>
> Choice of best file size is open, normally we use subruns to limit file size to 1-2
> Gbytes. If cloud storage prefers some other object size, we can easily to up to 10
> Gbytes and down to "a few megabytes" (ODB dumps will have to be turned off for this).
>
> Other than that, in your view, what else is needed to optimize midas files for storage
> in the Amazon S3 could?
>
> P.S. For reading files from the cloud, code needs to be written and added to
> midasio/midasio.cxx, for example, see the code that is already there for reading ssh-
> attached files and dcache/dccp-attached files. (CERN EOS files can be read directly
> from POSIX mount point /eos).
>
> K.O.
thanks,
actually a I made a small work around with python boto3 library with file of any size (with
the obviously limitation of opportunity and time to wait) eg:
key = 'TMP/run00060.mid.gz'
aws_session = creds.assumed_session("infncloud-iam")
s3 = aws_session.client('s3', endpoint_url="https://minio.cloud.infn.it/",
config=boto3.session.Config(signature_version='s3v4'),verify=True)
s3_obj = s3.get_object(Bucket='cygno-data',Key=key)
buf = BytesIO(s3_obj["Body"]._raw_stream.data)
for event in MidasSream(gzip.GzipFile(fileobj=buf)):
if event.header.is_midas_internal_event():
print("Saw a special event")
continue
bank_names = ", ".join(b.name for b in event.banks.values())
print("Event # %s of type ID %s contains banks %s" % (event.header.serial_number,
event.header.event_id, bank_names))
....
where in MidasSream I just bypass the open, and the code work, but obviously in this way I
need to have all the buffer in memory and it take time get all the buffer. I was interested to
understand if some one have already develop the stream event by event (better in python but
not mandatory). I'll look to the code you underline.
Thanks, G.
|