Tracking The Project

# Tracking The Project ## Presentation Is it okay to use the ServiceX graphic ([this one](https://iris-hep.org/assets/images/ServiceXWorkflow.png)) in my presentation? Notes: - Want to talk about - What the final product actually is: a C++ program and library built to connect ServiceX experiment data delivery as a data source for a ROOT RDataFrame. - ServiceX - Describe the use case for the program - Show Docker demo image that anyone can pull themselves and try out - Describe improvements and possible enhancement to make to it Structure: - Start with the "why?" Why we set out to make this in the first place. The program, named XDataFrame, is a standalone C++ library that creates an analysis-ready ROOT RDataFrame object from ServiceX experiment data fetched from CERN OpenDATA. ## Note For building the Docker image ``` docker build -t xdataframe-demo-image -f dev/Dockerfile . ``` ## TODO - [DONE] Have code check for the status "Canceled" and handle accordingly. - [DONE] Have the running output show files completed/remaining using the curl GET - Implement a checksum for the cache files to perform before using them in the RDataFrame. Checksums are computed after downloading for the first time and stored to a file in the cachedirectory. Use a map to map file names to that file's checksum. - Actually computing the checksums of the ROOT files would be a bit tougher than anticipated. Might just make a list of files instead? - [DONE] Use files from cache instead of redownloading every time. - Add option to XDataFrame() to redownload files always. ``` curl -X GET https://cmsopendata.servicex.ssl-hep.org/servicex/transformation/<request_id>/status ``` which returns a json like ```json= { "files-processed":22, "files-remaining":0, "files-skipped":0, "finish-time":"2021-10-05T20:49:11.403295Z", "request-id":"969972dd-25f5-4bca-ae4d-88e8414d6032", "stats":{"avg-rate":0.0,"avg-time":24.545454545454547,"max-time":32,"min-time":15,"total-bytes":3173862,"total-events":0,"total-messages":0,"total-time":540}, "status":"Canceled", "submit-time":"2021-10-05T20:29:17.033171Z" } ``` ## 2021-8-6 Meeting, code review? Improvements and design notes Among comments. ## 2021-7-30 Code review and Gordon install. - First do a visual code review before trying to install. - I've tried to make the CMakeLists.txt has good as possible, it works well on Linux, but Windows might be a different story. Try your best to get it working on Windows, even if you have to make changes to the CMakeLists.txt, we can use that for cross platform compatibility. ---- For me: ran into a memory error, "double free or corruption (out)" after job completes. I'll have to bug hunt that. ## 2021-7-27 Short meeting, * Installation methods for a user so they are able to incorporate this project into their own projects. * Global installation + `find_package(XDataFrame)` * Git submodule? * `ExternalProject_Add(...GIT_REPOSITORY=<XDATAFRAME-repo>` * Important to remark that there are a handful of depencencies that this project requires, including: * AWS C++ sdk (core, dynamodb, s3) * ryml * c4core * jsoncpp * Boost * system, filesystem, unit_test_framework * and obviously ROOT * Most of these can be downloaded from the package manager vcpkg, but building AWS sdk from source provides the cleanest and most hassle-free experience. * Some info on package managers in particle physics can be found [here](https://hepsoftwarefoundation.org/workinggroups/toolsandpackaging.html). * Send a message when code is ready to be reviewed (the C++ code). ## 2021-7-22 ### Features * Feature idea: input string sanitization? * Currently the cache works by hashes of exact strings, so if one space is different it makes a new request. * In python we have exactly this feature, but not for spaces. You are a bit saved here: normally a human won't generate these strings, a computer will. But they could use a different name for something (like the `lambda` argument) which has no real meaning, but would cause the hash to "fail". So sanitization is a good idea - we can add that as a "todo" item - a github issue? * List in the log/output how many files completed/remaining when waiting on job to finish. * This is perfect, then someone can look here to see what the job is taking so long. New Completed features: * Wait until job status is "Complete" before attempting downloading mino files. * Demo with request string, applied filter, draw histogram and save to pdf file. ### Remarks * How long should the program wait for job completion? Sometimes they hang. * This may not need to be handled by the library and just be handled by the user, since they can see the status of their job on the servicex webpage. * Gordon: I've seen jobs take several days when there is something off about I/O. I've seen the same job take 20 minutes. So, I agree - there should be no timeout. But there does need to be some feedback, perhaps? What is a good way to do this in C++? ### Tests Some ideas for tests and possible points of failure * Want to test how the program handles incorrect or invalid request strings. * Corrupted downloaded root files * Verify root files' tree names give the expected tree. Make sure the program finds the root file's tree name successfully * Make sure that data is not corrupted after an unexpected crash or internet outtage * If the RDataFrame object is malformed or isn't created successfully. * Do we want to make sure it always returns an RDataFrame even if it's an empty one or missing pieces that were requested? ## 2021-7-20 * Finally posted to the ROOT forum [about how to best package this](https://root-forum.cern.ch/t/packing-libraries-for-use-in-and-alongside-root/45932). --- - New feature's I'll have to do - When running the XDataFrame("...") function, it needs to wait until files are available to start downloading them. It currently tries to download files that aren't there and results in a seg fault. - Feature: wait until job status is "Complete" before attempting download - DONE - CMake library works for now. Will be checking up on the ROOT forum post for responses, if there are better ways of doing it. - CMake adds a library called "XDataFrameLib" with all the cpp and h files it needs. Then later it makes the executable "Demo" and links the XDataFrameLib library to the Demo executable. This currently works. - Demo with creating an RDataFrame, applying filter - DONE ## 2021-7-14 OpenSSL 1.1.1, if downgrade needed ``` wget https://www.openssl.org/source/openssl-1.1.1f.tar.gz tar xzvf openssl-1.1.1f.tar.gz cd openssl-1.1.1f/ ./config -Wl,--enable-new-dtags,-rpath,'$(LIBRPATH)' make -j4 sudo make install openssl version ``` The version should be 1.1.1f --- For the aws-sdk, the new instructions are to build and install the sdk by: Ensuring your OpenSSL version is 1.1.1, the LTS stable one. Do not use the latest 3.0.0, there are many deprecation errors. Make a directory for it, like `mkdir aws-sdk-cpp` and cd into it. Clone the repo. Make a build directory ``` git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp mkdir build cd build cmake ../aws-sdk-cpp/ -DBUILD_ONLY="s3;dynamodb" -DMINIMIZE_SIZE=ON -DENABLE_TESTING=OFF make -j8 sudo make install ``` The CMakeLists.txt should find it, since it will always install it to the same place and the dir will be `/usr/local/lib/cmake/AWSSDK`. After this prerequisite is installed, you can build the project. ## 2021-7-13 * Make option to take raw string * Fix bug - the RDataFrameHandler object needs to create RDataFrame with a pointer, not on the stack. No need to cache it, and not clear you still need the Display method there either. Just have it cache the list of files or similar. * demo write compact. * gets data using the query. * makes a really simple selection on the return pt's (say > 50, or 5000, I can't remember what the data looks like). * Create and fill a histogram of the pt * All this in RDF: show full lifecycle of operations a user would want to do * Make an option to redownload files or keep current ones. * Come up with some test designs, not code yet * Project, ask on the ROOT forums what the best way to implement a package that is easy to use with root. ACTION: Gordon Code link https://github.com/decheine/XDataFrame/tree/v0.1-alpha * Program creates an RDataFrame object from the files, and operations on the RDF work (mostly) as expected * The issue I'm seeing is that a class cannot store a plain `ROOT::RDataFrame` object, but can store a `ROOT::RDataFrame*` pointer to one. But there has been some unexpected issues with seg faults when trying to access the pointer. * Also when trying to do a `->Print()`, it will get through most of the entries but stop with a `malloc`, something malformed error, but this error seems to have stopped and I can't recreate it anymore. I'll be on the lookout for it. * Want to write the CMakeLists.txt to make the library installable so that it can be fetched in another program with something like `find_package(XDataFrame)`. ## 2021-7-8 Meeting Going over points below in 7-6. - RDataFrame can be created from downloaded minio files. - For now, it waits until they are all there to create the RDataFrame. - This is fine to start out with - we'd love to change so that it can start processing, of course. - Make a ROOT forum post about constructing "dynamic" RDataSource with files loaded over time? - That is probably the best place to get this sort of information, so go for it. - Make demo it's own executable. Part of library that users have the option of running, like an example. - RDataFrame has direct object or pointer? poke around tutorials to see best practice - While files are downloading, download them to a file with a '.temp' on the end, and rename it once it's fully downloaded. To avoid corruption. Get code to Gordon sooh-ish so that he can start evaluating the overall approach - don't want to wait till the very end to do this, or more code might have to get thrown out. ## 2021-7-6 ### Dev Notes Using boost filesystem library for filesystem operations. The root files be stored in tmp folders, with each folder being the hash of it's `submit_request.json`. * `MCache` class for handling the caching * Builds the cache as part of the constructor. * Reads the filesystem to construct the hash map. * In `/tmp/XDataFrame`, will be where the cache is. * There will be subfolders titled with the hash of the submit.json * Folders will contain a file with the request_id as well as the root files. * What about jobs that have the same submit_request.json but different request_id's? Like if the same submit_request.json was used multiple times. * Check for this, and if it already exists, but the request_id isn't found in the cache, it will add that job to the cache too. So it will have multiple request_id's. Maybe. * The same submit query on the same data files should always yield the same result. Thus, if someone resubmits a request, and it has already been run, that request should be returned. ### List of features to do Implementing RDataFrame part and bringing things together. Need to make: * Hashing of submission json - DONE * Saving of hashed submit json - DONE * Download files corresponding to objects in the request_id-bucket - DONE * Skip re-downloading root files, * check if objectKey exists in directory * Could move ServiceXHandler::GetMinIOData() to Request::GetMinIOData() to be able to access the cache it's in too. * Split up GetMinIOData(bucketName) into that and another function GetMinIOObjets(bucketName). First get the buckets, then check if files exist. If "overwrite" option is on, continue and overwrite, otherwise do not download. If files do not exist, download them. * When downloading a file, download to temp name, and then rename once file has been downloaded * Construct RDataFrame from ROOT files * Create the RDataFrame * x Write the RDataFrame to a ROOT file. * Since this will be run as a library, it is up to the person that calls into the library to do this (if they want). * x With constructed RDataFrame, open a root app at the end of the program. Less important but still useful * Make library to put all the helper functions like StrToJson, ## 2021-7-2 Meeting topics * Class structure design * I do currently have some hacky solutions for various things like the following. All of them work. * submitting a servicex request returning the json * doing a GET request to get the status of a transformation with request_id, * parsing a yaml file, specifically the servicex.yaml * A function that takes a string and returns a json object from that string. * Saving json file to disk * Starting a ROOT session, openning TBrowser, and bringing up the ROOT prompt. * Getting a list of objects in a bucket labeled request_id * Fetching and saving those root files to disk. * Persistent user data/request data * Will definitely go the route of saving the request files as <request_id>.json. * Idea for the structure is in the program's `bin` directory, have a `user` directory containing requests folder, containing the folders for each request titled by their id, and those containing the request_id.json files for them. Visually, looks like the tree structure below. * Need to handle the failures when fetching the data * Possible cache lookup workflow (how it currently works): 1. Build json query query out of qastle, etc. 2. Hash the json string, using a stable hash like md5 3. Look for a hash entry on disk, if there, get the request-id from the lookup * Now use the request-id as normal. 5.If no hash entry, submit a transform, get back request-id 6.Write hash and request id to disk * Now use the request-id as normal. ``` 📦bin ┣ 📂userdata ┃ ┗ 📂requests ┃ ┃ ┣ 📂139f38de-6a17-4de7-9605-80e1c72abe39 ┃ ┃ ┃ ┗ 📜139f38de-6a17-4de7-9605-80e1c72abe39.json ┃ ┃ ┗ 📂345974d4-d2ec-49bb-bef2-6683b7e461d5 ┃ ┃ ┃ ┗ 📜345974d4-d2ec-49bb-bef2-6683b7e461d5.json ┗ 📜XDataFrame ``` * C\+\+ standard can be C\+\+17 since we don't need to worry about older systems, since ROOT is compatible with C\+\+17. * With these things in place, our currently objective is to * organize these functional pieces in the project, so encapsulation and such * Parse the root file names better, they are currently really long strings like this. The characters up until the last colon are the same for all objects in the bucket ``` root:::eospublic.cern.ch::eos:opendata:cms:MonteCarlo2011:Summer11LegDR:SMHiggsToZZTo4L_M-125_7TeV-powheg15-JHUgenV3-pythia6:AODSIM:PU_S13_START53_LV6-v1:20000:F8DC5130-4E92-E411-BDCA-E0CB4E29C4BB.root ``` * They are readable by ROOT but ugly, perhaps we want to interpret the name into properties of the file or bucket? * On Windows it has to handle renaming since colon's can't appear. Sanitizing file names. * Filename limits, CMS has really long filenames that can exceed the 256 char limit. ### Notes ... [RDataSource docs](https://root.cern/doc/master/classROOT_1_1RDF_1_1RDataSource.html) [code example](https://root.cern/doc/master/RRootDS_8cxx_source.html) Sanitize names always. Get the RDataSource in place. Cache storing. config file, user can specify where the cache can go, defaults to /temp Allows user to target a scratch disk to write to if they wanted. Built the request json. Removed unnesessary. Take hash, md5 to request_id If hash exists, you can use guid. First checks cache to see if request was already made, if not it sends it, if it exists in the cache, get data from cache. Next steps: * last piece that load root files into rdataframe on the fly * flowchart of how everything is put together * starting from where user gives request to where the root files are being fed to the RDataFrame flowchart 2 cases: 1 first time, the other when it is run a second time with the same query. On dynamic TChains/RDataSource, ask ROOT experts ## 2021-7-1 My thoughts, among points I want to go over today * Design and planning of class structure and libraries * I feel as though incorporating libraries is the best way to approach managing users and keeping track of requests. * I'm comfortable with this - but we need to make sure every library we include is solid and really required. ROOT, and its ilk, are more than 30 years old and will be around another 30 years. * [Lucidchart for flowchart mapping](https://lucid.app/lucidchart/invitations/accept/inv_269e0fe3-872d-4cca-8668-aac78a843cfb). Not much there yet. * Looks good. Once you've got a hacky solution that can feed data through we can then really talk about design - because they you'll have all the moving places in place - and it will be a question of arranging them. * Saving user data to disk so that it can be read later. * This could be done with root files? It might be worthwhile to incorporate root's data structures like TTree's, TChain's, maybe TFolder's. * User data all stored in a root file with a TTree, the TBranches holding things like requests. * One Request branch holds more branches, each corresponding to a request. * Or it could just simply save the json that gets returned by ServiceX when submitting as a json file in an organized directory. Save it as <request_id>.json. * This option is likely the best and easiest * This is how the python version of it works: 1. Store the request and the request id linked in a small series of files. 2. Create a directory that is named by the request id, and store the data in there as it is streamed off the internet. 3. Failures require special handling, you have to fetch errors, and you have to remove the initial request-request-id linkage so servicex is re-queried when the user tries again. * Testing. Using CMake's testing framework / CTest. * Have setup simple, stub test file. * Need to figure out libraries before going further on this. * C++ standard? Currently uses C\+\+11, but there is a [filesystem library](https://en.cppreference.com/w/cpp/filesystem/create_directory) in C\+\+17 that could be very useful. However, 17 is more to expect from possible legacy system that could use this. Is this something to be concerned about or is it good to use C++17? * Could also use Boost for this. Boost has a lot of really nice utilities. * What are you refering to here with "currently uses" - your code? Or ROOT code. I think the only thing we need to conform to is the ROOT standard. * CMake project configures the c++ standard 11. But if we can do 17 that would provide more features, for instance the `filesystem` library which can create directories. ## 2021-6-29 > Solution to AWS SDK breaking curl. Make sure the package libcurl4-openssl-dev is installed, and run `apt autoremove` too. It all works again now. Code design notes Need to cache data like guid's. This will be done in the program's "bin" directory in it's own folder. * `User` class should be able to: * set and get endpoints, tokens, types of servicex requests * Program state * Start Program * Check if old state exists * If yes: load data into program variables * Get new user data and append to previous if it existed * Save current state * Need a `Request` object that contains endpoints, tokens, types, and pointers to the cached data. * To make a servicex request, it must create a Request object which, as part of it's construction, does the curl request and such. This is the curl request to submit a servicex job. ``` curl -X POST https://cmsopendata.servicex.ssl-hep.org/servicex/transformation -H "Content-Type: application/json" -d @../submit_request.json ``` ---- AWS SDK is definitely the way to go for this, after doing research into options there don't seem to be any viable alternatives. - ROOT file download from minio bucket working! > TODO: Implement encapsulation for data retrieval, allow to iterate on objects in a bucket. - For S3 operations and code, see [the github](https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/cpp/example_code/s3) for lots of good examples. What I've written so far is loosely based on examples there. Also a snippet [here](https://www.programmersought.com/article/2070542330/) was very good. * Once encapsulation is done, make tests for ServiceX query data. * Make a test ServiceX query and verify the integrety of the retrieved files (valid root files, readable, nonzero data, ...) * ### Design Remarks - Do we want to have a GUI for it? The options are to use a GUI or use a config file that is read when the program starts and does what the config file says to do. - A GUI would be the most user friendly, and ROOT has a GUIBuilder that I've used before so creating one wouldn't be too difficult. ## 2021-6-28 AWS SDK building. Warning, Jank Solution: When building the project with cmake, the build would fail with a message like ``` [build] ninja: error: '/home/nick/IRIS-HEP/vcpkg/packages/aws-sdk-cpp_x64-linux/lib/libz.a', needed by 'XDataFrame', missing and no known rule to make it ``` So it needs the libraries in that directory. But what I did was go to `/usr/lib/x86_64-linux-gnu` and copied the various libraries it needed with `.a` extension. After putting them in that lib folder, cmake build worked! This is clearly something to do with libraries not being loaded right, but for now it works. ---- Package for C++ to access minio filesyste: Apache arrow? See [here](https://arrow.apache.org/install/). Using vcpkg for package installations because it makes it easy. The above is turning out to be a headache, it has so many dependencies. What I'm doing is following [this](https://docs.min.io/docs/minio-client-quickstart-guide#everyday-use) to use `mc` to manage minio. Doing ``` ./mc share download minio/345974d4-d2ec-49bb-bef2-6683b7e461d5 ``` Creates URls for download for each of the files. But in order to get things from those URLs, it needs AWS4-HMAC-SHA256. So trying out the AWS sdk. [See here for overview](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingAWSSDK.html). Building the sdk failed, something to do with OPENSSL. I could look more into it but first I want to explore other options. But it looks like there aren't many other options. https://docs.min.io/docs/s3cmd-with-minio.html ## 2021-6-24 - Code now can make a GET request and store the returned data to a variable. - Stores the string into a Json object. Can be referenced with standard string indexing, like `jsonData["request_id"]` and it will return the corresponding string. --- List of external packages and libraries: Need to take care with CMake and decide how I want to incorporate them. For others to develop, they would also need to install them. Issue I can tackle later. Instead, add them as git submodules. - [rapid yaml](https://github.com/biojppm/rapidyaml) - [jsoncpp](https://github.com/open-source-parsers/jsoncpp) Successfully parsed endpoint, token, api type from servicex.yaml file. Setting variables to the User class. Might want to refactor the design of the project and split things up into more, smaller files. Want to get it functional first. ### Development notes Example ```yaml= foo: 1 bar: [2, 3] sub: - sub2: sometext ``` To target the node that's a child, like `sub2` do ``` tree["sub"].child(0)[0] ``` ## 2021-06-22 ### Program For Submission and Data ServiceX is a stateless web API (restful service) 1. Submit the request 1. The contents of _selection_ comes from the user, as does the _did_. 2. Send query to that web address (like the cms open data), as you do with curl 3. Return should be JSON which contains the GUID for this request. 4. This GUID needs to be cached somewhere in the code. 2. Getting the Data 1. You need to poll the service to find out what the status is. Specifically, you are looking for failure. 2. You need to poll the minio _bucket_ to find out if there any files availible for download (they are put there as they are completed by ServiceX). * The minio datamodel is simple: there are _buckets_, and inside buckets, there are _files_. The files are binary blobs - minio doesn't care what the contents is. minio is basically a key-store database. * Minio should be API compatible with Amazon's S2 object store (at least, that is its design), but minio is open source, and can be run on anything. 3. When a minio file becomes availible, it can be downloaded, and sent through the RDataSource (or one could wait until all files are downloaded before streaming data - not sure how the RDF Source works) How do you do these operations? I can point you to code that currently does them in python - it works - so one option is to translate. * ServiceX Communications * [Submitting a Query](https://github.com/ssl-hep/ServiceX_frontend/blob/316ee66d04774b1e65d7c5f9a55414d4445645f9/servicex/servicex_adaptor.py#L51) to ServiceX (this matches your curl command and returns the GUID) * [Query the status of the transform](https://github.com/ssl-hep/ServiceX_frontend/blob/316ee66d04774b1e65d7c5f9a55414d4445645f9/servicex/servicex_adaptor.py#L138) This code includes code that looks for errors. This comes back with lots of data about transform. * See the * Minio * Usage in C++ is going to be different! * Step 1: Find a minio access package in C++! * All of our minio access code can be found in [this one file](https://github.com/ssl-hep/ServiceX_frontend/blob/316ee66d04774b1e65d7c5f9a55414d4445645f9/servicex/minio_adaptor.py) (so you can see all the features you need) curl request for transformation status ``` curl -X GET https://cmsopendata.servicex.ssl-hep.org/servicex/transformation/{request_id} # or curl -X GET https://cmsopendata.servicex.ssl-hep.org/servicex/transformation/345974d4-d2ec-49bb-bef2-6683b7e461d5 ``` Latest test request_id: ``` {"request_id": "345974d4-d2ec-49bb-bef2-6683b7e461d5"} ``` curl request for transformation submit ``` curl -X POST https://cmsopendata.servicex.ssl-hep.org//servicex/transformation -H "Content-Type: application/json" -d @../submit_request.json ``` The ID output after a job submittion will be used to retrieve that job's data. misc ref ``` { "did": "cernopendata://1507", "selection": "(call ResultTTree (call Select (call SelectMany (call EventDataset 'ServiceXSourceCMSRun1AOD') (lambda (list e) (call (attr e 'TrackMuons') 'globalMuons'))) (lambda (list m) (call (attr m 'pt')))) (list 'mu_pt') 'treeme' 'file.root')", "result-destination": "object-store", "result-format": "root-file", "chunk-size": "1000", "workers": "20" } ``` ``` import logging sh = logging.StreamHandler() sh.setLevel(logging.DEBUG) root_logger = logging.getLogger() root_logger.setLevel(logging.DEBUG) root_logger.addHandler(sh) ``` ### Discussion Want to discuss and walkthrough the following issues. https://github.com/gordonwatts/pyhep-2021-SX-OpenDataDemo ### Issues at hand Just submitted, successfully, a curl request to https://atlasopendata.servicex.ssl-hep.org/ , but I don't know if the jobs will finish yet or have similar problems like before. Goal: To have a local deployment be able to use cern open data. Run from a curl request to localhost. Currently, all attempts at creating the transformers fail, and in their descriptions, the failure to mount reason is ``` Unable to attach or mount volumes: unmounted volumes=[x509-secret], unattached volumes=[generated-code kube-api-access-9vzw6 x509-secret]: timed out waiting for the condition ``` and ``` MountVolume.SetUp failed for volume "x509-secret" : secret "servicex-x509-proxy" not found ``` Is the trick in doing the helm installation/in the `my_values.yaml`? * This may be a bug in the distro - I will follow up on making sure the ATLAS open data works. --- For https://atlasopendata.servicex.ssl-hep.org/ I can submit jobs similarly, but still do not get a finished result. Going to the website and the dashboard in a web browser, I see the 3 jobs I've submitted so far, with a status of "Running", but with no progress being made, and a "Total Size" of 0 Bytes. But suppodedly the jobs are reported as completed when others view the site? ## 2021-06-16 ### Issues **[!]** Current problem, the transformer pods are hanging on ContainerCreating status. Checking the logs of the servicex-did-finder-cernopendata pod show that it looks like the data isn't being fetched. This is supported by the jobs in the local servicex dashboard, each of the jobs has a "Total Size" of 0 bytes. But it also says "Files: 1", "Files Remaining: 1", so it may not be getting to the lookup step at all. The last several lines indicate it did indeed connect, but the DID Lookup request for dataset 3827 shows a metric like {timestamp: ..., file_path:..., **"file_size": 0**, ...}. I tried again with cernopendata://1507, and that one has 22 files which was shown on the servicex dashboard, but again there was 0 bytes of data being transfered. This leads me to believe it is an **authentication issue**. I've done everything right I believe. * Getting a grid certificate may be necessary, which will be a pain. * It may not be needed if the servicex-did-finder handles authentication (or lack thereof). I saw this was a fresh topic of discussion in slack, so it may require authentication anyways even though it doesn't need to. * Servicex release makes 5 Transformer Pods, but they never finish and keep crashing and restarting. The error might be due to x509-secrets, which will be a big problem if it is. * The log of one of the transformers is ``` cp: cannot stat '/etc/grid-security-ro/x509up': No such file or directory chmod: cannot access '/etc/grid-security/x509up': No such file or directory usage: transformer.py [-h] [--brokerlist BROKERLIST] [--topic TOPIC] [--chunks CHUNKS] [--tree TREE] [--path PATH] [--limit LIMIT] [--result-destination {kafka,object-store,output-dir}] [--output-dir OUTPUT_DIR] [--result-format {arrow,parquet,root-file}] [--max-message-size MAX_MESSAGE_SIZE] [--rabbit-uri RABBIT_URI] [--request-id REQUEST_ID] transformer.py: error: argument --chunks: invalid int value: 'None' ``` **Note:** This doesn't seem to be the issue any more, see the top most issue ## 2021-06-15 ### Status curl commands now working thanks to including a missing @. ### Notes - MinIO accesskey and secretkey are found in vscode menu `[cluster] > Configuration > Secrets > servicex-minio` - Next understand the kubernetes pods for the transformer. Some are running but they have many restarts. ### Brief Daily Setup Instructions - Initialize Docker, and therefore Kubernetes - Open VSCode in to ServiceX directory - Check on the Workloads > Pods servicex pods. Helm ``` helm dependency update servicex/ helm install -f ../my_values.yaml servicex servicex/ ``` - Port forward servicex-app-... - Port forward servicex-minio-... Then after making any modifications to `submit_request.json`, make the ServiceX request with ``` curl -X POST http://localhost:5000/servicex/transformation -H "Content-Type: application/json" -d @../submit_request.json ``` The transformer pods will then be created. To list the active deployments, do ``` kubectl get deployments --all-namespaces ``` To stop a job... To delete a transformer deployment, do ``` kubectl delete deployment <deployment name> ``` #### Questions - Basic question, but how do you restart the servicex pods? They all have an age of 7d, but do they even need to be restarted? > Ans: ## 2021-06-08 * Running a local instance of ServiceX * Requires K8S - comes with docker for windows * Install Helm * The helm chart to use is from [pr_multi_did_finders](https://github.com/ssl-hep/ServiceX/tree/pr_multiple_did_finders) branch of the ServiceX chart repository * Steps to start/run: * Clone the ServiceX repo locally * Switch to the proper branch (pr_multiple_did_finders) - because this is the only branch that contains open data. * Command line cd into the repository root directory * `helm dependency update servicex/` * `helm install -f my_values.yaml servicex servicex/` * (Personal setup: make sure docker has started and is running. Mine is kind-control-plane) * To see if it all gets up and running: `kubectl get pods` - and you can see which ones are running and which have crashed. * Hopefully the X509 container crashes won't affect you. * From VSCode, using the Kubernetes plug-in, expand the cluster, and then the workloads, and then the pods. * Right click on the `ServiceX_App`, and click "Port Forward", then agree to the 5000:5000 mapping * Right click on the `minio` app, and click Port Forward, and then agree to the 9000:9000. * Now you can try a curl to the `localhost:5000` endpoint for a OpenData file. * Your first query. SX queries are JSON, placed in the payload of a `http` `POST` request. * Use `curl` or `wget` (need the web api post command) * On Linux * `curl -X POST http://localhost:5000/servicex/transformation -H "Content-Type: application/json" -d @submit_request.json` * Note: the `@` was the source of the problems we were having with the linux curl command. * On Powershell (windows, but Linux too I guess, now): * `curl -Method Post -Uri http://localhost:5000/servicex/transformation -InFile submit_request.json -ContentType "application/json"` * The key is in the `submit_request.json` - see below * To track its progress, open the open port in a browser [`http://localhost:5000`](http://localhost:5000) - you should see a query with one file. My `my_values.yaml` file I used to start ServiceX on my Windows machine: ``` # Used to run servicex locally on a windows desktop docker, in whatever machine rabbitmq: persistence: enabled: false auth: erlangCookie: forkitover42 minio: persistence: enabled: false # Turn off the static file demo mode (by making it empty), if we want to use it, # then we should not specify it. didFinder: rucio_host: https://voatlasrucio-server-prod.cern.ch:443 auth_host: https://voatlasrucio-auth-prod.cern.ch:443 postgres: enabled: true objectStore: publicURL: localhost:9000 codeGen: image: sslhep/servicex_code_gen_func_adl_uproot transformer: pullPolicy: Always defaultTransformerImage: sslhep/servicex_func_adl_uproot_transformer:develop autoscaler: minReplicas: 1 ``` And a possible `submit_request.json`. - Figure out what's wrong with the submit_requests.json ``` { "did": "cernopendata://3827", "tree-name": "nominal", "selection": "(Select (call EventDataset) (lambda (list e) (list (attr e 'jet_pt'))))", "result-destination": "object-store", "result-format": "parquet", "workers": 1 } ```