One of the first challenges researchers face when accessing data is determining if the dataset has information relevant to their research question. The NIAGADS team is committed to reducing this burden for researcher through NIAGADS Open Access, a suite of knowledge bases to help explore the NIAGADS data collection and annotations for Alzheimer’s Disease and Related Dementias (ADRD).
The newest addition to NIAGADS Open Access is the Data Discovery Portal, powered by Gen3. We caught up with Andy Wilk, the lead programmer for the Data Discovery Portal to learn about the development process and how the portal can assist users.
What is the Data Discovery Portal?
It’s a tool that researchers can use in two ways. The first way is if there isn’t an approved data access request for users, researchers can use the portal to explore the types of data NIAGADS has. Right now, exploration is limited to the Alzheimer’s Disease Sequencing Project (ADSP) dataset (NG00067).
The second way is if researchers do have an approved access request, then they can use the portal to build a research cohort, based on filters, and download a file manifest based on their selections that they can then use in AWS to access the files of interest.
What considerations were there when developing the portal?
It was important that it would be cloud based and comply with industry standards for data compliance. We also wanted to make sure it was a tool and framework that was used by other members of the community, other groups that collaborate with us, with the idea that we would want the option to connect the systems together.
How did you decide on Gen3?
We started searching through open-source solutions, with a focus on a cost effective and mature solution so that there would be less development work.
The team was already aware of Gen3 and once we started evaluating it against other options, it was the best fit. It was open source, was widely adapted by other organizations (the Anvil, Kids First, ect), particularly some that NIAGADS works with, and has extensive documentation. Also very important was that it has an active developer community, so we were able to connect with people with experience on how to work with the software. We also really liked how fast it was to apply filters and get nice visualizations of how the data was broken out.
What did the development process look like?
Initial deployment of the application was pretty quick, about 3 or 4 weeks to get the system up and running, all on amazon in the cloud. This included the architectural structure of a several of microservices that work together, like data submission, permissions, UI, search query and population. Then we were able to start loading test data with spreadsheets and build in a submission portal.
The transition from proof of concept to a working portal was about 3 years. There was extensive work that needed to be done to build all the technical infrastructure around it, updating service versions, database maintenance, technical infrastructure to import info from DSS (data and user permissions) and translating data into Gen3 format to be able to see it in Gen3.
The last step was setting up NIH authentication login, which took 2 years to get enabled. We needed to work with a lot of stakeholders at UPenn and the NIH CIT organization, which involved an onboarding process for RAS, developing a lot of documentation that had to go through review and approval, and developing a testing process before we got credentials.
Once we had credentials, we still had a decent amount of user interface (UI/UX) testing to do. Checking to make sure all the permission changes were translated to gen3 correctly from DSS. We spent a lot of time tweaking the UI for best user experience and conducting intrusion testing to make sure it was secure from cyber-attacks.
Altogether, from start to finish it took about 5 years to go from concept to the beta version we have up and running today and was a team effort. Wan-Ping Lee, PhD directed the project with Youli Ren as our project manager. I lead the application development and communication with stakeholders. Otto Valladares helped with technical assistance and services being deployed in AWS. Amanda Kuzma, Lauren Bass, Naveensri Saravanan, Joseph Manuel, and Heather Nicaretta helped with developing the data organization so we could translate the data relationships in the DSS to a graphical representation as well as the UI and permissions testing. Zivadin Katanic worked on the UI/UX development and Conor Klamann provided insight on how to get an API to pull data from the DSS into Gen3. Finally, a big thanks to BioTeam who helped us work on the custom functionalities we wanted to integrate for users.
What was challenging about the developmental process?
It was a complex system. To organize all the microsystems, we had to use of a lot of technology and frameworks. There was a definite learning curve to adapt new technology, but the Gen3 developer community and Otto were super helpful.
Gen 3 has excellent documentation for standing up standard gen3, but it is so customizable that there can’t be documentation for everything. There were instances where we wanted to do something and we knew that technically we could do it, but we were trying to reverse engineer it without having insight into how we might do that. We spent a lot of time looking at a lot of code and reaching out to the developer community.
Any future plans?
Yes, right now, the data discovery portal only has the Alzheimer’s Disease Sequencing Project (ADSP, NG00067) loaded in, which is why we consider the current release a beta version. We intend to get all the NIAGADS DSS datasets into the Data Discovery Portal in the future.
We are also working on additional functionalities to deploy, like a workspace feature. Currently, users are responsible for downloading a manifest, then downloading that data themselves, then putting the data somewhere to analyze it before they can get to analysis. The hope for the workspace feature is that it will allow researchers to build a cohort, download a file manifest and the upload that manifest to the workspace so the data can be pulled directly into that environment for analysis.
How do you see the Data Portal having an impact on researchers?
We hope it allows researcher to easily navigate and discover the data we have, saving researchers time and helping researcher identifying data of interest when they are writing grants.
Check out the Data Discovery Portal and its accompanying user guide to explore the data available in the ADSP dataset (NG00067).