The PubFlow Framework

PubFlow is a data publication framework for scientific data, build on top of proven business workflow technologies like BPMN 2.0, Apache ODE and JBoss JBPM. It brings automation and the division of work to the domain of scientific data management. PubFlow is based on the assumption, that data managers know best about the processes and guidelines, which have to be followed to publish research data to a public available archive. Unfortunately, the amount of scientific data is so overwhelming, that data managers alone can not curate each dataset and upload it to the archives.

Researchers, institutes and funding agencies on the other hand want their research data to be published as soon as possible. This factor was considered, when the PubFlow system was designed.

In PubFlow the role of the data manager is to define the publication workflows and to take care for complex tasks. The publication process for a specific dataset on the other hand is triggered by a scientist. She chooses a predefined workflow meeting her requirements from a list and starts it through a common ticket system like Jira, for instance, providing her dataset as input. Afterwards, the PubFlow system executes the selected workflow, which was predefined by the data manager, on the dataset the scientist uploaded. Every time a problem occurs and the workflow can not be continued, PubFlow creates a new ticket in the ticket system and assigns it to a data manager or to the researcher, who uploaded the dataset to the PubFlow system. If the problem described by the ticket is marked as solved, PubFlow continues the workflow execution.

As an evaluation scenario we created a workflow, which is used for transferring data from an institutional repository, the OCN database at the GEOMAR, to the WDC mare Pangaea. However, PubFlow is not restricted to this workflow. Data managers can add their own workflows to PubFlow and create support for different phases of the lifecycle of a research dataset.



The picture below shows the architecture of the PubFlow System. As you can see, we designed the system to be highly modular so developers can adapt the system to meet their specific requirements, integrate it in their IT landscape or even replace complete modules as well as using modules from PubFlow for their own data management solutions.