An Exercise in Using ACSets as a Data Structure in Research

Jacob S. Zelko

Thu Aug 01 2024

I’ve been thinking around an idea for some time now about running a small multimodal health study using ACSets as the core data structure to relate the different modalities of data I have together to investigate a research question (any kind at the moment) as an exercise in what is possible with this data structure. To describe my data, here is the type of data that I have:

Dataset 1: Climate data sampled at a regular time series
- Exists at geographic regions (regions, census groups, states, territories, counties, etc.)
- Sampling can occur hourly or even more precise over long
  time horizons (i.e. decades)
- Exists as a database or file
Dataset 2: Electronic health records sampled at irregular time series
- Exists at individual person level
- Sampling can occur very irregularly over long time horizons (i.e. decades)
- Exists as a large database
Dataset 3: Census microdata sampled at regular time series over a long time horizon
- Can exist at general population levels, geographic regions, and more
- Sampling occurs regularly but is very sparse over decades worth of time
- Wide variety of data ranging from socioeconomic to demographic information
- Exists as a database or file

What I have been curious about with using ACSets in this problem space is how it could truly be operationalized in this context.

Suppose, for example, I am running a study where I am trying to predict what patients go on to have a heart attack after having a history of heat-related illnesses (or highly correlated with high temperatures). How I might do this normally is the following (bear with me, I am skipping over some of the specific nuance):

Define a patient population who have had a history of heat-related illnesses but have not had a heart attack at some initial time point, t_1 (using Dataset 2)
Define a patient population who have had heart attacks at some later time point, t_2 (using Dataset 2)
Harmonize Dataset 1 and 3 against my patient population defined in 1 using some key or index (maybe geographic location or demographic location)
Construct a data frame with a variety of prediction features and an outcome column from my population defined in 2
Run a simple logistic regression to see how well I could predict using my features which patients from my initial group go on to have heart attacks.

In this situation, I would do a lot of manual harmonization and some feature engineering. Although this is somewhat contrived as an example, this is similar to some questions I would love to explore. Additionally, I know a possible critique of my line of thought is, “well, if this approach works for you, why don’t you just continue using it?” The whole reason I am going through this mental exercise is I am very curious to see what sort of questions or process improvements the adoption of ACSets in my workflow could enable or produce.

I know that @slwu has been looking at ACSet use here and there and I’ve come across some of the applied category theory work of Simon Frost and, of course, Nathaniel Osgood’s work. Finally, I am just very curious about pushing the bounds of what can be done with this data structure.

Reading up on the original ACSet paper, Categorical Data Structures for Technical Computing, I know the following about ACSets:

Act as a unifying abstract data type
Particularly useful for graphs and data frames, data structures
Combinatorial data could be thought of the data that:
- Exists solely within a graph structure
- Defines vertices in a graph structure
- Defines edges in a graph structure
- Set of all vertices and set of all edges are isomorphic
  - As long as edge-vertex relationships are maintained
Attribute data has something concrete that describes it apart from a graph structure
- Encodes symmetries or relationships in data that are important to that data

And the paper mentions that this could lead to the development of novel data structures but at the moment, I am not quite seeing how I could make such structures. What else am I missing or not understanding?

In conclusion, the following questions still are floating in my mind:

How could I use ACSets to associate my datasets together?
How can I perform a statistical method on top of the relations I defined with an ACSet?
What am I missing in my mental model of developing a study with ACSets as a core data structure? What other categorical knowledge should I recall about this work?

Like I said, this really is more of an exercise to see what is possible as well as for me to “push the envelope” in my own research. If I can come out of this exercise with more knowledge about even greater visions for ACT applications, that would be a huge win for me. A paper or at least blog post would be another interesting outcome too!

Thanks all!

P.S. I was also inspired by your work on ODEs and ACSets @kris-brown so I am going to CC you and @owenlynch here.

Sean

Thu Aug 01 2024

Hi @TheCedarPrince, thanks for the tag! I’ll just jot down some pretty disjointed thoughts as a user of acsets in industry with very limited ACT knowledge.
I am using acsets in one “production” application at Merck. I had intended to use them in more but a few various things have prevented my wider adoption of them, which I’ll list below.

acsets (generated using @acset_type, I have not used the others) are a bit unwieldy to use in the exploratory phase when the number of tables, number of columns, types of columns, etc are still being decided upon. I still use DataFrames for much exploratory work, and even for “production” problems where the the structure (the C in C-Set) is liable to change, because it is a lot of work to adapt to changes in the acset. I know there was some work on pushouts of FinCats and FinFunctors at some point that could maybe make combining acsets easier in exploratory work (like using the various join functions on data frames).
There’s a long standing issue regarding adding support for multi column indexing (Multicolumn indexes · Issue #17 · AlgebraicJulia/ACSets.jl · GitHub) which would let the user say that one Ob in a schema is the product of two others, and use the projections to index into the product. That is obvious very very useful in practice but is not yet in acsets yet (though, not that awful to write an intersection of multiple calls to incident). I think Kris has some work in GATLab here Initial work on combinatorial models of GATs by kris-brown · Pull Request #135 · AlgebraicJulia/GATlab.jl · GitHub which may at some point be the future for combinatorial data structures, but I do not know.
Alternative primary keys would be nice, for closer integration with external data for example Primary keys for acsets · Issue #64 · AlgebraicJulia/ACSets.jl · GitHub
I am not sure how practical it is to develop generalizations of convenience features of data frames libraries for acsets, like filtering rows on general criteria applied to multiple columns. There is a Tables interface for acsets, but then you have the acset and the tables living in your Julia session, which I suppose isn’t too bad. Anyway this also may be a possible weakness at the moment. I don’t know if DataMigrations could be used for some of these operations, I would guess the answer is yes but there isn’t enough documentation for me to be able to use it currently.

The application where acsets are being used has the virtues of having a very clear data structure (a graph with other doodads and tables hanging off it) that will be stable over time. The graph is used to structure an optimization problem that is passed to JuMP. Something very lovely about acsets + JuMP is being able to generate anonymous variables or constraints and store them directly in the attributes of the acset. In my case, think having decision variables at each node and edge. When generating the constraints, I iterate over edges and look up associated decision variables of sources/targets which is very elegant and readable in what would otherwise be a difficult to understand integer-linear problem. I think using acsets as a structure to build mathematical programs and link them to data in the same structure is a really killer combination of tools. And being able to formulate queries as UWDs is absolutely fantastic for clarity in code, and for visualizing it in presentations.

Jacob S. Zelko

Sat Aug 03 2024

Hey @slwu,

Wow! Thanks for the awesome thoughts and feedback here; let me see if I can respond to each point:

Hunh! This is a bit unexpected as I was under the impression that I could use acsets themselves for exploration. So, you would generally recommend using acsets when you are done with the exploratory phase of work, otherwise, it might be a bit too cumbersome?

This rather leads into an interesting question for me: what problems are using acsets best suited for? It doesn’t seem like my toy problem I enumerated above is well-suited – or perhaps it is?

Could you actually say why this would be so useful? It’s unclear to me. Are you suggesting that this would give the means to indicate in the acset what features are in fact “the same” across data resources quickly?

I am not sure how practical it is to develop generalizations of convenience features of data frames libraries for acsets, like filtering rows on general criteria applied to multiple columns. There is a Tables interface for acsets, but then you have the acset and the tables living in your Julia session, which I suppose isn’t too bad.

Yea, that’s very fair I feel. I really like the fact that there does exist a tables interface for acsets as that should give the ability to “just use” packages that expect Table objects. And interesting about the in-memory nature of the interface, I thought the filtering and other tables operations are applied lazily versus first having to materialize the underlying data objects in a session.

Oh that sounds absolutely great! Do you happen to have any of these examples out public somewhere? I’d love to see them – I seem to vaguely recall some code you and Simon Frost worked on regarding operationalizing ACT for epidemiology work but I am not sure.

Otherwise, this was a really interesting perspective Sean! It seems to me, the feeling I get is when I have a study and want to run a graph-based method on my data, it might be better to reach for acsets as the underlying representation for my study. But then, separately still some outstanding questions I had with your explanation was:

How do you perform any sort of analysis on top of an acset structure in your experience? Getting into the weeds of Julia for a moment (the benefit of us both being Julians!), suppose I structure my study on top of an acset after I have stabilized data resources and relations: how could I run a simple MLJ linear regression on top of this data structure? Maybe, instead, what about a graph neural network a la GraphNeuralNetworks.jl?

I know that last question is vague and seems odd but I am still trying to figure out an exercise I could perform myself with the data structure to answer a question I might be interested in.

Thanks!

P.S. I am going to CC @epatters and @jpfairbanks as you all may find this discussion interesting. Sorry if it is just noise otherwise!

Sean

Sat Aug 03 2024

Hunh! This is a bit unexpected as I was under the impression that I could use acsets themselves for exploration. So, you would generally recommend using acsets when you are done with the exploratory phase of work, otherwise, it might be a bit too cumbersome?
This rather leads into an interesting question for me: what problems are using acsets best suited for? It doesn’t seem like my toy problem I enumerated above is well-suited – or perhaps it is?

Sure you absolutely can. It’s just in the case where the C of the C-Set is still in flux, and I’m looking at a lot of possible columns across my tables coming from some database system I don’t understand or don’t have access to explanatory documents for, for me the added weight of having to design a schema for every possible schema that could represent the data at hand I want to consider is too cumbersome. It’s often easier for me in practice to figure out those problems using DataFrame objects and the usual join, crossjoin, etc and then move to the acsets once I’m somewhat more certain what the schema will be moving forwards.

Also, again, this is using @acset_type. I don’t know if dynamic acsets or the third type of acset whose name escapes me would be better suited for this. They may be, as they don’t store the schema information at the type level, but since they use the same interface which allows them to connect to the categorical machinery in Catlab, they also may also be inconvenient (for me) in the same ways. I haven’t’ experimented yet.

But for your case it sounds like you are interested in applying acsets from inception to completion of a data science project in the health sciences domain, so it could be really interesting and helpful not just for yourself but for AlgJulia in general to hear your feedback on another start to finish application.

As to the interesting question, well, as Kris pointed out, acsets (C-Sets) can handle terms with unary or nullary constructors, so currently they are not well suited to things with higher order constructors. Which leads to the next point…

Could you actually say why this would be so useful? It’s unclear to me. Are you suggesting that this would give the means to indicate in the acset what features are in fact “the same” across data resources quickly?

I know you have a lot of experience in data engineering so I think you might have just misunderstood slightly what I meant. I mean by that statement “multicolumn indexing” as in SQL: Creating Multicolumn Indexes in SQL | Atlassian. You cannot currently create “multicolumn indexes” (indexing into the apex of a span by morphisms into the feet) with acsets. Here is the relevant issue in acsets Multicolumn indexes · Issue #17 · AlgebraicJulia/ACSets.jl · GitHub and @kris-brown’s experiments on generalizations of C-Sets that would address such things Initial work on combinatorial models of GATs by kris-brown · Pull Request #135 · AlgebraicJulia/GATlab.jl · GitHub.

Yea, that’s very fair I feel. I really like the fact that there does exist a tables interface for acsets as that should give the ability to “just use” packages that expect Table objects. And interesting about the in-memory nature of the interface, I thought the filtering and other tables operations are applied lazily versus first having to materialize the underlying data objects in a session.

I think you are right. It just feels a little annoying to me to go that extra step. I think it would be helpful and fun to look into implementing as many of the DataFrames convenience functions as makes sense for acsets in a way that is performant and also “feels” like using them on a data frame.

Oh that sounds absolutely great! Do you happen to have any of these examples out public somewhere? I’d love to see them – I seem to vaguely recall some code you and Simon Frost worked on regarding operationalizing ACT for epidemiology work but I am not sure.

Nothing in a nice format. Here is some code I wrote that uses the Catlab.Graphs interface (a thin wrapper over acsets) to formulate and solve (using JuMP + HiGHS) a basic project scheduling problem Orcas.jl/src/BasicSchedule.jl at main · slwu89/Orcas.jl · GitHub. Here is something I wrote that compares using acsets to data frames to structure and build an optimization problem optimex/gams.md at main · slwu89/optimex · GitHub

How do you perform any sort of analysis on top of an acset structure in your experience? Getting into the weeds of Julia for a moment (the benefit of us both being Julians!), suppose I structure my study on top of an acset after I have stabilized data resources and relations: how could I run a simple MLJ linear regression on top of this data structure? Maybe, instead, what about a graph neural network a la GraphNeuralNetworks.jl?

I think you absolutely could run regressions, etc on top of the acset data structure. Currently I am doing nearly zero statistical work (in the sense of fitting models and interpreting their coefficients/parameters) so I’d be very interested to see what your experiments with acsets for health research look like. All of my modeling work is pure optimization focused, building integer linear programming models of business or logistics processes, or sometimes connecting to a constraint problem solver to find sets of feasible solutions (e.g. all subgraphs that fulfill some properties, etc).

Jacob S. Zelko

Mon Aug 12 2024

Hey Sean,

Sorry to have vanished! Was in the midst of wrapping up a paper (submitted!), readjusting to my apartment after vacation (unpacked!), and scheming for this upcoming Fall semester (busy!). But, perhaps it was good to step away for a moment as I have been thinking quite heavily about our discussion and your idea here:

So, I spent a lot of time thinking about this and yea, I am going to try something. Here’s the sketch of the idea:

Inspired by thoughts from @bgavran’s papers in applying Category Theory to Machine Learning (namely Category Theory in Machine Learning and Position: Categorical Deep Learning is an Algebraic Theory of All Architectures) and Sean’s thoughts, my tentative proposal is to:

Familiarize myself a bit further with the current ACSets.jl API. This will probably includea more thorough read of Categorical Data Structures for Technical Computing again now that I am more category theoretically aware.
Identify a research question to explore. This will need to be worked on a bit more but I want to investigate a question that intentionally spans a variety of heterogeneous data sets. The reason being is based on the notion of some of the work done within An Algebraic Framework for Structured Epidemic Modeling using ACSets, I want to see how far I can push the heterogeneity to produce further insight when I apply category theoretic approaches to this problem (as this is an exercise).
Find data sets to use for a research project. The datasets would need to accommodate this sort of heterogeneity I want. Not only between the species of the data used, but the time resolutions used between these data sets. That’ll be interesting to see how well I can accommodate these data.
Interface a ACSet study configuration with Machine Learning tools. As the current ACSet machinery is within Julia, we’ll make liberal use of the Tables.jl interface as probably the core interchange format.
Interpret a statistic on top of the ACSet-based study. This will be the crux of the study as to what is possible and what can be done within this framing. At this point will be some uncharted territory so will base again upon An Algebraic Framework for Structured Epidemic Modeling. Perhaps I will reach out to Simon Frost or @ndo885 at this stage as well.

So as for each component of this study, here’s where things are at:

For 1, I’ll play around a bit with ACSets.jl further and I have been re-reading the original ACSets paper again.
For 3, I am happy to say the species of data I have access to (amongst others) are:
- IPUMS: census microdata for a variety of different demographic information across the globe and a ton for the USA.
- ERA5 & National Centers for Environmental Information: These two datasets have such a wide array of climate related data for the entire globe on a fantastic time resolution.
- Pharmetrics Plus: a 20+ million USA patient database of patient medical records from across the entire country.
For 2, I am not quite certain yet what I want to explore exactly as I think it will be a function of how I harmonize these datasets together and what axes I chose to focus on. Here’s some ideas (just ideas at the moment but could be expanded upon):
- Predict heat-related illnesses in minority population demographics spread out across the USA.
- Cluster various populations based on vulnerability to climate-change related risks.
- Investigate population mental health and well-being based on seasonal affective disorders across geographic region.
For 5, this will be a function of what I end pu choosing for 2.
For 4, this will be interesting given the fact that there exists a Tables.jl interface for ACSets and that tools I’d use, like MLJ.jl or GraphNeuralNetworks.jl or Flux.jl, accept Tables compatible objects. I am curious how much engineering I will have to do to get things talking properly to each other as per your comment @slwu ,

So that said and in terms of next steps, this is the start of the project which I’ll be working on within this repository: TheCedarPrince/CompositionalMLStudy · GitHub (if it is alright, I am terming this a Compositional Machine Learning project which is completely inspired by the language of @bgavran’s 2019 thesis Compositional Deep Learning as categorical machine learning is an already overloaded terminology).

I’ll work on 1 some more in the interim and think about 2 some more. Some final open questions for folks as I move along:

@slwu have you experienced using ACSets to harmonize datasets of differing time series? What approaches did you use here? Any sort of time warping methods?
@slwu, did you end up using Tables as the core interchange format to compose ACSets with other methods from other Julia packages? Or what did you do?
Does anyone know of any places that might be good for learning the ACSets interface interactively? Like notebooks or scripts to follow along with? Maybe CC @owenlynch on this one too in case he knows.
For 2, does anyone have any suggestions for any research questions that might be fun to explore in this framing? Open to suggestions as well as other data sets to incorporate!

Thanks Sean – you gave me much to think about these past few days! Hopefully I can make something good on these ideas!

Cheers!

~ jz

Jacob S. Zelko

Wed Aug 14 2024

Hey @slwu and folks, just following up here with some more thoughts:

Regarding the question species above (i.e. climate health related questions), I got inspired quite a lot by IPUMS DHS research in climate change and health research blog here: IPUMS DHS Climate Change and Health Research Hub - Latest Posts
[There was a second thing that I am forgetting…]
Gioele Zardini’s group at MIT is doing a fascinating course on Applied Category Theory for Engineers which is based off their course developed over at Zurich ETH (the notes are a great repository of applied category theory thinking).

I’ll keep thinking some more on this problem space!

Sean

Wed Aug 14 2024

@TheCedarPrince regarding data exchange I’ve used both the JSON serialization and Excel import/export (well, it’s popular ) features of acsets in my work. The tests should be enough to figure out the API ACSets.jl/test/serialization at main · AlgebraicJulia/ACSets.jl · GitHub. This issue has been an annoyance in using them before Serialize concrete acset schemas · Issue #63 · AlgebraicJulia/ACSets.jl · GitHub; I do not know if InterTypes addresses it yet, as I haven’t looked at that part of the package yet.

Regarding harmonizing data of different temporal resolution in an acset framework, I haven’t needed to worry about that yet. @jpfairbanks do you have any thoughts on this? I seem to recall someone in your lab is working on time series data analysis with acsets, or at least in a ACT framework?

James

Thu Aug 15 2024

Intertypes definitely addresses the problem of deserializing data from JSON that is some highly structured data. The idea is that with intertypes you can have ADTs in your Attributes and ACSets in your ADTs. So you could have a Graph where the edges have Formulas on them or Formulas that have DWDs as their function definitions.

Wilmer Leal and @Benjamin_Bumpus wrote this paper on temporal sheaves of data that @Matt_Cuffaro has been implementing. There is change of resolution as a tool that you can use in that framework if you want scale the granularity of the time index.