An Exercise in Using ACSets as a Data Structure in Research
Jacob S. Zelko
Thu Aug 01 2024
I’ve been thinking around an idea for some time now about running a small multimodal health study using ACSets as the core data structure to relate the different modalities of data I have together to investigate a research question (any kind at the moment) as an exercise in what is possible with this data structure. To describe my data, here is the type of data that I have:
- Dataset 1: Climate data sampled at a regular time series
- Exists at geographic regions (regions, census groups, states, territories, counties, etc.)
- Sampling can occur hourly or even more precise over long
time horizons (i.e. decades) - Exists as a database or file
- Dataset 2: Electronic health records sampled at irregular time series
- Exists at individual person level
- Sampling can occur very irregularly over long time horizons (i.e. decades)
- Exists as a large database
- Dataset 3: Census microdata sampled at regular time series over a long time horizon
- Can exist at general population levels, geographic regions, and more
- Sampling occurs regularly but is very sparse over decades worth of time
- Wide variety of data ranging from socioeconomic to demographic information
- Exists as a database or file
What I have been curious about with using ACSets in this problem space is how it could truly be operationalized in this context.
Suppose, for example, I am running a study where I am trying to predict what patients go on to have a heart attack after having a history of heat-related illnesses (or highly correlated with high temperatures). How I might do this normally is the following (bear with me, I am skipping over some of the specific nuance):
- Define a patient population who have had a history of heat-related illnesses but have not had a heart attack at some initial time point, t_1 (using Dataset 2)
- Define a patient population who have had heart attacks at some later time point, t_2 (using Dataset 2)
- Harmonize Dataset 1 and 3 against my patient population defined in 1 using some key or index (maybe geographic location or demographic location)
- Construct a data frame with a variety of prediction features and an outcome column from my population defined in 2
- Run a simple logistic regression to see how well I could predict using my features which patients from my initial group go on to have heart attacks.
In this situation, I would do a lot of manual harmonization and some feature engineering. Although this is somewhat contrived as an example, this is similar to some questions I would love to explore. Additionally, I know a possible critique of my line of thought is, “well, if this approach works for you, why don’t you just continue using it?” The whole reason I am going through this mental exercise is I am very curious to see what sort of questions or process improvements the adoption of ACSets in my workflow could enable or produce.
I know that @slwu has been looking at ACSet use here and there and I’ve come across some of the applied category theory work of Simon Frost and, of course, Nathaniel Osgood’s work. Finally, I am just very curious about pushing the bounds of what can be done with this data structure.
Reading up on the original ACSet paper, Categorical Data Structures for Technical Computing, I know the following about ACSets:
-
Act as a unifying abstract data type
-
Particularly useful for graphs and data frames, data structures
-
Combinatorial data could be thought of the data that:
- Exists solely within a graph structure
- Defines vertices in a graph structure
- Defines edges in a graph structure
- Set of all vertices and set of all edges are isomorphic
- As long as edge-vertex relationships are maintained
-
Attribute data has something concrete that describes it apart from a graph structure
- Encodes symmetries or relationships in data that are important to that data
And the paper mentions that this could lead to the development of novel data structures but at the moment, I am not quite seeing how I could make such structures. What else am I missing or not understanding?
In conclusion, the following questions still are floating in my mind:
- How could I use ACSets to associate my datasets together?
- How can I perform a statistical method on top of the relations I defined with an ACSet?
- What am I missing in my mental model of developing a study with ACSets as a core data structure? What other categorical knowledge should I recall about this work?
Like I said, this really is more of an exercise to see what is possible as well as for me to “push the envelope” in my own research. If I can come out of this exercise with more knowledge about even greater visions for ACT applications, that would be a huge win for me. A paper or at least blog post would be another interesting outcome too!
Thanks all!
jz
P.S. I was also inspired by your work on ODEs and ACSets @kris-brown so I am going to CC you and @owenlynch here.
Sean
Thu Aug 01 2024
Hi @TheCedarPrince, thanks for the tag! I’ll just jot down some pretty disjointed thoughts as a user of acsets in industry with very limited ACT knowledge.
I am using acsets in one “production” application at Merck. I had intended to use them in more but a few various things have prevented my wider adoption of them, which I’ll list below.
@acset_type
, I have not used the others) are a bit unwieldy to use in the exploratory phase when the number of tables, number of columns, types of columns, etc are still being decided upon. I still useDataFrames
for much exploratory work, and even for “production” problems where the the structure (the C in C-Set) is liable to change, because it is a lot of work to adapt to changes in the acset. I know there was some work on pushouts of FinCats and FinFunctors at some point that could maybe make combining acsets easier in exploratory work (like using the variousjoin
functions on data frames).Ob
in a schema is the product of two others, and use the projections to index into the product. That is obvious very very useful in practice but is not yet in acsets yet (though, not that awful to write an intersection of multiple calls toincident
). I think Kris has some work in GATLab here Initial work on combinatorial models of GATs by kris-brown · Pull Request #135 · AlgebraicJulia/GATlab.jl · GitHub which may at some point be the future for combinatorial data structures, but I do not know.Tables
interface for acsets, but then you have the acset and the tables living in your Julia session, which I suppose isn’t too bad. Anyway this also may be a possible weakness at the moment. I don’t know if DataMigrations could be used for some of these operations, I would guess the answer is yes but there isn’t enough documentation for me to be able to use it currently.The application where acsets are being used has the virtues of having a very clear data structure (a graph with other doodads and tables hanging off it) that will be stable over time. The graph is used to structure an optimization problem that is passed to JuMP. Something very lovely about acsets + JuMP is being able to generate anonymous variables or constraints and store them directly in the attributes of the acset. In my case, think having decision variables at each node and edge. When generating the constraints, I iterate over edges and look up associated decision variables of sources/targets which is very elegant and readable in what would otherwise be a difficult to understand integer-linear problem. I think using acsets as a structure to build mathematical programs and link them to data in the same structure is a really killer combination of tools. And being able to formulate queries as UWDs is absolutely fantastic for clarity in code, and for visualizing it in presentations.
Jacob S. Zelko
Sat Aug 03 2024
Hey @slwu,
Wow! Thanks for the awesome thoughts and feedback here; let me see if I can respond to each point:
Hunh! This is a bit unexpected as I was under the impression that I could use acsets themselves for exploration. So, you would generally recommend using acsets when you are done with the exploratory phase of work, otherwise, it might be a bit too cumbersome?
This rather leads into an interesting question for me: what problems are using acsets best suited for? It doesn’t seem like my toy problem I enumerated above is well-suited – or perhaps it is?
Could you actually say why this would be so useful? It’s unclear to me. Are you suggesting that this would give the means to indicate in the acset what features are in fact “the same” across data resources quickly?
Yea, that’s very fair I feel. I really like the fact that there does exist a tables interface for acsets as that should give the ability to “just use” packages that expect
Table
objects. And interesting about the in-memory nature of the interface, I thought the filtering and other tables operations are applied lazily versus first having to materialize the underlying data objects in a session.Oh that sounds absolutely great! Do you happen to have any of these examples out public somewhere? I’d love to see them – I seem to vaguely recall some code you and Simon Frost worked on regarding operationalizing ACT for epidemiology work but I am not sure.
Otherwise, this was a really interesting perspective Sean! It seems to me, the feeling I get is when I have a study and want to run a graph-based method on my data, it might be better to reach for acsets as the underlying representation for my study. But then, separately still some outstanding questions I had with your explanation was:
I know that last question is vague and seems odd but I am still trying to figure out an exercise I could perform myself with the data structure to answer a question I might be interested in.
Thanks!
P.S. I am going to CC @epatters and @jpfairbanks as you all may find this discussion interesting. Sorry if it is just noise otherwise!
Sean
Sat Aug 03 2024
Sure you absolutely can. It’s just in the case where the C of the C-Set is still in flux, and I’m looking at a lot of possible columns across my tables coming from some database system I don’t understand or don’t have access to explanatory documents for, for me the added weight of having to design a schema for every possible schema that could represent the data at hand I want to consider is too cumbersome. It’s often easier for me in practice to figure out those problems using
DataFrame
objects and the usualjoin
,crossjoin
, etc and then move to the acsets once I’m somewhat more certain what the schema will be moving forwards.Also, again, this is using
@acset_type
. I don’t know if dynamic acsets or the third type of acset whose name escapes me would be better suited for this. They may be, as they don’t store the schema information at the type level, but since they use the same interface which allows them to connect to the categorical machinery in Catlab, they also may also be inconvenient (for me) in the same ways. I haven’t’ experimented yet.But for your case it sounds like you are interested in applying acsets from inception to completion of a data science project in the health sciences domain, so it could be really interesting and helpful not just for yourself but for AlgJulia in general to hear your feedback on another start to finish application.
As to the interesting question, well, as Kris pointed out, acsets (C-Sets) can handle terms with unary or nullary constructors, so currently they are not well suited to things with higher order constructors. Which leads to the next point…
I know you have a lot of experience in data engineering so I think you might have just misunderstood slightly what I meant. I mean by that statement “multicolumn indexing” as in SQL: Creating Multicolumn Indexes in SQL | Atlassian. You cannot currently create “multicolumn indexes” (indexing into the apex of a span by morphisms into the feet) with acsets. Here is the relevant issue in acsets Multicolumn indexes · Issue #17 · AlgebraicJulia/ACSets.jl · GitHub and @kris-brown’s experiments on generalizations of C-Sets that would address such things Initial work on combinatorial models of GATs by kris-brown · Pull Request #135 · AlgebraicJulia/GATlab.jl · GitHub.
I think you are right. It just feels a little annoying to me to go that extra step. I think it would be helpful and fun to look into implementing as many of the
DataFrames
convenience functions as makes sense for acsets in a way that is performant and also “feels” like using them on a data frame.Nothing in a nice format. Here is some code I wrote that uses the
Catlab.Graphs
interface (a thin wrapper over acsets) to formulate and solve (using JuMP + HiGHS) a basic project scheduling problem Orcas.jl/src/BasicSchedule.jl at main · slwu89/Orcas.jl · GitHub. Here is something I wrote that compares using acsets to data frames to structure and build an optimization problem optimex/gams.md at main · slwu89/optimex · GitHubI think you absolutely could run regressions, etc on top of the acset data structure. Currently I am doing nearly zero statistical work (in the sense of fitting models and interpreting their coefficients/parameters) so I’d be very interested to see what your experiments with acsets for health research look like. All of my modeling work is pure optimization focused, building integer linear programming models of business or logistics processes, or sometimes connecting to a constraint problem solver to find sets of feasible solutions (e.g. all subgraphs that fulfill some properties, etc).
Jacob S. Zelko
Mon Aug 12 2024
Hey Sean,
Sorry to have vanished! Was in the midst of wrapping up a paper (submitted!), readjusting to my apartment after vacation (unpacked!), and scheming for this upcoming Fall semester (busy!). But, perhaps it was good to step away for a moment as I have been thinking quite heavily about our discussion and your idea here:
So, I spent a lot of time thinking about this and yea, I am going to try something. Here’s the sketch of the idea:
Inspired by thoughts from @bgavran’s papers in applying Category Theory to Machine Learning (namely Category Theory in Machine Learning and Position: Categorical Deep Learning is an Algebraic Theory of All Architectures) and Sean’s thoughts, my tentative proposal is to:
So as for each component of this study, here’s where things are at:
So that said and in terms of next steps, this is the start of the project which I’ll be working on within this repository: TheCedarPrince/CompositionalMLStudy · GitHub (if it is alright, I am terming this a Compositional Machine Learning project which is completely inspired by the language of @bgavran’s 2019 thesis Compositional Deep Learning as categorical machine learning is an already overloaded terminology).
I’ll work on 1 some more in the interim and think about 2 some more. Some final open questions for folks as I move along:
Thanks Sean – you gave me much to think about these past few days!
Hopefully I can make something good on these ideas!
Cheers!
~ jz
Jacob S. Zelko
Wed Aug 14 2024
Hey @slwu and folks, just following up here with some more thoughts:
I’ll keep thinking some more on this problem space!
Sean
Wed Aug 14 2024
@TheCedarPrince regarding data exchange I’ve used both the JSON serialization and Excel import/export (well, it’s popular
) features of acsets in my work. The tests should be enough to figure out the API ACSets.jl/test/serialization at main · AlgebraicJulia/ACSets.jl · GitHub. This issue has been an annoyance in using them before Serialize concrete acset schemas · Issue #63 · AlgebraicJulia/ACSets.jl · GitHub; I do not know if InterTypes addresses it yet, as I haven’t looked at that part of the package yet.
Regarding harmonizing data of different temporal resolution in an acset framework, I haven’t needed to worry about that yet. @jpfairbanks do you have any thoughts on this? I seem to recall someone in your lab is working on time series data analysis with acsets, or at least in a ACT framework?
James
Thu Aug 15 2024
Intertypes definitely addresses the problem of deserializing data from JSON that is some highly structured data. The idea is that with intertypes you can have ADTs in your Attributes and ACSets in your ADTs. So you could have a Graph where the edges have Formulas on them or Formulas that have DWDs as their function definitions.
Wilmer Leal and @Benjamin_Bumpus wrote this paper on temporal sheaves of data that @Matt_Cuffaro has been implementing. There is change of resolution as a tool that you can use in that framework if you want scale the granularity of the time index.