Simplifying Working with GIS


At the place I'm working at currently, we regularly have the problem that we need to work with GIS data; this data cannot really come from anywhere but should come from official places, i.e. cities or regional (maybe even national) departments.

A note on that: Official data often isn't necessarily better than, say, Openstreetmap! Often, we just need to do that as we are working together with those departments. Other times, it might be a bit better in some respect, and worse in another.

What happens is something like the following: We get some data from one department of the city, in SHP and DXF format, that should be the same but factually aren't. We get another SHP file from another department. Finally, we get a CSV file from a regional department.

To make it easier, I will say SHP file when I mean Shapefile (.shp) and its associated other files (.shx, etc.). I will also only use DXF files, as that is an open format, as opposed to DWG files.

To be able to use this data, we will now need to process it into something that suits our needs.

For that, we will need to search through the data, understand it, parse it, and put it in a format for our needs. The first steps, i.e. searching though the data, understanding and parsing it, are menial tasks that take time.

Tools like QGIS help with it, but for that one needs to understand that first. To understand QGIS, a basic understanding of how GIS data in general works is needed as well.

Often we get to work with people that don't know either. Partially, these people are computer science students. In that case, they would probably be able to understand these things fairly well soon. In other cases, the people are architecture students. They are mostly interested in computers and often know a fair bit of programming, but often not all too much.

The question is: How can we make it easier for these people to work with this seemingly random mess of data, and to easily make sense of it?

Query Languages

At first, I was thinking in the direction of OpenStreetMap. There, data (tags) are not really given prior, but by now they're often almost standardized, while very special data can still exist. (For example, main streets are mostly tagged the same all over the world, but special types of buildings that are, for example, a special type of some region, can still exist and also still be added. One example of that would be building=trullo)

In OpenStreetMap, it is useful to query data using Overpass Turbo. Among others, this mostly uses Overpass QL as the query language. At the same time, it is also possible to query data using SPARQL. Here is a comparison of SPARQL and Overpass QL.

My idea was to create something like this (most likely something like Overpass QL, as that seems easier) for local data.

Basically, something where I point a program or a library to the root directory of my data, the program/library understands the data automatically, and then let me search it using this Query Language.

This idea is (just like any idea here) not fully thought-out, of course. For one, we will probably have data that is highly different to OpenStreetMap data. Also, I haven't yet had the time to look into SHP or DXF files deeply, so I currently don't yet have a deep understanding of those, which, however, I will need here.

Code Generation

After I had this first idea, I wasn't fully satisfied. I do think that the idea is something that doesn't really exist: I don't really know of a tool that allows me to query heterogeneous datasets just by pointing to the root folder.

Actually, maybe Apache Spark fits this description? I know too little of it to know for sure, but I do remember that Apache Spark should be able to query heterogeneous data. (And now that I think of it, I think duckdb also tries to achieve this.)

Nonetheless, it was still not really what we wanted.

We expect our users to be able to program just a little. The way the tool will be used is for data-processing purposes, which means that if we exclude GIS tools like QGIS, some form of programming will be necessary.

However, since the users won't have too much programming experience, the tool should be easy. Giving them a full-blown query language might not exactly fit this need.

I had another idea: One very nice thing about Django is that I don't need to handle SQL myself. While knowing SQL is certainly good, it's not what I want to work with when I just quickly want to create a small CRUD backend application. Django completely abstracts that away from me, the user of Django. (Even migrations are abstracted for me, which is extra nice!)

Wouldn't it be nice to have something like that for GIS data?

Currently, if we have this folder full of heterogeneous GIS files, we will need to handle all of them not only separately, but also in regard to their specific format. Thus, if we have a Shapefile, we need to do import shapefile, then shapefile.get_fields(...). For DXF, it's again the same: import dxf, ... (This is just pseudocode, the exact code varies from one programming language / library to another, of course)

All that while what we actually want is something like this:

street = get_street(street_name="My Street Name")
near_trees = single_street.get_near_objects(max_distance=100, object_type=Tree)
city = single_street.get_city()
connected_streets = single_street.get_connected_streets()

I really like this! And I do wonder if something like this is feasible.

I want to see if it is. What I want to try is: Is it possible to generate code that reads from the files and abstracts these reads into functions or classes that hide, to the user of the code, the exact details, while showing them only relevant semantic access code.

To some extent, this also reminds me of scaffolding, i.e. the process of generating boilerplate code automatically. One widely used tool that does this is openapi-generator, a tool that creates a frontend client to a RESTful backend API, given just the specification of the backend in an OpenAPI specification file.

I think it is possible that something similar can be created for GIS data.