[Image of a split circle appears with photos in each half of the circle flashing through of various CSIRO activities and the circle then morphs into the CSIRO logo]
[Image changes to show a new slide showing a line graph of real-time data and Peter Baker can be seen talking inset in the top right and text appears: Good, bad and ugly real-time data – how to tell the difference?, Vacation Scholar – Peter Baker]
Peter Baker: Yeah, so I’ve been on Digiscape GBR which is the Great Barrier Reef sub-project of Digiscape. In particular, I’ve been working on the 1622WQ App which is a water quality management application which focuses on posting real-time data streams online on an innovative digital platform. So I’ve been doing this work as a vacation scholar for the last two or so months.
[Image changes to show a new slide showing a photo of the coast of Queensland and text appears on the right: 1622WQ Project Objectives, Present real-time data on innovative digital platform, Provide objective feedback on water quality to – Farmers, Scientists, Government Agencies, Provide a platform that is robust and scalable]
So the project objectives of 1622WQ, so we’re all on the same page. Primarily, we want to produce a online platform, it’s actually currently viewable, that hosts data in an interactive way. It’s sort of geolocated data. So on the right here you can see this is a kind of snapshot of a part of the website and as you zoom in you’d see sensors that are located on the map at a particular location and primarily those sensors are either rainfall sensors that are pulled in from various sources, or the one I’ve been focussing on is nitrate data.
So what we did was we put a physical sensor inside a water body, and then we read out real-time nitrate sensor data. The point of this is that we want to give objective, an objective feedback cycle for farmers primarily, who are potentially fertilising soils nearby very sensitive eco-systems, namely the Great Barrier Reef. Excessive fertilisation can cause run-off into the waterways which can lead to coral bleaching and other effects due to excessive nitrate in the water. And my kind of part of that is mostly the data but that plays into a larger picture of creating a robust and scalable system.
[Image changes to show a new slide showing symbols of a hand holding a leaf, an arrow trending upwards and across on a computer screen, and a handshake, and text appears on the slide: Why do we care about data?, Data integrity is critical, Environmental impact, Business impact, Trust impact, Water quality feedback from sensor data, Balancing act of risk and responsibility, Relationships built on honesty and integrity]
So why do we care about data integrity on this platform? There’s sort of three big factors here. One’s environmental impact, which yeah we know it kind of explains, so these farmers are making decisions which can have direct environmental impacts, so if the data we provide is misleading or inaccurate or unreliable, that has some concerning implications for our sort of second-hand environmental impact. So, we want to try and do our best to make sure that the data’s accurate.
There’s a business impact. Sometimes farmers have to make decisions about how they balance risks and responsibilities. So they will feel, “I don’t want to damage the waterway. I love the environment, I love where I live, and I respect it”, but you know, on the same basis, they actually want to make money and they want to not take too much risk with crops. So, fertilisation is a big decision and again, we need to act responsibly. And we need trust. So this is a platform where we’re displaying data; if that data seems unreliable or untrustworthy then the whole point of this platform falls apart, so data plays a central role in this.
[Image changes to show a new slide showing three different line graphs showing nitrate concentrations, and text appears: But what’s wrong with the data?, Often noisy, Sometimes misleading, Occasionally unavailable]
OK, so what’s wrong with all the data? There’s a few issues, some of them larger than others. On the left, we can see the data is routinely extremely noisy. The data is occasionally misleading. So, on the right side you can see, well, it’s one thing having data, that’s fantastic. Often, we don’t have data. What do we do when it’s not there?
[Image changes to show a new slide showing two Isolation Forest Anomaly Detection diagrams, and text appears: Isolation Forest Anomaly Detection (Lie), Outliers are less collocated and therefore take fewer cuts to isolation, Points become window properties, How do we make a static outlier detector work for moving data?, Rolling windows + statistical aggregates which describe the window – Mean, Std dev, Variance, Range etc.]
So the existing system uses a pretty basic approach and my job was OK, you’ve just arrived here. Now, try and improve things. So, I went and deep-dived into some anomaly detection. So, that’s one approach of trying to filter data is to tag the data as being anomalous or unusual and then make a decision on that basis. So one particular algorithm that people have developed, it’s quite a novel approach, it’s called Isolation Forest Anomaly Detection, and the paper is linked at the top right there, if you’re interested. So the idea here is that if you have this sort of cluster, we’re visualising that as an x and a y cartesian plane. The idea is we make cuts through that plane and we keep doing that until every single point is by itself in its own little square.
Now the sort of property, OK, well how do we tell if something is an outlier, if it takes very few cuts to isolate a point, then there’s a good chance that it’s sort of an outlier because it’s sitting by itself. It’s not as co-located as other points. OK, so fantastic, what we’ve done is develop an outlier algorithm, but how does that relate to real-time data which is sequential? Well, one approach you can do that kind of brings it into the static space is to look at your data as windows instead of individual points.
So when you bring it into a window you have to say, “What’s interesting about this window of data, this sort of collection of data?”. Some things we can do is to perform statistical aggregates. So we detect the mean, standard deviation, variance, those kind of factors. Then we feed those into our outlier detection algorithms and it gives, it says to us, “OK, this particular range, this particular time period seems anomalous”.
[Image changes to show a new slide showing two line graphs showing two “Anomaly spikes” and then two “Real event? Unsure” and text appears: Did it work? Kind of…. Was good at detecting “events” in general, Couldn’t distinguish between real events and false events, Needs more contextual data – isn’t always available and is often poorly correlated]
Did it work? Well, it was interesting but I don’t know how useful it was. That’s still being investigated. We have some graphs on the bottom left there which are showing on the bottom there’s the actual data that we got from our sensor, and on the orange plot above that is what our anomaly detection algorithm thinks is going on. So, a score of about 0.5 or below is kind of normal behaviour and then as it goes above that that’s when it thinks there’s an event going on or something unusual happened.
So, we can see that it has picked up something unusual in this bottom left chart that says, OK, there’s something going on here, there’s something wrong with this data. However, whether that’s a real event, as in something’s happened in the waterway and there’s been a flood, or there’s been a bunch of rain, or we’re getting dry reads on the sensor, it’s impossible to distinguish from this algorithm alone the differences between those sort of things.
[Image changes to show a new slide showing symbols of a 3-D octagonal shape, an eye inside a head, and a book inside a head, and text appears: Key points, Complexity can be a curse, More to go wrong, More computation time (money), Less explainable and transparent, Configuration requires expertise, Anomaly detection is useful but isn’t ‘enough’, Real events look like fake ones, Fake events look like real ones, Is it enough to just raise a flag? What next?, More intelligence requires more information – More data, more labelling, More context – more sensors, Consistency]
So, the kind of outlooks of this anomaly detection approach, we also tried a few other methods and had similar outcomes. A few key points – so complexity sometimes is a curse. We’re doing everything in the Cloud. Cloud computation time can be expensive. It also makes it hard to maintain. So, this sort of expert analysis requires configuration all the time, as well, which is one of the big things we’re trying to improve. Often, this anomaly detection, it’s great, it’s useful, it’s interesting, but it’s not enough to filter data. We want to know, can we delete that data or do we keep that data? Or is there some modifications we need to do, and that all feeds back into our trust pathway.
So in this case, the real events, they look like fake ones on paper and the so-called fake events look like real ones. And even sometimes sitting down and trying to work that out by hand is difficult and that’s with sort of expert supervision. So asking an algorithm to tell the difference can be a bit much. And if we want more intelligence, we need more information. So, we need more data. We need that data to be labelled. Often we don’t have many tools to do that, otherwise we’d just use that tool so that has to be done by hand.
[Image changes to show a new slide showing three different hand drawn diagrams demonstrating the Data Window, the Normal Distribution, and the Outlier, and Peter can be seen talking inset in the top right and text appears: Point noise filter, Filter eliminates point noise outliers, Uses rolling window, Based on normal distribution]
This is the current pipeline that we’ve been playing around with . There’s some new tools, so I’ll talk you through those new approaches. So as part of one of the papers previously mentioned, there’s the idea of this point noise filter. So what we do there is again we fall back to this window idea, so we’re rolling through our data in windows. We look at the data within that window. We generate a kind of standard distribution. And then we can do some statistics and that gives as a outlier boundary that kind of moves along with your time-series. So, the idea of this is that we’re cleaning up that data.
[Image changes to show a new slide showing two hand drawn line graphs and an arrow links the first graph to the second via a smoothing parameter type formula and a text heading appears: Smoothing]
I’m sure you’re all familiar with the kind of classical smoothing approaches. So we’re using an exponential weighted moving average in this case which takes a scaled version of the original value at that time and also a scaled version of the previous smoothed data. So the result of that is you get this reduction in noise but an improvement in the readability of your trends.
[Image changes to show a new slide showing six line graphs showing Raw data, Interpretation for gap size 6 iteration, Quadratic, Cubic, Smoothed Quadratic, and Smoothed Cubic, and text heading and text appears: Interpolation example (Real data, Original Data), Old Approach, New Approach]
We also improved the interpolation approach used, so current, beforehand we had a plain old straight line and through some experimentation we found that cubic sky interpolation with some sort of smoothing over on the back end, ended up working the best.
[Image changes to show a new slide showing four line graphs showing Raw data, Interpolated data, Post noise removal, Smoothed results, and text heading and text appears: Full pipeline example, Original Data, Old Approach, Noise removed, Smoothed output, Trend clearer – no important data lost]
So here’s an example of the full data pipeline. So, in the top left, we have a sample I’ve chosen of original data, so that’s nitrate concentration data. You can see that the original filter system, which is in the top right, that’s where it would’ve ended up. We’ve filled in some gaps but that’s about it. Our filters just weren’t sophisticated enough to reliably move these points. So, the bottom left, you can see this is after the point noise filter has gone over it and it’s taken out a lot of those peaky values which were clearly inaccurate. And then in the bottom right, that’s the outcome of the noise filtering plus the smoothing and also the improved interpolation. So you can see that we can still keep that trend information and we don’t lose any important data.
[Image changes to show a new slide showing two line graphs showing the New ~1.5mgN/L peak, and then the Old ~1.9mgN/L peak, and text appears on the right: Significant… readable data, Peak amplitude reduction due to smoothing (can be lessened or increased), Real event information retained]
So, sometimes this doesn’t work as well as we would like. So you can see in the top that, fantastic, we have some awesome trends. It’s less noisy, it’s more immediately apparent what’s happened. That’s actually a real event. Around New Year’s, there was some very major rain events, if you were following Queensland rain, like sort of 250mm or more in some of these catchments in a day. And the outcome of that is, we’ve got some major flushing of nitrate into the waterways so you have these huge peaks. Alright, so you can see we get a peak reduction in the output there.
[Image changes to show a new slide and Peter can be seen inset talking in the top right, and text appears: Outcomes, No per stream parameter adjustment besides hardware limited min/max thresholds, Point noise is handled more robustly and data context (surrounding values) are taken into account, Interpolation is more accurate, Smoothing increases trend readability]
OK, so the outcomes, well, we’ve removed some of these parameters, so we don’t have to set as many parameters anymore. It’s kind of automatic. We’re improving point noise handling. Interpolation is better and our smoothing increases our readability at a slight cost of losing some accuracy on the peaks. So everything is a trade-off but we’re doing our best.
[Image changes to show a white screen and the CSIRO logo can be seen and text appears: CSIRO, Australia’s National Science Agency]