A quick update: I have been "out of the loop" because I was working on my dissertation and getting ready for my defense. I successfully defended my dissertation on 2 November 2017! I will be receiving my doctorate officially on 15 December 2017. It's been quite a ride, and I have been blessed by the help and advice of so many friends and family, as well as my advisor and supervisory committee members.
In the mean time, I have also been working on some other research such as the bonefish and horseshoe crab modeling. Parts of my dissertation are also 'in the works' as publications. I will share all of this work as soon as it is ready to share, or becomes officially available. I will spend the next few weeks updating my websites and adding information on the things I'm working on. Stay tuned!
3 Comments
For a current research project, I am developing an occupancy model for horseshoe crabs in Florida. We are working with data from FWC, which is awesome long-term random stratified sampling data, but since it isn't specifically targeted at horseshoe crabs (it is for fish inventories), we are dealing with highly zero-inflated data – lots of non-detection. Hence, we can't in good conscience treat the data in its current form as a measure of abundance of the crabs. Cue occupancy modeling: this type of model considers two processes: occupancy and detectability. Occupancy refers to the presence or absence of a species during the period of sampling / season, and detectability refers to whether or not a species is detected. Thus, there is a possibility that a species is present but not detected! This method also allows you to incorporate variables that could potentially impact either occupancy or detectability.
I am really excited about this approach! I first heard about it at a 2016 seminar talk by Dr Bliznyuk (UF, see here for his arXiv paper), and have been itching to apply it ever since! And now I finally have a chance.
However, I am new to this type of modeling, and also have not done much binomial modeling in the past. As I was going through a book, the R code and its manuals, making little examples to clarify formulas, I realized this might also benefit others – which is the reason I am blogging about this. I will post several installments as I progress from simple to more complex models. I will make R code available too.
First and foremost; all credit for theoretical aspects goes to the following publications: - MacKenzie, D. I., Nichols, J. D., Lachman, G. B., Droege, S., Royle, A., & Langtimm, C. A. (2002). Estimating site occupancy rates when detection probabilities are less than one. Ecology, 83(8), 2248–2255. - MacKenzie, D. I., Nichols, J. D., Royle, A., Pollock, K.H., Bailey, L.L. & Hines, J.E. (2005). Occupancy Estimation and Modeling: Inferring Patterns and Dynamics of Species Occurrence. Elsevier. Especially the latter, the book, has been incredibly helpful.
For coding, I am using the package “unmarked” in R: Fiske, I., & Chandler, R. (2011). unmarked: An RPackage for Fitting Hierarchical Models of Wildlife Occurrence and Abundance. Journal of Statistical Software, 43(10), 1–23. http://doi.org/10.18637/jss.v043.i10
Let’s get started!
First a refresher: the binomial process shows the probability of either success (1) or failure (0). The normal binomial distribution formula:
with n=number of trials, x=number of successes, p=probability of success and b=binomial probability. The first part, the binomial coefficient expresses how many ways k successes can be distributed over n trials. We then multiply with the probability of success (p), then the probability of failure (p-1). The expected mean is and the variance is for a normal distribution.
EXAMPLE So if for 29 trials (n), you get 13 successes (x), and you know the probability of success is 0.5 this formula tells you that the probability of getting this result (13 out of 29 is a success) is 0.13. There is a 0.14 probability of getting 14 successes. There is only a 0.02 probability of getting 20 successes. These values are the area under the probability distribution plot (pdf). So if you know p, you can calculate b, and vice versa.
For occupancy modeling, we use the following terminology: - The 'area' is the larger spatial area for which we want to draw an inference; - It is divided into sampling units (S) – which could be grid cells or discrete units (like a pond); - We select s units – sites – as random samples: the assumption is that S is very large in comparison to the sites s, and we are trying to say something about occupancy of S (not s); - Surveys are done repeatedly at each site, K times, to record presence (1) or absence (0) of a species, creating a detection history (hi).
There are a few assumptions we need to take into account: 1) We assume occupancy for the period of analysis to be 'closed', i.e. it does not change, a species occupies a site or it does not. In practice this often means you have to develop a model for a 'season', which can differ in length for different species. 2) The probability of occupancy is the same for all sites; 3) The probability of detection is the same for all sites; 4) Detections in each survey are independent; and 5) Detection histories are independent. Note: some of these assumptions are/can be modified with more complex modeling, but for now we are starting with simple situations!
The starting point for our first simple occupancy model is that we assume there is a common probability of sites being occupied by a species, e.g. 0.7. With formula (with species detected perfectly), this would result in 70 detections (x) at 100 sites (s), for instance. Note that this is the same formula as in the refresher earlier, just with different symbols (p= and n=s).
But then we realize, detection is not perfect! If the probability of detecting a species at an occupied site is p, after k surveys the probability of detecting at least once is p*=1-(1-p)k. The second part (1-p)k is the probability that the species is not detected at all after k surveys.
Examples: High p and high k: p*=1-(1-0.7)30=1 High p and low k: p*=1-(1-0.7)3=0.973 Low p and high k: p*=1-(1-0.3)30=1 Low p and low k: p*=1-(1-0.3)3=0.66 (i.e. the probability 1-0.66 that species go undetected in all surveys)
This illustrates that with low probabilities of detection, you need more trials to detect at least once (obviously).
Thus the probability of the species being present and being detected becomes and thus the estimator for the proportion of sites occupied (if we know p) is with sD = sites where species is detected. Similarly as above
Example With 30 surveys and 10 detections, if p=0.7 and k=30, p*=1 (see above), so should be close to the real number, e.g. 10/30. In this case, even if p is low, the number of k makes up for it, so this is also close to the 'real' probability. But let's say k=3 and p=0.3 (so a low probability of detection, which makes p*=0.66, as above), and there is 1 detection, this makes (instead of 0.333).
Generally though, we do not know what the detection probability is - so occupancy modeling is based on figuring out what the detection probability (p) and the occupancy probability () is.
So, what are we calculating? 1) expresses the probability of getting this particular detection history (absence-presence-absence-presence) with being the probability that the species occupies the site and pj the probability of detecting the species during survey j: to repeat, p is the probability of success and p-1 is the probability of failure. 2) It gets a little bit more complex if the species is not detected at all since this does not necessarily imply absence. We add the possibility of non-detection to absence:
Here we add the probability that the species is present but undetected (probability of occupancy multiplied by the product of all absence probabilities) to the probability of non-occupancy
Then, if these detection histories are constructed for all sites, we assume they’re independent and use the model likelihood for the observed data: the likelihood of occupancy and detection given the available data, is the product of all site probabilities:
In words this is 'the likelihood that we observe these detection histories, given occupancy probability and detection probability p)'.
In a more extensive form, using the formulas defined earlier, this becomes
Essentially, the second part represents the sites where there was no detection, so where we need to take absence and non-detection into account. This part is raised to the number of sites without detection. The first part is for where there is detection: sj is the number of sites where the species was detected at the jth survey. It again includes a calculation for success (presence) and for failure (absence). We first take the occupancy probability and raise it to the power of the number of sites where there was detection. We then multiply this with the product of the detection probabilities based on K surveys. This started to look like alphabet soup to me too, so I made an example with known probabilities.
Example: We make a detection survey with 4 sites (s, rows), 5 surveys (j=1 to K, columns), =0.7 and p=0.6. For: Site 1 1 1 1 0 0 Site 2 0 0 0 0 0 Site 3 1 0 1 0 1 Site 4 1 1 0 0 0 sD=3 (number of sites where species was detected at least once We start with second half of the detection histories (h1 through h5), the parts with p's in them: survey j=1: =0.216 survey j=2: =0.144 survey j=3: =0.144 survey j=4: =0.7 (this one is easy because there is only one survey with zeros, so no product calculation needed) survey j=5: =0.096 Then putting it all together gives =0.000147.0.49=0.00007227 for which the log-likelihood is -4.141
Since this has become a longer post than expected, I will leave more detail on log-likelihood and Maximum Likelihood Estimation (MLE) for the next post. I hope to also include a bit of R code in the next post – though admittedly I have found implementation a little bit challenging so far.
If you want to do some more reading on occupancy models, the USGS created a nice straightforward (short) document: https://fresc.usgs.gov/products/fs/fs2005-3096.pdf
For scientific publications, you need to submit high quality images: high resolution, and preferably in vector format. I like to make most of my figures in R (Rstudio) because of the amazing options packages such as ggplot2, grid and cowplot offer. Unfortunately, standard outputs are not of great quality: 72 ppi, whereas most journals seems to prefer 300 (Plos ONE, Scientific Reports) and even 600 ppi (ESA). Fortunately, when using ggplot2, there are straightforward ways to achieve this, as well as getting output formats such as EPS (Encapsulated Postscript Vector graphics).
I didn't come up with this solution myself, but am grateful for finding it and would like to disseminate it wider - I know there are more of you struggling with the same problem! Dr. Hocking (Frostburg State University) published the solution on his website: click here. I am working on a project that evaluates bonefish population dynamics in Florida Bay - super interesting, I'll try and post more about that some other time. Right now, I am reading up on some environmental variables that could be drivers for the dynamics we are witnessing: things like salinity, temperature, rainfall, algae, climatic conditions - and how a lot of these are associated with each other. One interesting component of the Florida Bay system is seagrass. It grows on the flats in the Bay and is an important habitat, especially for juvenile fish. It is of course also sensitive to the earlier mentioned variables, and seagrass die-off is a serious problem in the Bay.
I just found out seagrass dying is a problem around the world, but also that seagrass is so much more amazing than I thought. So that's the reason for this post, to share an excellent New York Times article: Disappearing Seagrass Protects Against Pathogens, Even Climate Change, Scientists Find Also, if you're in Gainesville (FL) and you want to learn more about seagrass, Savanna Barry (Nature Coast Biological Station) will be doing a talk on 1 March 2017, "Seagrasses of the Florida Big Bend: Ecology, resilience, and threats". This is part of the Water, Wetlands and Watersheds Seminar series by the UF Howard T. Odum Center for Wetlands (click here for details). |
About meI currently work as a Research Scientist at the University of Florida. I try to blog about interesting random science stuff. And because I do a lot of coding (in R), I will share coding tips I found useful. Archives
November 2017
Categories |