statistics
nonparametric inference in "icky" spaces
the bulk of statistical theory and practice developed in the last century deals with Euclidean (e.g., vector-valued) data with "small p, large n"; that is, lots of simple data. the incoming century seems to be filled with a whole new class of data. some is "large p, small n", that is, still Euclidean space, but high-dimensional. other data is even more complex, living in function space or graph space, for example. we develop theory and algorithms for performing inference in these more interesting (to us) data scenarios. one might think of this as "statistical manifold learning", if one were prone to making up new names for stuff.
graph statistics
a particularly interesting class of data for us is graph-valued data. graphs are fun because they are combinatorial objects, so the dimensionality of the space, while finite, explodes super-exponentially. the numbers are just silly, with only 10 vertices, there are more distinct binary graphs than particles in the universe. we ask questions about inference in graphs or collections of graphs, such as: can we detect anomalous vertices? can we classify collections of graphs? can we label vertices? can we match graphs? we care about answering all these questions from a probabilistic perspective, and we like our strategy to scale up to very large collections of very large graphs (e.g., billions of graphs, each with billions of vertices). sometimes we endow our graphs with directed-edges, multi-edges, attributed-edges, etc.
graph inference
whether the raw data are time-series, repeated observations, or volumetric images, we often want to infer graphs underlying the measurements. to this end, we develop tools to perform this inference, including from calcium, diffusion magnetic resonance, and electron microscopy images of brains.
applications
neuroimaging analysis tools (aka, MR connectomes)
imaging the brain, whether using MRI, calcium, or some other modality, is becoming increasingly popular as camera speeds and resolution continue to increase. these data often live in "icky" spaces (e.g., graphs). therefore, we can utilize all the theory and algorithms that we develop to help answer questions about the brain, such as: how are brains wired up? how does brain wiring change with various psychiatric conditions? or how does the brain change after learning something new?
computer vision for electron microscopy (aka, EM connectomes)
high-throughput electron microscopy images of brains yields both beautiful data and a completely different set of inference tasks. in particular, we'd love to be able to infer a graph where vertices are neurons and edges are synapses. because the data are so large (e.g., 10TB) and noisy, this requires the development of novel computer vision strategies that are robust to noise and scale extremely well.
genomes
we've all got 'em, and they seem to determine some fraction of our experience in the world, and some of the diseases we get. but how much? we use our high-dimensional data analysis tools to address a number of pressing genetics questions, such as, what are the gene networks responsible for various forms of cancer.
shalomes
in light of the recent investigation of various 'omes, we coin the word "shalomes" to mean the complete "ome" of a person, including their connectome, genome, and mentalome. our interests here lie in parsing the relative contribution of genetics and learning (aka, nature and nuture). answering these questions relies upon our ability to deal with multiple disparate massive data sets, including genomes, connectomes, and mentalomes.
misc
statistical philosophy of science
historically, philosophical questions are addressed by intellectual arguments, not backed by data. but many philosophical questions can be mapped into statistical questions. one special case of this is mind-brain supervenience: can a pair of mental-states differ without their corresponding pair of brain-states differing? we are interested in this question, as well as related questions about causal inference.
parallel programming and massive scientific databases
our motivating datasets are large or massive ("large" data can fit in disk-space on a workstation, "massive" data cannot). thus, to perform inference in reasonable time, we take two complementary strategies. first, our code run on multi-processor machines in parallel. sometimes this follows from embarrassingly parallel code (e.g., looping over different data sets). other times we develop/utilize parallel algorithms (e.g., alternating direction method of multipliers). second, we run our algorithms on scientific databases (e.g., SciDB) designed specifically for our queries.
if any of this stuff is interesting to you, and you might want to work with me, shoot me an email. we are always hiring smart people that we like working with :)