Proof-of-concept: Use OpenCV to detect exoplanets
I. Introduction
II. Why OpenCV
III. Procedure and Parameters
IV. Test and Verification on 4 Stars
V. Comparison of OpenCV and Industry EXPRES CBC RVs
VI. Implementation (Hardware and Code
VII. Future work and How you can help

Published November 28, 2023

OpenCV for Exoplanet Detection

In this proof-of-concept project, we utilize OpenCV to extract star radial velocities that may contribute to the detection of Earth 2.0!

AdvancedFull instructions providedOver 4 days1,862

Ranked Prize: 2nd Place

OpenCV AI Competition 2023

Finalist

OpenCV AI Competition 2023

Popular Vote - Finalist

OpenCV AI Competition 2023

Things used in this project

Software apps and online services

OpenCV

OpenCV is extensively used in this project.

Bitcraze Crazyflie Python Client

MATLAB

Story

Proof-of-concept: Use OpenCV to detect exoplanets!

Introduction
Why OpenCV?
Procedure and Parameters
Test and Verification on 4 Stars
OpenCV Method vs. State-of-the-Art Industry Methods
Implementation (Hardware and Code)
Future work and How you can help?

I. Introduction

An exoplanet’s minute size and faintness in relation to its host star pose significant challenges for direct detection using conventional optical or radio telescopes. However, a breakthrough occurred in the late 1980s when astrophysicists introduced indirect methods that involved studying characteristics of the star’s reflex motion caused by its surrounding planets. These methods led to the first definitive detection of an exoplanet in 1995 [1] and have since contributed to the discovery of over 5, 000 exoplanets [2]. The radial-velocity (RV) method, one of the primary indirect detection techniques, has played a crucial role in detecting almost one thousand exoplanets so far. This method involves studying and fitting the star’s periodic perturbation in RV caused by the exoplanet to theoretical models of star-planet orbits. In the last decade, advancements in spectrograph stability and spectral sensitivity have enabled the detection of perturbations in RV with precision reaching the sub-meter-per-second range.

By examining the absorption lines within a stellar spectrum, one can determine the RV of the star about the star-planet barycenter through the observation of a minor and consistent Doppler shift. Equation (1) is widely used to relate the measurable absorption line wavelength λ of a received photon and its velocity υ at an angle θ relative to the observer-source direction [3].

When v ≪ c, equation (1) can be simplified into an equation that relates the RV to the wavelength:

However, different forms of stellar variability have the potential to modify spectra, resulting in shifts, alterations in-depth, or asymmetry within the lines. These changes in line shapes can lead to misinterpretations, where they are mistakenly identified as genuine shifts in the center of mass during the analysis of RV. As a result, these fluctuations exhibited by the star introduce inaccuracies into the measurements of RV, and in certain instances, they can convincingly imitate periodic signals falsely indicating the presence of planets that do not exist. Numerous approaches have been developed to differentiate between stellar variabilities and genuine center-of-mass shifts caused by planets. Theseapproaches can be grouped into two major categories based on the input aspects: a) RVs with activity indicators which models and links indicators with RV measurements, and b) cross-correlation function (CCF) which compares a given spectrum with a template spectrum and reveals periodic shifts betweenthem [4]. In each of the two categories, various approaches focus on different areas or characteristics of the measured stellar spectra. These approaches extract RVs either globally or from localized sections, highlighting the diversity within each category [5].

In this project, we propose a novel RV extraction method in which OpenCV technique is used to capture and interpret RV perturbation from stellar spectra images. A set of extreme-precision spectroscopic data from the Extreme-precision Spectrograph (EXPRES) Stellar Signals Project (ESSP) [6] is used to test the method. Our results show that the new method can robustly extract sub-meter-per-second RV perturbations.

II. Why OpenCV?

The idea.

Computer Vision, a subfield of artificial intelligence, empowers computers and systems to extract valuable insights from images and videos. With the aid of specifically designed algorithms, it can differentiate objects, estimate distances, detect motion, and identify anomalies in images and videos. Figure 1 demonstrates a process of precise image alignment and feature transformation between images that contain the same scene from different viewing angles.

Figure 1: Illustration of using OpenCV to perform precise image alignment and feature transformation between images that contain the same scene from different viewing angles. (a) Image of a book with a gold square on top of it from a side viewing angle. (b) Top view of the same book. (c) Two images side by side with the matching pairs of features detected by OpenCV. (d) After homography computation, the OpenCV performs a perspective warp of the first image so that it is transformed into the top view.

Similar to the algorithm and computation process illustrated in Figure 1, the concept behind this research is straightforward and uncomplicated: we aim to use OpenCV to identify and align the features, i.e. positions of absorption lines, in two stellar spectra images. Subsequently, the computer will determinewhether there is a shift between the two spectra, and if so, quantify the magnitude of the shift.

III. Procedure and Parameters

In this section, we will provide a detailed description of how to analyze stellar spectra using OpenCV and calculate the time-dependent RV shifts. We have divided the entire process into six steps:

Template creation
Spectral line screening
Raw spectra image generation
Enhancement to create detectable features
RV calculation based on OpenCV feature alignment and tomography transformation
Calculation of overall weighted RV time series.

Template creation

A spectrum is obtained by measuring the intensity of the electromagnetic radiation as a function of wavelength. When observing the spectrum of a star, there may be gravitational influences from surrounding stars or planets causing an RV shift. To detect and consider this change, spectral measurements at different time points are performed. However, apart from the “RV shift” caused by the gravitational influence of surrounding stars or planets, the majority of the stellar spectra remain the same in spectral measurements. Therefore, choosing to use the average spectrum as a template is a common practice used in many RV calculation methods [13]. The advantage of using the average spectrum as a template is that it can eliminate the influence of other non-RV variation factors on the measurement results, allowing for a more accurate detection and quantification of RV shifts. Therefore, during the first step, we collect and merge all the spectra of the same order obtained at different times to construct a comprehensive template. This template will be subsequently used to extract smaller individual spectral line templates, which will be utilized in the individual OpenCV image alignments performed in the following steps.

Spectral line screening

There are thousands of spectral lines present in a high-resolution stellar spectrum, each exhibiting significant variations in height, width, and spacing. Theoretically, each spectral line contains identical RV information. However, due to factors such as star granulation pattern changes, measurement resolution, instrumental noise & errors, it is necessary to collectively analyze a certain number of spectral lines and calculate their weighted averages to obtain relatively accurate RV information. In this study, we employ OpenCV which requires spectra images with prominent features and strong contrast. Hence, we have devised a selection strategy favoring narrow FWHM (full width at half maximum) and deep peak values. Based on this strategy, we have established multiple parameters to scan all spectral lines and select less than one-tenth of them for further analysis. Specifically, during the screening process, we aim to identify spectral lines that satisfy the following criteria:

a) The FWHM of the spectral line is smaller than the specified maximum allowable FWHM value.

b) Avoid overlapping “double-peak” structures. The minimum distance between two spectral lines must be greater than the maximum allowable FWHM value set in step a).c) The height of the spectral line must exceed a given minimum “Spectral Line Height Ratio (SLHR)” where the SLHR parameter is defined as:

Figure 4 displays examples of spectral line selection with varying minimum SLHR values.

Figure 4: Demonst ration of spectral line screening with different minimum spectral line height ratio settings. Only narrow, non-overlapped lines with minimum height are selected. (a) Minimum SLHR = 0.5. Ten lines passed the screening process. (b) Minimum SLHR = 0.3. Fourteen lines passed the screening process. The raw spectrum consists of absorption lines, which can be visualized as valleys. In order to improve clarity, the spectrum are plotted in an inverted orientation.

Raw spectral line image generation

Once we have scanned the entire average spectrum and selected a pool of deep and narrow spectral lines, we start to create the raw spectral line images. In this step, we stack a certain number of spectral lines vertically and perform interpolation horizontally by a factor called “wavelength interpolation factor (WIF)”. Figure 5 displays several example images with a fixed vertical pixel count of 100 and various WIFs. We may also intentionally add some noise to the images, as shown in Figure 5f.

Figure 5: Example of raw spectral line images with a vertical pixel count of 100. Figures (a) to (d) illustrate the same spectral line using different Wavelength Interpolation Factors (WIF) of 8, 16, 32, and 64, respectively. Figures (e) and (f) represent a slightly wider spectral line with same WIF = 16, but with the addition of noise based on data“uncertainty” in Figure (f).

Enhancements to create features detectable by OpenCV

In the previous section, we explained how the OpenCV SIFT algorithm detects scale-invariant features in an image. However, if we apply the SIFT algorithm directly to the raw spectral line images, the program fails to identify any features because of the vertical identicalness and the absence of abrupt variations in the horizontal direction. To make the images more amenable for feature detection, we designed several enhancement approaches to introduce vertical variation yet preserve the horizontal information.

a) Foreground masks

Having gained an understanding of how the SS feature detection process works, the first and most intuitive enhancement that comes to mind is to introduce vertical variation using a “foreground mask.” This involves adding blinds between the observer and the image, where the mask consists of horizontal lines, and all pixels within a line are identical. There are numerous options for the mask design, with two fundamental parameters to consider: the location and intensity of the mask lines. Figure 6 illustrates six types of masks, showcasing various combinations of different options for these two parameters.

Figure 6: Demonstration of foreground feature formation enhancement masks studied in this research. Horizontal lines are applied to the spectra images. There are three choices on “line intensity” (“fixed”, “gradient”, and “random”) and two choices on “line location” (“multiples of one or more prime numbers” and random”).

We apply the six types of mask to the raw spectral line images compare how the mask enhances the feature detection process and present the finding in Figure 7. After applying the mask enhancement, SIFT successfully detects features on all six masked images. Simultaneously, increasing the randomness of both line locations and intensities leads to a greater number of detectable features.

Figure 7: Comparison of detected feature numbers after applying six types of foreground masks with same number of lines.

b) Discretization

Discretization is the process of converting a continuous entity, such as a continuum or spectrum, into a finite set of points. This involves making the object mathematically discrete by dividing it into distinct and separate elements. It is a common technique used in various fields, including mathematics, computer science, and signal processing, to handle continuous data in a practical and manageable manner. In our scenario, during the creation of a raw spectral line image, we have already performed discretization by converting the original float continuous spectral line data into grayscale image values within the range of 0 to 255 and saving it into a JPEG or PNG format file. During this step, we continue the process of discretization by converting the grayscale image values into a smaller subset of values within the range of [0, 255]. While this action may result in some loss of resolution from the original data, it enhances the clarity of horizontal transitions, making the image more conducive to feature detection without sacrificing essential horizontal information. Figure 8 showcases an example of the enhancement achieved through discretization. There is a significant increase in the number of detected features after the discretization enhancement application.

Figure 8: spectra image enhancement for feature formation through discretization. (a) Demonstration of discretiza- tion on a spectral line. (b) Before discretization, there are only 8 features detected. (c) After discretization, the detected feature number increased to 46

c) Contrast enhancement by histogram equalization

Further feature detection enhancement is done through histogram equalization, a technique employed in image processing to enhance the contrast of an image by modifying its intensity values. The process involves redistributing the most common intensity values, effectively stretching out the intensity range of the image. By reorganizing the intensities in this manner, the histogram is better balanced, leading to a noticeable improvement in the overall contrast of the image. We have conducted extensive testing of the method on many spectral images, and the results consistently show significant enhancement, leading to the appearance of more detectable features in the images. An example of this improvement is demonstrated in Figure 9.

Figure 9: OpenCV feature detection on a spectra image before and after contrast enhancement through histogram equalization. (a) Before histogram equalization, there are only 5 features detected. (b) After histogram equalization, the detected feature number increased to 25.

Individual image RV calculation based on OpenCV feature alignment and homography transformation

In the previous sections, we discussed how to generate an average spectrum, how to scan, select “good” individual spectral lines from the average spectra, and create raw spectral line images, and how to increase the number of features that can be detected by OpenCV functions through techniques such as foreground masks, discretization, and contrast enhancement. In this section, we will demonstrate, using a single spectral line image as an example, how to analyze and calculate the RV from spectral data using the OpenCV technique. Figure 10 effectively illustrates this process with a total of six images labeled (a) to (f). In each image, the left half demonstrates the transitions of the average spectral line image obtained by summing all spectral measurements in the dataset, while the right side displays the changes of one individual spectral line image generated from a single spectral measurement data at the same wavelength location.

Figure 10: Example of image creation, processing, and RV extraction from a single spectral line data. (a) Original raw spectral line images with the average image on the left and one individual image on the right. (b) Add a foreground mask. (c) Apply discretization. (d) Apply histogram equalization enhancement. These are the final images going into OpenCV processing for feature detection, matching, homography transformation, and RV calculation. (e) Feature detection in OpenCV. Two images have slightly different detectable “features”. (f) Matching features in OpenCV. Based on the “matched features” (highlighted in blue), OpenCV is able to calculate the homography transform matrix.

Figure 10a showcases the raw spectral line images (both average and individual). Subsequently, figures 10b, 10c, and 10d portray how the “feature enhancement” techniques transform the original feature-lacking and vertically repetitive spectral line images into feature-rich and vertically varied ones, without introducing any “new” horizontal information.

After OpenCV detects and matches features in both the average and individual spectral line images, it proceeds to calculate the homography transform matrix. Using Figure 10 as an example, the homography transform matrix reported by the OpenCV function findhomography() is

With this homography transform matrix, determining the image’s horizontal shift, representing the wavelength shift between the average and individual spectral line images becomes straightforward. By substituting the wavelength shift in the logarithmic scale into formula (2), we can derive the RV value.

Calculation of overall weighted time-dependent RV statistics

In the last section, we scanned and selected several spectral lines for each measurement. Using the methods and steps described in section 3.5 for image processing, feature optimization, and OpenCV analysis, we calculated the RV values for each of these spectral lines. Subsequently, we grouped the RVs corresponding to spectral lines measured at the same time point and calculated the inverse variance-weighted average RV for that specific time point. As a result, we obtained a set of “RV vs. time” data for all four stars. In the next section, we will present these data in detail and perform a comprehensive comparison and analysis against the original EXPRES CBC pipeline results.

IV. Test and Verification on 4 Stars

In this project, OpenCV techniques are employed to extract RVs from the spectra of four real stars: HD 101501 (61 UMa), HD 26965 (40 Eri), HD 10700 (τ Ceti), and HD 34411 (λ Aur). Our primary emphasis is to demonstrate a high level of consistency between the RV patterns derived through our OpenCV method and those obtained through the widely recognized “industry-standard” EXPRES chunk-by-chunk (CBC) technique pipeline [14]. In the comparison process, we utilize two approaches. Firstly, we plot and compare the 2D graphs of RVs obtained from both methods over time. This approach provides a visual indication of whether the fluctuations in the RV calculations from the two methods are consistent. In the second approach, we plot Lomb-Scargle power spectra, a method used in time-series analysis to identify periodicities and determine the frequency content of unevenly sampled data, for both OpenCV and EXPRES CBC RVs. Through this comparison, we aim to assess the degree of agreement between the predicted inherent periods obtained from both methods.

Data Sets: HD 101501, HD 26965, HD 10700, HD 34411

The full stellar parameters for the four stars used in the study can be found in [15]. We chose these four stars because they are highly representative: HD 101501 is chromospherically active, HD 26965 is known to have at least one planet, HD 10700 exhibits very low chromospheric activity and is under investigation for three or more planet candidates, and HD 34411 closely resembles the Sun and possesses low chromospheric activity. The four data sets were measured between August 2019 and November 2020 by ESSP and were widely used to compare different RV extraction methods. EXPRES is an optical (390–780 nm), fiber-fed spectrograph designed and built at the Yale Exoplanet Laboratory with a median resolution of R ≈ 137, 000 [16]. The instrument was fully commissioned at the 4.3 m Lowell Discovery Telescope [16] near Flagstaff, AZ in January 2019. The standard EXPRES pipeline utilizes a forward-modeling, CBC technique to obtain RVs [14]. Previous studies have demonstrated that the combination of EXPRES measurements and the EXPRES CBC RV extraction technique can achieve sub-meter-per-second rms RV precision [15].

V. Comparison of OpenCV and Industry EXPRES CBC RVs

To compare our OpenCV method with the original EXPRES CBC method, we present the RVs plotted across time in Figure 11 and provide the corresponding RV rms values in Table 1 for both approaches. Although the RV rms values of our OpenCV method are slightly larger than those of the EXPRES CBC method, the patterns in the RV time series from both methods exhibit a very close match. To further validate the virtual observation, we plot the generalized Lomb-Scargle (GLS) periodograms in Figure 12. Remarkably, these periodograms demonstrate a strong agreement between our OpenCV method and the well-regarded EXPRES CBC pipeline.

Table 1: The RV rms for the four star RVs extracted using OpenCV and EXPRES CBC method

Figure 11: Comparison of OpenCV technique (blue) and the original EXPRES CBC pipeline (red): RV vs. Time for four Stars.

Figure 12: Comparison of OpenCV technique (blue) and the original EXPRES CBC pipeline (red): The generalized Lomb-Scargle (GLS) periodograms of extracted RVs for (a) HD 101501, (b) HD 26965. (c) HD 10700, and (d) HD 34411. The horizontal lines on the periodograms indicate required peak heights to attain certain false alarm probabilities (FAP) conditioned on the assumption of the null hypothesis of no signal [17]. For example in Figure (b), there are peaks above the orange line, What this tells us is that under the assumption that there is no periodic signal in the data, we will observe a peak above the orange line approximately 0.05% of the time, which gives a very strong indication that a periodic signal is present in the HD 101501 data. Since the FAP lines are computed relative to a particular set of observing times and a particular choice of frequency grid, they are independent of the actual RV power spectra, hence both blue and red curves share the same FAP lines as they have the same time distribution profile and are evaluated the same frequency grid.[18]

OpenCV method parameters and robustness analysis

We provide a comprehensive listing of various parameters employed at each step of the RV extraction process. In this section, we undertake an in-depth analysis of key parameters to assess their impact on the final results. The objective of this analysis is to determine whether the OpenCV method exhibits sensitivity to these parameters and to gain insights into establishing an optimized OpenCV model that guarantees precise output with strong robustness, efficiency, and minimal utilization of system resources. Through this investigation, we aim to ascertain the optimal configuration for the OpenCV model, ensuring its capacity to deliver accurate results while maintaining a high level of stability and computational efficiency. For this analysis, we utilize the HD 101501 data set. To expedite the observation of trends and subtle distinctions, we employ RV vs. timeline plots to compare changes in each parameter.

Spectral image vertical pixels. As outlined in section 3.3, the raw spectral line image is constructed by stacking a specific number of identical or slightly varied lines based on the spectral line values. Increasing the number of vertical pixels generally provides a larger workable area for both feature enhancements and OpenCV feature detection processes. Consequently, a higher number of vertical pixels often leads to more accurate results, but it also demands increased computing resources. From the presented plots in Figure 13, we observe that for the EXPRES data set, a minimum of VP=50 vertical pixels is the minimum value to get reliable results. Based on limited experimentation, we recommend setting VP=100 for data measured by EXPRES to achieve optimal results. However, it is essential to note that the ideal VP value may vary for data acquired by different instruments. Tailoring this parameter to the specific characteristics of each instrument is vital for achieving the best outcomes in the OpenCV-based RV extraction process.

Figure 13: RV method parameter analysis: spectral line image vertical pixels (VP).

Wavelength Interpolation Factor (WIF). The Wavelength Interpolation Factor (WIF) represents the number of wavelength pixels within the same region covered by a single pixel in the original data. It serves as a measure of how the wavelength resolution is affected during the process of interpolation. Based on the analysis shown in Figure 14, it is evident that the RV outputs exhibit little sensitivity to Wavelength Interpolation Factor (WIF) changes as long as WIF ≥ 8. To ensure reliable results for both clean and noisy data, we recommend setting WIF to 16. Again and similar to VP, this recommendation is founded on our limited tests with EXPRES data. The optimal WIF setting might vary for other data sets acquired by different instruments.

Figure 14: RV method parameter analysis: Wavelength interpolation factor (WIF).

Minimum spectral line height ratio. In the spectral line screening process outlined in section 3.2, we introduced a parameter termed “minimum spectral line height ratio” to regulate the number of spectral lines utilized in the RV extraction process. By varying this parameter, we plotted the corresponding RVs in Figure 15. Our findings indicate that setting the parameter too high, as demonstrated in Figure 15c (e.g., 70%), results in a lack of sufficient selected spectral lines, leading to inaccurate RV measurements. Conversely, setting the parameter too low will significantly increase the number of selected spectral lines, unnecessarily reducing the efficiency of the process. A value of 30% or 40% appears to be suitable for this parameter.

Figure 15: RV method parameter analysis: Minimum spectral line height ratio. (a) Minimum SLHR = 30%. (b) Minimum SLHR = 50%. (c) Minimum SLHR = 70%.

Mask line locations. In Section 3.4.1, we describe the design and application of a foreground mask to introduce “variation” and “contrast” along the vertical direction. Furthermore, we demonstrate how the OpenCV feature detection outcomes are influenced by the specific location and intensity of the mask lines. Figure 16 and Figure 17 showcase the overall RV outputs obtained under different settings of these two parameters. Regarding the location assessment, we find that unevenly distributed mask lines, as exemplified in Figures 16b and 16c, yield favorable results by generating the required vertical variation for OpenCV. Conversely, masks with evenly distributed lines, such as Figure 16a, fail to provide the desired vertical variation. To address the potential occurrence of randomly distributed lines leading to even placement, we propose employing fixed locations at multiples of specific prime numbers, which serve to enhance the degree of variation. While the use of prime numbers is not a mandatory requirement, it contributes to increasing the overall desired “disorder’” in the mask layout.

Figure 16: RV method parameter analysis: Mask line locations. (a) Fixed locations at multiples of 3. Total mask line count = 33. (b) Fixed locations at multiples of 5 prime numbers [7, 11, 23, 37, 47]. Total mask line count = 33. (c) Random locations. Total mask line count = 30.

Mask line intensities. Regarding the mask intensity, our preliminary OpenCV feature detection test in Section 3.4.1 indicates that increased variation in line intensity increases the total number of OpenCV features by the OpenCV function. However, the comprehensive results depicted in Figure 17 present contrasting findings. Specifically, employing a fixed intensity at a relatively high level, such as 200, yields the most accurate RVs. Further examination of the OpenCV matching process reveals that the fixed intensity case, despite providing fewer OpenCV features results in more easily matched features in both the average and individual line images when compared to the cases with gradients and random intensities. Additionally, elevating the intensity further to 220 or the maximum values allowed in JPG and PNG images (255) adversely affects the output quality. This phenomenon can be attributed to setting the intensity too high, leading to some areas with potential quasi-saturation of the local pixel intensities during the feature detection process when constructing the Gaussian scale space in OpenCV. In conclusion, we recommend employing a fixed high intensity, albeit not reaching the maximum value, for the mask lines and placing them at fixed, non-evenly distributed locations to achieve optimal results in the RV extraction process.

Figure 17: RV method parameter analysis: Mask line intensity. (a) Fixed intensity at 200 for all mask lines. (b) Gradient line intensity from 200 at the top to 50 at the bottom. (c) Random line intensity with the range of [50, 200].

Contrast enhancement and discretization. Figure 18 showcases a specifically designed control test that illustrates the cumulative impacts of contrast enhancement and discretization on the RV extraction process. While both contrast enhancement and discretization contribute to enhancing the RV extraction process, their significance is particularly pronounced when dealing with input spectra that are noisier compared to clean input spectra.

Figure 18: RV method parameter analysis: Contrast enhancement and discretization. (a) In the first test, we utilize the original raw spectral line images as input. The middle plot depicts the extracted RV with slight contrast enhancement and discretization, while the right plot illustrates the extracted RV with optimized contrast enhancement and discretization. (b) In the second test, we introduce random noise to the original raw spectral line images. The middle plot displays the extracted RV using the same small contrast enhancement and discretization as in the first test, while the right plot shows the extracted RV with optimized contrast enhancement and discretization

VI. Implementation (Hardware and Code)

We developed and tested our CV-based RV extraction algorithms in Python 3.9.15. Table 2 lists the code execution time of the entire process described in section 3, including pre-processing steps on the original EXPRES FITS data files. All the tests are performed on a desktop PC with 32 GB RAM and a 3.0GHz Intel i9 13900KF CPU.

Table 2: CV-based RV Extraction Execution Time.

We are in the process of setting up a GitHub project to distribute my research and code. Once it's complete, I'll share the link to the GitHub repository.

VII. Future work and How you can help?

In this project, we introduced a pioneering method for RV extraction using CV techniques, which represents a novel approach in the field of exoplanet detection research. While our work is in its early stages, we have observed that the extracted RV patterns and periodic components match those obtained through the EXPRES CBC pipeline. However, it’s worth noting that the RV rms values of our extracted RVs on the four real star data sets are 30% to 40% higher compared to the EXPRES CBC method. There are certainly quite some extensions to explore and improve.

Firstly, we believe that reducing the RV rms scatter is achievable through various means. In our research, we employed a simple algorithm to identify a group of deep and narrow spectral lines. While this algorithm successfully identified some of the best spectral lines, it also selected several questionable ones, such as overlapped double peaks and poorly symmetrical lines. Meanwhile, increasing the total number of spectral lines may improve the signal-to-noise ratio and reduce the scatter and variability in the extracted RVs. To enhance feature detection, we are actively investigating methods to improve the process. Another area of focus is the mask design. In our paper, we used a “foreground” mask, which involves laying the mask lines on top of the image to simulate looking at the spectral image through blinds. However, we can experiment with other ways to apply the mask, such as blending the image pixels with the mask pixels using a designed function. Furthermore, there is potential for further exploration of alternative mask options, including randomizing the mask line location and optimizing the mask line density. Although these aspects have only been briefly touched on here, delving deeper into them could potentially result in a reduction of RV rms scatter. Secondly, in the feature detection and matching section, we have only explored the usage of OpenCV’s SIFT algorithm. However, the field of CV offers a plethora of other image feature detection, matching, and alignment algorithms, including Harris corner detection [7] and Canny edge detection [8], among others.

Furthermore, in calculating spectral line shifts, we employed a simple method involving the multiplication of the Homography transform matrix by four corner coordinates, followed by the determination of the average horizontal displacement. While this approach provides a rough approximation, it neglects the physical significance of the Homography rotation parameters and vertical displacements. To address this limitation, we are actively researching and developing a more comprehensive algorithm to compute the corresponding spectroscopic shifts.

Simultaneously, we can also consider using the CV algorithm in a “chunk-by-chunk” manner. Instead of selecting specific spectral lines, we could divide the spectra into equidistant smaller chunks and apply CV-based RV extraction to each chunk individually. In a broader perspective, we can extend the application of CV methods to other purposes. For instance, it can serve as an auxiliary tool for pre-conditioning or pre-screening of raw stellar measurement data. By automatically tagging out problematic data, CV can significantly reduce the manual preprocessing work that was previously reliant on experiential knowledge, benefiting other RV extraction methods.

Please contact me if you are interested in this project or have any questions/comments: katelyngan77@gmail.com

References[1] M. Mayor and D. Queloz, “A Jupiter-mass companion to a solar-type star, ” Nature, vol. 378, no. 6555, pp. 355–359, Nov. 1995, provided by the SAO/NASA Astrophysics Data System. [2] NASA. Nasa exoplanet archive. [3] M. Perryman, The Exoplanet Handbook, 2nd ed. Cambridge University Press, 2018.[4] R. F. Griffin, “A Photoelectric Radial-Velocity Spectrometer, ” Astrophysical Journal, vol. 148, p.465, May 1967.[5] V. M. Rajpaul, S. Aigrain, and L. A. Buchhave, “A robust, template-free approach to precise radial velocity extraction, ” Monthly Notices of the Royal Astronomical Society, vol. 492, no. 3, pp. 3960–3983, 01 2020. [6] L. Zhao, D. A. Fischer, E. B. Ford, G. W. Henry, R. M. Roettenbacher, and J. M. Brewer, “The EXPRES Stellar-signals Project. I. Description of Data, ” Research Notes of the American Astronomical Society, vol. 4, no. 9, p. 156, Sep. 2020, provided by the SAO/NASA Astrophysics Data System. [7] C. G. Harris and M. J. Stephens, “A combined corner and edge detector, ” in Proceedings of the 4th Alvey Vision Conference, Manchester, 1988, pp. 147–151. [8] J. Canny, “A computational approach to edge detection, ” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-8, no. 6, pp. 679–698, 1986.[9] D. G. Lowe, “Distinctive image features from scale-invariant key points, ” Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, Nov. 2004. [10] T. Lindeberg, “Scale-space theory: A basic tool for analyzing structures at different scales, ” Journal of Applied Statistics, vol. 21, pp. 225–270, 1994.[11] K. Mikolajczyk, “Detection of local features invariant to affine transformations, ph.d. thesis, ” Journal of Applied Statistics, vol. 21, pp. 225–270, 2002.[12] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration.” in VISAPP (1), A. Ranchordas and H. Ara ́ujo, Eds. INSTICC Press, 2009, pp. 331–340.[13] G. Anglada-Escud ́e and R. P. Butler, “The harps-terra project. i. Description of the algorithms, performance, and new measurements on a few remarkable stars observed by harps, ” The Astrophysical Journal Supplement Series, vol. 200, no. 2, p. 15, May 2012.[14] R. R. Petersburg, J. M. J. Ong, L. L. Zhao, R. T. Blackman, J. M. Brewer, L. A. Buchhave, S. H. C. Cabot, A. B. Davis, C. A. Jurgenson, C. Leet, T. M. McCracken, D. Sawyer, M. Sharov, R. Tronsgaard, A. E. Szymkowiak, and D. A. Fischer, “An extreme-precision radial-velocity pipeline: First radial velocities from EXPRES, ” The Astronomical Journal, vol. 159, no. 5, p. 187, Apr 2020.[15] L. L. Zhao, et al., “The expres stellar signals project ii. state of the field in disentangling photospheric velocities, ” The Astronomical Journal, vol. 163, no. 4,p. 171, Mar 2022. [16] C. Jurgenson, D. Fischer, T. McCracken, D. Sawyer, A. Szymkowiak, A. Davis, G. Muller, and F. Santoro, “EXPRES: a next-generation RV spectrograph in the search for earth-like worlds, ” in SPIE Proceedings, C. J. Evans, L. Simard, and H. Takami, Eds. SPIE, Aug 2016.[17] J. T. VanderPlas, “Understanding the lomb–scargle periodogram, ” The Astrophysical Journal Supplement Series, vol. 236, no. 1, p. 16, May 2018. [18] M. R. D ́ıaz, J. S. Jenkins, M. Tuomi, R. P. Butler, M. G. Soto, J. K. Teske,F. Feng, S. A. Shectman, P. Arriagada, J. D. Crane, I. B. Thompson, and S. S. Vogt,“The test case of HD 26965: Difficulties disentangling weak Doppler signals from stellar activity, ” The Astronomical Journal, vol. 155, no. 3, p. 126, Feb 2018.

OpenCV for Exoplanet Detection

import numpy as np
import cv2
#matplotlib.use('qtAgg') # do not display plot unless plt.show()
#matplotlib.use('Agg') # do not display plot unless plt.show()
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 100})
SMALL_SIZE = 14
MEDIUM_SIZE = 20
BIGGER_SIZE = 24
plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=SMALL_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title
from scipy import interpolate
from astropy.io import fits
import os
import pickle
from scipy.constants import c
import random
from scipy.signal import find_peaks
import re
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import time
from astropy.timeseries import LombScargle
import scipy.signal as signal
# import matplotlib.dates as mdates
# from matplotlib.ticker import AutoMinorLocator

# *************** FUNCTIONS ***************************************************************************************************************
def gen_chunks(DataSet, Avg_Spectrum_OverSampling, n_start, n_stop, cutoffRatio, widthMin, widthMax, heightminRatio):
    # function to pick the top chunks on orders from n_start to n_stop. 
    # use peakRatio to determin the start_pixel and stop_pixel for each chunk
    
    # open average spectrum
    with open(DataSet+'templates_ao'+str(Avg_Spectrum_OverSampling)+'_n'+str(n_start)+'_n'+str(n_stop)+'.pickle','rb') as f:
        totTemplate, wavelength, shift_per_pixel=pickle.load(f)
    chunkIndices = []
    start_pixel = []
    stop_pixel = []
    imgFolder = DataSet+'ChunkSelectionImg//v'+Chunk_Algorithm_Version+'_ao'+str(Avg_Spectrum_OverSampling)+'_nMin'+str(n_start)+'_nMax'+str(n_stop)+'_cutoff'+str(cutoffRatio)+'_widthMax'+str(widthMax)+'_heightMin'+str(heightminRatio)+'//'
    make_directories(imgFolder)
    for n in range(0, n_stop-n_start+1):
        print("Working on n={} ".format(n))
        x,y,z = find_local_extremum(totTemplate[n,:], 0, cutoffRatio, widthMin, widthMax, heightminRatio, n, imgFolder, Avg_Spectrum_OverSampling,wavelength)
        chunkIndices.append(x)
        start_pixel.append(y)
        stop_pixel.append(z)    
    # save with the avg template into a new file
    AvgSpectrum = 'AvgSpectrum_with_Chunks_ao'+str(Avg_Spectrum_OverSampling)+'_nMin'+str(n_start)+'_nMax'+str(n_stop)+'_cutoff'+str(cutoffRatio)+'_widthMax'+str(widthMax)+'_heightMin'+str(heightminRatio)+'_'+Chunk_Algorithm_Version
    savetofile = DataSet + AvgSpectrum + '.pickle'
    with open(savetofile,'wb') as f:
        pickle.dump([totTemplate, wavelength, shift_per_pixel, chunkIndices, start_pixel, stop_pixel], f)
    return

def find_local_extremum(input_array, max_or_min, cutoffRatio, widthMin, widthMax, heightminRatio, n,imgFolder, Avg_Spectrum_OverSampling,wavelength):
    if max_or_min:
        array = input_array
    else:
        array = -input_array[0::]+max(input_array)
    
    goodPeaks = []
    start_pixel = []
    stop_pixel = []
    
    # Find indices of local peaks    
    heightmin = max(array)*heightminRatio # set minimum peak height
    margin_pixel=50
    if Chunk_Algorithm_Version == '1.1':
        peaks, _ = find_peaks(array[margin_pixel:-margin_pixel], height=heightmin , distance=200, width=[1,200]) # find the wavelength lines
    else:
        peaks, _ = find_peaks(array[margin_pixel:-margin_pixel], height=heightmin , distance=25*Avg_Spectrum_OverSampling, width=[1,25*Avg_Spectrum_OverSampling]) # find the wavelength lines
        
    if not peaks.size == 0: # not empty
        peaks += margin_pixel
        
        # #Plot the spectra and candidate peaks
        # plt.figure(figsize=(8,4),dpi=200)
        # plt.plot(array)
        # plt.plot(peaks, array[peaks], 'x')
        # plt.show() 
        
        # Calculate start_pixel and stop_pixel for each peak
        for j, peak in enumerate(peaks):
            peak_value = array[peak]
    
            # Find left and right indices 
            for i in range(peak - 1, -1, -1):
                if array[i] < peak_value*cutoffRatio:
                    leftPixel = i
                    break                
                else:
                    leftPixel = 0
            for i in range(peak + 1, len(input_array)):
                if array[i] < peak_value*cutoffRatio:
                    rightPixel = i
                    break   
                else:
                    rightPixel = array.size-1
            if rightPixel-leftPixel <= widthMax and rightPixel-leftPixel >= widthMin:
                goodPeaks.append(peak)
                start_pixel.append(leftPixel)
                stop_pixel.append(rightPixel)      
        
        #Plot the spectra and selected lines
        fig = plt.figure(figsize=(6,4),dpi=200)
        plt.plot(wavelength[n,:], -array/max(array)+1)
        plt.plot(wavelength[n, goodPeaks], -array[goodPeaks]/max(array)+1, 'x')
        plt.xlabel('$\ln$(Wavelength) [$\AA$]', size = 18)
        plt.ylabel('Normalized flux', size = 18)
        filename = imgFolder+str(n)+'.png'        
        plt.tight_layout()
        fig.savefig(filename)
        plt.close(fig)
        
    return goodPeaks, start_pixel, stop_pixel

def make_directories(path):
    try:
        os.makedirs(path)
        print(f"Directories created successfully at path: {path}")
    except FileExistsError:
        print(f"Directories already exist at path: {path}")
    except Exception as e:
        print(f"An error occurred while creating directories: {str(e)}")
        
def make_order_directories(parent_directory):
    # Create new directories
    for i in range(1, 86):
        directory_name = str(i)
        directory_path = os.path.join(parent_directory, directory_name)
        os.makedirs(directory_path)
        
def genRawSpectra_byChunks(AvgSpectrum_pickle_file, fitsFilelist, n_start, n_stop, v_Pixel, w_oversampling, uncertainty, savetoFolder):    
    # Make directory to save images    
    if not os.path.exists(savetoFolder):
        make_directories(savetoFolder)
        make_order_directories(savetoFolder)
        
    # load template and chunk data
    with open(AvgSpectrum_pickle_file,'rb') as f:
        totTemplate, wavelength, shift_per_pixel, chunkIndices, start_pixel_array, stop_pixel_array = pickle.load(f)  
    wavelengthOption = 'bary_wavelength'
    
    
    for n in range(0, n_stop-n_start+1):
        N = n + n_start
        print("N={}".format(N))
        # Create avgSpectrum images
        start_pixel_array_byorder = start_pixel_array[n]
        for i, start_pixel in enumerate(start_pixel_array_byorder): 
            stop_pixel = stop_pixel_array[n][i]
            # read the start and stop wavelength
            logwavelength_low = wavelength[n, start_pixel]
            logwavelength_high = wavelength[n, stop_pixel]
            # Perform oversampling  
            numWavelength = (stop_pixel-start_pixel)*w_oversampling+1
            logwavelength = np.linspace(logwavelength_low, logwavelength_high, numWavelength)             
            f = interpolate.interp1d(wavelength[n,:], totTemplate[n,:]) 
            spectrum_interp_img1 = f(logwavelength)
            # build file name to be saved
            ImgFileName1 = savetoFolder+'//'+str(N)+'//Raw_v' + str(v_Pixel) + '_w'+'_n' + str(N) + '_p' + str(start_pixel) + '_to_p' + str(stop_pixel) + '_OS' + str(w_oversampling)+'_AvgTemplate.jpg'
            if not os.path.exists(ImgFileName1):                          
                # Construct image
                spec_image_img1 = np.zeros((v_Pixel, numWavelength))   
                spec_image_img1[:] = spectrum_interp_img1
                normalized_image = (spec_image_img1 - np.min(spec_image_img1)) / (np.max(spec_image_img1) - np.min(spec_image_img1))
                grayscale_image = cv2.cvtColor((normalized_image*255).astype(np.uint8), cv2.COLOR_GRAY2BGR)
                plt.imsave(ImgFileName1, grayscale_image)
        
        # Create individual spectrum images
        for fitsFile in fitsFilelist:            
            fitsFilePath = os.path.join(fitsFile_folder, fitsFile)
            with fits.open(fitsFilePath) as ff:
                data = ff[1].data
                ff.close()
            # Interpolate the test spectrum at the common wavelength    
            spectrum0 = data['spectrum'][N,:]/data['continuum'][N,:]        
            logwavelength0 = np.log(data[wavelengthOption][N,:])        
            if uncertainty:
                uncertainty0 =  data['uncertainty'][N,:]/data['continuum'][N,:] 
             
            for i, start_pixel in enumerate(start_pixel_array_byorder):
                stop_pixel = stop_pixel_array[n][i]
                # read the start and stop wavelength
                logwavelength_low = wavelength[n, start_pixel]
                logwavelength_high = wavelength[n, stop_pixel]
                # Perform oversampling  
                numWavelength = (stop_pixel-start_pixel)*w_oversampling+1
                logwavelength = np.linspace(logwavelength_low, logwavelength_high, numWavelength)
                f = interpolate.interp1d(logwavelength0, spectrum0) 
                spectrum_interp_img2 = f(logwavelength)                 
                ImgFileName2 = savetoFolder+'//'+str(N)+'//Raw_v' + str(v_Pixel) + '_w'+'_n' + str(N) + '_p' + str(start_pixel) + '_to_p' + str(stop_pixel) + '_OS' + str(w_oversampling)+ '_'+fitsFile[0:-5]+'.jpg'
                # Construct image
                spec_image_img2 = np.zeros((v_Pixel, numWavelength))     
                if uncertainty:             
                    f = interpolate.interp1d(logwavelength0, uncertainty0) 
                    uncertainty_interp_img2 = f(logwavelength)
                    # add uncertainty noise
                    for row in range(v_Pixel):
                        spec_image_img2[row,:] = spectrum_interp_img2 + np.random.randn(numWavelength)*uncertainty_interp_img2
                else:
                    # No noise. Set every row to spectrum_interp_img2
                    spec_image_img2[:] = spectrum_interp_img2
                            
                normalized_image = (spec_image_img2 - np.min(spec_image_img2)) / (np.max(spec_image_img2) - np.min(spec_image_img2))
                grayscale_image = cv2.cvtColor((normalized_image*255).astype(np.uint8), cv2.COLOR_GRAY2BGR)
                plt.imsave(ImgFileName2, grayscale_image) 
    return 0

def find_multiples(numbers, limit):
    multiples = set()
    for number in numbers:
        for i in range(1, limit):
            if number * i < limit:
                multiples.add(number * i)
    return multiples

def mask(grayscale2DArray1, grayscale2DArray2, MaskType, LocationType, intensity_max, linegap):
    # Apply Mask to the input grayscale2DArray
    # Input: [MaskType, LocationType, intensity, linegap] 
    # - MaskTypes = [0, 1, 2] # 0: fixed; 1: gradient; 2: random 
    # - LocationTypes = [0,1] # 0: multiple of a single factor (evenly distributed); 1: multiples of several factors; 2: random
    
    # Find the multiples of numbers in linegap
    rowNum, columnNum = grayscale2DArray1.shape
    lineSet = find_multiples(linegap, rowNum)     
    lineCount = len(lineSet)   
    if LocationType==0: 
        lines = np.array(list(lineSet))
    else:            
        lines = np.array(random.sample(range(rowNum), lineCount))
        
    if MaskType==0: # fixed          
        intensity = np.ones(lineCount)*intensity_max  
    elif MaskType==1: # gradient
        intensity = np.linspace(0, intensity_max, lineCount).astype(np.uint16)
    else: # random
        intensity = np.random.randint(intensity_max+1, size=lineCount)
    
    lines.sort()
    for i, row in enumerate(lines):  
        grayscale2DArray1[row,:] = intensity[i]
        grayscale2DArray2[row,:] = intensity[i]
    return [grayscale2DArray1, grayscale2DArray2]

def features_matching(Image1, Image2, g_match_threshold, showImg):   
    # Function for Feature Matching + Perspective Transformation
    # Check input type - img file path or grayscale 2D array
    if type(Image1 ) == str:
        img1 = cv2.imread(Image1, 0)   # read train image in grayscale
    else:
        img1 = Image1
        
    if type(Image2 ) == str:
        img2 = cv2.imread(Image2, 0)   # read train image in grayscale
    else:
        img2 = Image2
        
    try:
        if showImg==2:        
            plt.figure(figsize=(18,10), dpi=100)
            plt.imshow(img1)
            plt.title('Avg Spectrum')
            plt.figure(figsize=(18,10), dpi=100)
            plt.imshow(img2)
            plt.title('Individual Spectrum')
    
        min_match=1
        
        # SIFT detector
        sift = cv2.SIFT_create()
    
        # extract the keypoints and descriptors with SIFT
    
        kps1, des1 = sift.detectAndCompute(img1,None)
        kps2, des2 = sift.detectAndCompute(img2,None)
        
        if showImg==2:
            # Display key points for reference image in green color
            imgWithKP = cv2.drawKeypoints(img1, kps1, 0, (0,255,0), None)
            imgWithKP1 = imgWithKP[:,:,0]
            imgWithKP = cv2.drawKeypoints(img2, kps2, 0, (0,255,0), None)
            imgWithKP2 = imgWithKP[:,:,0]
            imgshow = np.concatenate((imgWithKP1, imgWithKP2), axis=1)
            plt.figure(figsize=(18,10), dpi=100)
            plt.imshow(imgshow)
            plt.title('Spectrum w/ KeyPoint')
        
        featureCount = [len(kps1),len(kps2)]
        #print(featureCount)
        
        FLANN_INDEX_KDTREE = 0
        index_params = dict(algorithm = FLANN_INDEX_KDTREE, trees = 5)
        search_params = dict(checks = 50)
    
        flann = cv2.FlannBasedMatcher(index_params, search_params)
    
        matches = flann.knnMatch(des1, des2, k=2)
        
        # Need to draw only good matches, so create a mask
        matchesMask = [[1,0] for i in range(len(matches))]    
        
    
        # store all the good matches (g_matches) as per Lowe's ratio 
        g_match_m = []
        g_match_mn = []
        for i, (m,n) in enumerate(matches):
            if m.distance < g_match_threshold* n.distance:
                g_match_mn.append([m,n])
                g_match_m.append(m)
                #matchesMask[i]=[1,0]
        
        num_g_match = len(g_match_m)
        #print('Good Match = {}'.format(num_g_match))
        # Draw all matches
        draw_params = dict(matchColor = (0,255,0),singlePointColor = (255,0,0),matchesMask = matchesMask,flags = 0)
        img3 = cv2.drawMatchesKnn(img1, kps1, img2, kps2, matches, None, **draw_params)
            
        # Draw good matches only
        matchesMask = [[1,0] for i in range(len(g_match_mn))]  
        draw_params = dict(matchColor = (0,255,0),singlePointColor = (255,0,0),matchesMask = matchesMask,flags = 0)
        img3 = cv2.drawMatchesKnn(img1, kps1, img2, kps2, g_match_mn, None, **draw_params)
            
        #cv2.imshow('Image Match', img3)
        
        matchesMask = [[1,0] for i in range(len(matches))]  
        
        if num_g_match>min_match:
            src_pts = np.float32([ kps1[m.queryIdx].pt for m in g_match_m ]).reshape(-1,1,2)
            dst_pts = np.float32([ kps2[m.trainIdx].pt for m in g_match_m ]).reshape(-1,1,2)
    
            M, mask = cv2.findHomography(src_pts, dst_pts, cv2.RANSAC,5.0)
            matchesMask = mask.ravel().tolist()
    
            h,w = img1.shape
            pts = np.float32([ [0,0],[0,h-1],[w-1,h-1],[w-1,0] ]).reshape(-1,1,2)
            dst = cv2.perspectiveTransform(pts,M)
    
            # img2 = cv2.polylines(img2, [np.int32(dst)], True, (0,255,255) , 3, cv2.LINE_AA)
            
            draw_params = dict(matchColor = (0,255,255), singlePointColor = (0,255,0), matchesMask = matchesMask, flags = 2)  # only inliers   
            # #draw_params = dict(matchColor = (0,255,0),   singlePointColor = (255,0,0), matchesMask = matchesMask, flags = 0)               
            # # region corners    
            # cpoints=np.int32(dst)
            # a, b,c = cpoints.shape
    
            # # reshape to standard format
            # c_p=cpoints.reshape((b,a,c))   
            
            # Get the shift
            # Old method 
            # shift = (w-1)/2-np.sum(c_p[0,:,0])/4
            # delta_shift = abs(c_p[0,2,0]-c_p[0,3,0])            
            # new method to calculate the shift (use the raw data after homography transformation)
            shift = (w-1)/2-np.average(dst[:,0,0])
            # delta_shift = np.sqrt(np.mean((pts[:,0,0]-dst[:,0,0])**2))
            
            # crop matching region and show image
            # matching_region = crop_region(path_train, c_p)
            if showImg:
                img3 = cv2.drawMatches(img1, kps1, img2, kps2, g_match_m, None, **draw_params)
                plt.figure(figsize=(18,10), dpi=100)
                plt.imshow(img3)
            # print("Image1 feature detected: {}; Image2 feature detected: {}. Good matches found: {}".format(featureCount[0], featureCount[1], len(g_match_m)))
            # return (shift, delta_shift, featureCount, num_g_match, M)    
            return (shift, featureCount, num_g_match, M)    
        else:
            # print("Image1 feature detected: {}; Image2 feature detected: {}. Not enough matches have been found! - {}/{}".format(featureCount[0], featureCount[1], len(g_match_m), min_match))
            matchesMask = None
            # return (9999, 9999, featureCount, num_g_match, 0)
            return (9999,  featureCount, num_g_match, 0)
    except:
        if featureCount:
            # return (9999, 9999, featureCount, 0, 0)
            return (9999, featureCount, 0, 0)
        else:
            # return (9999, 9999, [0,0], 0, 0)
            return (9999, [0,0], 0, 0)
            
            

    else:
        print("Not enough matches have been found! - %d/%d" % (len(g_match_m), min_match))
        matchesMask = None
        # return (None, None, 9999, 9999, featureCount, num_g_match)
        return (None, None, 9999, featureCount, num_g_match)

def calculateSingleRV(RVPickleFile, RV_cutoff, fitsNum):
    with open('AvgSpectrum_with_Chunks_' + Chunk_Algorithm_Version + '.pickle','rb') as f:
        totTemplate, wavelength, shift_per_pixel, chunkIndices, start_pixel_array, stop_pixel_array = pickle.load(f)  
    del totTemplate, wavelength, chunkIndices, start_pixel_array, stop_pixel_array
    
    with open(RVPickleFile,'rb') as f:
        # shift_stats, delta_shift_stats, featureCount_stats, num_g_match_stats = pickle.load(f)
        shift_stats, featureCount_stats, num_g_match_stats = pickle.load(f)
        
    # extract parameters from the filename
    strArray = RVPickleFile.split('_')
    w_oversampling = int(re.findall(r'\d+', strArray[2])[0])
    n_start = int(re.findall(r'\d+', strArray[4])[0])
    n_stop = int(re.findall(r'\d+', strArray[4])[1])
    orderCount = n_stop-n_start
    RV_avg = np.zeros(4)
    
    i = fitsNum
    RV_fitsfile = np.zeros([orderCount, 4])
    for j, n in enumerate(range(n_start, n_stop)):
        shift_array = np.array(shift_stats[i][j])
        RVs = (np.exp(shift_per_pixel[n]/w_oversampling*shift_array)-1)*c
        RVs_good = RVs[abs(RVs)<RV_cutoff]
        RV_fitsfile[j,0] = np.average(RVs)
        RV_fitsfile[j,1] = np.var(RVs)
        RV_fitsfile[j,2] = np.average(RVs_good)
        RV_fitsfile[j,3] = np.var(RVs_good)
    RV_avg[0] = np.average(RV_fitsfile[:,0]) # avg of all RVs
    RV_weights = 1/RV_fitsfile[:,1] # use variance as weights
    RV_avg[1] = np.average(RV_fitsfile[:,0], weights=RV_weights) # weighted avg of all RVs
    
    RV_avg[2] = np.average(RV_fitsfile[:,2]) # avg of good RVs
    RV_good_weights = 1/RV_fitsfile[:,3] # use variance of good RVs as weights
    RV_avg[3] = np.average(RV_fitsfile[:,2], weights=RV_good_weights) # weighted avg of good RVs
        
    return -RV_avg

def calculateRV(RVPickleFile_fullpath, RV_cutoff, w_oversampling, n_start, n_stop, shift_per_pixel):   
    with open(RVPickleFile_fullpath,'rb') as f:
        shift_stats, delta_shift_stats, featureCount_stats, num_g_match_stats = pickle.load(f)
            # shift_stats,  featureCount_stats, num_g_match_stats = pickle.load(f)
        
    orderCount = n_stop-n_start+1
    fitsCount = len(shift_stats)
    RV_avg = np.zeros([fitsCount, 4])
    
    for i in range(fitsCount):
        RV_fitsfile = np.zeros([orderCount, 4])
        for j, n in enumerate(range(n_start, n_stop+1)):
            # print('i={}, j={}'.format(i,j))
            shift_array = np.array(shift_stats[i][j])
            RVs = (np.exp(shift_per_pixel[j]/w_oversampling*shift_array)-1)*c
            RVs_good = RVs[abs(RVs)<RV_cutoff]
            RV_fitsfile[j,0] = np.average(RVs)
            RV_fitsfile[j,1] = np.var(RVs)
            RV_fitsfile[j,2] = np.average(RVs_good)
            RV_fitsfile[j,3] = np.var(RVs_good)
        RV_avg[i, 0] = np.average(RV_fitsfile[:,0]) # avg of all RVs
        RV_weights = 1/RV_fitsfile[:,1] # use variance as weights
        RV_avg[i, 1] = np.average(RV_fitsfile[:,0], weights=RV_weights) # weighted avg of all RVs
        
        RV_avg[i, 2] = np.average(RV_fitsfile[:,2]) # avg of good RVs
        RV_good_weights = 1/RV_fitsfile[:,3] # use variance of good RVs as weights
        RV_avg[i, 3] = np.average(RV_fitsfile[:,2], weights=RV_good_weights) # weighted avg of good RVs
        
    return -RV_avg

def check_string_in_nested_list(nested_list, search_string):
    for sublist in nested_list:
        if isinstance(sublist, list):
            if check_string_in_nested_list(sublist, search_string):
                return True
        elif isinstance(sublist, str) and search_string in sublist:
            return True
    return False

# *************** END OF FUNCTIONS ************************************************************************************************************

## *************** Code Settings ***************************************************************************************************************
# General Settings
EXPRES_folder = 'EXPRES//'                                              # EXPRES Data Folder
DataSavingRootFolder = 'RV//'                                           # Root folder to save the result data
Star = '101501'                                                         # Define which star to analyze ('101501', '26965', '10700', '34411', etc.)
# Star = '34411'  
# Star = '26965'  
# Star = '10700'

# Step 1: Average Spectrum Calculation Settings
startInd = 1000                                                         # Starting index of spectrum calculation
stopInd = 7000                                                          # Stop index of spectrum calculation
Avg_Spectrum_OverSamplings = [8]                                        # List of Target Average Spectrum Over Samplings. Example: [1, 2, 4, 8]

# Step 2: Absorption Lines/Chunks Selection Settings
# n_pairs = [[20, 80],[20, 70], [30, 80], [30, 75]]
# n_pairs = [[31, 70]]
n_pairs = [[31, 70]]
Chunk_Algorithm_Version = '1.8'                                                           # Algorithm Version #
cutoffRatio = 0.75                                             # find the start and stop pixels at which the spectrum height is cutoffRatio*chunk height
heightminRatio = 0.5                                                   # Minimum chunk height = heightminRatio * maximum spectrum height

# Step 3: Raw Spectrum Image Generation Settings                                                   
v_Pixels = [100]                                                        # List of Vertical Pixels. Example: [100, 200]
w_oversamplings = [2]                                                   # List of Wavelength Over Samplings. Example: [1, 4, 8]
uncertainty = False                                                     # Use uncertainty as noise

# Step 4: OpenCV feature matching and shift calculation Settings                                # List of [n_start_OpenCV, n_stop_OpenCV] pairs. [n_start_OpenCV, n_stop_OpenCV] is the range of orders to use in this step. n_start_OpenCV > n_start; n_stop_OpenCV < n_stop
intensities = [200]                                                     # grayscale intensity for the horizontal lines
linegap = [7, 11, 23, 37, 47]                                      # location factors for the horizontal lines. 
# linegap = [3,29] 
MaskTypes = [0]                                                            # Mask type: 0: fixed; 1: gradient; 2: random 
LocationType = 0                                                        # Horizontal lines location: 0: multiples of several factors; 1: random
discretization_ratios = [0.1]                                           # discretization settings: the smaller the hearier
showImg = 0                                                             # Image plotting: 0: no plotting; 1: Plot matching only; 2: Plot all template, test, and matching images
g_match_thresholds = [1]                                                # Feature matching threshold. Matching distance lower than the number will be considered good match and be used for homography matrix calculation 
useContrastEnhancement = True                                           # Use contrast enhancement or not
   
## ************** END OF Code Settings *********************************************************************************************************



# ## ************* Code initialization. Do Not Modify *****************************************************************************************************************
# Build AvgSpectrum keywords
DataSet = Star+'//'                                                                 # Star's sub-directory
fitsFile_folder = EXPRES_folder + DataSet + "Spectra " + Star + "//"
fitsFiles = [f for f in os.listdir(fitsFile_folder) if f.endswith('.fits')]         # Get a list of all the FITS files in the folder 
# fitsFiles = fitsFiles[0:10]
fitsCount = len(fitsFiles)
RawSpectrumImg_folder = DataSet+'RawSpectrumImg//'
RV_Results_folder = DataSavingRootFolder + DataSet
folders = ['ChunkSelectionImg', 'RawSpectrumImg'] # Setup Directories
make_directories(Star)
make_directories(DataSavingRootFolder)
for folder in folders:
    make_directories((Star+'//' + folder))      
# # ************************************************************************************************************************************************************

# Record the start time
start_time = time.time()

for n_start, n_stop in n_pairs:
    AvgSpectrums = []
    for Avg_Spectrum_OverSampling in Avg_Spectrum_OverSamplings:
        widthMin = 2*Avg_Spectrum_OverSampling
        widthMax = 12*Avg_Spectrum_OverSampling
        AvgSpectrums.append('AvgSpectrum_with_Chunks_ao'+str(Avg_Spectrum_OverSampling)+'_nMin'+str(n_start)+'_nMax'+str(n_stop)+'_cutoff'+str(cutoffRatio)+'_widthMax'+str(widthMax)+'_heightMin'+str(heightminRatio)+'_'+Chunk_Algorithm_Version)
    EXPRES_Num_Orders=n_stop - n_start + 1  
    
    # Step 1: Calculate average spectrum and save data to a pickle file
    print('******************************** Calculate average spectrum and save data to a pickle file *******************************************************************')
    
    for Avg_Spectrum_OverSampling in Avg_Spectrum_OverSamplings:
        numWavelength = (stopInd-startInd-1)*Avg_Spectrum_OverSampling+1
        totTemplate = np.zeros((EXPRES_Num_Orders,numWavelength))
        LogWavelengths = np.zeros((EXPRES_Num_Orders,numWavelength))
        shift_per_pixel = np.zeros(EXPRES_Num_Orders)   
        
        # Loop over each FITS file, read its data, and add to template wavelength, produces template 
        for i in range(EXPRES_Num_Orders):
            order = i+n_start
            template=np.zeros(numWavelength)
            print('i={}'.format(i))
            for j, file_name in enumerate(fitsFiles):       
                file_path = os.path.join(fitsFile_folder, file_name)
                hdulist = fits.open(file_path)
                data = hdulist[1].data
                hdulist.close()
                
                if j==0:
                    logWavelength = np.log(data['bary_wavelength'][order][startInd:stopInd])
                    normalizedData = data['spectrum'][order][startInd:stopInd]/data['continuum'][order][startInd:stopInd]
                    startLogWavelength = logWavelength[0]
                    stopLogWavelength = logWavelength[-1]
                    LogWavelengths[i] = np.linspace(startLogWavelength, stopLogWavelength, numWavelength)
                    shift_per_pixel[i] = (stopLogWavelength-startLogWavelength)/numWavelength
                else:
                    logWavelength = np.log(data['bary_wavelength'][order])
                    normalizedData = data['spectrum'][order]/data['continuum'][order]
                    
                f = interpolate.interp1d(logWavelength, normalizedData)
                template = template + f(LogWavelengths[i])
                
            totTemplate[i]=template
            
        filename = DataSet+'templates_ao'+str(Avg_Spectrum_OverSampling)+'_n'+str(n_start)+'_n'+str(n_stop)+'.pickle'
        with open(filename, 'wb') as f:
            pickle.dump([totTemplate, LogWavelengths , shift_per_pixel], f)
    
    # Step 2: Find all good absorption lines (i.e. chunks) in each order of the average spectra and save the data along with the average spectrum to a new pickle file
    print('******************************** Find all good absorption lines of the average spectra and save the data  ****************************************************')
    for Avg_Spectrum_OverSampling in Avg_Spectrum_OverSamplings:
        widthMin = 2*Avg_Spectrum_OverSampling
        widthMax = 12*Avg_Spectrum_OverSampling # Maximum chunk width to consider. Ignore all chunks that has a width > widthMax
        gen_chunks(DataSet, Avg_Spectrum_OverSampling, n_start, n_stop, cutoffRatio, widthMin, widthMax, heightminRatio)
    
    # Step 3. Generate raw average spectrum and individual fitsfile spectrum images for good chuncks
    print('******************************** Generate raw average spectrum and individual spectrum images for good chuncks ***********************************************')     
    for Avg_Spectrum_OverSampling in Avg_Spectrum_OverSamplings:
        widthMin = 2*Avg_Spectrum_OverSampling
        widthMax = 12*Avg_Spectrum_OverSampling
        AvgSpectrum ='AvgSpectrum_with_Chunks_ao'+str(Avg_Spectrum_OverSampling)+'_nMin'+str(n_start)+'_nMax'+str(n_stop)+'_cutoff'+str(cutoffRatio)+'_widthMax'+str(widthMax)+'_heightMin'+str(heightminRatio)+'_'+Chunk_Algorithm_Version
        for v_Pixel in v_Pixels:
            AvgSpectrum_pickle_file = DataSet + AvgSpectrum +'.pickle'
            for w_oversampling in w_oversamplings:
                print('AvgSpectrum = {}, w_oversampling = {}'.format(AvgSpectrum, w_oversampling))
                savetoFolder = RawSpectrumImg_folder + AvgSpectrum +'_v' + str(v_Pixel) + '_w'+'_OS' + str(w_oversampling)+'_uncrty'+str(uncertainty)+'_grayscale//'
                genRawSpectra_byChunks(AvgSpectrum_pickle_file, fitsFiles, n_start, n_stop, v_Pixel, w_oversampling, uncertainty, savetoFolder)
                
    # Step 4. Calculate shift on all good chunks and save the data to a pickel file
    print('******************************** Calculate shift on all good chunks and save the data to a pickel filee *******************************************************')
    for MaskType in MaskTypes:
        for Avg_Spectrum_OverSampling in Avg_Spectrum_OverSamplings:
            widthMin = 2*Avg_Spectrum_OverSampling
            widthMax = 12*Avg_Spectrum_OverSampling
            AvgSpectrum ='AvgSpectrum_with_Chunks_ao'+str(Avg_Spectrum_OverSampling)+'_nMin'+str(n_start)+'_nMax'+str(n_stop)+'_cutoff'+str(cutoffRatio)+'_widthMax'+str(widthMax)+'_heightMin'+str(heightminRatio)+'_'+Chunk_Algorithm_Version
            
            for v_Pixel in v_Pixels:   
                with open(DataSet + AvgSpectrum +'.pickle','rb') as f:
                    totTemplate, wavelength, shift_per_pixel, chunkIndices, start_pixel_array, stop_pixel_array = pickle.load(f)        
                for intensity in intensities:
                    for discretization_ratio in discretization_ratios:
                        for w_oversampling in w_oversamplings:
                            for g_match_threshold in g_match_thresholds:
                                print('n_start={}, n_stop={}, chunk_file={}, WO={}, g_threshold={}'.format(n_start, n_stop, AvgSpectrum, w_oversampling, g_match_threshold))
                                RV_save_folder =DataSavingRootFolder + DataSet + AvgSpectrum
                                make_directories(RV_save_folder)    
                                
                                pickleFileName = RV_save_folder+'//'+'v' + str(v_Pixel) + '_w'+'_OS' + str(w_oversampling)+'_uncrty'+str(uncertainty)+'_n'+str(n_start)+'-'+str(
                                        n_stop)+'_LnGap'+str(linegap)+'_gThrhld'+str(g_match_threshold)+'_MTyp'+str(MaskType)+'_Loc'+str(
                                            LocationType)+ '_discrt'+str(discretization_ratio)+'_int'+str(intensity)+'_fitsCnt'+str(fitsCount)+'_CE'+str(useContrastEnhancement)+'.pickle'
                                
                                shift_stats = []
                                # delta_shift_stats = []
                                featureCount_stats = []
                                num_g_match_stats = []
                                
                                RawSpectrumImg_folder_work = RawSpectrumImg_folder + AvgSpectrum + '_v' + str(v_Pixel) + '_w'+'_OS' + str(w_oversampling)+'_uncrty'+str(uncertainty)+'_grayscale//'
                                for j, fitsFile in enumerate(fitsFiles):
                                    print('Working on fits file #{} - {}...'.format(j, fitsFile))
                                    shift_byFitFiles = []
                                    # delta_shift_byFitFiles = []
                                    featureCount_byFitFiles = []
                                    num_g_match_byFitFiles = []
                                    for n in range(0, n_stop-n_start+1):   
                                        N=n+n_start
                                        shift_byFitFiles_n = []
                                        # delta_shift_byFitFiles_n = []
                                        featureCount_byFitFiles_n = []
                                        num_g_match_byFitFiles_n = []
                                        start_pixel_array_byorder = start_pixel_array[n]
                                        for i, start_pixel in enumerate(start_pixel_array_byorder):
                                            stop_pixel = stop_pixel_array[n][i]
                                            rawSpectrumTemplateFile = RawSpectrumImg_folder_work+str(N)+'//Raw_v' + str(v_Pixel) + '_w'+'_n' + str(N) + '_p' + str(start_pixel) + '_to_p' + str(stop_pixel) + '_OS' + str(w_oversampling)+'_AvgTemplate.jpg'
                                            rawSpectrumTestImgFile  = RawSpectrumImg_folder_work+str(N)+'//Raw_v' + str(v_Pixel) + '_w'+'_n' + str(N) + '_p' + str(start_pixel) + '_to_p' + str(stop_pixel) + '_OS' + str(w_oversampling)+ '_'+fitsFile[0:-5]+'.jpg'                
                                        
                                            rawSpectrumTemplate = cv2.imread(rawSpectrumTemplateFile, cv2.IMREAD_GRAYSCALE) 
                                            rawSpectrumTestImg = cv2.imread(rawSpectrumTestImgFile, cv2.IMREAD_GRAYSCALE)                         
                                            
                                            # add mask to images
                                            rawSpectrumTemplate_masked, rawSpectrumTestImg_masked =  mask(rawSpectrumTemplate, rawSpectrumTestImg, MaskType, LocationType, intensity, linegap)
                                            
                                            rawSpectrumTemplate_masked_discretized = (rawSpectrumTemplate_masked*discretization_ratio).astype(np.uint8)
                                            # convert it to full grayscale
                                            rawSpectrumTemplate_masked_discretized = ((rawSpectrumTemplate_masked_discretized - rawSpectrumTemplate_masked_discretized.min())/(rawSpectrumTemplate_masked_discretized.max() - rawSpectrumTemplate_masked_discretized.min())*255).astype(np.uint8)
                                            
                                            rawSpectrumTestImg_masked_discretized = (rawSpectrumTestImg_masked*discretization_ratio).astype(np.uint8)
                                            # convert it to full grayscale
                                            rawSpectrumTestImg_masked_discretized = ((rawSpectrumTestImg_masked_discretized - rawSpectrumTestImg_masked_discretized.min())/(rawSpectrumTestImg_masked_discretized.max() - rawSpectrumTestImg_masked_discretized.min())*255).astype(np.uint8)
                                            
                                            if useContrastEnhancement: # perform histogram equalization to increae contrast
                                                rawSpectrumTemplate_masked_discretized_equalized = cv2.equalizeHist(rawSpectrumTemplate_masked_discretized)
                                                rawSpectrumTestImg_masked_discretized_equalized = cv2.equalizeHist(rawSpectrumTestImg_masked_discretized)
                                            else: # No contrast enhancement
                                                rawSpectrumTemplate_masked_discretized_equalized = rawSpectrumTemplate_masked_discretized
                                                rawSpectrumTestImg_masked_discretized_equalized = rawSpectrumTestImg_masked_discretized
                                            
                                            # test feature detection
                                            MaskSettings = [MaskType, LocationType]
                                            # shift, delta_shift, featureCount, num_g_match, M= features_matching(rawSpectrumTemplate_masked_discretized, rawSpectrumTestImg_masked_discretized, g_match_threshold, showImg)    
                                            shift,  featureCount, num_g_match, M= features_matching(rawSpectrumTemplate_masked_discretized_equalized, rawSpectrumTestImg_masked_discretized_equalized, g_match_threshold, showImg)    
                                            RV = (np.exp(shift_per_pixel[n]/w_oversampling*shift)-1)*c
                                            # print('{}, n={},{}-{}, Mask={}: shift = {}, delta shift = {}, discretization = {}, feature count = {}, good match = {}, M={}, RV_mask = {}'.format(fitsFile, n, pixel_start, pixel_stop, MaskSettings, shift, delta_shift, discretization_ratio, featureCount, num_g_match, M, RV))                
                                            # print('{}, n={},{}-{}, Mask={}: shift = {}, delta shift = {}, discretization = {}, feature count = {}, good match = {}, RV_mask = {}'
                                            #        .format(fitsFile, n, start_pixel, stop_pixel, MaskSettings, shift, delta_shift, discretization_ratio, featureCount, num_g_match, RV))     
                                            shift_byFitFiles_n.append(shift)
                                            # delta_shift_byFitFiles_n.append(delta_shift)
                                            featureCount_byFitFiles_n.append(featureCount)
                                            num_g_match_byFitFiles_n.append(num_g_match)
                                          
                                        shift_byFitFiles.append(shift_byFitFiles_n)
                                        # delta_shift_byFitFiles.append(delta_shift_byFitFiles_n)
                                        featureCount_byFitFiles.append(featureCount_byFitFiles_n)
                                        num_g_match_byFitFiles.append(num_g_match_byFitFiles_n)
                                        
                                        
                                    shift_stats.append(shift_byFitFiles)
                                    # delta_shift_stats.append(delta_shift_byFitFiles)
                                    featureCount_stats.append(featureCount_byFitFiles)
                                    num_g_match_stats.append(num_g_match_byFitFiles)
                                            
                                with open(pickleFileName,'wb') as f:
                                    # pickle.dump([shift_stats, delta_shift_stats, featureCount_stats, num_g_match_stats], f)
                                    pickle.dump([shift_stats, featureCount_stats, num_g_match_stats], f)



# Step5a. Plot industry pipeline
print('******************************** Plot industry pipeline ****************************************************************************')
# Load, plot, and save the activity csv
data_file = EXPRES_folder + DataSet+Star+'_activity.csv'
X = pd.read_csv(data_file)
if Star == '26965':
    yli = [-13, 13]
elif Star == '10700':
    yli = [-9, 9]
elif Star == '101501':
    yli = [-13, 13]

fig = plt.figure(figsize=(5,2.5), dpi=100)
plt.plot(X['Time [MJD]'], X['CBC RV [m/s]'], '.', color='red')
titletext = 'HD '+Star+' EXPRES RVs'
plt.title(titletext,fontsize=16)
plt.xlabel('Time [MJD]')
plt.ylabel('RV [m/s]')
plt.ylim(yli)
plt.tight_layout()

savetofile=DataSet + 'Original_EXPRES_CBC_dot.png'
fig.savefig(savetofile)

fig = plt.figure(figsize=(5,2.5), dpi=100)
plt.plot(X['Time [MJD]'], X['CBC RV [m/s]'], color='red')
plt.title(titletext,fontsize=16)
plt.xlabel('Time [MJD]')
plt.ylabel('RV [m/s]')
plt.tight_layout()
plt.ylim(yli)
plt.tight_layout()
savetofile=DataSet + 'Original_EXPRES_CBC_line.png'
fig.savefig(savetofile)
# ****************************************************************************************

# Step 5. Calculate RV for all pickle files in a folder
print('******************************** Calculate RV for all pickle files in a folder ****************************************************************************')
RV_cutoff = 5000
# Load, plot, and save the activity csv
data_file = EXPRES_folder + DataSet+Star+'_activity.csv'
X = pd.read_csv(data_file)

# get a list of all directories under a DataSet
RVPickle_folders = [os.path.join(RV_Results_folder, name)+'//' for name in os.listdir(RV_Results_folder) if os.path.isdir(os.path.join(RV_Results_folder, name))]
for RVPickle_folder in RVPickle_folders:    
    # extract avererage spectrum chunk info file location
    AvgSpectrum_Chunks_File_Name = RVPickle_folder.split('//')[2]
    # load the shift_per_pixel data
    AvgSpectrum_Chunks_File = DataSet + AvgSpectrum_Chunks_File_Name + '.pickle'
    with open(AvgSpectrum_Chunks_File,'rb') as f:
        totTemplate, wavelength, shift_per_pixel, chunkIndices, start_pixel_array, stop_pixel_array = pickle.load(f)  
    del totTemplate, wavelength, chunkIndices, start_pixel_array, stop_pixel_array
    
    make_directories(RVPickle_folder+'RV_Plots') # make a directory to save plots
    PickleFiles = [p for p in os.listdir(RVPickle_folder) if p.endswith('.pickle')]  # Get a list of all the pickle files in the folder
    
    # check if there is a saved RV_data file
    RV_summary_file = RVPickle_folder + 'RV_summary.pickle'
    if os.path.exists(RV_summary_file): # there is a saved RV_data file
        with open(RV_summary_file, 'rb') as f:
            RV_data = pickle.load(f)    
    else:
        RV_data = [] # No prreviously saved RV_data, start with an empty array
    
    for RVPickleFile in PickleFiles:
        if RVPickleFile != 'RV_summary.pickle':
            RVPickleFile_noext = RVPickleFile.split('.pickle')[0]
            # check if this file already processed previously        
            if not check_string_in_nested_list(RV_data, RVPickleFile_noext):
                print('Pickle File = {}'.format(RVPickleFile))    
                # extract w_oversampling, n_start, n_stop from file name
                strArray = RVPickleFile.split('_')
                w_oversampling = int(re.findall(r'\d+', strArray[2])[0])
                n_start = int(re.findall(r'\d+', strArray[4])[0])
                n_stop = int(re.findall(r'\d+', strArray[4])[1])
                
                avgRV = calculateRV(RVPickle_folder+RVPickleFile, RV_cutoff, w_oversampling, n_start, n_stop, shift_per_pixel)
                
                
                RV_info = RVPickleFile.split('.pickle')[0]
                if not np.isnan(avgRV[:,1]).any(): # all RV data are good
                    RV_RMS = np.sqrt(np.mean(avgRV[:,1]**2))
                    fig = plt.figure(figsize=(5,2.5), dpi=100)
                    plt.plot(X['Time [MJD]'], avgRV[:,1], '.', color='blue')                    
                    titletext = 'HD '+Star+' EXPRES RVs'
                    plt.title(titletext,fontsize=16)
                    plt.xlabel('Time [MJD]')
                    plt.ylabel('RV [m/s]')
                    yl_low = int(min(avgRV[:,1])*1.8)
                    yl_high = int(max(avgRV[:,1])*1.8)
                    plt.ylim([yl_low, yl_high])
                    plt.tight_layout()
                    savetofile=RVPickle_folder + 'RV_Plots//' + RV_info + '_RVRMS' + str(round(RV_RMS, 3)) + '_dot.png'
                    fig.savefig(savetofile)
                    
                    fig = plt.figure(figsize=(5,2.5), dpi=100)
                    plt.plot(X['Time [MJD]'], avgRV[:,1], color='blue')
                    plt.title(titletext,fontsize=16)
                    plt.xlabel('Time [MJD]')
                    plt.ylabel('RV [m/s]')
                    plt.ylim([yl_low, yl_high])
                    plt.tight_layout()
                    savetofile=RVPickle_folder + 'RV_Plots//' + RV_info + '_RVRMS' + str(round(RV_RMS, 3)) + '_line.png'
                    fig.savefig(savetofile)
                    
                    print('    RV (RMS) = {}'.format(RV_RMS))
                    RV_data.append([RV_info, RV_RMS])
                    plt.close('all')
                else:
                    print('    RV (RMS) = NaN detected.')
                    RV_RMS = 99999
                    RV_data.append([RV_info, RV_RMS])
                    
    # save RV summary                
    with open(RV_summary_file,'wb') as f:
        pickle.dump(RV_data, f)

# Record the stop time
stop_time = time.time()

# Calculate the elapsed time
elapsed_time = stop_time - start_time

# Print the elapsed time
print(f"Elapsed time: {elapsed_time:.6f} seconds")

########################################## Other Stuff #############################################################################################################
# Compare Images with and without uncertainty noise
uncertaintyOffImg = 'C://Temp0//Raw_v200_w_n50_p13482_to_p13541_OS2_101501_190210.1141_off.jpg'
uncertaintyOnImg =  'C://Temp0//Raw_v200_w_n50_p13482_to_p13541_OS2_101501_190210.1141_on.jpg'
a = plt.imread(uncertaintyOffImg)
plt.imshow(a[:,:,0], cmap='viridis')
plt.figure()
a = plt.imread(uncertaintyOnImg)
plt.imshow(a[:,:,0], cmap='viridis')

# # Generate demo images for paper
img1 = 'C://Temp0//Raw_v100_w_n50_p1670_to_p1725_OS1_101501_190210.1141.jpg'
a = plt.imread(img1)
plt.imshow(a[:,:,0], cmap='viridis')


img2 = 'C://Temp0//Raw_v100_w_n50_p1670_to_p1725_OS2_101501_190210.1141.jpg'
plt.figure()
a = plt.imread(img2)
plt.imshow(a[:,:,0], cmap='viridis')

img1 = 'C://Temp0//Raw_v100_w_n50_p1670_to_p1725_OS4_101501_190210.1141.jpg'
plt.figure()
a = plt.imread(img1)
plt.imshow(a[:,:,0], cmap='viridis')


img2 = 'C://Temp0//Raw_v100_w_n50_p1670_to_p1725_OS8_101501_190210.1141.jpg'
plt.figure()
a = plt.imread(img2)
plt.imshow(a[:,:,0], cmap='viridis')

# --------------------------  Periodogram -----------------------------------------------------------------------------------------------------------------------------------------------------------------
# 1) 101501
star = '101501'
# Plot EXPRES CBC pipeline RV periodogram
data_file = EXPRES_folder + DataSet+Star+'_activity.csv'
X = pd.read_csv(data_file)
x = X['Time [MJD]']
y = X['CBC RV [m/s]']
# # use astropy.timeseries's LombScargle function to draw the periodogram
# frequency, power = LombScargle(x, y).autopower()
# plt.plot(frequency, power)   

# Calculate OpenCV RV
RVPickle_folder = 'RV//101501//AvgSpectrum_with_Chunks_ao8_nMin31_nMax70_cutoff0.75_widthMax96_heightMin0.3_1.3//'
AvgSpectrum_Chunks_File_Name = RVPickle_folder.split('//')[2]
w_oversampling = 2
n_start = 31
n_stop = 70
RV_cutoff = 5000
AvgSpectrum_Chunks_File = DataSet + AvgSpectrum_Chunks_File_Name + '.pickle'
with open(AvgSpectrum_Chunks_File,'rb') as f:
        totTemplate, wavelength, shift_per_pixel, chunkIndices, start_pixel_array, stop_pixel_array = pickle.load(f)  
del totTemplate, wavelength, chunkIndices, start_pixel_array, stop_pixel_array   

PickleFiles = [p for p in os.listdir(RVPickle_folder) if p.endswith('.pickle')]  # Get a list of all the pickle files in the folder
for RVPickleFile in PickleFiles:
    if RVPickleFile != 'RV_summary.pickle':
        strArray = RVPickleFile.split('_')
        w_oversampling = int(re.findall(r'\d+', strArray[2])[0])
        n_start = int(re.findall(r'\d+', strArray[4])[0])
        n_stop = int(re.findall(r'\d+', strArray[4])[1])
        avgRV = calculateRV(RVPickle_folder+RVPickleFile, RV_cutoff, w_oversampling, n_start, n_stop, shift_per_pixel)
        y1 = avgRV[:,1]
        
        # nout = 10000
        # w = np.linspace(0.0001, 0.2, nout)  # signal.lombscargle use angular frequency as default
        # f = w/(2*np.pi) # calculate frequency (i.e. 1/T)
        # pgram = signal.lombscargle(x, y, w, normalize=True)
        # pgram1 = signal.lombscargle(x, y1, w, normalize=True)
        
        frequency = np.linspace(0, 20, 100000)
        ls = LombScargle(x, y)
        ls1 = LombScargle(x, y1)
        power = ls.power(frequency)         
        power1= ls1.power(frequency) 
        probabilities = [0.07, 0.05, 0.005, 0.0005]
        FAP = ls.false_alarm_level(probabilities)  
        FAP1 = ls1.false_alarm_level(probabilities)  
        
        fontsize = 18
        fig, ax = plt.subplots(layout='constrained', figsize=(15, 7))
        ax.plot(frequency, power, color = 'red', linewidth=2, alpha=0.7, label='Original EXPRES CBC Method')
        ax.plot(frequency, power1, color = 'blue', linewidth=2, alpha=0.7, label='Computer Vision Method')
        ax.set_xlabel('Frequency [1/Day]', size = fontsize+5)
        ax.set_ylabel('Normalized Power', size = fontsize+5)
        ax_xticks = np.linspace(0.0, 0.10, 11)
        ax.set_xticks(ax_xticks)
        # ax.set_title(RVPickleFile)
        
        def one_over(x):
            """Vectorized 1/x, treating x==0 manually"""
            x = np.array(x, float)
            near_zero = np.isclose(x, 0)
            x[near_zero] = np.inf
            x[~near_zero] = 1 / x[~near_zero]
            return x        
        # the function "1/x" is its own inverse
        inverse = one_over
        
        secax = ax.secondary_xaxis('top', functions=(one_over, inverse))
        secax_xticks=[1000000, 100.0,50.0,33.3,25.0,20.0,16.7,14.3,12.5,11.1,10.0]
        secax_xlabels=secax_xticks.copy()
        secax_xlabels[0]=r"$\infty$"
        secax.set_xticks(secax_xticks)
        secax.set_xticklabels(secax_xlabels)
        secax.get_xticklabels()[0].set_fontsize(21)
        secax.set_xlabel('Period [Day]', size = fontsize+5)
        ax.set_xlim([-0.000001, 0.10])
        plt.show()
        plt.tight_layout()
        plt.legend(loc='upper left', fontsize = str(fontsize))
        
        # add FAP lines
        plt.axhline(y=FAP1[1], color = 'black', linestyle = '--')
        plt.text(0.088, FAP[1]+0.0045, 'FAP=5%', ha='left', va='bottom', size = fontsize)
        plt.axhline(y=FAP1[2], color = 'green', linestyle = '--')
        plt.text(0.088, FAP[2]+0.0045, 'FAP=0.5%', ha='left', va='bottom', size = fontsize)
        plt.axhline(y=FAP1[3], color = 'orange', linestyle = '--')
        plt.text(0.088, FAP[3]+0.0045, 'FAP=0.05%', ha='left', va='bottom', size = fontsize)
        savetofile=RVPickle_folder + 'RV_Plots//' + RVPickleFile+'_Periodogram.png'
        fig.savefig(savetofile)
        plt.close(fig)

# 2) 26965 - v1.1
Star = '26965'
DataSet = Star+'//'   
# Plot EXPRES CBC pipeline RV periodogram
data_file = EXPRES_folder + DataSet+Star+'_activity.csv'
X = pd.read_csv(data_file)
x = X['Time [MJD]'].to_numpy()
y = X['CBC RV [m/s]'].to_numpy()
# # use astropy.timeseries's LombScargle function to draw the periodogram
# frequency, power = LombScargle(x, y).autopower()
# plt.plot(frequency, power)   

# Calculate OpenCV RV
# RVPickle_folder = 'RV//26965//AvgSpectrum_with_Chunks_ao8_nMin0_nMax85_cutoff0.75_widthMax96_heightMin0.3_v1.1//'
# AvgSpectrum_Chunks_File = '26965//AvgSpectrum_with_Chunks_ao8_nMin0_nMax85_cutoff0.75_widthMax96_heightMin0.3_v1.1.pickle'
# RVPickle_folder = 'RV//26965//AvgSpectrum_with_Chunks_ao8_nMin31_nMax70_cutoff0.75_widthMax96_heightMin0.3_1.3//'
# AvgSpectrum_Chunks_File = '26965//AvgSpectrum_with_Chunks_ao8_nMin31_nMax70_cutoff0.75_widthMax96_heightMin0.3_1.3.pickle'
RVPickle_folder = 'RV//26965//AvgSpectrum_with_Chunks_ao4_nMin0_nMax85_cutoff0.75_widthMax48_heightMin0.3_v1.1//'
AvgSpectrum_Chunks_File = '26965//AvgSpectrum_with_Chunks_ao4_nMin0_nMax85_cutoff0.75_widthMax48_heightMin0.3_v1.1.pickle'
w_oversampling = 2
n_start = 31
n_stop = 70
RV_cutoff = 5000
with open(AvgSpectrum_Chunks_File,'rb') as f:
        totTemplate, wavelength, shift_per_pixel, chunkIndices, start_pixel_array, stop_pixel_array = pickle.load(f)  
del totTemplate, wavelength, chunkIndices, start_pixel_array, stop_pixel_array   

PickleFiles = [p for p in os.listdir(RVPickle_folder) if p.endswith('.pickle')]  # Get a list of all the pickle files in the folder
for RVPickleFile in PickleFiles:
    if RVPickleFile != 'RV_summary.pickle':
        avgRV = calculateRV(RVPickle_folder+RVPickleFile, RV_cutoff, w_oversampling, n_start, n_stop, shift_per_pixel)
        y1 = avgRV[:,1]
        
        # nout = 10000
        # w = np.linspace(0.0001, 0.2, nout)  # signal.lombscargle use angular frequency as default
        # f = w/(2*np.pi) # calculate frequency (i.e. 1/T)
        # pgram = signal.lombscargle(x, y, w, normalize=True)
        # pgram1 = signal.lombscargle(x, y1, w, normalize=True)
        
        frequency = np.linspace(0, 20, 100000)
        probabilities = [0.07, 0.05, 0.005, 0.0005]
        ls = LombScargle(x, y)
        power0 = ls.power(frequency)       
        FAP = ls.false_alarm_level(probabilities)  
        ls1 = LombScargle(x, y1)  
        power1= ls1.power(frequency) 
        FAP1 = ls1.false_alarm_level(probabilities)  
        
        fontsize = 18
        fig, ax = plt.subplots(layout='constrained', figsize=(15, 7))
        ax.plot(frequency, power0, color = 'red', linewidth=2, alpha=0.7, label='Original EXPRES CBC Method')
...

This file has been truncated, please download it to see its full contents.